網頁電話/手機號碼識別

識別網頁上的電話號碼,一個比較容易想到的方法就是,通過預先設計電話號碼的正則表達式,對網頁文本內容中電話號碼進行匹配,抽取出對應的聯繫方式。然而,這種方法是假定電話號碼都是按照比較理想的格式在網頁上展示的,自然對於這樣的識別精度會很高,但是同時也漏掉了很多電話號碼。如果你沒有深入分析處理過Web網頁數據,你是想象不到互聯網上網頁的格式到底有多不規範。

這裏,我們實現一種識別網頁上電話號碼的方法,不需要設計精確的正則表達式來匹配電話號碼,而是通過電話號碼最抽象的特徵來考慮和設計。

電話號碼一定是一個含有數字的序列,而且可能數字之間通過一些特殊或常見的字符來進行分隔,比如“逗號”、“短線”、“空格”、“字母”等等。我們通過對一個頁面的文本內容進行分析,將放寬數字字符串的定義:

如果兩個數字字符之間連續,則認爲兩個數字字符屬於同一個序列;如果兩個數字字符之間存在小於給定閾值限制個數的非數字字符,則認爲這兩個數字字符也屬於同一個序列。這種觀點的實質是,將距離比較近的數字字符串合併爲一個獨立的序列,這樣,通過分析一個頁面的文本內容就可以得到一個數字字符序列的集合。

然而,這樣會把比較短的數字,如日期、年齡、序號等都分析出來。自然而然想到,通過過濾算法將其過濾掉。我們這裏通過一種推薦模型,計算每個數字字符序列的相似度,然後根據相似度進行排序,再從排序靠前的數字字符串序列中篩選出電話號碼。

下面,看看我們用Java實現這個思路,並觀察一下結果。

定義一個序列推薦接口SequenceRecommendation,recommend方法是具體的實現邏輯,可以根據自己的需要去設計。

package org.shirdrn.webmining.recommend;

public interface SequenceRecommendation {
	public void recommend() throws Exception;
}

下面,我們實現一個用來抽取數字字符串序列的算法,並計算相關度,從而進行排序推薦。基本思路如下:

1、清洗原生的網頁:將HTML標籤等等都去掉,得到最終的文本內容。

2、對文本內容進行分詞:使用Lucene自帶的SimpleAnalyzer分析器(未使用停用詞過濾),之所以選擇這個是因爲,在數字字符序列附近(前面和後面)存在某些具有領域特定含義的詞(如電話號碼數字前面和後面可能存在一些詞:phone、telephone等;Email地址附近可能存在一些詞:email、email us等;等等),可能它是一個停用詞(對StandardAnalyzer等來說),我們不希望過濾掉這些詞。另外,我們記錄了每個詞的位置信息。

3、聚集數字字符序列,同時記錄前向和後向指定數量的詞(核心):這個應該是最核心的,需要精細地處理文本內容,和設計數據結構,得到一個我們能夠方便地進行相關度計算的結果集。

4、根據一個樣本集的計算結果,來建立領域模型(特徵詞向量),用於計算數字字符序列的相關度:我這裏收集了一部分英文網頁,通過英文網頁的分析處理,提煉出一批特徵詞,爲簡單起見直接使用詞頻作爲權重(注意:這樣使用詞頻簡單而且合理,也可以採用其他的方法進行權重的計算,或者補充其它屬性權重的貢獻)。我們這裏使用了兩個特徵詞向量,分別如下所示:

前向特徵詞向量(文件forwards_feature_vector):

email                                     9124
e                                         3368
mail                                      4767
e-mail                                    2183
email us at                               178
fax                                       147
email address                             146
email us                                  121
fx                                        115
or                                        113
email us                                  102
email or                                  95
email us at                               76
or e-mail                                 67

後向特徵詞向量(文件backwards_feature_vector):

phone                          27407
call						   13697
free						   13092
toll						   10092
toll free                      9012
tel                            8710
call                           5247
telephone                      4052
call us                        3108
ph                             3067
t                              2838
p                              2830
contact us                     2150
or call                        1889
local                          1477
f                              1437
or                             1362
abn                            1257
call us at                     1194
office                         1183
call us today                  1152
customer service               1101
call toll free                 1080

我們的特徵詞向量是通過文件形式導入,在後面的測試用例中使用。

5、相關度排序,並進行推薦:這裏排序後就可一目瞭然,排在前面的是電話號碼的可能性最大。

下面是整個思想的設計及其實現,NumberSequenceRecommendation類的代碼,如下所示:

package org.shirdrn.webmining.recommend;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringReader;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Parser;

public class NumberSequenceRecommendation implements SequenceRecommendation {

	private byte[] content;
	private Charset charset;
	private String baseUri;
	
	/** Max count of non-number number sequence in a continual number sequence, 
	 * on conditions of which we think the number sequence is continual.*/
	private int maxGap = 5;
	/** Max word count after or before a number sequence */
	private int maxWordCount = 5;
	private Pattern numberPattern = Pattern.compile("^\\d+$");
	private String cleanedContent;
	
	/** All words analyzed by Lucene analyzer from specified page text. */
	private LinkedList<Word> wordList = new LinkedList<Word>();
	private LinkedList<NumberSequence> numberSequenceList = new LinkedList<NumberSequence>();
	/** Final result sorted by correlation */
	List<NumberSequence> sortedNumberSequenceSet = new ArrayList<NumberSequence>(1);
	private Map<String, Double> backwardsFeatureVector = new HashMap<String, Double>();
	private Map<String, Double> forwardsFeatureVector = new HashMap<String, Double>();
	
	private double backwardsWeight = 1.75;
	private double forwardsWeight = 1.05;
	
	public NumberSequenceRecommendation() {
		this(new byte[]{}, Charset.defaultCharset(), null, null, null);
	}
	
	public NumberSequenceRecommendation(byte[] content, Charset charset, String baseUri, 
			String backwordsFeatureVectorFile, String forwardsFeatureVectorFile) {
		super();
		this.baseUri = baseUri;
		this.content = content;
		this.charset = charset;
		loadFeatureVectors(backwordsFeatureVectorFile, forwardsFeatureVectorFile);
	}
	
	private void loadFeatureVectors(String backwordsFeatureVectorFile, String forwardsFeatureVectorFile) {
		load(backwordsFeatureVectorFile, backwardsFeatureVector);
		load(forwardsFeatureVectorFile, forwardsFeatureVector);
	}

	private void load(String featureVectorFile, Map<String, Double> featureVector) {
		FileInputStream fis = null;
		BufferedReader reader = null;
		try {
			fis = new FileInputStream(featureVectorFile);
			reader = new BufferedReader(new InputStreamReader(fis, charset));
			String line = null;
			while((line = reader.readLine())!=null) {
				if(!line.isEmpty()) {
					String pair[] = line.trim().split("\\s+");
					try {
						featureVector.put(pair[0].trim(), Double.parseDouble(pair[1].trim()));
					} catch (Exception e) { }
				}
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}catch (IOException e) {
			e.printStackTrace();
		} finally {
			try {
				if(reader!=null) {
					reader.close();
				}
				if(fis!=null) {
					fis.close();
				}
			} catch (IOException e) {
				e.printStackTrace();
			}
		}		
	}

	@Override
	public void recommend() throws Exception {
		recommend(content, charset, baseUri);
	}

	private List<NumberSequence> recommend(byte[] content, Charset charset, String baseUri) {
		String html = new String(content, charset);
		Document doc = Parser.parse(html, baseUri);
		StringBuffer buf = new StringBuffer();
		parseHtmlText(doc.body(), buf);
		cleanedContent = buf.toString().trim();
		collectWords(cleanedContent);
		analyzeNumberWords();
		return sortByCorrelation();
	}
	
	/**
	 * Compute correlation, and sort result, for recommending.
	 * @return
	 */
	private List<NumberSequence> sortByCorrelation() {
		// sort numberSequenceList
		for(NumberSequence ns : numberSequenceList) {
			// backwards
			double backwardsCorrelation = 0;
			for(Word w : ns.backwardsWords) {
				if(backwardsFeatureVector.containsKey(w.text)) {
					backwardsCorrelation += backwardsFeatureVector.get(w.text);
				}
			}
			// forwards
			double forwardsCorrelation = 0;
			for(Word w : ns.forwardsWords) {
				if(forwardsFeatureVector.containsKey(w.text)) {
					forwardsCorrelation += forwardsFeatureVector.get(w.text);
				}
			}
			ns.correlation = backwardsWeight * backwardsCorrelation + forwardsWeight * forwardsCorrelation;
			sortedNumberSequenceSet.add(ns);
		}
		
		// sort by correlation
		Collections.sort(sortedNumberSequenceSet, new Comparator<NumberSequence>() {

			@Override
			public int compare(NumberSequence o1, NumberSequence o2) {
				if(o1.correlation<o2.correlation) {
					return 1;
				} else if(o1.correlation>o2.correlation) {
					return -1;
				}
				return 0;
			}
			
		});
		return sortedNumberSequenceSet;
	}

	/**
	 * Extract text data from a HTML page.
	 * @param node
	 * @param buf
	 */
	private void parseHtmlText(Node node, StringBuffer buf) {
		List<Node> children = node.childNodes();
		if(children.isEmpty() && node instanceof TextNode) {
			String text = node.toString().trim();
			for(String ch : ESCAPE_SEQUENCE) {
				text = text.replaceAll(ch, "");
			}
			if(!text.isEmpty()) {
				buf.append(text.toLowerCase().trim()).append("\n");
			}
		} else {
			for(Node child : children) {
				parseHtmlText(child, buf);
			}
		}
	}
	
	/**
	 * Analyze text, extract terms by Lucene analyzer.
	 * @param content
	 */
	private void collectWords(String content) {
		StringReader reader = new StringReader(content);
		Analyzer a = new EnglishAnalyzer(Version.LUCENE_36);
		TokenStream ts = a.tokenStream("", reader);
		TermAttribute ta = ts.addAttribute(TermAttribute.class);
		OffsetAttribute oa = ts.addAttribute(OffsetAttribute.class);
		Pos pos = new Pos();
		try {
			while(ts.incrementToken()) {
				Pos nextPos = new Pos(oa.startOffset(), oa.endOffset());
				nextPos.gap = nextPos.startOffset - pos.endOffset;
				Word word = new Word(ta.term(), nextPos);
				wordList.addLast(word);
				pos = nextPos;
				// is number?
				Matcher m = numberPattern.matcher(word.text);
				if(m.find()) {
					word.isNumber = true;
				}
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	
	/**
	 * Compute number words relations.
	 */
	private void analyzeNumberWords() {
		for(int i=0; i<wordList.size(); i++) {
			Word w = wordList.get(i);
			if(w.isNumber) {
				NumberSequence ns = new NumberSequence();
				ns.numberWords.add(w);
				// compute backwards words
				for(int j=Math.max(0, i-1); j>=Math.max(i-maxWordCount, 0); j--) {
					if(!wordList.get(j).isNumber) {
						ns.backwardsWords.add(wordList.get(j));
					}
				}
				// recognize nearest number string sequence
				int gap = 0;
				if(i<wordList.size()) {
					for(int k=i+1; ; k++) {
						if(gap==0) {
							gap = wordList.get(k).pos.gap;
						}
						if(gap<=maxGap) {
							if(wordList.get(k).isNumber) {
								ns.numberWords.add(wordList.get(k));
								gap = 0;
							} else {
								i = k-1;
								break;
							}
							ns.pos.gap += wordList.get(k).pos.gap;
						} else {
							i = k-1;
							break;
						}
					}
					// compute forwards words
					for(int p=Math.min(i, wordList.size()-1); p<=Math.min(wordList.size(), i+maxWordCount); p++) {
						if(!wordList.get(p).isNumber) {
							ns.forwardsWords.add(wordList.get(p));
						}
					}
					numberSequenceList.add(ns);
				}
			}
		}
	}
	
	private static String[] ESCAPE_SEQUENCE = new String[] {
		""", "&", "—", "–", "‰",
		" ", " ", " ", " ", "‌", "‍",
		"‚", "˜", "ˆ", "‎", "‏",
		"×", "÷", "“", "”", "„", 
		"<", ">", "‹", "›", "‘", "’",
		"¡", "¢", "£", "¤", "¥", "¦", 
		"§", "¨", "©", "ª", "«", "¬",
		"­", "®", "¯", "°", "±", "²",
		"³", "´", "µ", "¶", "·", "¸",
		"¹", "º", "»", "¼", "½", "¾",
		"¿", "À", "Á", "ˆ", "Ã", "Ä",
		"˚", "Æ", "Ç", "È", "É", "Ê",
		"Ë", "Ì", "Í", "Î", "Ï", "Ð",
		"Ñ", "Ò", "Ó", "Ô", "Õ", "Ö",
		"×", "Ø", "Ù", "Ú", "Û", "Ü",
		"Ý", "Þ", "ß", "à", "á", "â",
		"ã", "ä", "å", "æ", "ç", "è",
		"é", "ê", "ë", "ì", "í", "î",
		"ï", "&ieth;", "ñ", "ò", "ó", "ô",
		"õ", "ö", "÷", "ø", "ù", "ú",
		"û", "ü", "ý", "ÿ"
	}; 
	
	/**
	 * Number sequence who holds:
	 * <pre>
	 * a number {@link Word} list which we analyzed from text of a page
	 * a correlation index
	 * a forwards {@link Word} list 
	 * a backwards {@link Word} list
	 * a {@link Pos} which specifies this number sequence's position information
	 * </pre>
	 * @author shirdrn
	 */
	public static class NumberSequence {
		
		/** This sequence's position metadata */
		Pos pos = new Pos();
		/** Number word collection */
		List<Word> numberWords = new LinkedList<Word>();
		/**  */
		List<Word> forwardsWords = new LinkedList<Word>();
		List<Word> backwardsWords = new LinkedList<Word>();
		double correlation;
		
		@Override
		public String toString() {
			return "[" +
				"correlation=" + correlation + ", " +
				"numberWords=" +numberWords + ", " +
				"forwardsWords=" + forwardsWords + ", " +
				"backwardsWords=" + backwardsWords + ", " + "]";
		}

	}
	
	/**
	 * Word unit analyzed by Lucene's {@link Analyzer}. Here
	 * a {@link Word} is minimum and is not split again. 
	 * @author shirdrn
	 */
	static class Word {
		
		/** Word text */
		String text;
		/** Is this word a number? */
		boolean isNumber;
		/** Word's position metadata */
		Pos pos;
		
		public Word(String text, Pos pos) {
			super();
			this.text = text;
			this.pos = pos;
		}
		
		@Override
		public String toString() {
			return "[" +text + pos + "]";
		}
	}
	
	/**
	 * Position information
	 * @author shirdrn
	 */
	static class Pos {
		
		/** Start offset of a word */
		int startOffset;
		/** End offset of a word */
		int endOffset;
		/** Max distance between tow word */
		int gap;
		
		public Pos() {
			super();
		}
		
		public Pos(int startOffset, int endOffset) {
			super();
			this.startOffset = startOffset;
			this.endOffset = endOffset;
		}
		
		@Override
		public String toString() {
			return "<" + startOffset + ", " + endOffset + ", " + gap + ">";
		}
	}

	public List<NumberSequence> getSortedNumberSequenceSet() {
		return sortedNumberSequenceSet;
	}

	public String getCleanedContent() {
		return cleanedContent;
	}
}

結果輸出,包括原生網頁清理後的網頁文本內容,如下:

click here to go to our u.s. or arabic versions
close
cnn
edition: international
u.s.
mxico
arabic
tv
:
cnn
cnni
cnn en espaol
hln
sign up
log in
home
video
world
u.s.
africa
asia
europe
latin america
middle east
business
world sport
entertainment
tech
travel
ireport
about cnn.com/international
cnn.com/international:
the international edition of
cnn.com
is constantly updated to bring you the top news stories from around the world. it is produced by dedicated staff in london and hong kong, working with colleagues at cnn's world headquarters in atlanta, georgia, and with bureaus worldwide. cnn.com relies heavily on cnn's global team of over 4,000 news professionals.
cnn.com/international
features the latest multimedia technologies, from live video streaming to audio packages to searchable archives of news features and background information. the site is updated continuously throughout the day.
contact us:
help us make your comments count. use our
viewer comment page
to tell us what you think about our shows, our anchors, and our hot topics for the day.
help page:
visit our
extensive faqs
for answers to all of your questions, from cnn tv programming to rss to the cnn member center.
cnn:
back to top
what's on:
click here for the full rundown of all
cnn daily programming
.
who's on:
click here for full bios on all of
cnn's anchors, correspondents and executives
.
press office:
click here for information from
cnn international press offices
.
cnn's parent company:
time warner inc.
services:
back to top
your e-mail alerts:
your e-mail alerts, is a free, personalized news alerting service created for you.
with cnn's service you can:
sign up for your e-mail alerts and follow the news that matters to you.
select key words and topics across the wide range of news and information on the site.
create your own alerts.
customize your delivery options to fit your schedule and be alerted as a story is published on cnn.com. receive your alerts daily or weekly.
easily manage your alerts. edit, delete, suspend or re-activate them at any time.
register
to be a member and begin customizing your e-mail alerts today!
cnn.com preferences:
personalize your cnn.com page experience
today and receive breaking news in your e-mail inbox and on your cell phone, get your hometown weather on the home page and set your news edition to your world region.
cnn mobile:
cnn.com/international content is now available through your mobile phone. with
cnn mobile
, you can read up-to-the-minute news stories with color photos, watch live, streaming video or the latest video on demand clips and receive cnn breaking news text alerts. no matter where your on-the-go lifestyle takes you, cnn brings the news directly to you.
e-mail newsletters:
be the first to know with a variety of e-mail news services. receiving breaking news alerts, delivered straight to your e-mail address. follow the latest news on politics, technology, health or the topics that interest you most. or stay informed on what's coming up on your favorite cnn tv programs.
cnn offers e-mail updates as numerous and diverse as your tastes.
register now
and select from the various e-mails.
advertise on cnn.com:
advertise with us!
get information about advertising on the cnn web sites.
business development:
companies interested in partnering with cnn should contact cnn business development by sending an e-mail to
[email protected]
.
job search:
visit our web sites for information about internships or job opportunities with cnn international in
europe, middle east, africa
and
other regions
legal terms and conditions:
back to top
cnn interactive service agreement:
view the terms of the
cnn interactive services agreement
.
cnn comment policy:
cnn encourages you to add comment to our discussions. you may not post any unlawful, threatening, libelous, defamatory, obscene, pornographic or other material that would violate the law. please note that cnn makes reasonable efforts to review all comments prior to posting and cnn may edit comments for clarity or to keep out questionable or off-topic material. all comments should be relevant to the post and remain respectful of other authors and commenters. by submitting your comment, you hereby give cnn the right, but not the obligation, to post, air, edit, exhibit, telecast, cablecast, webcast, re-use, publish, reproduce, use, license, print, distribute or otherwise use your comment(s) and accompanying personal identifying information via all forms of media now known or hereafter devised, worldwide, in perpetuity.
cnn privacy statement
.
privacy statement:
to better protect your privacy, we provide this notice explaining our
online information practices
and the choices you can make about the way your information is collected and used
cnn's reprint and copyright information:
copyrights and copyright agent. cnn respects the rights of all copyright holders and in this regard, cnn has adopted and implemented a policy that provides for the termination in appropriate circumstances of subscribers and account holders who infringe the rights of copyright holders. if you believe that your work has been copied in a way that constitutes copyright infringement, please provide cnn's copyright agent the following information required by the online copyright infringement liability limitation act of the digital millennium copyright act, 17 u.s.c.  512:
a physical or electronic signature of a person authorized to act on behalf of the owner of an exclusive right that is allegedly infringed.
identification of the copyright work claimed to have been infringed, or, if multiple copyrighted works at a single online site are covered by a single notification, a representative list of such works at that site.
identification of the material that is claimed to be infringing or to be the subject of infringing activity and that is to be removed or access to which is to be disabled, and information reasonably sufficient to permit us to locate the material.
information reasonably sufficient to permit us to contact the complaining party.
a statement that the complaining party has a good-faith belief that use of the material in the manner complained of is not authorized by the copyright owner, its agent, or the law.
a statement that the information in the notification is accurate, and under penalty of perjury, that the complaining party is authorized to act on behalf of the owner of an exclusive right that is allegedly infringed.
cnn's copyright agent for notice of claims of copyright infringement on or regarding this site can be reached by sending an email to
[email protected]
or writing to-
copyright agent
one cnn center
atlanta, ga 30303
phone: (404) 878-2276
fax: (404) 827-1995
email:
[email protected]
for any questions or requests other than copyright issues, please view our
extensive faqs
.
weather forecast
home
|
video
|
world
|
u.s.
|
africa
|
asia
|
europe
|
latin america
|
middle east
|
business
|
world sport
|
entertainment
|
tech
|
travel
|
ireport
tools  widgets
|
podcasts
|
blogs
|
cnn mobile
|
my profile
|
e-mail alerts
|
cnn radio
|
cnn shop
|
site map
|
cnn partner hotels
cnn en espaol
|
cnn chile
|
cnn expansion
|
|
|
|
cnn tv
|
hln
|
transcripts
2010 cable news network.
turner broadcasting system, inc.
all rights reserved.
terms of service
|
privacy guidelines
|
advertising practices
|
advertise with us
|
about us
|
contact us
|
help
最後,計算結果只是給出了排序的結果,可以直接觀察排序推薦的效果,如下所示:

[correlation=57696.8, numberWords=[[404<6705, 6708, 3>], [878<6710, 6713, 2>], [2276<6714, 6718, 1>]], forwardsWords=[[fax<6719, 6722, 1>], [email<6739, 6744, 1>]], backwardsWords=[[phone<6697, 6702, 1>], [ga<6688, 6690, 2>], [atlanta<6679, 6686, 1>], [center<6672, 6678, 1>]], ]
[correlation=57542.45, numberWords=[[404<6725, 6728, 3>], [827<6730, 6733, 2>], [1995<6734, 6738, 1>]], forwardsWords=[[email<6739, 6744, 1>], [copyrightag<6746, 6760, 2>], [turner.com<6761, 6771, 1>], [ani<6776, 6779, 5>], [question<6780, 6789, 1>]], backwardsWords=[[fax<6719, 6722, 1>], [phone<6697, 6702, 1>]], ]
[correlation=154.35, numberWords=[[30303<6691, 6696, 1>]], forwardsWords=[[phone<6697, 6702, 1>], [fax<6719, 6722, 1>]], backwardsWords=[[ga<6688, 6690, 2>], [atlanta<6679, 6686, 1>], [center<6672, 6678, 1>], [cnn<6668, 6671, 1>], [on<6664, 6667, 1>]], ]
[correlation=0.0, numberWords=[[17<5371, 5373, 2>]], forwardsWords=[[u.s.c<5374, 5379, 1>], [physic<5390, 5398, 5>], [electron<5402, 5412, 4>], [signatur<5413, 5422, 1>]], backwardsWords=[[act<5366, 5369, 1>], [copyright<5356, 5365, 1>], [millennium<5345, 5355, 1>], [digit<5337, 5344, 8>], [act<5326, 5329, 1>]], ]
[correlation=0.0, numberWords=[[512<5382, 5385, 3>]], forwardsWords=[[physic<5390, 5398, 5>], [electron<5402, 5412, 4>], [signatur<5413, 5422, 1>], [person<5428, 5434, 6>], [author<5435, 5445, 1>]], backwardsWords=[[u.s.c<5374, 5379, 1>], [act<5366, 5369, 1>], [copyright<5356, 5365, 1>], [millennium<5345, 5355, 1>]], ]
[correlation=0.0, numberWords=[[2010<7239, 7243, 1>]], forwardsWords=[[cabl<7244, 7249, 1>], [new<7250, 7254, 1>], [network<7255, 7262, 1>], [turner<7264, 7270, 2>], [broadcast<7271, 7283, 1>]], backwardsWords=[[transcript<7227, 7238, 3>], [hln<7221, 7224, 3>], [tv<7216, 7218, 1>], [cnn<7212, 7215, 9>], [expans<7194, 7203, 1>]], ]

我們分析解釋一下:

numberWords是最終的數字字符串的集合(都是數字);

forwardsWords是對應numberWords所表示的數字字符序列前向詞集合;

backwardsWords是對應numberWords所表示的數字字符序列後向詞集合。

上面結果,格式化一下,便容易看出來:

[correlation=57696.8, numberWords=[404-878-2276], forwardsWords=[fax, email], backwardsWords=[phone, ga, atlanta, center]]
[correlation=57542.45, numberWords=[404-827-1995], forwardsWords=[email, copyrightag, turner.com, ani, question], backwardsWords=[fax, phone]]
[correlation=154.35, numberWords=[30303], forwardsWords=[phone, fax], backwardsWords=[ga, atlanta, center, cnn, on]]
[correlation=0.0, numberWords=[17], forwardsWords=[u.s.c, physic, electron, signatur], backwardsWords=[act, copyright, millennium, digit, act]]
[correlation=0.0, numberWords=[512], forwardsWords=[physic, electron, signatur, person, author], backwardsWords=[u.s.c, act, copyright, millennium]]
[correlation=0.0, numberWords=[2010], forwardsWords=[cabl, new, network, turner, broadcast], backwardsWords=[transcript, hln, tv, cnn, expans]]
根據上面得到的網頁文本內容可以看出,第一條得分最高,確實就是電話號碼,第二條是傳真號碼。

最後,如果我們想要使得到的電話號碼更加精確,可以通過多種方式進行篩選和驗證,在一定程度上會提高識別出的電話號碼的精度。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章