htmlparser對html頁面處理的算法

主要是如下幾種方式

採用Visitor方式訪問Html

try {

Parser parser = new Parser();

parser.setURL(”http://www.google.com”);

parser.setEncoding(parser.getEncoding());

NodeVisitor visitor = new NodeVisitor() {

public void visitTag(Tag tag) {

System.out.println (”testVisitorAll() Tag name is :”

+ tag.getTagName() + ” \n Class is :”

+ tag.getClass());

}

};

parser.visitAllNodesWith(visitor);

} catch (ParserException e) {

e.printStackTrace();

}

採用Filter方式訪問html

try {

NodeFilter filter = new NodeClassFilter(LinkTag.class);

Parser parser = new Parser();

parser.setURL(”http://www.google.com”);

parser.setEncoding(parser.getEncoding());

NodeList list = parser.extractAllNodesThatMatch(filter);

for (int i = 0; i < list.size(); i++) {

LinkTag node = (LinkTag) list.elementAt(i);

System.out.println(”testLinkTag() Link is :” + node.extractLink());

}

} catch (Exception e) {

e.printStackTrace();

}

採用org.htmlparser.beans方式

另外htmlparser 還在org.htmlparser.beans中對一些常用的方法進行了封裝，以簡化操作，例如：

Parser parser = new Parser();

LinkBean linkBean = new LinkBean();

linkBean.setURL(”http://www.google.com”);

URL[] urls = linkBean.getLinks();

for (int i = 0; i < urls.length; i++) {

URL url = urls[i];

System.out.println (”testLinkBean() -url is :” + url);

}

Htmlparser關鍵包結構說明

htmlparser其實核心代碼並不多，好好研究一下其代碼，彌補文檔不足的問題。同時htmlparser的代碼註釋和單元測試用例還是很齊全的，也有助於瞭解htmlparser的用法。

3.1、org.htmlparser

定義了htmlparser的一些基礎類。其中最爲重要的是Parser類。

Parser是htmlparser的最核心的類，其構造函數提供瞭如下：Parser.createParser (String html, String charset)、 Parser ()、Parser (Lexer lexer, ParserFeedback fb)、Parser (URLConnection connection, ParserFeedback fb)、Parser (String resource, ParserFeedback feedback)、 Parser (String resource)

各構造函數的具體用法及含義可以查看其代碼，很容易理解。

Parser常用的幾個方法：

• elements獲取元素

Parser parser = new Parser (”http://www.google.com”);

for (NodeIterator i = parser.elements (); i.hasMoreElements (); )

processMyNodes (i.nextNode ());

• parse (NodeFilter filter)：通過NodeFilter方式獲取

• visitAllNodesWith (NodeVisitor visitor)：通過Nodevisitor方式

• extractAllNodesThatMatch (NodeFilter filter)：通過NodeFilter方式

3.2、org.htmlparser.beans

對Visitor和Filter的方法進行了封裝，定義了針對一些常用html元素操作的bean，簡化對常用元素的提取操作。

包括：FilterBean、HTMLLinkBean、HTMLTextBean、LinkBean、StringBean、BeanyBaby等。

3.3、org.htmlparser.nodes

定義了基礎的node，包括：AbstractNode、RemarkNode、TagNode、TextNode等。

3.4、org.htmlparser.tags

定義了htmlparser的各種tag。

3.5、org.htmlparser.filters

定義了htmlparser所提供的各種filter，主要通過extractAllNodesThatMatch (NodeFilter filter)來對html頁面指定類型的元素進行過濾，包括：AndFilter、CssSelectorNodeFilter、HasAttributeFilter、HasChildFilter、HasParentFilter、HasSiblingFilter、IsEqualFilter、LinkRegexFilter、LinkStringFilter、NodeClassFilter、NotFilter、OrFilter、RegexFilter、StringFilter、TagNameFilter、XorFilter

3.6、org.htmlparser.visitors

定義了htmlparser所提供的各種visitor，主要通過visitAllNodesWith (NodeVisitor visitor)來對html頁面元素進行遍歷，包括：HtmlPage、LinkFindingVisitor、NodeVisitor、ObjectFindingVisitor、StringFindingVisitor、TagFindingVisitor、TextExtractingVisitor、UrlModifyingVisitor

3.7、org.htmlparser.parserapplications

定義了一些實用的工具，包括LinkExtractor、SiteCapturer、StringExtractor、WikiCapturer，這幾個類也可以作爲htmlparser使用樣例。

3.8、org.htmlparser.tests

對各種功能的單元測試用例，也可以作爲htmlparser使用的樣例。

1 . 邏輯關係：與或非

AndFilter()
Creates a new instance of an AndFilter.

AndFilter(NodeFilter[] predicates)
Creates an AndFilter that accepts nodes acceptable to all given filters.

AndFilter(NodeFilter left, NodeFilter right)
Creates an AndFilter that accepts nodes acceptable to both filters.

OrFilter()
Creates a new instance of an OrFilter.

OrFilter(NodeFilter[] predicates)
Creates an OrFilter that accepts nodes acceptable to any of the given filters.

OrFilter(NodeFilter left, NodeFilter right)
Creates an OrFilter that accepts nodes acceptable to either filter.

OrFilter()
Creates a new instance of an OrFilter.

OrFilter(NodeFilter[] predicates)
Creates an OrFilter that accepts nodes acceptable to any of the given filters.

OrFilter(NodeFilter left, NodeFilter right)
Creates an OrFilter that accepts nodes acceptable to either filter.

2. 內容

StringFilter：功能簡單有限；複雜功能可使用RegexFilter (正則表達式)

StringFilter()
Creates a new instance of StringFilter that accepts all string nodes.

StringFilter(String pattern)
Creates a StringFilter that accepts text nodes containing a string.

StringFilter(String pattern, boolean sensitive)
Creates a StringFilter that accepts text nodes containing a string.

StringFilter(String pattern, boolean sensitive, Locale locale)
Creates a StringFilter that accepts text nodes containing a string.

RegexFilter()
Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.

RegexFilter(String pattern)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.

RegexFilter(String pattern, int strategy)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.

3 標籤

TagNameFilter()利用標籤名過濾 : div ,img , ...

NodeClassFilter()利用標籤類別：LinkTag.class ...

HasAttributeFilter()利用屬性：HasAttributeFilter(“class”, “className”)

LinkRegexFilter（）用正則表達式匹配鏈接

TagNameFilter()
Creates a new instance of TagNameFilter.

TagNameFilter(String name)
Creates a TagNameFilter that accepts tags with the given name.

NodeClassFilter()
Creates a NodeClassFilter that accepts Html tags.

NodeClassFilter(Class cls)
Creates a NodeClassFilter that accepts tags of the given class.

HasAttributeFilter()
Creates a new instance of HasAttributeFilter.

HasAttributeFilter(String attribute)
Creates a new instance of HasAttributeFilter that accepts tags with the given attribute.

HasAttributeFilter(String attribute, String value)
Creates a new instance of HasAttributeFilter that accepts tags with the given attribute and value.

LinkRegexFilter(String regexPattern)
Creates a LinkRegexFilter that accepts LinkTag nodes containing a URL that matches the supplied regex pattern.

LinkRegexFilter(String regexPattern, boolean caseSensitive)
Creates a LinkRegexFilter that accepts LinkTag nodes containing a URL that matches the supplied regex pattern.

LinkStringFilter(String pattern)
Creates a LinkStringFilter that accepts LinkTag nodes containing a URL that matches the supplied pattern.

LinkStringFilter(String pattern, boolean caseSensitive)
Creates a LinkStringFilter that accepts LinkTag nodes containing a URL that matches the supplied pattern.

4 層次關係

HasParentFilter()
Creates a new instance of HasParentFilter.

HasParentFilter(NodeFilter filter)
Creates a new instance of HasParentFilter that accepts nodes with the direct parent acceptable to the filter.

HasParentFilter(NodeFilter filter, boolean recursive)
Creates a new instance of HasParentFilter that accepts nodes with a parent acceptable to the filter.

HasChildFilter()
Creates a new instance of a HasChildFilter.

HasChildFilter(NodeFilter filter)
Creates a new instance of HasChildFilter that accepts nodes with a direct child acceptable to the filter.

HasChildFilter(NodeFilter filter, boolean recursive)
Creates a new instance of HasChildFilter that accepts nodes with a child acceptable to the filter.