Notes on Chinese Web Data Extraction in Java(part 2)

轉載 : http://isaacyang.wordpress.com/

 

 

 

2. Parsing a HTML Page

There are a lot of choices when trying to parse a HTML page. For a complete list of all possible parsers, you can search from google for query “html parser”. Basically, there are two different ways to parse a HTML page – parse with an embeded parser and parse with a local web browser.

For the first method, you can write your own HTML parser or using some libraries from the internet, like HTML ParserJTidyNekoHTML, etc. These parsers simply parse the content of the HTML page using the encoding you provided and correct the syntax error of the HTML tags. You can output the parsed HTML page or get the parsed DOM tree. You can directly manipulate the DOM tree or convert this tree into your own memory model. This kind of parser is usually fast but not robust. I’ve tried the previous three parsers and I would recommend to use the HTML Parser.

For the second method, you may choose to use the JDesktop Integration Components (JDIC)or the Standard Widget Toolkit (SWT). Both methods provide a way to access the local web browser(IE and firefox). For SWT, you can use XULRunner even if you don’t have web browser installed. XULRunner is a Mozilla runtime package that can be used to bootstrap XUL + XPCOM applications that are as rich as Firefox and Thunderbird. As the web page is resolved using the local web browser, all the resources used by the page (like styles and images) are loaded from the remote server and the javascript inside the page is also executed. The time needed to parse a page is much longer than that of the first method. But, this method is extremely robust and the resolve quality is extremely high. What you get after the parsing is exactly the same as you browse the page from a web browser. This character is especially important when you try to extract data from web pages mainly generated by javascript. The first method will not work under this situation. This method is also easy to use. Simply create an instance of the local browser and pass the target url to the browser, the browser will provide a parsed DOM tree or the parsed document. A small problem of this method is that the parsed web page uses the original encoding. So you still have to detect the encoding of the web page and convert the page into your project encoding.

In the following subsections, I will show one example for each of the above two methods.

2.1 Example for Working with HTML Parser

To use HTML Parser, you need to add htmllexer.jar and htmlparser.jar to your Java build path. The following snippet parses a html page and traverse the body of the DOM tree. Note that the class of “node” in the code is org.htmlparser.Node which does not implement the standard W3C node interface org.w3c.dom.Node. You should write your own adapter if you want to use the W3C node type.

try {
    Parser parser = Parser.createParser(content, "utf-8");
    HtmlPage page = new HtmlPage(parser);
    parser.visitAllNodesWith(page);
    NodeList list = page.getBody();
    for (NodeIterator iterator = list.elements(); iterator.hasMoreNodes();) {
        Node node = iterator.nextNode();
    }
} catch (ParserException e{
}

2.2 Example for Working with SWT & XULRunner

Simply adding swt.jar to your Java build path and you will be able to work with SWT. Using XULRunner is a little bit complicated. You have to download the right version of XULRunner regarding to your platform and execute the registration command under the XULRunner direction.
To make XULRunner available for all the users, use command:

  • Windows XULRunner –register-global
  • Linux ./XULRunner –register-global
  • Mac ./XULRunner-bin –register-global

To make it only available for the current user, use command:

  • Windows XULRunner –register-user
  • Linux ./XULRunner –register-user
  • Mac ./XULRunner-bin –register-user

You also need to add MozillaGlue.jar and MozillaInterfaces.jar into your Java build path. These two packages can be found in the xulrunner-sdk archive. Now you can use XULRunner in your application.

The following snippet resolves a url using XULRunner. You can disable the caching of the browser by adding parameter “Cache-Control: no-cache” when calling the setUrl() method. The complete() event is called when the page resolving is finished. You can add code in this event to trigger the processing of the page. The sample code in the snippet gets the DOM tree of the page. Note that the DOM tree is not the standard W3C DOM tree.

Display display = new Display(); 
final Shell shell = new Shell(display); 
FillLayout layout = new FillLayout(); 
shell.setLayout(layout);

 

final Browser browser = new Browser(shell, SWT.BORDER | SWT.MOZILLA);
browser.setUrl(url, null, new String[] { "Cache-Control: no-cache" });
browser.addProgressListener(new ProgressAdapter() {
    public void completed(ProgressEvent event{
        nsIWebBrowser webBrowser = (nsIWebBrowserbrowser.getWebBrowser();
        nsIDOMWindow domWindow = webBrowser.getContentDOMWindow();
        nsIDOMDocument document = domWindow.getDocument();
        documentElement = document.getDocumentElement();        
    }
});
shell.open();
while (!shell.isDisposed()) {
    if (!display.readAndDispatch())
        display.sleep();
}
display.dispose();

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章