Notes on Chinese Web Data Extraction in Java(part 3)

 

轉載: http://isaacyang.wordpress.com/

 

 

 

 

3. Web Crawler and Data Extraction

For a general purpose web crawler, the task of data extraction is rather complicated. It needs to identify the data region first and then extracts the data in the region using some algorithms plus some sort of NLP techniques. Detecting the data region is a rather difficult task. A very normal problem like automatically finding the news title and body leads to thousands of top conference and journal papers. I will skip this part and focus on a much easier scenario where manually labelling is involved. So the only problem here is how to extract the data or wrapper induction (WI). A wrapper is a program that extracts data from an information source (like a Web page) such that the information integration system can access the data in a unified way. Apart from manually writing wrappers, there are two main approaches to wrapper generation – supervised approach and unsupervised approach. Supervised approach uses supervised learning to learn data extraction rules (like a regular expression) from a set of manually labelled examples. The unsupervised approach, RoadRunner[2], EXALG[1] and DEPTA[4], does not need any labelled examples. To make the extraction as accurate as possible, the approach discussed below incorporates manually labelling and the partial tree alignment algorithm in DEPTA.

A very common pattern in Web design is too make a list page which contains general information of records. The list page can be either one page or multiple pages with a “next page” link to the next page of records. Additionally, each record has a link to its corresponding record page with detailed information. Both the list page and the record page are generated using predefined templates. Fig. 1 shows an example of list page. It has multiple records on the page and each record has a record page link. The records share the same underlying template and the record pages also share the same underlying template.

Figure 1: A list page contains records and links to record pages

WI systems like RoadRunner and EXALG are all based on this observation. The necessity of using multiple similar pages is due to the large noise of Web pages and the information incompleteness of single pages. For example, in the problem of news body extraction, a Web page may contain several discontinuous part identified as news bodies and determining which part is the real news body is difficult. But if you put multiple news pages together, the part which is always identified as news body is highly likely to be the real news body region. Another example is the problem of product information extraction. Blank items may be invisible on the Web pages. So a single page may not have all the product information displayed. A reasonable approach is to compare as much pages as possible to determine all the product information items.

This observation also enables us to maximize the efficiency of Web crawlers. The crawler starts from the first list page and visits all the links to record pages on this page, and then visits the next list page until no more “next page” link exists.

The only remaining problem now is how the find the common underlying template. This can be solved using the partial tree alignment algorithm[4] on the DOM tree obtained in the previous section. The following subsection introduces the partial tree alignment algorithm. The tree obtained from the algorithm is called the template tree. The template tree is used in the data extraction.

3.1 Partial tree alignment algorithm

The partial tree alignment algorithm was proposed by Yanhong Zhai and Bing Liu in [4]. Given a forest, it calculates the tree edit distance of each two trees in the forest. The calculation of tree edit distance between two trees also gives a mapping between these two trees which enables to merge these two trees together. Repeat this procedure on the forest will result in only one tree.

Tree edit distance between two labelled ordered trees A and B is the cost associated with the minimum set of operations needed to transform A into B. Standard operation are node removal, node insertion and node replacement where a cost is assigned to each of the operations. The concept of mapping is formally defined as [3]:

Let X be a tree and let X[i] be the i-th node of tree X in a preorder walk of the tree. A mapping M between a tree A of size n1 and a tree B of size n2 is a set of ordered pairs (i,j), one from each tree, satisfying the following conditions for all (i1, j1),(i2,j2) ∈ M:

  1. i1 = i2 iff j1 = j2;
  2. A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2];
  3. A[i1] is an ancestor of A[i2] iff B[j1] is an ancestor of B[j2].

In the matching, the hierarchical relation between nodes and the order between sibling nodes are both preserved. Fig. 2 shows a matching example.


Figure 2: A general tree mapping example (from [4])

The standard tree matching problem can be solved in a time complexity of /mathcal{O}(n_1n_2h_1h_2)using dynamic programming where h1 and h2 are the heights of trees. In the standard setting, the mapping can across levels(node a in Tree A and node b in Tree B in Fig. 2). The simple tree matching algorithm considers a restricted version where no node replacement and no level crossing are allowed. The time complexity of the algorithm is /mathcal{O}(n_1n_2). Let Ai be the subtree corresponds to the i-th child node of A’s root and Bj be the subtree corresponds to the j-th child node of B’s root. Denote M[i,j] as the number of maximal matchings using /{A_1, /cdots, A_i/} and /{B_1, /cdots, B_j/}. Then the recursive formula is M[i, j] = /max(M[i - 1, j], M[i, j - 1], M[i - 1, j - 1] + W[i, j]) where W[i, j] is the value of maximal matchings between Ai and Bj. After the matching, we can trace back to find the matched nodes between the two trees.

The partial tree alignment works on the result of the simple tree matching algorithm. It is about how to merge two trees based on the matched nodes. Just code with your instinct. Anything which preserves the order of the sibling nodes will work. After repeating the algorithm enough times, all the trees will be merged into one tree which corresponds to the underlying template of the pages.

The above discussion works for general trees. For DOM tree, the type of node also matters in matching. Some common node types are element, attribute, text and comment. I suggest the following rules for DOM tree matching (comment nodes are removed).

  1. Two nodes match only if their node types are the same.
  2. Two element nodes match iff they have the same tag names. A more rigorous rule is two element nodes match iff they have the same tag names and class names.
  3. A mismatch scores 0 point and a match scores 1 point except for special cases.
  4. Two text nodes always match without considering their texts.
  5. If two text nodes share the same prefixes, then the score is 2. If the texts of two text nodes belong to the same ontology(this requires NLP techniques), then the score is higher than 1.

Here, the target of tree matching is to find the matching with maximal score. Two trees match if their matching score is above a specific threshold. Given a forest, define a score for each tree which is a function of the number of matched trees in the forest and the sum of the matching scores correspond to matched trees. The tree with the highest score is thereference tree of the forest and the score of this tree is the matching score of the forest. Denote the two parameter as n and s respectively, a possible scoring function is n + s / n which is a linear combination of the number of matched trees and the average score of matching.

For a list page, find the node whose children(this is a forest) has the highest matching score and merge all the other children with the reference tree will result in the template tree. The list page, the node and the template tree is called the list tree, the record zone tree and therecord tree respectively. If the scoring function does not lead to the desired record zone tree, one can manually choose the record zone tree and calculate its corresponding record tree. For a group of data pages(it is also a forest), calculate the reference tree among these pages and merge all the other pages with the reference tree will result in the template tree. The template tree is called the record detail tree.

3.2 Data extraction using the template tree

A template tree represents the underlying template of a series Web pages. It is also a wrapper. Typical wrapper usage is to convert it into an automata and uses the automata to match the new pages. It is natural to use regular expression because regular expression describes a deterministic finite automata. However, the implementation of regular expression in Java is not robust enough which may fall into a dead loop. And Java does not provide a way to listen to the matching procedure and terminate it when timeout. So it is not a good idea to use regular expression here as the regular expression corresponds to a template tree is extremely complicated.

An alternative approach is to calculate the simple tree matching between the template tree and the DOM tree of the new page, and trace back to find out the node mapping. Then we can find the corresponding node of any node in the template tree from the DOM tree of the new page. As the time complexity is /mathcal{O}(n_1n_2), and a page usually has only thousands of nodes, the matching speed is rather fast. This approach turns out to be very efficient in practice.

The real data extraction system works as follows.

  1. Labelling phase.
    1. Feed the target list page to the system and the system shows the detected record zone tree and record tree. If the detected record zone tree is what you need, then move to the next step, otherwise notifying the system with the correct record zone tree and the system shows the corresponding record tree.
    2. Labelling the node in the record tree which contains the “next page” link. Labelling the nodes in the record tree which contain the data needed. If the data in the record tree is enough, then go to the next phase.
    3. Labelling the node in the record tree which contains link to record page.
    4. The system calculates the correspondent node of the labelled node for all the siblings of the record tree and visits all the retrieved record links to get the forest of record pages.
    5. The system calculates the record detail tree of the forest and shows the tree to the user. If the tree is not the desired one, the user may notify the system the desired one and the system calculates the record detail tree again.
    6. Labelling the nodes in the record detail tree which contain the data needed.
  2. Extracting phase.
    1. Visiting the target list page and convert it into a DOM tree. Calculate the mapping from the list tree to the DOM tree. For each child of the record zone tree’s corresponding node, calculate the mapping from the record tree to the subtree corresponds to the child and extract the labelled node (including the link node).
    2. If the information on the record page is needed, then visit all the extracted record page and calculate the mapping from the list detail tree to the DOM tree of the record page. Extract all the labelled nodes’ corresponding node.
    3. Following the corresponding node of the “next page” node to the next list page and loops until no more pages available.

After you get the information in the nodes, all your need to do is to convert the information into your desired format. Writing some regular expressions to match things like money, time and cellphone will be necessary. A crawler using method crawls about 0.1m houses for rent/sale from the internet everyday at www.fang.com. The crawler runs on a dual-core machine with 4g memory and 4m Internet access. It uses up all the Internet bandwidth with an average CPU usage of about 25%.

References
[1] A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337-348. ACM, 2003.
[2] V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the international conference on very large data bases, pages 109-118. Citeseer, 2001.
[3] K.C. Tai. The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):433, 1979.
[4] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web, pages 76-85. ACM, 2005.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章