A Search-based Chinese Word Segmentation Method
一個基於搜索的中文分詞方法
Xin-Jing Wang Wen Liu Yong Qin
IBM China Research Center Huazhong Univ. of Sci. & Tech. IBM China Research Center
Beijing, China Wuhan, China Beijing, China
[email protected] [email protected] [email protected]
ABSTRACT
In this paper, we propose a novel Chinese word segmentation method which leverages the huge deposit of Web documents and search technology. It simultaneously solves ambiguous phrase boundary resolution and unknown word identification problems. Evaluations prove its effectiveness.
Keywords: Chinese word segmentation, search.
摘要:在本論文中,我們提出一個新的中文分詞方法,它巧妙地利用了數量非常大的網頁和搜索技術。它同時解決了模糊的短語邊界的問題和未登錄詞識別問題。並且通過評價來證明其有效性。
關鍵詞:中文分詞 搜索技術
1. INTRODUCTION
Automatic Chinese word segmentation is an important technique for many areas including speech synthesis, text categorization, etc [3]. It is challenging because 1) there is no standard definition of words in Chinese, 2) word boundaries are not marked by spaces. Two research issues are mainly involved: ambiguous phrase boundary resolution and unknown word identification.
Previous approaches fall roughly into four categories:
1)Dictionary-based methods, which segment sentences by matching entries in a dictionary [3]. Its accuracy is determined by thecoverage of the dictionary, and drops sharply as new word sappear.
2) Statistical machine learning methods [1], which aretypically based on co-occurrences of character sequences. Generally large annotated Chinese corpora are required for model training, and they lack the flexibility to adapt to different segmentation standards.
3) Transformation-based methods [4].They are initially used in POS tagging and parsing, which learn a set of n-gram rules from a training corpus and then apply them to the new text.
4) Combining methods [3] which combine two or more of the above methods.
As the Web prospers, it brings new opportunities to solve many previously "unsolvable" problems. In this paper, we propose to leverage the Web and search technology to segment Chinese words. Its typical advantages include:
1) Free from the Out-of-Vocabulary (OOV) problem, and this is a typical feature of leveraging the Web documents.
2) Adaptive to different segmentation standards since ideally we can obtain all valid character sequences by searching the Web.
3) Can be entirely unsupervised that need no training corpora.
1、介紹
中文自動分詞是一個包括語音合成,文本分類等許多領域的重要技術。這是挑戰,因爲中文中沒有標準清晰的詞定義,而且詞兩個詞之間不用空格分開。兩個研究問題主要涉及:歧義處理和未登錄詞識別。
以前的辦法大致可以分爲四類:
1). 基於字典的方法,它通過匹配詞典中的條目來實現分詞[3]。它的精確度取決於字典的覆蓋程度,當出現新詞時,準確率大幅下降。
2) 統計機器學習方法[1],這是典型的基於字符序列出現的共同性。一般這種方法需要一個大語料庫,當中註明所需要的模型訓練。這種方法缺乏靈活性,不能很好地適應不同的細分標準。
3) 基於統計的方法[4]。它們最初應用在標記和分析POS機,這從一個學習訓練語料的N-gram模型(一種統計模型),然後將它們應用到新的文本中使用。
4) 相結合的方法[3],結合上述兩種或多種方法。
隨着網絡的繁榮,會帶來新的機會,解決了許多以前“無法解決的問題“。在本論文中,我們提出利用網絡和搜索技術領域中的話。其典型的優點包括:
1)擺脫不在詞表種的詞彙(未登錄詞)的問題,這是一種利用Web文檔的典型特徵。
2)自適應分割的標準,因爲不同的理想,我們可以通過搜索網站的所有有效的字符序列。
3)可完全不受監督,並不需要訓練語料。
2. THE PROPOSED APPROACH
2. 推薦的方法
The approach contains three steps:
1) segments collecting,
2) segments scoring,
3) segmentation scheme ranking.
該方法包含三個步驟:
1)收集分割字段
2)分割方式的比較
3)分割方案的排名。
2.1 Segments Collecting
The segments are collected in two steps:
1) Firstly, the query sentence is semantically segmented by punctuation which gives several sub-sentences.
2) Then each sub-sentence is submitted to a search engine for segments collecting. Technically, if the search
engine’s invertedindices are inaccessible as commercial search engines do, e.g. Google and Yahoo!, we collect the highlights (the red words in Figure 1) from the returned snippets as the segments. Otherwise,we check the characters’ positions indicated by the inverted indices and find those that neighbor each other in the query.
2.1 收集分割字段
該段被收集在兩個步驟:
1)首先,查詢句會被語義上的標點符號所分割,形成不同的子句。
2)然後每個子句子提交到搜索引擎收集段。從技術上講,如果搜索引擎的倒排索引是不能使用,因爲倒排索引就是爲商業搜索引擎做的,例如谷歌和雅虎,我們將收集的亮點(圖1中的紅色字)從返回的部分片段。否則,我們檢查字的位置表示了倒排索引,在查詢語句中找到它的相鄰位置。
Although search engines generally have local segmentors, we argue that their performance normally will not affect our results, e.g. Figure 1 shows the search results of “他高興地說” (he said happily), our method assumes that the highlight “他高興地” (he happily) is a segment. However, by checking the HTML source, we found that Yahoo!’s local segmentor gives “<b>