外文翻譯_A Search-based Chinese Word Segmentation Method

                                           A Search-based Chinese Word Segmentation Method

                                                           一個基於搜索的中文分詞方法

 

 

Xin-Jing Wang                                       Wen Liu                                                         Yong Qin

IBM China Research Center                  Huazhong Univ. of Sci. & Tech.                      IBM China Research Center

Beijing, China                                       Wuhan, China                                                Beijing, China

[email protected]                          [email protected]                                [email protected]

 

 

ABSTRACT

     In this paper, we propose a novel Chinese word segmentation method which leverages the huge deposit of Web documents and search technology. It simultaneously solves ambiguous phrase boundary resolution and unknown word identification problems. Evaluations prove its effectiveness.

 

 

Keywords: Chinese word segmentation, search.

 

 

摘要:在本論文中,我們提出一個新的中文分詞方法,它巧妙地利用了數量非常大的網頁和搜索技術它同時解決了模糊的短語邊界的問題和未登錄詞識別問題。並且通過評價來證明其有效性。

關鍵詞:中文分詞 搜索技術

 

 

 

1. INTRODUCTION

      Automatic Chinese word segmentation is an important technique for many areas including speech synthesis, text categorization, etc [3]. It is challenging because 1) there is no standard definition of words in Chinese, 2) word boundaries are not marked by spaces. Two research issues are mainly involved: ambiguous phrase boundary resolution and unknown word identification.

      Previous approaches fall roughly into four categories:

      1)Dictionary-based methods, which segment sentences by matching entries in a dictionary [3]. Its accuracy is determined by thecoverage of the dictionary, and drops sharply as new word sappear.

      2) Statistical machine learning methods [1], which aretypically based on co-occurrences of character sequences. Generally large annotated Chinese corpora are required for model training, and they lack the flexibility to adapt to different segmentation standards.

      3) Transformation-based methods [4].They are initially used in POS tagging and parsing, which learn a set of n-gram rules from a training corpus and then apply them to the new text.

     4) Combining methods [3] which combine two or more of the above methods.

     As the Web prospers, it brings new opportunities to solve many previously "unsolvable" problems. In this paper, we propose to leverage the Web and search technology to segment Chinese words. Its typical advantages include:

     1) Free from the Out-of-Vocabulary (OOV) problem, and this is a typical feature of leveraging the Web documents.

     2) Adaptive to different segmentation standards since ideally we can obtain all valid character sequences by searching the Web.

     3) Can be entirely unsupervised that need no training corpora.

 

 

1、介紹

     中文自動分詞是一個包括語音合成,文本分類許多領域的重要技術。這是挑戰,因爲中文中沒有標準清晰的詞定義,而且詞兩個詞之間不用空格分開兩個研究問題主要涉及:歧義處理未登錄詞識別。

     以前的辦法大致可以分爲四類:

     1). 基於字典的方法,它通過匹配詞典中的條目來實現分詞[3]。它的精確度取決於字典的覆蓋程度,當出現新詞時,準確率大幅下降

     2)  統計機器學習方法[1],這是典型的基於字符序列出現的共同性一般這種方法需要一個大語料庫,當中註明所需要的模型訓練。這種方法缺乏靈活性,不能很好地適應不同的細分標準。

     3) 基於統計的方法[4]。它們最初應用在標記和分析POS機,從一個學習訓練語料的N-gram模型(一種統計模型),然後將它們應用到新的文本中使用

    4) 相結合的方法[3],結合上述兩種或多種方法

     隨着網絡的繁榮會帶來新的機會,解決了許多以前“無法解決的問題“在本論文中,我們提出利用網絡和搜索技術領域的話。其典型的優點包括

     1)擺脫不在詞表種詞彙未登錄詞)的問題,這是一種利用Web文檔的典型特徵。
     2)自適應分割的標準,因爲不同的理想,我們可以通過搜索網站的所有有效的字符序列。
     3)可完全不受監督,並不需要訓練語料


 

2. THE PROPOSED APPROACH

2.  推薦的方法

       The approach contains three steps:

       1) segments collecting,

       2) segments scoring,

       3) segmentation scheme ranking.

       該方法包含三個步驟:
       1)收集分割字段
       2)分割方式的比較
       3)分割方案的排名。

 

2.1 Segments Collecting

     The segments are collected in two steps:

     1) Firstly, the query sentence is semantically segmented by punctuation which gives several sub-sentences.

     2) Then each sub-sentence is submitted to a search engine for segments collecting. Technically, if the search

engine’s invertedindices are inaccessible as commercial search engines do, e.g. Google and Yahoo!, we collect the highlights (the red words in Figure 1) from the returned snippets as the segments. Otherwise,we check the characters’ positions indicated by the inverted indices and find those that neighbor each other in the query.

 

2.1 收集分割字段

     被收集在兩個步驟:
     1)首先,查詢句會被語義上標點符號所分割,形成不同的子句
     2)然後每個子句子提交到搜索引擎收集段。從技術上講,如果搜索引擎的倒排索引不能使用,因爲倒排索引就是爲商業搜索引擎做的,例如谷歌和雅虎,我們將收集的亮點圖1中的紅色字)返回部分片段否則我們檢查字的位置表示倒排索引,在查詢語句中找到它的相鄰位置。

     Although search engines generally have local segmentors, we argue that their performance normally will not affect our results, e.g. Figure 1 shows the search results of “他高興地說” (he said happily), our method assumes that the highlight “他高興地” (he happily) is a segment. However, by checking the HTML source, we found that Yahoo!’s local segmentor gives “<b>

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章