筆記-2009-An Error-Driven Word-Character Hybrid Model for Joint CWS and POS Tagging

An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
作者:神戶大學,Canasai Kruengkrai, and Kiyotaka Uchimoto, and Jun’ichi Kazama, Yiou Wang, and Kentaro Torisawa, and Hitoshi Isahara
出處:Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 513–521,Suntec, Singapore, 2-7 August2009.
word-character based標註結合MIRA算法,是Tetsuji Nakagawa繼2004-2007年後的又一次改進

引言部分
分詞詞性標註一體化,從2004-2009得到非常廣泛的關注(Ngand Low, 2004; Nakagawa and Uchimoto, 2007;Zhang and Clark, 2008; Jiang et al., 2008a; Jianget al., 2008b)
字詞混合標註模型2004年提出使用,詞Markov model,字ME model(Nakagawa, 2004; Nakagawa and Uchimoto, 2007)
MIRA算法(Crammer,2004; Crammer et al., 2005; McDonald, 2006)
使用語料
Penn Chinese Treebank(Xia et al.,2000) (下文簡稱CTB)

正文
Background
1 搜索空間the search space with a lattice based on the word-character hybrid model (Nakagawa and Uchimoto, 2007)
2 word-level 先查詞典,如果查到標註其詞性(POS);character-level,構詞位置標註(POC)和POS(Asahara, 2003; Naka-gawa, 2004).
3 詞典能查到的詞用word-level;查不到的詞用character-level。
4 測試部分使用動態規劃算法,搜索最佳候選路徑。

Policies for correct path selection
如果一個字十分罕見(在訓練語料中)有可能是OOV(Baayen and Sproat 1996)這個理論的有效性得到了驗證 (Ratna-parkhi, 1996; Nagata, 1999; Nakagawa, 2004).
該文采用此方法作爲baseline policy,先統計訓練語料中詞的詞頻,將詞頻小於某個閾值“r”的詞都標註出來(即得到假設OOV)。然後人工調整IV和假設OOV的閾值r,平衡兩者的數量使其得到最佳效果。疑問:是說大於r的詞,才用作生成詞典,生成word-level的依據嗎?
10-fold的交叉檢驗,1份驗證,9份訓練,r=1,從每次驗證語料中得到unidentified unknown words。
error-driven policy:用1)假設OOV是從訓練語料得到 ,2)unidentified unknown words是從待驗證語料得到,3)identified words詞典詞,這三項去學習unknown words.
Training method
McDonald 2006年的方法,k-best MIRA,0/1 loss fumction
Feature 包含兩個字、詞兩個層面,一元(27個)、二元(18個)。w代表詞,p代表詞性,T的分類在表4中(TB代表取詞的首字)

迭代次數N=10,k-best=5,infrequent word(罕見字) r=3時候最好。結果,seg 最好時 0.9787,seg&tag 最好時 0.9364
這篇論文與Ng and Low(2004) CTB3.0,Zhang and Clark(2008)CTB4.0, Jiang et al.(2008)CTB5.0 的結果對比,該論文是最好的。

疑問這一段錯誤驅動方法還是不太清楚,爲什麼一定要設定人工OOV呢?人工OOV的作用是什麼呢?到底可以學到什麼知識呢?怎麼有助於真OOV的發現呢?過幾天再看

We now describe our new approach to leverage additional examples of unknown words. Intuition suggests that even though the system can handle someunknown words(用閾值“r”人工生成的), manyunidentified unknown words remain that cannot be recovered by the system; we wish to learn the characteristics (特點)of suchunidentified unknown words. We propose the following simple scheme:
•Divide the training corpus into ten equal sets and perform 10-fold cross validation to find the errors.訓練語料分成10份,交叉驗證,找錯誤
•For each trial, train the word-character hybrid model with the baseline policy (r = 1) using nine sets and estimate errors using the remaining validation set.使用9份做詞字混合模型的訓練語料,1份驗證語料用於估計錯誤
•Collect unidentified unknown words from each validation set.保留每個驗證集合的unidentified unknown words 
Several types of errors are produced by the baseline model, but we only focus on those caused by unidentified unknown words, which can be easily collected in the evaluation process. As described later in Section 5.2, we easure the recall on out-of-vocabulary (OOV) words. Here, we define unidentified unknown words as OOV words in each validation set that cannot be recovered by the system. After ten cross validation runs, we get a list of the unidentified unknown words derived from the whole training corpus. Note that the unidentified unknown words in the cross validation are not necessary to be infrequent words, but some overlap may exist.(unidentified unknown words 並不一定是罕見詞,但是可能會有一些重疊) Finally, we obtain the artificial unknown words that combine the unidentified unknown words in cross validation and infrequent words for learning unknown words. We refer to this approach as the error-driven policy.

該文用了Baayen and Sproat 1996的方法作爲baseline 也許需要看一下這篇論文,他怎麼unknown words 與unidentified unknown words結合?怎麼學習!


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章