參考:
SRILM安裝:http://blog.csdn.net/zhoubl668/article/details/7759370
SRILM使用:http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87
文獻:SRILM - An Extensible Language Modeling Toolkit(點此閱讀)
更有興趣的可以參考:
SRILM源碼框架分析:http://download.csdn.net/download/yqzhao/4546985
SRILM源碼閱讀系列:http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html
SRILM打折算法:http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
兩個核心模塊
SRILM工具包的有兩個核心模塊,一個是利用訓練數據構建語言模型,是ngram-count模塊,另一個是對語言模型進評測(計算測試集困惑度),是ngram模塊。
一. ngram-count
對於ngram-count模塊,有很多的計數功能,可以單獨生成訓練語料的計數文件,然後可以讀取計數文件構建語言模型,也可以兩步一起做。
假設語料庫的名字是train.data,如下:
it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let me know .
No worry about that . I 'll take it and you need not wrap it up .
Do you do alterations ?
the light was red .
we want to have a table near the window .
it 's over there , just in front of the tourist information .
I twisted it playing tennis . it felt Okay after the game but then it started turning black - and - blue . is it serious ?
please input your pin number .
1.計數功能——生成計數文件
命令:
ngram-count -text train.data -order 3 -write train.data.count
|
其中text指向的後邊是輸入文件, order 3是3元語法到意思,-write後跟到是輸出到計數文件,如果不指定-order選項的話,默認是3元語法。
從rain.data.count得到的部分結果如下:
please 1
please input 1
please input your 1
<s> 8
<s> it 2
<s> it 's 2
<s> the 1
<s> the light 1
<s> we 1
<s> we want 1
…...
up 1
up . 1
up . </s> 1
Do 1
Do you 1
Do you do 1
這裏的<s>,</s>分別表示句子的開始和結束,計數文件的格式是:
a_z <tab> c(a_z)
a_z:a表示n元語法的第一個詞,z表示最後一個詞,_表示在a和z之間的0個或多個詞
c(a_z):表示a_z在訓練語料中的計數
2.從計數文件構建語言模型
命令:
ngram-count -read train.data.count -order 3 -lm train.lm
|
其中參數-read指向輸入文件,此處爲 train.data.count ;-order與上同;-lm指向訓練好的語言模型輸出文件,此處爲train.lm,此命令爲後面可以接具體的打折算法,和後退或插值相結合,比如後面接-interpolate -kndiscount,其中-interpolate爲插值平滑,-kndiscount爲 modifiedKneser-Ney 打折法,如果不指定的話,默認使用Good-Turing打折和Katz退避算法,這裏使用默認。
train.lm文件部分結果如下:
\data\
ngram 1=75
ngram 2=106
ngram 3=2
\1-grams:
-1.770852 'll -0.03208452
-1.770852 's -0.02453138
-1.770852 , -0.4659371
-1.770852 - -0.02832437
-1.030489 . -0.5141692
…...
\2-grams:
-1.361728 'll bring
-1.361728 'll take
-1.361728 's just
-1.361728 's over
-0.1760913 , just
-1.361728 - and
-1.361728 - blue
…...
\3-grams:
-0.1760913 . I 'll
-0.1760913 <s> it 's
其中文件格式:
log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))
注:f(a_z)是條件概率即P(z|a_),bow(a_z)是回退權重
第一列表示以10爲底對數的條件概率P(z|a_),第二列是n元詞,第三列是以10爲底的對數回退權重(它爲未看見的n+1元詞貢獻概率)
3.直接結合上面兩步
一般的直接利用訓練語料構建語言模型,即結合上面兩部,命令如下:
ngram-count -text train.data -lm train.lm
|
這裏沒寫-order,默認是3,沒指定打折算法,默認使用Good-Turing打折和Katz退避算法
二. n-gram模塊測試集困惑度
假設測試集爲test.data,如下:
we want to have a table near the window .
read a list of sentence .
先用如下命令:
ngram -ppl test.data -order 3 -lm train.lm
|
參數-ppl有兩個主要任務,一是計算句子對數概率(log10P(T),其 中P(T)爲所有句子的概率乘積)二是計算測試集困惑度,有兩個指標ppl, 和ppl1
在終端上得到輸出結果:
file test.data: 2 sentences, 16 words, 3 OOVs
0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603
輸出信息的第一行:2個句子,16單詞,3個未登錄詞;
輸出信息的第二行:無零概率,logP(T)=-17.9098,ppl= 15.6309 ppl1= 23.8603
其中ppl和ppl1分別計算如下(這裏參考自http://www.52nlp.cn/language-model-training-tools-srilm-details):
ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}
其中Sen和Word分別代表句子和單詞數。
在原來命令基礎上如果想要得到更詳細的輸出信息,可以加選項-debug 0-4,-debug 0對應這裏的默認情況。比如用ngram -ppl test.data -order 3 -lm train.lm -debug 1,終端上得到輸出信息如下:
reading 75 1-grams
reading 106 2-grams
reading 2 3-grams
we want to have a table near the window .
1 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697
read a list of sentence .
1 sentences, 6 words, 3 OOVs
0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379
file test.data: 2 sentences, 16 words, 3 OOVs
0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603