A Toolkit For Langugae Modeling——SRILM使用記錄


參考:


SRILM安裝:http://blog.csdn.net/zhoubl668/article/details/7759370

SRILM使用:http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87

文獻:SRILM - An Extensible Language Modeling Toolkit(點此閱讀)


更有興趣的可以參考:


SRILM源碼框架分析:http://download.csdn.net/download/yqzhao/4546985

SRILM源碼閱讀系列:http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html

SRILM打折算法:http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html



兩個核心模塊


SRILM工具包的有兩個核心模塊,一個是利用訓練數據構建語言模型,是ngram-count模塊,另一個是對語言模型進評測(計算測試集困惑度),是ngram模塊。




. ngram-count



對於ngram-count模塊,有很多的計數功能,可以單獨生成訓練語料的計數文件,然後可以讀取計數文件構建語言模型,也可以兩步一起做。


假設語料庫的名字是train.data,如下:


it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let me know .

No worry about that . I 'll take it and you need not wrap it up .

Do you do alterations ?

the light was red .

we want to have a table near the window .

it 's over there , just in front of the tourist information .

I twisted it playing tennis . it felt Okay after the game but then it started turning black - and - blue . is it serious ?

please input your pin number .



1.計數功能——生成計數文件


命令:


ngram-count -text train.data -order 3 -write train.data.count


其中text指向的後邊是輸入文件, order 33元語法到意思,-write後跟到是輸出到計數文件,如果不指定-order選項的話,默認是3元語法。


rain.data.count得到的部分結果如下:


please 1

please input 1

please input your 1

<s> 8

<s> it 2

<s> it 's 2

<s> the 1

<s> the light 1

<s> we 1

<s> we want 1

…...

up 1

up . 1

up . </s> 1

Do 1

Do you 1

Do you do 1


這裏的<s></s>分別表示句子的開始和結束,計數文件的格式是:


a_z <tab> c(a_z)


a_za表示n元語法的第一個詞,z表示最後一個詞,_表示在az之間的0個或多個詞

c(a_z):表示a_z在訓練語料中的計數





2.從計數文件構建語言模型


命令:


ngram-count -read train.data.count -order 3 -lm train.lm


其中參數-read指向輸入文件,此處爲 train.data.count -order與上同;-lm指向訓練好的語言模型輸出文件,此處爲train.lm,此命令爲後面可以接具體的打折算法,和後退或插值相結合,比如後面接-interpolate -kndiscount,其中-interpolate爲插值平滑,-kndiscount爲 modifiedKneser-Ney 打折法,如果不指定的話,默認使用Good-Turing打折和Katz退避算法,這裏使用默認。



train.lm文件部分結果如下:


\data\

ngram 1=75

ngram 2=106

ngram 3=2



\1-grams:

-1.770852 'll -0.03208452

-1.770852 's -0.02453138

-1.770852 , -0.4659371

-1.770852 - -0.02832437

-1.030489 . -0.5141692


…...


\2-grams:

-1.361728 'll bring

-1.361728 'll take

-1.361728 's just

-1.361728 's over

-0.1760913 , just

-1.361728 - and

-1.361728 - blue


…...

\3-grams:

-0.1760913 . I 'll

-0.1760913 <s> it 's



其中文件格式:

log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))


注:f(a_z)是條件概率即P(z|a_)bow(a_z)是回退權重


第一列表示以10爲底對數的條件概率P(z|a_),第二列是n元詞,第三列是以10爲底的對數回退權重(它爲未看見的n+1元詞貢獻概率)




3.直接結合上面兩步


一般的直接利用訓練語料構建語言模型,即結合上面兩部,命令如下:


ngram-count -text train.data -lm train.lm

這裏沒寫-order,默認是3,沒指定打折算法,默認使用Good-Turing打折和Katz退避算法





. n-gram模塊測試集困惑度


假設測試集爲test.data,如下:


we want to have a table near the window .

read a list of sentence .


先用如下命令:


ngram -ppl test.data -order 3 -lm train.lm

參數-ppl有兩個主要任務,一是計算句子對數概率(log10P(T),其 中P(T)爲所有句子的概率乘積)二是計算測試集困惑度,有兩個指標ppl, ppl1


在終端上得到輸出結果:

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603


輸出信息的第一行:2個句子,16單詞,3個未登錄詞;

輸出信息的第二行:無零概率,logP(T)=-17.9098ppl= 15.6309 ppl1= 23.8603

其中pplppl1分別計算如下(這裏參考自http://www.52nlp.cn/language-model-training-tools-srilm-details)


ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}

其中SenWord分別代表句子和單詞數。


在原來命令基礎上如果想要得到更詳細的輸出信息,可以加選項-debug 0-4-debug 0對應這裏的默認情況。比如用ngram -ppl test.data -order 3 -lm train.lm -debug 1,終端上得到輸出信息如下:

reading 75 1-grams

reading 106 2-grams

reading 2 3-grams

we want to have a table near the window .

1 sentences, 10 words, 0 OOVs

0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697


read a list of sentence .

1 sentences, 6 words, 3 OOVs

0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379


file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章