參考：

SRILM安裝：http://blog.csdn.net/zhoubl668/article/details/7759370

SRILM使用：http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87

文獻：SRILM - An Extensible Language Modeling Toolkit(點此閱讀)

更有興趣的可以參考：

SRILM源碼框架分析：http://download.csdn.net/download/yqzhao/4546985

SRILM源碼閱讀系列：http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html

SRILM打折算法：http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

兩個核心模塊

SRILM工具包的有兩個核心模塊，一個是利用訓練數據構建語言模型，是ngram-count模塊，另一個是對語言模型進評測（計算測試集困惑度），是ngram模塊。

一. ngram-count

對於ngram-count模塊，有很多的計數功能，可以單獨生成訓練語料的計數文件，然後可以讀取計數文件構建語言模型，也可以兩步一起做。

假設語料庫的名字是train.data，如下：

it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let me know .

No worry about that . I 'll take it and you need not wrap it up .

Do you do alterations ?

the light was red .

we want to have a table near the window .

it 's over there , just in front of the tourist information .

I twisted it playing tennis . it felt Okay after the game but then it started turning black - and - blue . is it serious ?

please input your pin number .

1.計數功能——生成計數文件

命令：

ngram-count -text train.data -order 3 -write train.data.count

其中text指向的後邊是輸入文件， order 3是3元語法到意思，-write後跟到是輸出到計數文件，如果不指定-order選項的話，默認是3元語法。

從rain.data.count得到的部分結果如下：

please 1

please input 1

please input your 1

<s> 8

<s> it 2

<s> it 's 2

<s> the 1

<s> the light 1

<s> we 1

<s> we want 1

…...

up 1

up . 1

up . </s> 1

Do 1

Do you 1

Do you do 1

這裏的<s>，</s>分別表示句子的開始和結束，計數文件的格式是：

a_z <tab> c(a_z)

a_z：a表示n元語法的第一個詞，z表示最後一個詞，_表示在a和z之間的0個或多個詞

c(a_z)：表示a_z在訓練語料中的計數

2.從計數文件構建語言模型

命令：

ngram-count -read train.data.count -order 3 -lm train.lm

其中參數-read指向輸入文件，此處爲 train.data.count ；-order與上同；-lm指向訓練好的語言模型輸出文件，此處爲train.lm，此命令爲後面可以接具體的打折算法，和後退或插值相結合，比如後面接-interpolate -kndiscount，其中-interpolate爲插值平滑，-kndiscount爲 modifiedKneser-Ney 打折法，如果不指定的話，默認使用Good-Turing打折和Katz退避算法，這裏使用默認。

train.lm文件部分結果如下：

\data\

ngram 1=75

ngram 2=106

ngram 3=2

\1-grams:

-1.770852 'll -0.03208452

-1.770852 's -0.02453138

-1.770852 , -0.4659371

-1.770852 - -0.02832437

-1.030489 . -0.5141692

…...

\2-grams:

-1.361728 'll bring

-1.361728 'll take

-1.361728 's just

-1.361728 's over

-0.1760913 , just

-1.361728 - and

-1.361728 - blue

…...

\3-grams:

-0.1760913 . I 'll

-0.1760913 <s> it 's

其中文件格式：

log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))

注：f(a_z)是條件概率即P(z|a_)，bow(a_z)是回退權重

第一列表示以10爲底對數的條件概率P(z|a_)，第二列是n元詞，第三列是以10爲底的對數回退權重(它爲未看見的n+1元詞貢獻概率)

3.直接結合上面兩步

一般的直接利用訓練語料構建語言模型，即結合上面兩部，命令如下：

ngram-count -text train.data -lm train.lm

這裏沒寫-order，默認是3,沒指定打折算法，默認使用Good-Turing打折和Katz退避算法

二. n-gram模塊測試集困惑度

假設測試集爲test.data，如下：

we want to have a table near the window .

read a list of sentence .

先用如下命令：

ngram -ppl test.data -order 3 -lm train.lm

參數-ppl有兩個主要任務，一是計算句子對數概率(log10P(T)，其中P(T)爲所有句子的概率乘積）二是計算測試集困惑度，有兩個指標ppl, 和ppl1

在終端上得到輸出結果：

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603

輸出信息的第一行：2個句子，16單詞，3個未登錄詞；

輸出信息的第二行：無零概率，logP(T)=-17.9098，ppl= 15.6309 ppl1= 23.8603

其中ppl和ppl1分別計算如下(這裏參考自http://www.52nlp.cn/language-model-training-tools-srilm-details)：

ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}

其中Sen和Word分別代表句子和單詞數。

在原來命令基礎上如果想要得到更詳細的輸出信息，可以加選項-debug 0-4，-debug 0對應這裏的默認情況。比如用ngram -ppl test.data -order 3 -lm train.lm -debug 1，終端上得到輸出信息如下：

reading 75 1-grams

reading 106 2-grams

reading 2 3-grams

we want to have a table near the window .

1 sentences, 10 words, 0 OOVs

0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697

read a list of sentence .

1 sentences, 6 words, 3 OOVs

0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603

A Toolkit For Langugae Modeling——SRILM使用記錄

兩個核心模塊

一. ngram-count

1.計數功能——生成計數文件

2.從計數文件構建語言模型

3.直接結合上面兩步

二. n-gram模塊測試集困惑度

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

4 Generalized linear models

8 Support vector machine(3)

10 Cross validation, VC dimension

A Toolkit For Langugae Modeling——SRILM使用記錄

RNNLM——A Toolkit For Language Modeling rnnlm基本功能命令詳細介紹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結