今天想利用kenlm來跑一個n-gram模型,發現安裝起來並不是很容易,我這裏分享一下我的安裝過程:
- 安裝依賴
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
- 下載代碼
git clone https://github.com/kpu/kenlm.git
- 編譯
mkdir -p build && cd build
cmake ..
make -j 4
怎樣測試呢:
創建一個txt文件:
我 是 個 程序員
我 們 都 是 中國人
然後運行命令:
bin/lmplz -o 3 --verbose header --text text.txt --arpa test.arpa --discount_fallback
看能不能運行成功,我的日誌爲:
test@test-X10DAi:~/Documents/kenlm/build$ bin/lmplz -o 3 --verbose header --text text.txt --arpa test.arpa --discount_fallback
=== 1/5 Counting and sorting n-grams ===
Reading /home/test/Documents/kenlm/build/text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 9 types 10
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:120 2:18728921088 3:35116728320
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Statistics:
1 10 D1=0.5 D2=1 D3+=1.5
2 10 D1=0.5 D2=1 D3+=1.5
3 9 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type B
probing 672 assuming -p 1.5
probing 776 assuming -r models -p 1.5
trie 438 without quantization
trie 3424 assuming -q 8 -b 8 quantization
trie 461 assuming -a 22 array pointer compression
trie 3447 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:120 2:160 3:180
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++********************************************************************************####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:120 2:160 3:180
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++********************************************************************************####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
--------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++****************************************************************************************************
Name:lmplz VmPeak:52757736 kB VmRSS:6380 kB RSSMax:12150804 kB user:1.51267 sys:3.42447 CPU:4.9372 real:5.01352
參考文獻
[1].【自然語言處理入門】02:Kenlm語料庫的製作與模型的訓練. https://blog.csdn.net/xyz1584172808/article/details/78834779
[2].使用KenLM訓練n-gram語言模型 (中文). https://blog.csdn.net/benbenls/article/details/102898960
[3].BUILDING. https://github.com/kpu/kenlm/blob/master/BUILDING