ubuntu15.04 32位下基於IRSTLM，GIZA++和Moses的統計翻譯系統的環境搭建及運行

由於搭建Moses環境需要多個模塊之間相互調用，模塊之間存在版本搭配的問題，筆者也嘗試過，不同的模塊版本相互搭配的話造成編譯失敗的可能性非常大，經過不到一週的堅持，總算搭建成功，下面是我成功搭建的各個模塊的具體版本，筆者建議大家使用相同的版本。PS：本人是在新安裝的系統下搭建的，測試是漢語->英語的翻譯^_^

系統：ubuntu：15.04 32位操作系統

Moses版本：2015.2.28日發佈的mosesdecoder-pre-MMT
g++版本：4.9.2
boost版本：2015.8.17通過命令sudo apt-get install libboost-all-dev版本
irstlm版本：5.80.03
giza版本：2015.8.17通過命令：git clone https://github.com/moses-smt/giza-pp.git獲得版本

一.安裝Moses之前，首先系統得安裝G++編譯器，和Boost庫。
1.安裝G++:Ubuntu缺省情況下，並沒有提供C/C++的編譯環境，因此還需要手動安裝。如果單獨安裝gcc以及g++比較麻煩，幸運的是，爲了能夠編譯Ubuntu的內核，Ubuntu提供了一個build-essential軟件包。也就是說，安裝了該軟件包，編譯c/c++所需要的軟件包也都會被安裝。因此如果想在Ubuntu中編譯c/c++程序，只需要安裝該軟件包就可以了。安裝方法如下：
sudo apt-get install build-essential

2.安裝Boost庫： sudo apt-get install libboost-all-dev
檢驗安裝成功否：在linux下任意目錄下創建test.cpp
#include<iostream>
#include<boost/bind.hpp>
using namespace std;
using namespace boost;
int fun(int x,int y){return x+y;}
int main(){
int m=1;int n=2;
cout<<boost::bind(fun,_1,_2)(m,n)<<endl;
return 0;
}

運行命令:
sudo g++ test.cpp -o test
sudo ./test
將得到輸出結果爲：3

代表boost安裝成功

二.安裝IRSTLM:先運行步驟2，３,失敗，再運行步驟1.
1.從http://zlib.net/先下載slib-1.2.8,解壓到/home根目錄下

鍵入命令：

cd zlib-1.2.8

sudo ./configure
sudo make install
sudo make test
2.在編譯IRSTLM之前,需要安裝兩個工具：

鍵入命令：

sudo apt-get install automake

sudo apt-get install libtool
3.從 http://sourceforge.net/projects/irstlm/files/下載irstlm-5.80.03.tgz後解壓到主目錄/home下，（文件名：irstlm-5.80.03）

鍵入命令：

tar zxvf irstlm-5.80.03.tgz

cd ~/irstlm-5.80.03
sudo ./regenerate-makefiles.sh
sudo ./configure --prefix=$HOME/irstlm

sudo make install

三.安裝GIZA++:

git clone https://github.com/moses-smt/giza-pp.git

cd giza-pp
sudo make
若安裝成功 , 上述命令會生成 ~/giza-pp/GIZA++-v2/GIZA++ ,~/giza-pp/GIZA++-v2/snt2cooc.out 和~/giza-pp/mkcls-v2/mkcls 三個可執行文件。
我們將它們複製到一個特定的目錄external-bin-dir中,以便 Moses 在第八步訓練翻譯模型時知道需要它們的時候知道它們在哪裏。
cd ~
mkdir external-bin-dir
cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out ~/giza-pp/mkcls-v2/mkcls external-bin-dir

四.安裝Moses：

從網址：https://github.com/moses-smt/mosesdecoder/releases下載2015.2.28日發佈的mosesdecoder-pre-MMT版本，解壓到/home的mosesdecoder文件夾下。

鍵入命令：

cd ~/mosesdecoder

./bjam --with-irstlm=～/irstlm --with-giza=giza-pp -j4

編譯成功後,命令窗口最後會有編譯SECCESS提示，否則如果報錯或編譯失敗，一定要找到原因使其編譯成功，否則在第八步的訓練翻譯模型的時候會無法通過。
在/home/mosesdecoder 目錄下多了 bin，下面會有各種可執行文件。

五.訓練預料庫的獲得(在線獲得，大部分情況下，得自己整理，本文用的就是自己的語料庫)：Corpus Preparation

鍵入命令：

mkdir corpus
cd corpus
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar zxvf training-parallel-nc-v8.tgz
六.預料預處理：
1.Tokenisation(分詞):
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <~/Dapeng/corpus/training/dictionary.ch-en.ch>~/Dapeng/corpus/dictionary.ch-en.ch.tok.ch
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr < ~/Dapeng/corpus/training/dictionary.ch-en.en>~/Dapeng/corpus/dictionary.ch-en.en.tok.en

2.truecasing（單詞形式還原，減少數據稀疏）

預訓練，統計單詞：
~/Dapeng/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/Dapeng/corpus/truecase-model.ch --corpus ~/Dapeng/corpus/dictionary.ch-en.ch.tok.ch
~/Dapeng/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/Dapeng/corpus/truecase-model.en --corpus ~/Dapeng/corpus/dictionary.ch-en.en.tok.en

還原：
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model ~/Dapeng/corpus/truecase-model.ch<~/Dapeng/corpus/dictionary.ch-en.ch.tok.ch>~/Dapeng/corpus/dictionary.ch-en.true.ch

~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model ~/Dapeng/corpus/truecase-model.en<~/Dapeng/corpus/dictionary.ch-en.en.tok.en>~/Dapeng/corpus/dictionary.ch-en.true.en

3.cleaning（去掉少於2個單詞和多於60個單詞的行）：

~/Dapeng/mosesdecoder/scripts/training/clean-corpus-n.perl ~/Dapeng/corpus/dictionary.ch-en.true en ch ~/Dapeng/corpus/dictionary.ch-en.clean 2 60

七.語言模型訓練：語言模型爲了保證流暢到輸出，因此只對目標語言進行語言模型的訓練，在本實驗中目標語言是英語。
1.預處理執行命令：
mkdir ~/Dapeng/lm
cd ~/Dapeng/lm
~/Dapeng/irstlm/bin/add-start-end.sh <~/Dapeng/corpus/dictionary.ch-en.true.en>dictionary.ch-en.sb.en
export IRSTLM=$HOME/Dapeng/irstlm
~/Dapeng/irstlm/bin/build-lm.sh -i dictionary.ch-en.sb.en -t ./tmp -p -s improved-kneser-ney -o dictionary.ch-en.lm.en
~/Dapeng/irstlm/bin/compile-lm --text=yes dictionary.ch-en.lm.en.gz dictionary.ch-en.arpa.en
2.將語言模型二進制化
~/Dapeng/mosesdecoder/bin/build_binary dictionary.ch-en.arpa.en dictionary.ch-en.blm.en
3.測試語言模型
echo "how old are you" | ~/Dapeng/mosesdecoder/bin/query dictionary.ch-en.blm.en
輸出結果如下(形式)：
is=35 2 -2.67038this=287 3 -0.889891an=295 3 -2.25232English=7284 1 -5.277988sentence=4468 2-2.69927?=65 1 -3.3272662</s>=21 2 -0.0308079Total: -17.147924 OOV: 0
Perplexity including OOVs:281.6459360332587
Perplexity excluding OOVs:281.6459360332587
OOVs: 0
Tokens: 7

Name:query VmPeak:35956 kB VmRSS:3296 kBRSSMax:32796 kBuser:0sys:0.004CPU:0.004real:0.00516975

八.訓練翻譯模型：
執行命令：
mkdir ~/working
cd ~/Dapeng/working
nohup nice ~/Dapeng/mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus ~/Dapeng/corpus/dictionary.ch-en.clean -f ch -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:$HOME/Dapeng/lm/dictionary.ch-en.blm.en -external-bin-dir ~/Dapeng/external-bin-dir >&training.out&
此步驟執行成功後，會在model文件夾下生成moses.ini文件。

九.調整權重 Tuning
1.下載開發集（質量高，數量少的平行預料庫，本文用的也是自己整理的）
cd ~/Dapeng/corpus
wget http://www.statmt.org/wmt12/dev.tgz
tar zxvf dev.tgz

2.開發集預處理：
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <dev/dev.ch> dev.ch-en.tok.ch
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr <dev/dev.en> dev.ch-en.tok.en
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.ch <dev.ch-en.tok.ch> dev.ch-en.true.ch
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <dev.ch-en.tok.en> dev.ch-en.true.en
3.調參
cd ~/Dapeng/working
nohup nice ~/Dapeng/mosesdecoder/scripts/training/mert-moses.pl ~/Dapeng/corpus/dev.ch-en.true.ch ~/Dapeng/corpus/dev.ch-en.true.en ~/Dapeng/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/Dapeng/mosesdecoder/bin &> mert.out &
此步驟也是需要及其漫長的等待，執行成功後，在輸出的mert.out文件最後生成moses.ini的提示，更新原來的moses.ini配置文件。
十.解碼（翻譯）和測試BLEU值
1.測試語料預處理
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <test/test.ch> test.ch-en.tok.ch
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr <test/test.en> test.ch-en.tok.en
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.ch <test.ch-en.tok.ch> test.ch-en.true.ch
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <test.ch-en.tok.en> test.ch-en.true.en
2.執行翻譯：將翻譯結果存儲在~/working/test.ch-en.translated.en文件中
~/Dapeng/mosesdecoder/bin/moses -f ~/Dapeng/working/train/model/moses.ini <~/Dapeng/corpus/test.ch-en.true.ch > ~/Dapeng/working/test.ch-en.translated.en
3.測試BLEU值
~/Dapeng/mosesdecoder/scripts/generic/multi-bleu.perl -lc ~/Dapeng/corpus/test.ch-en.true.en < ~/Dapeng/working/test.ch-en.translated.en

ubuntu15.04 32位下基於IRSTLM，GIZA++和Moses的統計翻譯系統的環境搭建及運行

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

linux安裝cuda和cudnn

Mellanox網卡開啓SR-IOV

模擬手機設備：使用 Playwright 實現移動端自動化測試

HTML 00 Tutorial

全面系統的AI學習路徑，幫助普通人也能玩轉AI

從零開始：使用 Playwright 腳本錄製實現自動化測試

uni-app實現上拉加載

ubuntu15.04 32位下基於IRSTLM，GIZA++和Moses的統計翻譯系統的環境搭建及運行

新安裝的Ubuntu16.10Wifi無法使用，驅動安裝

GoTTY：CentOS Web頁面交互命令行配置

glibc的安裝配置

Win7下用U盤安裝Centos6.3雙系統

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結