ubuntu15.04 32位下基于IRSTLM，GIZA++和Moses的统计翻译系统的环境搭建及运行

由于搭建Moses环境需要多个模块之间相互调用，模块之间存在版本搭配的问题，笔者也尝试过，不同的模块版本相互搭配的话造成编译失败的可能性非常大，经过不到一周的坚持，总算搭建成功，下面是我成功搭建的各个模块的具体版本，笔者建议大家使用相同的版本。PS：本人是在新安装的系统下搭建的，测试是汉语->英语的翻译^_^

系统：ubuntu：15.04 32位操作系统

Moses版本：2015.2.28日发布的mosesdecoder-pre-MMT
g++版本：4.9.2
boost版本：2015.8.17通过命令sudo apt-get install libboost-all-dev版本
irstlm版本：5.80.03
giza版本：2015.8.17通过命令：git clone https://github.com/moses-smt/giza-pp.git获得版本

一.安装Moses之前，首先系统得安装G++编译器，和Boost库。
1.安装G++:Ubuntu缺省情况下，并没有提供C/C++的编译环境，因此还需要手动安装。如果单独安装gcc以及g++比较麻烦，幸运的是，为了能够编译Ubuntu的内核，Ubuntu提供了一个build-essential软件包。也就是说，安装了该软件包，编译c/c++所需要的软件包也都会被安装。因此如果想在Ubuntu中编译c/c++程序，只需要安装该软件包就可以了。安装方法如下：
sudo apt-get install build-essential

2.安装Boost库： sudo apt-get install libboost-all-dev
检验安装成功否：在linux下任意目录下创建test.cpp
#include<iostream>
#include<boost/bind.hpp>
using namespace std;
using namespace boost;
int fun(int x,int y){return x+y;}
int main(){
int m=1;int n=2;
cout<<boost::bind(fun,_1,_2)(m,n)<<endl;
return 0;
}

运行命令:
sudo g++ test.cpp -o test
sudo ./test
将得到输出结果为：3

代表boost安装成功

二.安装IRSTLM:先运行步骤2，３,失败，再运行步骤1.
1.从http://zlib.net/先下载slib-1.2.8,解压到/home根目录下

键入命令：

cd zlib-1.2.8

sudo ./configure
sudo make install
sudo make test
2.在编译IRSTLM之前,需要安装两个工具：

键入命令：

sudo apt-get install automake

sudo apt-get install libtool
3.从 http://sourceforge.net/projects/irstlm/files/下载irstlm-5.80.03.tgz后解压到主目录/home下，（文件名：irstlm-5.80.03）

键入命令：

tar zxvf irstlm-5.80.03.tgz

cd ~/irstlm-5.80.03
sudo ./regenerate-makefiles.sh
sudo ./configure --prefix=$HOME/irstlm

sudo make install

三.安装GIZA++:

git clone https://github.com/moses-smt/giza-pp.git

cd giza-pp
sudo make
若安装成功 , 上述命令会生成 ~/giza-pp/GIZA++-v2/GIZA++ ,~/giza-pp/GIZA++-v2/snt2cooc.out 和~/giza-pp/mkcls-v2/mkcls 三个可执行文件。
我们将它们复制到一个特定的目录external-bin-dir中,以便 Moses 在第八步训练翻译模型时知道需要它们的时候知道它们在哪里。
cd ~
mkdir external-bin-dir
cp ~/giza-pp/GIZA++-v2/GIZA++ ~/giza-pp/GIZA++-v2/snt2cooc.out ~/giza-pp/mkcls-v2/mkcls external-bin-dir

四.安装Moses：

从网址：https://github.com/moses-smt/mosesdecoder/releases下载2015.2.28日发布的mosesdecoder-pre-MMT版本，解压到/home的mosesdecoder文件夹下。

键入命令：

cd ~/mosesdecoder

./bjam --with-irstlm=～/irstlm --with-giza=giza-pp -j4

编译成功后,命令窗口最后会有编译SECCESS提示，否则如果报错或编译失败，一定要找到原因使其编译成功，否则在第八步的训练翻译模型的时候会无法通过。
在/home/mosesdecoder 目录下多了 bin，下面会有各种可执行文件。

五.训练预料库的获得(在线获得，大部分情况下，得自己整理，本文用的就是自己的语料库)：Corpus Preparation

键入命令：

mkdir corpus
cd corpus
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar zxvf training-parallel-nc-v8.tgz
六.预料预处理：
1.Tokenisation(分词):
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <~/Dapeng/corpus/training/dictionary.ch-en.ch>~/Dapeng/corpus/dictionary.ch-en.ch.tok.ch
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr < ~/Dapeng/corpus/training/dictionary.ch-en.en>~/Dapeng/corpus/dictionary.ch-en.en.tok.en

2.truecasing（单词形式还原，减少数据稀疏）

预训练，统计单词：
~/Dapeng/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/Dapeng/corpus/truecase-model.ch --corpus ~/Dapeng/corpus/dictionary.ch-en.ch.tok.ch
~/Dapeng/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/Dapeng/corpus/truecase-model.en --corpus ~/Dapeng/corpus/dictionary.ch-en.en.tok.en

还原：
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model ~/Dapeng/corpus/truecase-model.ch<~/Dapeng/corpus/dictionary.ch-en.ch.tok.ch>~/Dapeng/corpus/dictionary.ch-en.true.ch

~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model ~/Dapeng/corpus/truecase-model.en<~/Dapeng/corpus/dictionary.ch-en.en.tok.en>~/Dapeng/corpus/dictionary.ch-en.true.en

3.cleaning（去掉少于2个单词和多于60个单词的行）：

~/Dapeng/mosesdecoder/scripts/training/clean-corpus-n.perl ~/Dapeng/corpus/dictionary.ch-en.true en ch ~/Dapeng/corpus/dictionary.ch-en.clean 2 60

七.语言模型训练：语言模型为了保证流畅到输出，因此只对目标语言进行语言模型的训练，在本实验中目标语言是英语。
1.预处理执行命令：
mkdir ~/Dapeng/lm
cd ~/Dapeng/lm
~/Dapeng/irstlm/bin/add-start-end.sh <~/Dapeng/corpus/dictionary.ch-en.true.en>dictionary.ch-en.sb.en
export IRSTLM=$HOME/Dapeng/irstlm
~/Dapeng/irstlm/bin/build-lm.sh -i dictionary.ch-en.sb.en -t ./tmp -p -s improved-kneser-ney -o dictionary.ch-en.lm.en
~/Dapeng/irstlm/bin/compile-lm --text=yes dictionary.ch-en.lm.en.gz dictionary.ch-en.arpa.en
2.将语言模型二进制化
~/Dapeng/mosesdecoder/bin/build_binary dictionary.ch-en.arpa.en dictionary.ch-en.blm.en
3.测试语言模型
echo "how old are you" | ~/Dapeng/mosesdecoder/bin/query dictionary.ch-en.blm.en
输出结果如下(形式)：
is=35 2 -2.67038this=287 3 -0.889891an=295 3 -2.25232English=7284 1 -5.277988sentence=4468 2-2.69927?=65 1 -3.3272662</s>=21 2 -0.0308079Total: -17.147924 OOV: 0
Perplexity including OOVs:281.6459360332587
Perplexity excluding OOVs:281.6459360332587
OOVs: 0
Tokens: 7

Name:query VmPeak:35956 kB VmRSS:3296 kBRSSMax:32796 kBuser:0sys:0.004CPU:0.004real:0.00516975

八.训练翻译模型：
执行命令：
mkdir ~/working
cd ~/Dapeng/working
nohup nice ~/Dapeng/mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus ~/Dapeng/corpus/dictionary.ch-en.clean -f ch -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:$HOME/Dapeng/lm/dictionary.ch-en.blm.en -external-bin-dir ~/Dapeng/external-bin-dir >&training.out&
此步骤执行成功后，会在model文件夹下生成moses.ini文件。

九.调整权重 Tuning
1.下载开发集（质量高，数量少的平行预料库，本文用的也是自己整理的）
cd ~/Dapeng/corpus
wget http://www.statmt.org/wmt12/dev.tgz
tar zxvf dev.tgz

2.开发集预处理：
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <dev/dev.ch> dev.ch-en.tok.ch
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr <dev/dev.en> dev.ch-en.tok.en
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.ch <dev.ch-en.tok.ch> dev.ch-en.true.ch
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <dev.ch-en.tok.en> dev.ch-en.true.en
3.调参
cd ~/Dapeng/working
nohup nice ~/Dapeng/mosesdecoder/scripts/training/mert-moses.pl ~/Dapeng/corpus/dev.ch-en.true.ch ~/Dapeng/corpus/dev.ch-en.true.en ~/Dapeng/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/Dapeng/mosesdecoder/bin &> mert.out &
此步骤也是需要及其漫长的等待，执行成功后，在输出的mert.out文件最后生成moses.ini的提示，更新原来的moses.ini配置文件。
十.解码（翻译）和测试BLEU值
1.测试语料预处理
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <test/test.ch> test.ch-en.tok.ch
~/Dapeng/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr <test/test.en> test.ch-en.tok.en
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.ch <test.ch-en.tok.ch> test.ch-en.true.ch
~/Dapeng/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <test.ch-en.tok.en> test.ch-en.true.en
2.执行翻译：将翻译结果存储在~/working/test.ch-en.translated.en文件中
~/Dapeng/mosesdecoder/bin/moses -f ~/Dapeng/working/train/model/moses.ini <~/Dapeng/corpus/test.ch-en.true.ch > ~/Dapeng/working/test.ch-en.translated.en
3.测试BLEU值
~/Dapeng/mosesdecoder/scripts/generic/multi-bleu.perl -lc ~/Dapeng/corpus/test.ch-en.true.en < ~/Dapeng/working/test.ch-en.translated.en

ubuntu15.04 32位下基于IRSTLM，GIZA++和Moses的统计翻译系统的环境搭建及运行

python gdal 安装使用（Windows， python 3.6.8）

ubuntu15.04 32位下基於IRSTLM，GIZA++和Moses的統計翻譯系統的環境搭建及運行

新安裝的Ubuntu16.10Wifi無法使用，驅動安裝

GoTTY：CentOS Web頁面交互命令行配置

glibc的安裝配置

Win7下用U盤安裝Centos6.3雙系統

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結