本篇文章將學習paddlepaddle一個sentiment analysis的例子。
主要分爲幾步:
一、下載imdb數據集並進行數據拷貝
運行get_imdb.sh腳本。在虛擬機中用wget方式下載太慢了,我是直接在本機下載後,ftp上傳到虛擬機的,稍微修改一下腳本即可。其中imdb是一個著名的互聯網電影數據庫,這裏下載的是Stanford大學提供的數據語料。
二、imdb數據集目錄結構及說明
- test
- neg
- 0_2.txt
- …
- pos
- 1_10.txt
- …
- labeledBow.feat
- urls_neg.txt
- urls_pos.txt
- neg
- train
- neg
- 0_3.txt
- …
- pos
- 1_10.txt
- …
- unsup
- 2_0.txt
- …
- labeledBow.feat
- unsupBow.feat
- urls_neg.txt
- urls_pos.txt
- urls_unsup.txt
- neg
- imdb.vocab
- imdbEr.txt
- README
數據集說明:
1、總共50000條電影評論及評分語料,25000條用於train訓練,25000條用於test測試,屬於典型的平衡語料庫。
2、imdb.vocab是整個語料的詞典,可用於分詞,也用於查詢單詞的id;imdbEr.txt是詞典中每個單詞的average polarity(類似於TDF之類)。
3、正面pos的評分範圍是大於等於7分(總分10分),負面neg的評分範圍是小於等於4分。
4、在test和train測試集中,正面pos和負面neg數據各爲12500條,數據格式均爲文本格式,命名規則爲{文件id}_{評分}txt,如:1_10.txt代表id爲1的文章,評分是10分。
5、在train測試集中,還有一個未標註的數據集unsup,共50000條,每個文章的評分均是0,此數據集用於非監督學習之用。
6、*.feat文件中存儲的是libSVM所需的數據格式,如下:
每一行代表一個評論的特徵,第一列,如10,代表得分,第二列以後代表特徵(詞)及其頻次,如0:7代表詞典(imdb.vocab)中第一個單詞(即the)出現了7次。
7、urls_*.txt是評論的url地址,暫時未用到。
8、詳細說明,可參看其README文件及paddlepaddle官方文檔。
三、數據預處理
這裏只用到了標註好的訓練和測試數據,運行demo/sentiment/preprocess.sh腳本:
set -e
echo "Start to preprcess..."
data_dir="./data/imdb"
python preprocess.py -i $data_dir
echo "Done."
執行後,會生成一些數據文件:
1、dict.txt:在訓練數據基礎上生成的詞典,它與上面imdb的imdb.vocab不一樣,這裏分詞根本沒有用到imdb.vocab。
2、labels.list:就兩行數據。
neg 0
pos 1
3、test.list和train.list:分別爲一行數據。
./data/pre-imdb/test_part_000
./data/pre-imdb/train_part_000
4、test_part_000和train_part_000:標註好的測試和訓練數據,其中,訓練數據被隨機打亂順序。
test_part_000數據
train_part_000數據
四、訓練數據
運行 demo/sentiment/train.sh:
由於是虛擬機單CPU進行訓練,每一輪(200個batch)時間較長。每一輪訓練完成後,會把模型文件存儲到model_output文件夾,如上圖。總共跑了七輪,以下是第七輪的log:
I0923 02:12:09.063531 9787 TrainerInternal.cpp:162] Batch=10 samples=1280 AvgCost=0.000370172 CurrentCost=0.000370172 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:17:40.628078 9787 TrainerInternal.cpp:162] Batch=20 samples=2560 AvgCost=0.000381622 CurrentCost=0.000393071 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:23:12.219902 9787 TrainerInternal.cpp:162] Batch=30 samples=3840 AvgCost=0.000403263 CurrentCost=0.000446546 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:28:43.641633 9787 TrainerInternal.cpp:162] Batch=40 samples=5120 AvgCost=0.000405342 CurrentCost=0.000411579 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:34:22.490234 9787 TrainerInternal.cpp:162] Batch=50 samples=6400 AvgCost=0.000373985 CurrentCost=0.000248556 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:39:52.223538 9787 TrainerInternal.cpp:162] Batch=60 samples=7680 AvgCost=0.000345475 CurrentCost=0.000202924 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:45:28.569145 9787 TrainerInternal.cpp:162] Batch=70 samples=8960 AvgCost=0.000334409 CurrentCost=0.000268012 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:50:59.457522 9787 TrainerInternal.cpp:162] Batch=80 samples=10240 AvgCost=0.000321636 CurrentCost=0.000232229 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 02:56:40.102517 9787 TrainerInternal.cpp:162] Batch=90 samples=11520 AvgCost=0.00032322 CurrentCost=0.000335891 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:02:18.914818 9787 TrainerInternal.cpp:204] ___embedding_0__.w0 avg_abs_val=0.0346651 max_val=0.360826 avg_abs_grad=1.78088e-06 max_grad=0.0116757
I0923 03:02:18.916208 9787 TrainerInternal.cpp:204] ___fc_layer_0__.w0 avg_abs_val=0.127534 max_val=0.645288 avg_abs_grad=6.4037e-05 max_grad=0.0021109
I0923 03:02:18.916415 9787 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.0258548 max_val=0.0674928 avg_abs_grad=0.00216296 max_grad=0.0398809
I0923 03:02:18.917516 9787 TrainerInternal.cpp:204] ___lstmemory_0__.w0 avg_abs_val=0.103539 max_val=0.7114 avg_abs_grad=7.87959e-05 max_grad=0.00955594
I0923 03:02:18.917695 9787 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.0581983 max_val=0.298948 avg_abs_grad=0.00111475 max_grad=0.0390956
I0923 03:02:18.921948 9787 TrainerInternal.cpp:204] ___fc_layer_1__.w0 avg_abs_val=0.0352199 max_val=0.219293 avg_abs_grad=0.000131355 max_grad=0.00190035
I0923 03:02:18.923125 9787 TrainerInternal.cpp:204] ___fc_layer_1__.w1 avg_abs_val=0.0652509 max_val=0.532027 avg_abs_grad=3.66868e-05 max_grad=0.00089536
I0923 03:02:18.923316 9787 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.0186279 max_val=0.111975 avg_abs_grad=0.000686864 max_grad=0.0040742
I0923 03:02:18.924469 9787 TrainerInternal.cpp:204] ___lstmemory_1__.w0 avg_abs_val=0.0847814 max_val=0.610078 avg_abs_grad=1.47343e-05 max_grad=0.0033167
I0923 03:02:18.924633 9787 TrainerInternal.cpp:204] ___lstmemory_1__.wbias avg_abs_val=0.0388588 max_val=0.273343 avg_abs_grad=0.000145373 max_grad=0.0213416
I0923 03:02:18.929400 9787 TrainerInternal.cpp:204] ___fc_layer_2__.w0 avg_abs_val=0.0352939 max_val=0.229884 avg_abs_grad=0.0001623 max_grad=0.00334936
I0923 03:02:18.930450 9787 TrainerInternal.cpp:204] ___fc_layer_2__.w1 avg_abs_val=0.0491384 max_val=0.341062 avg_abs_grad=3.59661e-05 max_grad=0.00330794
I0923 03:02:18.930521 9787 TrainerInternal.cpp:204] ___fc_layer_2__.wbias avg_abs_val=0.0135314 max_val=0.153526 avg_abs_grad=0.000433104 max_grad=0.00527319
I0923 03:02:18.931517 9787 TrainerInternal.cpp:204] ___lstmemory_2__.w0 avg_abs_val=0.109626 max_val=0.713229 avg_abs_grad=3.93541e-05 max_grad=0.00424078
I0923 03:02:18.931588 9787 TrainerInternal.cpp:204] ___lstmemory_2__.wbias avg_abs_val=0.0814789 max_val=0.345273 avg_abs_grad=0.000283698 max_grad=0.00510029
I0923 03:02:18.931643 9787 TrainerInternal.cpp:204] ___fc_layer_3__.w0 avg_abs_val=0.0356766 max_val=0.162611 avg_abs_grad=0.00599002 max_grad=0.0319875
I0923 03:02:18.931684 9787 TrainerInternal.cpp:204] ___fc_layer_3__.w1 avg_abs_val=0.123792 max_val=0.23407 avg_abs_grad=0.00251986 max_grad=0.00756476
I0923 03:02:18.931721 9787 TrainerInternal.cpp:204] ___fc_layer_3__.wbias avg_abs_val=0.00120091 max_val=0.00120093 avg_abs_grad=0.00225513 max_grad=0.00225543
I0923 03:02:18.931780 9787 TrainerInternal.cpp:162] Batch=100 samples=12800 AvgCost=0.000319255 CurrentCost=0.000283566 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:08:00.537493 9787 TrainerInternal.cpp:162] Batch=110 samples=14080 AvgCost=0.000321684 CurrentCost=0.000345972 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:13:29.078312 9787 TrainerInternal.cpp:162] Batch=120 samples=15360 AvgCost=0.000326361 CurrentCost=0.000377817 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:18:56.538781 9787 TrainerInternal.cpp:162] Batch=130 samples=16640 AvgCost=0.000328759 CurrentCost=0.000357535 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:24:32.853353 9787 TrainerInternal.cpp:162] Batch=140 samples=17920 AvgCost=0.000325527 CurrentCost=0.000283505 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:30:16.785081 9787 TrainerInternal.cpp:162] Batch=150 samples=19200 AvgCost=0.00032612 CurrentCost=0.000334426 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:35:51.114645 9787 TrainerInternal.cpp:162] Batch=160 samples=20480 AvgCost=0.000321745 CurrentCost=0.000256123 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:41:19.795605 9787 TrainerInternal.cpp:162] Batch=170 samples=21760 AvgCost=0.000315861 CurrentCost=0.000221709 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:46:45.738401 9787 TrainerInternal.cpp:162] Batch=180 samples=23040 AvgCost=0.000322339 CurrentCost=0.000432473 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:52:27.157814 9787 TrainerInternal.cpp:162] Batch=190 samples=24320 AvgCost=0.000319989 CurrentCost=0.000277687 Eval: classification_error_evaluator=0 CurrentEval: classification_error_evaluator=0
I0923 03:55:21.358279 9787 TrainerInternal.cpp:179] Pass=7 Batch=196 samples=25000 AvgCost=0.000321772 Eval: classification_error_evaluator=0
I0923 04:29:03.403729 9787 Tester.cpp:111] Test samples=25000 cost=0.677085 Eval: classification_error_evaluator=0.1594
I0923 04:29:03.541514 9787 GradientMachine.cpp:112] Saving parameters to ./model_output/pass-00007
I0923 04:29:04.631625 9787 Util.cpp:219] copy trainer_config.py to ./model_output/pass-00007
五、測試和預測數據
運行test.sh,目的是從上述訓練結果中,選擇最優的model。
運行predict.sh,根據最優模型,進行sentiment分析預測。