深度學習框架PaddlePdddle學習( 二)

本篇文章將學習paddlepaddle一個sentiment analysis的例子。
這裏寫圖片描述
主要分爲幾步:
一、下載imdb數據集並進行數據拷貝
運行get_imdb.sh腳本。在虛擬機中用wget方式下載太慢了,我是直接在本機下載後,ftp上傳到虛擬機的,稍微修改一下腳本即可。其中imdb是一個著名的互聯網電影數據庫,這裏下載的是Stanford大學提供的數據語料
二、imdb數據集目錄結構及說明

  • test
    • neg
      • 0_2.txt
    • pos
      • 1_10.txt
    • labeledBow.feat
    • urls_neg.txt
    • urls_pos.txt
  • train
    • neg
      • 0_3.txt
    • pos
      • 1_10.txt
    • unsup
      • 2_0.txt
    • labeledBow.feat
    • unsupBow.feat
    • urls_neg.txt
    • urls_pos.txt
    • urls_unsup.txt
  • imdb.vocab
  • imdbEr.txt
  • README

數據集說明:
1、總共50000條電影評論及評分語料,25000條用於train訓練,25000條用於test測試,屬於典型的平衡語料庫。
2、imdb.vocab是整個語料的詞典,可用於分詞,也用於查詢單詞的id;imdbEr.txt是詞典中每個單詞的average polarity(類似於TDF之類)。
3、正面pos的評分範圍是大於等於7分(總分10分),負面neg的評分範圍是小於等於4分。
4、在test和train測試集中,正面pos和負面neg數據各爲12500條,數據格式均爲文本格式,命名規則爲{文件id}_{評分}txt,如:1_10.txt代表id爲1的文章,評分是10分。
5、在train測試集中,還有一個未標註的數據集unsup,共50000條,每個文章的評分均是0,此數據集用於非監督學習之用。
6、*.feat文件中存儲的是libSVM所需的數據格式,如下:
這裏寫圖片描述
每一行代表一個評論的特徵,第一列,如10,代表得分,第二列以後代表特徵(詞)及其頻次,如0:7代表詞典(imdb.vocab)中第一個單詞(即the)出現了7次。
7、urls_*.txt是評論的url地址,暫時未用到。
8、詳細說明,可參看其README文件及paddlepaddle官方文檔。

三、數據預處理
這裏只用到了標註好的訓練和測試數據,運行demo/sentiment/preprocess.sh腳本:

set -e
echo "Start to preprcess..."
data_dir="./data/imdb"
python preprocess.py -i $data_dir
echo "Done."

執行後,會生成一些數據文件:
這裏寫圖片描述
1、dict.txt:在訓練數據基礎上生成的詞典,它與上面imdb的imdb.vocab不一樣,這裏分詞根本沒有用到imdb.vocab。
2、labels.list:就兩行數據。

neg 0
pos 1

3、test.list和train.list:分別爲一行數據。
./data/pre-imdb/test_part_000
./data/pre-imdb/train_part_000

4、test_part_000和train_part_000:標註好的測試和訓練數據,其中,訓練數據被隨機打亂順序。
這裏寫圖片描述
test_part_000數據
這裏寫圖片描述
train_part_000數據

四、訓練數據
運行 demo/sentiment/train.sh:
這裏寫圖片描述
由於是虛擬機單CPU進行訓練,每一輪(200個batch)時間較長。每一輪訓練完成後,會把模型文件存儲到model_output文件夾,如上圖。總共跑了七輪,以下是第七輪的log:

I0923 02:12:09.063531  9787 TrainerInternal.cpp:162]  Batch=10 samples=1280 AvgCost=0.000370172 CurrentCost=0.000370172 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:17:40.628078  9787 TrainerInternal.cpp:162]  Batch=20 samples=2560 AvgCost=0.000381622 CurrentCost=0.000393071 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:23:12.219902  9787 TrainerInternal.cpp:162]  Batch=30 samples=3840 AvgCost=0.000403263 CurrentCost=0.000446546 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:28:43.641633  9787 TrainerInternal.cpp:162]  Batch=40 samples=5120 AvgCost=0.000405342 CurrentCost=0.000411579 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:34:22.490234  9787 TrainerInternal.cpp:162]  Batch=50 samples=6400 AvgCost=0.000373985 CurrentCost=0.000248556 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:39:52.223538  9787 TrainerInternal.cpp:162]  Batch=60 samples=7680 AvgCost=0.000345475 CurrentCost=0.000202924 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:45:28.569145  9787 TrainerInternal.cpp:162]  Batch=70 samples=8960 AvgCost=0.000334409 CurrentCost=0.000268012 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:50:59.457522  9787 TrainerInternal.cpp:162]  Batch=80 samples=10240 AvgCost=0.000321636 CurrentCost=0.000232229 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:56:40.102517  9787 TrainerInternal.cpp:162]  Batch=90 samples=11520 AvgCost=0.00032322 CurrentCost=0.000335891 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 
I0923 03:02:18.914818  9787 TrainerInternal.cpp:204] ___embedding_0__.w0  avg_abs_val=0.0346651   max_val=0.360826    avg_abs_grad=1.78088e-06 max_grad=0.0116757  
I0923 03:02:18.916208  9787 TrainerInternal.cpp:204] ___fc_layer_0__.w0   avg_abs_val=0.127534    max_val=0.645288    avg_abs_grad=6.4037e-05  max_grad=0.0021109  
I0923 03:02:18.916415  9787 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.0258548   max_val=0.0674928   avg_abs_grad=0.00216296  max_grad=0.0398809  
I0923 03:02:18.917516  9787 TrainerInternal.cpp:204] ___lstmemory_0__.w0  avg_abs_val=0.103539    max_val=0.7114      avg_abs_grad=7.87959e-05 max_grad=0.00955594 
I0923 03:02:18.917695  9787 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.0581983   max_val=0.298948    avg_abs_grad=0.00111475  max_grad=0.0390956  
I0923 03:02:18.921948  9787 TrainerInternal.cpp:204] ___fc_layer_1__.w0   avg_abs_val=0.0352199   max_val=0.219293    avg_abs_grad=0.000131355 max_grad=0.00190035 
I0923 03:02:18.923125  9787 TrainerInternal.cpp:204] ___fc_layer_1__.w1   avg_abs_val=0.0652509   max_val=0.532027    avg_abs_grad=3.66868e-05 max_grad=0.00089536 
I0923 03:02:18.923316  9787 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.0186279   max_val=0.111975    avg_abs_grad=0.000686864 max_grad=0.0040742  
I0923 03:02:18.924469  9787 TrainerInternal.cpp:204] ___lstmemory_1__.w0  avg_abs_val=0.0847814   max_val=0.610078    avg_abs_grad=1.47343e-05 max_grad=0.0033167  
I0923 03:02:18.924633  9787 TrainerInternal.cpp:204] ___lstmemory_1__.wbias avg_abs_val=0.0388588   max_val=0.273343    avg_abs_grad=0.000145373 max_grad=0.0213416  
I0923 03:02:18.929400  9787 TrainerInternal.cpp:204] ___fc_layer_2__.w0   avg_abs_val=0.0352939   max_val=0.229884    avg_abs_grad=0.0001623   max_grad=0.00334936 
I0923 03:02:18.930450  9787 TrainerInternal.cpp:204] ___fc_layer_2__.w1   avg_abs_val=0.0491384   max_val=0.341062    avg_abs_grad=3.59661e-05 max_grad=0.00330794 
I0923 03:02:18.930521  9787 TrainerInternal.cpp:204] ___fc_layer_2__.wbias avg_abs_val=0.0135314   max_val=0.153526    avg_abs_grad=0.000433104 max_grad=0.00527319 
I0923 03:02:18.931517  9787 TrainerInternal.cpp:204] ___lstmemory_2__.w0  avg_abs_val=0.109626    max_val=0.713229    avg_abs_grad=3.93541e-05 max_grad=0.00424078 
I0923 03:02:18.931588  9787 TrainerInternal.cpp:204] ___lstmemory_2__.wbias avg_abs_val=0.0814789   max_val=0.345273    avg_abs_grad=0.000283698 max_grad=0.00510029 
I0923 03:02:18.931643  9787 TrainerInternal.cpp:204] ___fc_layer_3__.w0   avg_abs_val=0.0356766   max_val=0.162611    avg_abs_grad=0.00599002  max_grad=0.0319875  
I0923 03:02:18.931684  9787 TrainerInternal.cpp:204] ___fc_layer_3__.w1   avg_abs_val=0.123792    max_val=0.23407     avg_abs_grad=0.00251986  max_grad=0.00756476 
I0923 03:02:18.931721  9787 TrainerInternal.cpp:204] ___fc_layer_3__.wbias avg_abs_val=0.00120091  max_val=0.00120093  avg_abs_grad=0.00225513  max_grad=0.00225543 

I0923 03:02:18.931780  9787 TrainerInternal.cpp:162]  Batch=100 samples=12800 AvgCost=0.000319255 CurrentCost=0.000283566 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:08:00.537493  9787 TrainerInternal.cpp:162]  Batch=110 samples=14080 AvgCost=0.000321684 CurrentCost=0.000345972 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:13:29.078312  9787 TrainerInternal.cpp:162]  Batch=120 samples=15360 AvgCost=0.000326361 CurrentCost=0.000377817 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:18:56.538781  9787 TrainerInternal.cpp:162]  Batch=130 samples=16640 AvgCost=0.000328759 CurrentCost=0.000357535 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:24:32.853353  9787 TrainerInternal.cpp:162]  Batch=140 samples=17920 AvgCost=0.000325527 CurrentCost=0.000283505 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:30:16.785081  9787 TrainerInternal.cpp:162]  Batch=150 samples=19200 AvgCost=0.00032612 CurrentCost=0.000334426 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:35:51.114645  9787 TrainerInternal.cpp:162]  Batch=160 samples=20480 AvgCost=0.000321745 CurrentCost=0.000256123 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:41:19.795605  9787 TrainerInternal.cpp:162]  Batch=170 samples=21760 AvgCost=0.000315861 CurrentCost=0.000221709 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:46:45.738401  9787 TrainerInternal.cpp:162]  Batch=180 samples=23040 AvgCost=0.000322339 CurrentCost=0.000432473 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:52:27.157814  9787 TrainerInternal.cpp:162]  Batch=190 samples=24320 AvgCost=0.000319989 CurrentCost=0.000277687 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 
I0923 03:55:21.358279  9787 TrainerInternal.cpp:179]  Pass=7 Batch=196 samples=25000 AvgCost=0.000321772 Eval: classification_error_evaluator=0 
I0923 04:29:03.403729  9787 Tester.cpp:111]  Test samples=25000 cost=0.677085 Eval: classification_error_evaluator=0.1594 
I0923 04:29:03.541514  9787 GradientMachine.cpp:112] Saving parameters to ./model_output/pass-00007
I0923 04:29:04.631625  9787 Util.cpp:219] copy trainer_config.py to ./model_output/pass-00007

五、測試和預測數據
運行test.sh,目的是從上述訓練結果中,選擇最優的model。
運行predict.sh,根據最優模型,進行sentiment分析預測。
這裏寫圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章