深度學習框架PaddlePdddle學習( 二）

本篇文章將學習paddlepaddle一個sentiment analysis的例子。

主要分爲幾步：
一、下載imdb數據集並進行數據拷貝
運行get_imdb.sh腳本。在虛擬機中用wget方式下載太慢了，我是直接在本機下載後，ftp上傳到虛擬機的，稍微修改一下腳本即可。其中imdb是一個著名的互聯網電影數據庫，這裏下載的是Stanford大學提供的數據語料。
二、imdb數據集目錄結構及說明

test
- neg
  - 0_2.txt
  - …
- pos
  - 1_10.txt
  - …
- labeledBow.feat
- urls_neg.txt
- urls_pos.txt
train
- neg
  - 0_3.txt
  - …
- pos
  - 1_10.txt
  - …
- unsup
  - 2_0.txt
  - …
- labeledBow.feat
- unsupBow.feat
- urls_neg.txt
- urls_pos.txt
- urls_unsup.txt
imdb.vocab
imdbEr.txt
README

數據集說明：
1、總共50000條電影評論及評分語料，25000條用於train訓練，25000條用於test測試，屬於典型的平衡語料庫。
2、imdb.vocab是整個語料的詞典，可用於分詞，也用於查詢單詞的id；imdbEr.txt是詞典中每個單詞的average polarity（類似於TDF之類）。
3、正面pos的評分範圍是大於等於7分（總分10分），負面neg的評分範圍是小於等於4分。
4、在test和train測試集中，正面pos和負面neg數據各爲12500條，數據格式均爲文本格式，命名規則爲{文件id}_{評分}txt，如：1_10.txt代表id爲1的文章，評分是10分。
5、在train測試集中，還有一個未標註的數據集unsup，共50000條，每個文章的評分均是0，此數據集用於非監督學習之用。
6、*.feat文件中存儲的是libSVM所需的數據格式，如下：

每一行代表一個評論的特徵，第一列，如10，代表得分，第二列以後代表特徵（詞）及其頻次，如0:7代表詞典（imdb.vocab）中第一個單詞（即the）出現了7次。
7、urls_*.txt是評論的url地址，暫時未用到。
8、詳細說明，可參看其README文件及paddlepaddle官方文檔。

三、數據預處理
這裏只用到了標註好的訓練和測試數據，運行demo/sentiment/preprocess.sh腳本：

set -e
echo "Start to preprcess..."
data_dir="./data/imdb"
python preprocess.py -i $data_dir
echo "Done."

執行後，會生成一些數據文件：

1、dict.txt：在訓練數據基礎上生成的詞典，它與上面imdb的imdb.vocab不一樣，這裏分詞根本沒有用到imdb.vocab。
2、labels.list：就兩行數據。

neg 0
pos 1

3、test.list和train.list：分別爲一行數據。
./data/pre-imdb/test_part_000 ./data/pre-imdb/train_part_000
4、test_part_000和train_part_000：標註好的測試和訓練數據，其中，訓練數據被隨機打亂順序。

test_part_000數據

train_part_000數據

四、訓練數據
運行 demo/sentiment/train.sh:

由於是虛擬機單CPU進行訓練，每一輪（200個batch）時間較長。每一輪訓練完成後，會把模型文件存儲到model_output文件夾，如上圖。總共跑了七輪，以下是第七輪的log：

I0923 02:12:09.063531  9787 TrainerInternal.cpp:162]  Batch=10 samples=1280 AvgCost=0.000370172 CurrentCost=0.000370172 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:17:40.628078  9787 TrainerInternal.cpp:162]  Batch=20 samples=2560 AvgCost=0.000381622 CurrentCost=0.000393071 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:23:12.219902  9787 TrainerInternal.cpp:162]  Batch=30 samples=3840 AvgCost=0.000403263 CurrentCost=0.000446546 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:28:43.641633  9787 TrainerInternal.cpp:162]  Batch=40 samples=5120 AvgCost=0.000405342 CurrentCost=0.000411579 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:34:22.490234  9787 TrainerInternal.cpp:162]  Batch=50 samples=6400 AvgCost=0.000373985 CurrentCost=0.000248556 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:39:52.223538  9787 TrainerInternal.cpp:162]  Batch=60 samples=7680 AvgCost=0.000345475 CurrentCost=0.000202924 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:45:28.569145  9787 TrainerInternal.cpp:162]  Batch=70 samples=8960 AvgCost=0.000334409 CurrentCost=0.000268012 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:50:59.457522  9787 TrainerInternal.cpp:162]  Batch=80 samples=10240 AvgCost=0.000321636 CurrentCost=0.000232229 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 02:56:40.102517  9787 TrainerInternal.cpp:162]  Batch=90 samples=11520 AvgCost=0.00032322 CurrentCost=0.000335891 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 
I0923 03:02:18.914818  9787 TrainerInternal.cpp:204] ___embedding_0__.w0  avg_abs_val=0.0346651   max_val=0.360826    avg_abs_grad=1.78088e-06 max_grad=0.0116757  
I0923 03:02:18.916208  9787 TrainerInternal.cpp:204] ___fc_layer_0__.w0   avg_abs_val=0.127534    max_val=0.645288    avg_abs_grad=6.4037e-05  max_grad=0.0021109  
I0923 03:02:18.916415  9787 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.0258548   max_val=0.0674928   avg_abs_grad=0.00216296  max_grad=0.0398809  
I0923 03:02:18.917516  9787 TrainerInternal.cpp:204] ___lstmemory_0__.w0  avg_abs_val=0.103539    max_val=0.7114      avg_abs_grad=7.87959e-05 max_grad=0.00955594 
I0923 03:02:18.917695  9787 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.0581983   max_val=0.298948    avg_abs_grad=0.00111475  max_grad=0.0390956  
I0923 03:02:18.921948  9787 TrainerInternal.cpp:204] ___fc_layer_1__.w0   avg_abs_val=0.0352199   max_val=0.219293    avg_abs_grad=0.000131355 max_grad=0.00190035 
I0923 03:02:18.923125  9787 TrainerInternal.cpp:204] ___fc_layer_1__.w1   avg_abs_val=0.0652509   max_val=0.532027    avg_abs_grad=3.66868e-05 max_grad=0.00089536 
I0923 03:02:18.923316  9787 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.0186279   max_val=0.111975    avg_abs_grad=0.000686864 max_grad=0.0040742  
I0923 03:02:18.924469  9787 TrainerInternal.cpp:204] ___lstmemory_1__.w0  avg_abs_val=0.0847814   max_val=0.610078    avg_abs_grad=1.47343e-05 max_grad=0.0033167  
I0923 03:02:18.924633  9787 TrainerInternal.cpp:204] ___lstmemory_1__.wbias avg_abs_val=0.0388588   max_val=0.273343    avg_abs_grad=0.000145373 max_grad=0.0213416  
I0923 03:02:18.929400  9787 TrainerInternal.cpp:204] ___fc_layer_2__.w0   avg_abs_val=0.0352939   max_val=0.229884    avg_abs_grad=0.0001623   max_grad=0.00334936 
I0923 03:02:18.930450  9787 TrainerInternal.cpp:204] ___fc_layer_2__.w1   avg_abs_val=0.0491384   max_val=0.341062    avg_abs_grad=3.59661e-05 max_grad=0.00330794 
I0923 03:02:18.930521  9787 TrainerInternal.cpp:204] ___fc_layer_2__.wbias avg_abs_val=0.0135314   max_val=0.153526    avg_abs_grad=0.000433104 max_grad=0.00527319 
I0923 03:02:18.931517  9787 TrainerInternal.cpp:204] ___lstmemory_2__.w0  avg_abs_val=0.109626    max_val=0.713229    avg_abs_grad=3.93541e-05 max_grad=0.00424078 
I0923 03:02:18.931588  9787 TrainerInternal.cpp:204] ___lstmemory_2__.wbias avg_abs_val=0.0814789   max_val=0.345273    avg_abs_grad=0.000283698 max_grad=0.00510029 
I0923 03:02:18.931643  9787 TrainerInternal.cpp:204] ___fc_layer_3__.w0   avg_abs_val=0.0356766   max_val=0.162611    avg_abs_grad=0.00599002  max_grad=0.0319875  
I0923 03:02:18.931684  9787 TrainerInternal.cpp:204] ___fc_layer_3__.w1   avg_abs_val=0.123792    max_val=0.23407     avg_abs_grad=0.00251986  max_grad=0.00756476 
I0923 03:02:18.931721  9787 TrainerInternal.cpp:204] ___fc_layer_3__.wbias avg_abs_val=0.00120091  max_val=0.00120093  avg_abs_grad=0.00225513  max_grad=0.00225543 

I0923 03:02:18.931780  9787 TrainerInternal.cpp:162]  Batch=100 samples=12800 AvgCost=0.000319255 CurrentCost=0.000283566 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:08:00.537493  9787 TrainerInternal.cpp:162]  Batch=110 samples=14080 AvgCost=0.000321684 CurrentCost=0.000345972 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:13:29.078312  9787 TrainerInternal.cpp:162]  Batch=120 samples=15360 AvgCost=0.000326361 CurrentCost=0.000377817 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:18:56.538781  9787 TrainerInternal.cpp:162]  Batch=130 samples=16640 AvgCost=0.000328759 CurrentCost=0.000357535 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:24:32.853353  9787 TrainerInternal.cpp:162]  Batch=140 samples=17920 AvgCost=0.000325527 CurrentCost=0.000283505 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:30:16.785081  9787 TrainerInternal.cpp:162]  Batch=150 samples=19200 AvgCost=0.00032612 CurrentCost=0.000334426 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:35:51.114645  9787 TrainerInternal.cpp:162]  Batch=160 samples=20480 AvgCost=0.000321745 CurrentCost=0.000256123 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:41:19.795605  9787 TrainerInternal.cpp:162]  Batch=170 samples=21760 AvgCost=0.000315861 CurrentCost=0.000221709 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:46:45.738401  9787 TrainerInternal.cpp:162]  Batch=180 samples=23040 AvgCost=0.000322339 CurrentCost=0.000432473 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 

I0923 03:52:27.157814  9787 TrainerInternal.cpp:162]  Batch=190 samples=24320 AvgCost=0.000319989 CurrentCost=0.000277687 Eval: classification_error_evaluator=0  CurrentEval: classification_error_evaluator=0 
I0923 03:55:21.358279  9787 TrainerInternal.cpp:179]  Pass=7 Batch=196 samples=25000 AvgCost=0.000321772 Eval: classification_error_evaluator=0 
I0923 04:29:03.403729  9787 Tester.cpp:111]  Test samples=25000 cost=0.677085 Eval: classification_error_evaluator=0.1594 
I0923 04:29:03.541514  9787 GradientMachine.cpp:112] Saving parameters to ./model_output/pass-00007
I0923 04:29:04.631625  9787 Util.cpp:219] copy trainer_config.py to ./model_output/pass-00007

五、測試和預測數據
運行test.sh，目的是從上述訓練結果中，選擇最優的model。
運行predict.sh，根據最優模型，進行sentiment分析預測。

深度學習框架PaddlePdddle學習( 二）

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

深度學習框架PaddlePdddle學習( 二）

中文詞庫

NLTK vs Sklearn vs Gensim

Keras學習（一）

機器學習/深度學習數據集

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結