論文復現|pointer-generator

論文代碼鏈接:https://github.com/becxer/pointer-generator/

一、數據(cnn,dailymail)
數據處理(代碼鏈接):https://github.com/becxer/cnn-dailymail/

把數據集處理成二進制形式

1、下載數據
需翻牆,下載cnn和daily mail的兩個stories文件

有的文件包含的例子中的文章缺失了,新代碼中把這些去除了。

2、下載Stanford corenlp(現在最新版是3.8.0,但是筆者試了是不行的,必須要用3.7.0版的)

環境:linux

我們需要Stanford corenlp來把數據分詞。
把下列這行代碼加到你的.bashrc裏面(vim .bashrc)

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

把/path/to/替換爲你保存stanford-corenlp-full-2016-10-31的地方的路徑

檢測:
運行下列代碼:

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

你會看到下列輸出:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3、Process into .bin and vocab files

運行:

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

把/path/to/cnn/stories替換爲你保存cnn/stories文件的路徑;dailymail同樣

這個腳本做了以下幾件事:1、將生成cnn_stories_tokenized和dm_stories_tokenized兩個文件夾,裏面的數據是已經被分詞了的的cnn/stories和dailymail/stories。這可能需要花一些時間。你可能會看到一些來自Stanford Tokenizer “Untokenizable:”的警告,這似乎是跟Unicode character有關。2、對於每一個all_train.txt, all_val.txt and all_test.txt,相應的分詞的數據,被小寫進二進制文件train.bin, val.bin and test.bin中。同時放在新生產的finished_files文件夾裏,這也需要花點時間。3、例外,從訓練數據中會生成一個vocab文件,這個文件也被放在finished_files裏。4、最後,train.bin, val.bin and test.bin將被分爲數據塊,每個數據塊裏有1000個例子。這些數據塊文件會被保存在finished_fies/chunked裏,例如train_000.bin, train_001.bin, …, train_287.bin。你可以使用單獨的文件或者數據塊作爲模型的輸入。(注意事項)

運行結果:

Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable:   (U+202F, decimal: 8239)
Untokenizable: ️ (U+FE0F, decimal: 65039)
Untokenizable: ‬ (U+202C, decimal: 8236)
Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable:   (U+202F, decimal: 8239)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable:  (U+F06E, decimal: 61550)
Untokenizable: ‬ (U+202C, decimal: 8236)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
PTBTokenizer tokenized 80044550 tokens at 864874.49 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing ../cnn/stories to cnn_stories_tokenized.

Preparing to tokenize ../dailymail/stories to dm_stories_tokenized...(同上,省略過程)
......
PTBTokenizer tokenized 203071165 tokens at 916186.85 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing ../dailymail/stories to dm_stories_tokenized.


Making bin file for URLs listed in url_lists/all_val.txt...
Writing story 0 of 13368; 0.00 percent done
Writing story 1000 of 13368; 7.48 percent done
Writing story 2000 of 13368; 14.96 percent done
Writing story 3000 of 13368; 22.44 percent done
Writing story 4000 of 13368; 29.92 percent done
Writing story 5000 of 13368; 37.40 percent done
Writing story 6000 of 13368; 44.88 percent done
Writing story 7000 of 13368; 52.36 percent done
Writing story 8000 of 13368; 59.84 percent done
Writing story 9000 of 13368; 67.32 percent done
Writing story 10000 of 13368; 74.81 percent done
Writing story 11000 of 13368; 82.29 percent done
Writing story 12000 of 13368; 89.77 percent done
Writing story 13000 of 13368; 97.25 percent done
Finished writing file finished_files/val.bin

Making bin file for URLs listed in url_lists/all_train.txt...(同前兩個,省略過程)
......
Writing story 287000 of 287227; 99.92 percent done
Finished writing file finished_files/train.bin

Writing vocab file...
Finished writing vocab file
Splitting train data into chunks...
Splitting val data into chunks...
Splitting test data into chunks...
Saved chunked data in finished_files/chunked

補充:

每篇文章用空格隔開每個句子(形式爲:‘句子1 句子2 句子3…’) 每個句子裏面也是分好詞的(PTB)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章