論文復現|pointer-generator

原創

听我蒋蒋

2018-08-27 03:46

論文代碼鏈接：https://github.com/becxer/pointer-generator/

一、數據（cnn,dailymail）
數據處理（代碼鏈接）：https://github.com/becxer/cnn-dailymail/

把數據集處理成二進制形式

1、下載數據
需翻牆，下載cnn和daily mail的兩個stories文件

有的文件包含的例子中的文章缺失了，新代碼中把這些去除了。

2、下載Stanford corenlp（現在最新版是3.8.0，但是筆者試了是不行的，必須要用3.7.0版的）

環境：linux

我們需要Stanford corenlp來把數據分詞。
把下列這行代碼加到你的.bashrc裏面(vim .bashrc)

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

把/path/to/替換爲你保存stanford-corenlp-full-2016-10-31的地方的路徑

檢測：
運行下列代碼：

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

你會看到下列輸出：

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3、Process into .bin and vocab files

運行：

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

把/path/to/cnn/stories替換爲你保存cnn/stories文件的路徑；dailymail同樣

這個腳本做了以下幾件事：1、將生成cnn_stories_tokenized和dm_stories_tokenized兩個文件夾，裏面的數據是已經被分詞了的的cnn/stories和dailymail/stories。這可能需要花一些時間。你可能會看到一些來自Stanford Tokenizer “Untokenizable:”的警告，這似乎是跟Unicode character有關。2、對於每一個all_train.txt, all_val.txt and all_test.txt，相應的分詞的數據，被小寫進二進制文件train.bin, val.bin and test.bin中。同時放在新生產的finished_files文件夾裏，這也需要花點時間。3、例外，從訓練數據中會生成一個vocab文件，這個文件也被放在finished_files裏。4、最後，train.bin, val.bin and test.bin將被分爲數據塊，每個數據塊裏有1000個例子。這些數據塊文件會被保存在finished_fies/chunked裏，例如train_000.bin, train_001.bin, …, train_287.bin。你可以使用單獨的文件或者數據塊作爲模型的輸入。（注意事項)

運行結果：

Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable:   (U+202F, decimal: 8239)
Untokenizable: ️ (U+FE0F, decimal: 65039)
Untokenizable: ‬ (U+202C, decimal: 8236)
Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable:   (U+202F, decimal: 8239)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable:  (U+F06E, decimal: 61550)
Untokenizable: ‬ (U+202C, decimal: 8236)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
PTBTokenizer tokenized 80044550 tokens at 864874.49 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing ../cnn/stories to cnn_stories_tokenized.

Preparing to tokenize ../dailymail/stories to dm_stories_tokenized...（同上，省略過程）
......
PTBTokenizer tokenized 203071165 tokens at 916186.85 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing ../dailymail/stories to dm_stories_tokenized.


Making bin file for URLs listed in url_lists/all_val.txt...
Writing story 0 of 13368; 0.00 percent done
Writing story 1000 of 13368; 7.48 percent done
Writing story 2000 of 13368; 14.96 percent done
Writing story 3000 of 13368; 22.44 percent done
Writing story 4000 of 13368; 29.92 percent done
Writing story 5000 of 13368; 37.40 percent done
Writing story 6000 of 13368; 44.88 percent done
Writing story 7000 of 13368; 52.36 percent done
Writing story 8000 of 13368; 59.84 percent done
Writing story 9000 of 13368; 67.32 percent done
Writing story 10000 of 13368; 74.81 percent done
Writing story 11000 of 13368; 82.29 percent done
Writing story 12000 of 13368; 89.77 percent done
Writing story 13000 of 13368; 97.25 percent done
Finished writing file finished_files/val.bin

Making bin file for URLs listed in url_lists/all_train.txt...（同前兩個，省略過程）
......
Writing story 287000 of 287227; 99.92 percent done
Finished writing file finished_files/train.bin

Writing vocab file...
Finished writing vocab file
Splitting train data into chunks...
Splitting val data into chunks...
Splitting test data into chunks...
Saved chunked data in finished_files/chunked

補充：

每篇文章用空格隔開每個句子（形式爲：‘句子1 句子2 句子3…’）每個句子裏面也是分好詞的（PTB）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

論文復現|pointer-generator

工作中用到的腳本合集

24-5-18 X

多變量時間序列相似度量

190721|本週看的兩篇論文的總結

190805|周總結

自然語言處理行業前沿的會議期刊

linux|tgz解壓出錯

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結