文章目錄

論文 Learning Neural Templates for Text Generation
增加了個人註釋的GitHub代碼

先說一下這個模型是幹嘛的~

簡介

基於Encoder-Decoder方式的文本生成模型已成爲NLG的主流，但是它存在諸如（1）不可解釋、（2）很難進行講點或內容的選擇這些缺點。

本文主要是對decoder進行了改進，使用隱半馬爾科夫HSMM模型作爲解碼器，這種模型可以學習得到模板，這些模板是可控的而且也具有解釋性。

該模型可以自動完成 講點選擇及排序、文本模板生成、模板槽位填充 幾個過程，最終得到一句完整的話。

從四個方面來介紹代碼的功能。

一、數據及數據準備

1.1 開源數據集E2E

E2E是一個最大的餐飲領域的開源數據集。常用用於NM。就是一個連詞成句的過程。

mr: （textual meaning representation）就是“詞”，類似屬性名及屬性值
ref:生成的可閱讀的句子

一個mr的例子

name[The Vaults], 
eatType[pub], 
priceRange[more than Â£30], 
customer rating[5 out of 5], 
near[CafÃ© Adriatic]

對應生成的句子

Near CafÃ© Adriatic is a five star rated, high priced pub called The Vaults.
The Vaults is a 5 stars pub with middle prices in CafÃ© Adriatic.
The Vaults Pub is close to CafÃ© Adriatic, it is five star rated and it has high prices
The Vaults is near CafÃ© Adriatic, it's a pub that ranges more than 30 and customers rate it 5 out of 5.
The Vaults is a five star, expensive public house situated close to CafÃ© Adriatic
There is an expensive, five-star pub called The Vaults located near CafÃ© Adriatic.
The Vaults is a local pub with a 5 star rating and prices starting at Â£30. You can find it near CafÃ© Adriatic.
The Vaults is a pub with menu items more than Â£30 and a customer rating of 5 out of 5. The Vaults is located near CafÃ© Adriatic.
The Vaults with a amazing 5 out of 5 customer rating, is a pub near the CafÃ© Adriatic.  Menu price are more than Â£30 per item.
Rated 5 star by diners, The Vaults offers Pub fair near CafÃ© Adriatic.
The Vaults in  CafÃ© Adriatic is a great 5 stars pub with middle prices.
The Vaults costs more than 30 pounds and has a 5 out of 5 rating.  It is a pub located near CafÃ© Adriatic.
The Vaults is a pub that costs more than 30 pounds and has a 5 out of 5 rating.  It is located near CafÃ© Adriatic.
The pub CafÃ© Adriatic ranges more than 30 and is rated 5 out of 5 its near The Vaults.
The Vaults is an expensive, five-star pub located near CafÃ© Adriatic.

注意項

1.生成的都是事實類的描述，比如是一家適合家庭聚餐的飯館，而沒有描述性
2. 同一個樣本可能在某些維度上有不同的值，比如，菜的價格，有人認爲比較合適，有人認爲比較貴。
3. 每一個樣本一個維度上只有一個取值，比如，一個餐廳有多種菜系，但是每個樣本的菜系取值只有一個（日本菜、意大利菜、中國菜等）

1.2 使用的數據

source and target

src_train Data，屬性及屬性值,總數量42061,去重後數量4862
train Sequence，生成的句子，總數量42061,去重後數量40785

field insource

['customerrating', 'name', 'area', 'food', 'near', 'priceRange', 'eatType']

1.3 數據準備

目前數據路徑 data/sub_path下面包含src_train.txt, tgt_train.txt,train.txt, src_test.txt, tgt_test.txt, test.txt,src_valid.txt, tgt_valid.txt 和 valid.txt 文件，

src_*.txt文件是結構化的數據，tgt_*.txt文件是可讀的文本。

但是模型訓練需要的是train.txt和valid.txt文件。是怎麼得到的呢？

可以執行

python data/make_e2e_labedata.py train > train.txt
python data/make_e2e_labedata.py valid > valid.txt

這個過程就是從生成文本中標記出那些和原數據完全一樣的槽位值。
舉個栗子：

原數據：

name: The Vaults
eatType: pub
priceRange: more than £ 30
customerrating: 5 out of 5
near: Café Adriatic

生成的文本：

The Vaults pub near Café Adriatic has a 5 star rating . Prices start at £ 30 .

可以看到只有name,eatType和near這些屬性對應的屬性值在生成文本中出現了，所以，我們用位置和屬性id標記出來，標點符號的id是同一個。

那麼這個例子的結果如下

[(0, 2, idx('name')), (2, 3, idx('eatType')), (4, 6, idx('near')), (11, 12, idx('unknow')), (17, 18, idx('unknow'))]

這樣有助於我們學習模板的位置及類型關係。

二、訓練過程

python chsmm.py \
    -data data/labee2e/ \
    -emb_size 300 \
    -hid_size 300 \
    -layers 1 \
    -K 55 \
    -L 4 \
    -log_interval 200 \
    -thresh 9 \
    -emb_drop \
    -bsz 15 \
    -max_seqlen 55 \
    -lr 0.5 \
    -sep_attn \
    -max_pool \
    -unif_lenps \
    -one_rnn \
    -Kmul 5 \
    -mlpinp \
    -onmt_decay \
    -cuda \
    -seed 1818 \
    -save models/chsmm-e2e-300-55-5.pt

主要過程如下：

把樣本打亂
make_combo_targs:把詞和copy信息放到一個單獨的Tensor中
make_masks：因爲生成文本中可能會新詞，在後續會處理掉，就是mask。這裏面有兩種操作，一種是直接把這個詞去掉，還有一種平均
get_uniq_fields：每個batch的Fields都補齊到最大長度

三、模板抽取

這個過程主要是生成模板，還是以上面的例子，生成如下帶有槽位的模板。

segs/路徑下保存抽後的模板。模板生成方法如下：

使用non-autoregressive方法：

python chsmm.py -data data/sub_path/ -emb_size 300 -hid_size 300 -layers 1 -K 55 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5  -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 5 -mlpinp -onmt_decay -cuda -load models/e2e-55-5.pt -label_train | tee segs/seg-e2e-300-55-5.txt

四、文本生成

有了模型和模板之後，就可以進行文本生成啦。（其實就是一個模板選擇和槽位填充的過程。）

autoregressive 模型生成:

python chsmm.py -data data/sub_path/ -emb_size 300 -hid_size 300 -layers 1 -dropout 0.3 -K 60 -L 4 -log_interval 100 -thresh 9 -lr 0.5 -sep_attn -unif_lenps -emb_drop -mlpinp -onmt_decay -one_rnn -max_pool -gen_from_fi data/labee2e/src_uniq_valid.txt -load models/e2e-60-1-far.pt -tagged_fi segs/seg-e2e-60-1-far.txt -beamsz 5 -ntemplates 100 -gen_wts '1,1' -cuda -min_gen_tokes 0 > gens/gen-e2e-60-1-far.txt

-gen_from_fi 後面跟的文件是結構化數據
-tagged_fi後面是我們抽取好的模板
-load 是我們訓練好的模型
-gens/路徑下保留生成的文本。

注意生成結果的格式: <文本>|||<模板片段>, 真正使用只需要保留文本。

NeuralTemplateGen-代碼功能梳理