論文原文: https://arxiv.org/abs/1706.03762
代碼實現:https://github.com/Kyubyong/transformer
按照原文結構記錄總結
#1.Model Architecture
1.1.Encoder&Decoder stacks
stacks = 6 transformer
sublayers = multi-head attention + FeedFormward
Embedding dimension = 512
主要學習代碼實現: 1.self-attention 2.positional encoding 3.Masked
Encoder input = [batchsize,seq_length,512]
Encoder output = [batchsize,seq_length,512] ???? key?
1.2.Attention
multi-heads = 8
a. Decoder Layer = 3 sublayers
self-attention + Encoder-Decoder-attention + FeedwardNet
Encoder-Decoder-attention = Keys,Values來自於Encoder; query來自於上一層的Decoderlayer
b.Encoder self-attention,keys+values+querys來自於相同的位置
c.Decoder self-attention,訓練過程中masked操作,保證當前時間步的Decoder僅能看到左側的單詞信息
代碼實現,主要關注self-attention的實現:
1.3 position-wise Feed-Forward Net
前向網絡對於每一個位置的embedding進行兩次線性變換
原文描述:可以把它想象成兩次kernel size=1的卷積網絡 (???)
1.4 Embedding&softmax
embedding layer共享權重,但是額外乘以參數根號dmodel (???)
比如: embedding dimension=512, 需要乘以22
1.5 positional Encoding
位置編碼,提前計算的數值,直接加到相應位置的embedding即可
2.Why self-attention
探討時間複雜度問題 seq_length=N; representation dimension=d; CNN kernel width=k
self-attention = N^2d
RNN = Nd^2
CNN = kNd^2
1.只有當N<D時,seq_length比較小的時候,self-attention才存在計算優勢
2.CNN kernel區分距離的遠近;self-attention所有單詞距離都是1
#3.Traning
1.update weight = Adam optimizer * 衰減Learning rate
2.正則項 = dropout + label smoothing
embedding layer也會進行dropout
標籤平滑超參=0.1
#4.Results
1.mutli-head不宜過多
2.更復雜的模型一般表現更好,dropout很關鍵
3.自學習的position encoding沒有明顯幫助
#參考引用
1.seq2seq https://github.com/NLP-LOVE/ML-NLP/tree/master/NLP/16.5%20seq2seq
每一層Decoder的輸入= encoder的輸出向量 + Decoder的上一時間步的輸出
2.Transformer http://jalammar.github.io/illustrated-transformer/
3.官方API https://github.com/tensorflow/tensor2tensor
4.本文參考的實現代碼 https://www.cnblogs.com/zhouxiaosong/p/11032431.html
#代碼引用
1.self-attention
具體參見代碼: https://github.com/Kyubyong/transformer/blob/fb023bb097e08d53baf25b46a9da490beba51a21/modules.py#L153
注意點:
1.輸入: [N, T_q, d_model] 輸出: [N, T_q, d_model]
2.multi-head
3.decoder self-attention用到padding_mask + senquence_mask
2.encoder-decoder attention
query: previous decoder output [N, T_q, d_model]
key=value:encoder output [N, T_q, d_model]
具體參見代碼:
https://github.com/Kyubyong/transformer/blob/fb023bb097e08d53baf25b46a9da490beba51a21/model.py#L111
注意點:
1.padding mask是key_mask
3.Others
3.1 LR warmup實現
3.2 add&normalization實現
3.3 label smoothing實現
後續計劃學習bert論文,學習其代碼實現的create_mask。 以上的代碼實現的padding_mask,sequence_mask感覺不夠通用