動手寫SelfAttetion和transformer Encoder模型實現電影情感分類
通過代碼學習,加深對Self Attention 和 Transformer 模型實現理解
-
數據預處理分析,掌握torchtext 在數據預處理應用
-
Self Attention 機制模型訓練
-
基於Transformer Encoder 代碼動手訓練情感分類
- transformer 模型論文以及代碼實現
文章目錄
導入庫
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from torchtext import data
import random
SEED = 1234
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('device = ',device)
device = cpu
數據預處理
數據分析
簡單瞭解下我們的數據分佈
train = pd.read_csv('data/senti.train.tsv',sep='\t',header=None,names=['data','label'])
val = pd.read_csv('data/senti.dev.tsv',sep='\t',names=['data','label'])
test = pd.read_csv('data/senti.test.tsv',sep= '\t',names=['data','label'])
train.head()
data | label | |
---|---|---|
0 | hide new secretions from the parental units | 0 |
1 | contains no wit , only labored gags | 0 |
2 | that loves its characters and communicates som... | 1 |
3 | remains utterly satisfied to remain the same t... | 0 |
4 | on the worst revenge-of-the-nerds clichés the ... | 0 |
查看數據是否存在空的數據字段: 感覺數據還不錯,不會出現nan 的數據
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67349 entries, 0 to 67348
Data columns (total 2 columns):
data 67349 non-null object
label 67349 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.0+ MB
val.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 872 entries, 0 to 871
Data columns (total 2 columns):
data 872 non-null object
label 872 non-null int64
dtypes: int64(1), object(1)
memory usage: 13.7+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1821 entries, 0 to 1820
Data columns (total 2 columns):
data 1821 non-null object
label 1821 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.5+ KB
我們看下整體數據樣本分佈
print('訓練數據集數量: ',train.shape[0])
print('驗證數據集數量: ',val.shape[0])
print('測試數據集數量: ',test.shape[0])
訓練數據集數量: 67349
驗證數據集數量: 872
測試數據集數量: 1821
不同標籤數據分佈,看類別數據是否均衡:每個數據標籤分類還不錯
train['label'].value_counts()
1 37569
0 29780
Name: label, dtype: int64
val['label'].value_counts()
1 444
0 428
Name: label, dtype: int64
test['label'].value_counts()
0 912
1 909
Name: label, dtype: int64
加載數據
-
通過TabularDataset 來定義我們的數據集,目前支持格式包括 csv, tsv, 和 json files ,同時可以藉助 splits (train, validation, test) 加載不同的數據集
- 參考 torchtext 提供的案例:https://torchtext.readthedocs.io/en/latest/examples.html
-
Field 定義文本和Label 的類型
-
torchtext 使用參考代碼
- https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb
- https://github.com/keitakurita/practical-torchtext/blob/master/Lesson%201%20intro%20to%20torchtext%20with%20text%20classification.ipynb
聲明Fields
# 聲明Fields
from torchtext.data import Field
tokenize = lambda x: x.split()# 指定文本字段分詞方法(中文的話可以jieba)
TEXT = Field(sequential=True, batch_first=True, include_lengths=True)
# batch_first=True 加載數據第一個維度batch_size,如果不設置默認max_seq_len
# include_lengths=True 表示後續text 中包括文本實際長度信息
LABEL = Field(sequential=False, use_vocab=False,dtype=torch.float)
創建我們的Dataset
# 創建我們的Dataset
from torchtext.data import TabularDataset
train, val, test = TabularDataset.splits(
path="data", # the root directory where the data lies
train='senti.train.tsv',
validation="senti.dev.tsv",
test = "senti.test.tsv",
format='tsv',
fields=[("text", TEXT), ("label", LABEL)])
# 我們 使用TEXT field 構建字典
#MAX_VOCAB_SIZE = 14000
TEXT.build_vocab(train)
LABEL.build_vocab(train)
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")
Unique tokens in TEXT vocabulary: 16284
Unique tokens in LABEL vocabulary: 3
# 接下來我們看下數據內容格式
print('TabularDataset 舉例說明:')
example = val.examples[0]
print('text = ',example.text)
print('label = ',example.label)
print('*' * 60)
print('讓我們看下字典數據:')
print('mapping(word->index)的映射關係: ',list(TEXT.vocab.stoi.items())[:5])
print('LABEL LABEL :',dict(LABEL.vocab.stoi)) # 這個感覺???
print('高詞頻數據topK:\n',TEXT.vocab.freqs.most_common(10))
TabularDataset 舉例說明:
text = ['It', "'s", 'a', 'charming', 'and', 'often', 'affecting', 'journey', '.']
label = 1
************************************************************
讓我們看下字典數據:
mapping(word->index)的映射關係: [('<unk>', 0), ('<pad>', 1), (',', 2), ('the', 3), ('and', 4)]
LABEL LABEL : {'<unk>': 0, '1': 1, '0': 2}
高詞頻數據topK:
[(',', 25980), ('the', 24648), ('and', 19871), ('a', 19622), ('of', 17886), ('.', 12673), ('to', 12483), ("'s", 8764), ('is', 8638), ('that', 7689)]
創建數據集的Iterator
-
在訓練時,我們使用一種特殊 Iterator,我們稱爲BucketIterator.來處理我們的數據
-
網絡中進行訓練,希望每個batch中的數據的長度一致
例如: [ [3, 15, 2, 7], [4, 1], [5, 5, 6, 8, 1] ] -> [ [3, 15, 2, 7, 0], [4, 1, 0, 0, 0], [5, 5, 6, 8, 1] ]
這裏我們通過mask 來獲取實際文本中單詞內容,用於區分那個位置上的單詞是padding的
-
BucketIterator加載的數據的text 默認情況下[max_seq_length,batch_size] ,這裏我們轉換[batch_size,max_seq_length]
## 創建數據集的Iterator
from torchtext.data import Iterator, BucketIterator
BATCH_SIZE = 64
PAD_IDX = TEXT.vocab.stoi['<pad>']
train_iter, val_iter,test_iter = BucketIterator.splits(
(train, val,test), # we pass in the datasets we want the iterator to draw data from
batch_size=BATCH_SIZE, # 或者batch_sizes=(xx,xx,xx)
device=device, # if you want to use the GPU, specify the GPU number here
sort_key=lambda x: len(x.text), # the BucketIterator needs to be told what function it should use to group the data.
sort_within_batch=True,
repeat=False # we pass repeat=False because we want to wrap this Iterator layer.
)
我們來看下數據
val_data = next(iter(val_iter))
val_data
[torchtext.data.batch.Batch of size 64]
[.text]:('[torch.LongTensor of size 64x7]', '[torch.LongTensor of size 64]')
[.label]:[torch.FloatTensor of size 64]
inputs,lengths = val_data.text
targets = val_data.label
mask = 1 - (inputs == TEXT.vocab.stoi['<pad>']).float()
print("inputs: ",inputs.shape)
print("lengths: ",lengths.shape)
print("target: ",targets.shape)
print("pad_idx: ", TEXT.vocab.stoi['<pad>'])
print("mask = ",mask.shape)
inputs: torch.Size([64, 7])
lengths: torch.Size([64])
target: torch.Size([64])
pad_idx: 1
mask = torch.Size([64, 7])
print('train_iter: ')
for batch in train_iter:
print(batch)
break
print('val_iter: ')
for batch in val_iter:
print(batch)
break
print('test_iter: ')
for batch in test_iter:
print(batch)
break
train_iter:
[torchtext.data.batch.Batch of size 64]
[.text]:('[torch.LongTensor of size 64x14]', '[torch.LongTensor of size 64]')
[.label]:[torch.FloatTensor of size 64]
val_iter:
[torchtext.data.batch.Batch of size 64]
[.text]:('[torch.LongTensor of size 64x7]', '[torch.LongTensor of size 64]')
[.label]:[torch.FloatTensor of size 64]
test_iter:
[torchtext.data.batch.Batch of size 64]
[.text]:('[torch.LongTensor of size 64x6]', '[torch.LongTensor of size 64]')
[.label]:[torch.FloatTensor of size 64]
我們看看數據text,label數據結構
Self Attention 機制模型
- 定義一種基於self attention的句子模型。
模型整體思路(實際 上 pytorch 中 transformer 的dot product 計算得分方案):
單詞t的權重是該單詞的embedding和所有其他單詞的embedding的dot product的和,然後 做sof t max歸一化
當前單詞與所有其它單詞的dot product的和
softmax 歸一化後的得分
x_t 是句子 x 中的第 t 個單詞。我們使 用 emb 來表示單詞的 embedding 函數
句子的向量表示: 單詞t 加權求和後的向量
- 這個句子是正面情感的概率爲:
- 可以在模型中加入residual connection,將輸入的詞向量平均向量加入進去
模型定義
import math
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class SelfAttentionModel(nn.Module):
def __init__(self,vocab_size,embedding_dim,p_drop,output_size,padding_idx,residual_conn=False):
super(SelfAttentionModel,self).__init__()
self.residual_conn = residual_conn
self.drop = nn.Dropout(p_drop)
self.embeddings = nn.Embedding(vocab_size,embedding_dim,padding_idx=padding_idx)
self.linear = nn.Linear(embedding_dim,output_size)
self.init_weights()
# 增加-發現模型可以快速收斂到一個比較好的模型 (也可以不加嘗試運行)
# 參考官方文檔: https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html
def init_weights(self):
initrange = 0.1
self.embeddings.weight.data.uniform_(-initrange, initrange)
self.linear.bias.data.zero_()
self.linear.weight.data.uniform_(-initrange, initrange)
def forward(self,inputs,mask):
# inputs:[batch_size,seq_len]
# mask: [batch_size,seq_len]
# (batch_size, seq_len, embedding_dim)
query = self.drop(self.embeddings(inputs))
key = self.drop(self.embeddings(inputs))
value = self.drop(self.embeddings(inputs))
h_self,_=self.attention(query,key,value,mask=mask)
if self.residual_conn:
# 輸入的詞向量平均向量
mask = mask.unsqueeze(2) #[batch_size,seq_len,1]
query = query * mask #[batch_size,seq_len,embedding_dim] 對於padding的數據設置0
h_avg = query.sum(1) / (mask.sum(1) + 1e-5) # 句子的平均的向量
h_self = h_avg + h_self
return self.linear(h_self).squeeze()
def attention(self,query, key, value, mask=None, dropout=None):
"""
Compute Scaled Dot Product Attention
參考: http://nlp.seas.harvard.edu/2018/04/03/attention.html
按照self attention計算公式實現模型定義
"""
d_k = query.size(-1)
# 這裏的得分 參考transformer 實現,增加math.sqrt(d_k)
scores = torch.matmul(query, key.transpose(-2, -1))/math.sqrt(d_k) #[batch_size,seq_len,seq_len]
if mask is not None:
mask= mask.unsqueeze(2)#[batch_size,seq_len,1]
scores = scores.masked_fill(mask == 0, -1e9)
# softmax 歸一化後的得分
p_attn = F.softmax(scores, dim = -1)
# 加權求和
h_self = torch.matmul(p_attn, value).sum(1) # [batch_seq,embedding_size]
return h_self,p_attn # 句子的向量、attention歸一化後的得分
#
vocab_size = len(TEXT.vocab)
embedding_dim = 200
p_drop = 0.5
output_size = 1
padding_idx = TEXT.vocab.stoi['<pad>']
model = SelfAttentionModel(vocab_size,embedding_dim,p_drop,output_size,padding_idx)
model = model.to(device)
#
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss : the sigmoid and the binary cross entropy
print('train_iter: ')
for batch in train_iter:
print(batch)
print('*'*60)
inputs,lengths = batch.text
targets = batch.label# [batch_size]
mask = 1 - (inputs==TEXT.vocab.stoi['<pad>']).float()
print("inputs:" ,inputs.shape) #[batch_size, max_seq_len]
print("targets:",targets.shape)# [batch_size]
print("mask:",mask.shape) #[batch_size, max_seq_len]
preds = model.forward(inputs,mask)
print(preds[0])
break
train_iter:
[torchtext.data.batch.Batch of size 64]
[.text]:('[torch.LongTensor of size 64x1]', '[torch.LongTensor of size 64]')
[.label]:[torch.FloatTensor of size 64]
************************************************************
inputs: torch.Size([64, 1])
targets: torch.Size([64])
mask: torch.Size([64, 1])
tensor(-0.0674, grad_fn=<SelectBackward>)
定義訓練函數
- 直接計算attention的得分模型訓練模型
- 計算attenttion得分,然後加上query的平均的hidden向量,然後訓練模型
import time
def epoch_time(start_time, end_time):
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
def binary_accuracy(preds, y):
"""
Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
"""
#round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float() #convert into float for division
acc = correct.sum() / len(correct)
return acc
def train(model,train_iter,criterion,optimizer):
epoch_acc = 0.
epoch_loss = 0.
model.train()
for batch in train_iter:
#
inputs,lengths = batch.text
targets = batch.label# [batch_size]
mask = 1 - (inputs==TEXT.vocab.stoi['<pad>']).float()
preds = model(inputs,mask)
#
loss = criterion(preds,targets) # BCEWithLogitsLoss 計算這個batch的平均loss
acc = binary_accuracy(preds, targets) # 計算這個batch的平均的準確率
epoch_acc += acc.item() # 當前批次準確率
epoch_loss += loss.item() # 當前批次loss
#
optimizer.zero_grad()
loss.backward()
optimizer.step()
return epoch_acc / len(train_iter),epoch_loss / len(train_iter) # 對所有批次求平均= 平均的acc和loss
def evaluate(model,data_iter,criterion):
epoch_acc = 0.
epoch_loss = 0.
model.eval()
with torch.no_grad():
for batch in data_iter:
#
inputs,lengths = batch.text
targets = batch.label# [batch_size]
mask = 1 - (inputs==TEXT.vocab.stoi['<pad>']).float()
preds = model(inputs,mask)
#
loss = criterion(preds,targets)
acc = binary_accuracy(preds, targets)
epoch_acc += acc.item()
epoch_loss += loss.item()
return epoch_acc / len(data_iter),epoch_loss / len(data_iter)
模型訓練
vocab_size = len(TEXT.vocab)
embedding_dim = 200
p_drop = 0.5
output_size = 1
padding_idx = TEXT.vocab.stoi['<pad>']
model = SelfAttentionModel(vocab_size,embedding_dim,p_drop,output_size,padding_idx)
model = model.to(device)
#
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss : the sigmoid and the binary cross entropy
#
N_EPOCHS = 5
best_valid_loss = float('inf')
best_valid_acc = float('-inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_acc,train_loss = train(model,train_iter,criterion,optimizer)
val_acc,val_loss = evaluate(model,val_iter,criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if val_acc > best_valid_acc:
print('val acc creasing->')
best_valid_acc = val_acc
torch.save(model.state_dict(), 'self_attention-model.pt')
print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {val_loss:.3f} | Val. Acc: {val_acc*100:.2f}%')
model.load_state_dict(torch.load('self_attention-model.pt'))
test_acc,test_loss = evaluate(model,test_iter,criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
val acc creasing->
Epoch: 01 | Epoch Time: 0m 52s
Train Loss: 0.384 | Train Acc: 83.46%
Val. Loss: 0.563 | Val. Acc: 80.09%
val acc creasing->
Epoch: 02 | Epoch Time: 1m 0s
Train Loss: 0.227 | Train Acc: 91.40%
Val. Loss: 0.682 | Val. Acc: 80.18%
Epoch: 03 | Epoch Time: 1m 1s
Train Loss: 0.190 | Train Acc: 92.84%
Val. Loss: 0.738 | Val. Acc: 79.87%
val acc creasing->
Epoch: 04 | Epoch Time: 1m 2s
Train Loss: 0.170 | Train Acc: 93.62%
Val. Loss: 0.799 | Val. Acc: 81.18%
val acc creasing->
Epoch: 05 | Epoch Time: 1m 3s
Train Loss: 0.157 | Train Acc: 94.27%
Val. Loss: 0.846 | Val. Acc: 81.41%
Test Loss: 0.754 | Test Acc: 80.42%
設置 residual_conn=True 重新訓練模型
vocab_size = len(TEXT.vocab)
embedding_dim = 200
p_drop = 0.5
output_size = 1
padding_idx = TEXT.vocab.stoi['<pad>']
model = SelfAttentionModel(vocab_size,embedding_dim,p_drop,output_size,padding_idx,residual_conn=True)
model = model.to(device)
#
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss : the sigmoid and the binary cross entropy
#
N_EPOCHS = 5
best_valid_loss = float('inf')
best_valid_acc = float('-inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_acc,train_loss = train(model,train_iter,criterion,optimizer)
val_acc,val_loss = evaluate(model,val_iter,criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if val_acc > best_valid_acc:
print('val acc creasing->')
best_valid_acc = val_acc
torch.save(model.state_dict(), 'self_attention-model.pt')
print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {val_loss:.3f} | Val. Acc: {val_acc*100:.2f}%')
model.load_state_dict(torch.load('self_attention-model.pt'))
test_acc,test_loss = evaluate(model,test_iter,criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
val acc creasing->
Epoch: 01 | Epoch Time: 1m 0s
Train Loss: 0.363 | Train Acc: 84.31%
Val. Loss: 0.529 | Val. Acc: 80.31%
Epoch: 02 | Epoch Time: 1m 12s
Train Loss: 0.210 | Train Acc: 91.88%
Val. Loss: 0.632 | Val. Acc: 80.16%
Epoch: 03 | Epoch Time: 1m 13s
Train Loss: 0.177 | Train Acc: 93.31%
Val. Loss: 0.664 | Val. Acc: 80.29%
val acc creasing->
Epoch: 04 | Epoch Time: 1m 8s
Train Loss: 0.158 | Train Acc: 94.10%
Val. Loss: 0.757 | Val. Acc: 80.51%
Epoch: 05 | Epoch Time: 1m 18s
Train Loss: 0.146 | Train Acc: 94.53%
Val. Loss: 0.812 | Val. Acc: 79.44%
Test Loss: 0.714 | Test Acc: 79.56%
發現效果並沒有變好
在線預測
tokenizer = lambda x: x.split()
def predict_sentiment(model, text):
model.eval()
indexed = torch.LongTensor([TEXT.vocab.stoi.get(t, PAD_IDX) for t in tokenizer(text)]).to(device)
indexed = indexed.unsqueeze(0) #[batch_size,seq_len]
mask = 1 - (indexed == TEXT.vocab.stoi['<pad>']).float()
with torch.no_grad():
pred = torch.sigmoid(model(indexed, mask)) # sigmoid(wx + b) ,最終返回結果概率
return pred.item()
predict_sentiment(model,"hide new secretions from the parental units")
0.006493980064988136
predict_sentiment(model,"Uneasy mishmash of styles and genres")
0.010009794495999813
predict_sentiment(model,'Director Rob Marshall went out gunning to make a great one .')
0.9782317280769348
predict_sentiment(model,'A well-made and often lovely depiction of the mysteries of friendship .')
0.9999963045120239
設計Attention函數模型訓練
爲提供情感分類的模型效果,我們加入了attention 機制。 那麼接下來我們自己設計一個Attention函數,一般思路如下:
- 研究dot product 和cosine similarity在attention機制上的區別(前面章節已經代碼實現)
- 使用transformation來區分key, query和value
- 使用多個Attention heads
- 使用positional encodings來增加單詞的位置信息
- 更多思路。。。
可以參考如下代碼
Transformer的模型, 參考資料如下:
transformer 模型架構
Embeddings and Softmax
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension . We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by .
class InputEmbeddings(nn.Module):
def __init__(self, d_model, vocab):
super(InputEmbeddings, self).__init__()
self.embed = nn.Embedding(vocab, d_model)
self.d_model = d_model
def forward(self, x):
return self.embed(x) * math.sqrt(self.d_model)
Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).
In this work, we use sine and cosine functions of different frequencies:
where is the position and is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from to . We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset , can be represented as a linear function of .
In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of .
import torch
from torch.autograd import Variable
class PositionalEncoding(nn.Module):
'''
Implement the PE function.
'''
def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
# CPU下稍微修改下 https://blog.csdn.net/brandday/article/details/100518612
position = torch.arange(0., max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0., d_model, 2) *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + Variable(self.pe[:, :x.size(1)],
requires_grad=False)
return self.dropout(x)
Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.
torch.zeros(1, 100, 20).shape
torch.Size([1, 100, 20])
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15, 5))
pe = PositionalEncoding(20, 0)# d_model = 20,dropout=0
y = pe.forward(Variable(torch.zeros(1, 100, 20)))
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())
g=plt.legend(["dim %d"%p for p in [4,5,6,7]])
Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension , and values of dimension . We compute the dot products of the query with all keys, divide each by , and apply a softmax function to obtain the weights on the values.
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix . The keys and values are also packed together into matrices and . We compute the matrix of outputs as:
Multi-head attention
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Where the projections are parameter matrices , , and . In this work we employ parallel attention layers, or heads. For each of these we use . Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
import copy
import torch
import torch.nn as nn
def clones(module, N):
"Produce N identical layers."
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
"Implements Figure 2"
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = \
[l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linears, (query, key, value))]
# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout)
# 3) "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous() \
.view(nbatches, -1, self.h * self.d_k)
return self.linears[-1](x)
import math
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class MyTransformerModel(nn.Module):
def __init__(self,vocab_size,d_model,p_drop,h,output_size):
super(MyTransformerModel,self).__init__()
self.drop = nn.Dropout(p_drop)
self.embeddings = InputEmbeddings(d_model,vocab_size)
self.position = PositionalEncoding(d_model, p_drop)
self.attn = MultiHeadedAttention(h, d_model)
self.norm = nn.LayerNorm(d_model)
self.linear = nn.Linear(d_model, output_size)
self.init_weights()
# 增加-發現模型可以快速收斂到一個比較好的模型 (也可以不加嘗試運行)
# 參考官方文檔: https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html
def init_weights(self):
initrange = 0.1
self.linear.bias.data.zero_()
self.linear.weight.data.uniform_(-initrange, initrange)
def forward(self,inputs,mask):
# 1. embed
# (batch_size, seq_len, d_model)
embeded = self.embeddings(inputs)
# 2. postional
# (batch_size, seq_len, d_model)
embeded = self.position(embeded)
# (batch_size,seq_len,1)
mask = mask.unsqueeze(2)
# 3. multi header
# (batch_size, seq_len, d_model)
inp_attn = self.attn(embeded,embeded,embeded,mask)
inp_attn = self.norm(inp_attn + embeded)
# 4. linear
# (batch_size, seq_len, d_model)
inp_attn = inp_attn * mask
#(batch_size,d_model)
h_avg = inp_attn.sum(1)/(mask.sum(1) + 1e-5)
return self.linear(h_avg).squeeze()
vocab_size = len(TEXT.vocab)
print('vocab_size : ',vocab_size)
d_model = 512
p_drop = 0.5
h=2
output_size=1
model = MyTransformerModel(vocab_size,d_model,p_drop,h,output_size)
model = model.to(device)
#
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss : the sigmoid and the binary cross entropy
print('train_iter: ')
for batch in train_iter:
print(batch)
print('*'*60)
inputs,lengths = batch.text
targets = batch.label# [batch_size]
mask = 1 - (inputs==TEXT.vocab.stoi['<pad>']).float()
print("inputs:" ,inputs.shape) #[batch_size, max_seq_len]
print("targets:",targets.shape)# [batch_size]
print("mask:",mask.shape) #[batch_size, max_seq_len]
preds = model.forward(inputs,mask)
break
vocab_size : 16284
train_iter:
[torchtext.data.batch.Batch of size 64]
[.text]:('[torch.LongTensor of size 64x9]', '[torch.LongTensor of size 64]')
[.label]:[torch.FloatTensor of size 64]
************************************************************
inputs: torch.Size([64, 9])
targets: torch.Size([64])
mask: torch.Size([64, 9])
模型訓練
vocab_size = len(TEXT.vocab)
print('vocab_size : ',vocab_size)
d_model = 512
p_drop = 0.5
h=4
output_size=1
model = MyTransformerModel(vocab_size,d_model,p_drop,h,output_size)
model = model.to(device)
#
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss() # BCEWithLogitsLoss : the sigmoid and the binary cross entropy
#
N_EPOCHS = 5
best_valid_acc = float('-inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_acc,train_loss = train(model,train_iter,criterion,optimizer)
val_acc,val_loss = evaluate(model,val_iter,criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if val_acc > best_valid_acc:
print('val acc creasing->')
best_valid_acc = val_acc
torch.save(model.state_dict(), 'mytransformer-model.pt')
print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {val_loss:.3f} | Val. Acc: {val_acc*100:.2f}%')
model.load_state_dict(torch.load('mytransformer-model.pt'))
test_acc,test_loss = evaluate(model,test_iter,criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
vocab_size : 16284
val acc creasing->
Epoch: 01 | Epoch Time: 2m 41s
Train Loss: 0.613 | Train Acc: 68.19%
Val. Loss: 0.515 | Val. Acc: 75.54%
val acc creasing->
Epoch: 02 | Epoch Time: 3m 9s
Train Loss: 0.446 | Train Acc: 81.84%
Val. Loss: 0.452 | Val. Acc: 79.29%
val acc creasing->
Epoch: 03 | Epoch Time: 3m 24s
Train Loss: 0.360 | Train Acc: 86.49%
Val. Loss: 0.444 | Val. Acc: 79.64%
val acc creasing->
Epoch: 04 | Epoch Time: 3m 22s
Train Loss: 0.315 | Train Acc: 88.59%
Val. Loss: 0.430 | Val. Acc: 81.50%
val acc creasing->
Epoch: 05 | Epoch Time: 3m 24s
Train Loss: 0.280 | Train Acc: 89.83%
Val. Loss: 0.422 | Val. Acc: 81.90%
Test Loss: 0.420 | Test Acc: 81.02%
loss 和 val acc 可以加大訓練,效果可能會更好,這裏就不繼續實驗了