前言
Seq2Seq模型用來處理nlp中序列到序列的問題,是一種常見的Encoder-Decoder模型架構,基於RNN同時解決了RNN的一些弊端(輸入和輸入必須是等長的)。Seq2Seq的模型架構可以參考Seq2Seq詳解,也可以讀論文原文sequence to sequence learning with neural networks.本文主要介紹如何用Pytorch實現Seq2Seq模型。
數據集的準備
本文使用的數據集極爲簡易,因爲只是想要動手實踐一下Seq2Seq模型進而更好的理解nlp中模型的搭建和訓練。
首先構建字典
建立一個字母表(其實是一個字典,格式爲序號:字母,一遍之後用序號檢索字母)
char_list = [c for c in 'SEPabcdefghijklmnopqrstuvwxyz']
char_dic = {n:i for i,n in enumerate(char_list)}
手動創建數據集
seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]
數據集只有6對單詞,如果有合適的數據集模型的訓練效果會好一點。
word embedding
本文采用的編碼方式是one-hot編碼。將數據集中單詞組的第一個單詞作爲encoder的input輸入,將第二個單詞作爲decoder的output輸入,也將第二個單詞作爲計算loss的target.
需要注意的是,拿input舉例,數據集中的每個input向量最終需要整合在一個大的向量中,因此就需要保證每一個時間步輸入的單詞向量維度是相同的。output和target亦是如此。但是數據集中每個單詞向量的長度不可能都相同,所以,需要設置一個單詞的最大長度seq_len,每一個單詞都用大寫P補充爲這個長度。
def make_batch(seq_data):
batch_size = len(seq_data)
input_batch,output_batch,target_batch = [],[],[]
for seq in seq_data:
for i in range(2):
seq[i] += 'P' * (seq_len - len(seq[i]))
input = [char_dic[n] for n in seq[0]]
output = [char_dic[n] for n in ('S' + seq[1])]
target = [char_dic[n] for n in (seq[1] + 'E')]
input_batch.append(np.eye(n_class)[input])
output_batch.append(np.eye(n_class)[output])
target_batch.append(target)
return Variable(torch.Tensor(input_batch)),Variable(torch.Tensor(output_batch)),Variable(torch.LongTensor(target_batch))
生成的三個向量形狀爲:(訓練集的長度,單詞的最大長度,單詞表的長度),定義爲(batchsize,seq_len,n_classes)
模型的搭建
Seq2Seq模型中有一個encoder和一個decoder,encoder負責將輸入的所有時間步的input轉換成一個向量C,C代表語義向量,裏邊包含了所有輸入單詞的信息。decoder負責將encoder生成的C解碼爲輸入向量output.
encoder的輸入爲輸入向量input和預先生成好的全1向量hidden;
decoder的輸入向量爲encoder生成的語義向量C和encoder中輸入向量對應的輸出向量。
在將input輸入到encoder時,需要將向量的第一維度和第二維度進行轉換。因爲RNN的輸入維度要求爲(seq_len,batchsize,n_classes),而我們之前生成的向量維度是(batchsize,seq_len,n_classes),所以需要轉換一下維度。RNN輸入輸出的維度可以參考這篇文章
class Seq2Seq(nn.Module):
def __init__(self):
super(Seq2Seq,self).__init__()
self.encoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.decoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.fc = nn.Linear(n_hidden,n_class)
def forward(self,enc_input,enc_hidden,dec_input):
enc_input = enc_input.transpose(0,1) #需要將向量的第一第二維度進行轉換
dec_input = dec_input.transpose(0,1)
_,h_states = self.encoder(enc_input,enc_hidden)
outputs,_ = self.decoder(dec_input,h_states)
outputs = self.fc(outputs)
return outputs
模型的訓練
模型訓練之前定義一下loss function和optimizer,learning rate設爲0.001.還有一點需要注意的是,模型訓練前要預先生成一個hidden,放入encoder中的RNN,hidden的維度爲(1,batchsize,n_hidden).
model = Seq2Seq()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
for epoch in range(5001):
hidden = Variable(torch.zeros(1,batch_size,n_hidden))
optimizer.zero_grad()
outputs = model(input_batch,hidden,output_batch)
outputs = outputs.transpose(0,1)
loss = 0
for i in range(batch_size):
loss += criterion(outputs[i],target_batch[i])
if (epoch % 500) == 0:
print('epoch:{},loss:{}'.format(epoch,loss))
loss.backward()
optimizer.step()
模型的檢驗
模型訓練好以後,就可以進行檢驗.同理還是需要將輸入轉換爲向量,輸出生成的output也需要轉換爲字符形式。
def translated(word):
input_batch,output_batch,_ = make_batch([[word,'P'*len(word)]])
hidden = Variable(torch.zeros(1,1,n_hidden))
outputs = model(input_batch,hidden,output_batch)
predict = outputs.data.max(2,keepdim=True)[1]
decode = [char_list[i] for i in predict]
end = decode.index('P')
translated = ''.join(decode[:end])
print(translated)
完整代碼
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
dtype = torch.FloatTensor
char_list = [c for c in 'SEPabcdefghijklmnopqrstuvwxyz']
char_dic = {n:i for i,n in enumerate(char_list)}
seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]
seq_len = 8
n_hidden = 128
n_class = len(char_list)
batch_size = len(seq_data)
def make_batch(seq_data):
batch_size = len(seq_data)
input_batch,output_batch,target_batch = [],[],[]
for seq in seq_data:
for i in range(2):
seq[i] += 'P' * (seq_len - len(seq[i]))
input = [char_dic[n] for n in seq[0]]
output = [char_dic[n] for n in ('S' + seq[1])]
target = [char_dic[n] for n in (seq[1] + 'E')]
input_batch.append(np.eye(n_class)[input])
output_batch.append(np.eye(n_class)[output])
target_batch.append(target)
return Variable(torch.Tensor(input_batch)),Variable(torch.Tensor(output_batch)),Variable(torch.LongTensor(target_batch))
input_batch,output_batch,target_batch = make_batch(seq_data)
class Seq2Seq(nn.Module):
def __init__(self):
super(Seq2Seq,self).__init__()
self.encoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.decoder = nn.RNN(input_size = n_class,hidden_size = n_hidden)
self.fc = nn.Linear(n_hidden,n_class)
def forward(self,enc_input,enc_hidden,dec_input):
enc_input = enc_input.transpose(0,1)
dec_input = dec_input.transpose(0,1)
_,h_states = self.encoder(enc_input,enc_hidden)
outputs,_ = self.decoder(dec_input,h_states)
outputs = self.fc(outputs)
return outputs
model = Seq2Seq()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=0.001)
for epoch in range(5001):
hidden = Variable(torch.zeros(1,batch_size,n_hidden))
optimizer.zero_grad()
outputs = model(input_batch,hidden,output_batch)
outputs = outputs.transpose(0,1)
loss = 0
for i in range(batch_size):
loss += criterion(outputs[i],target_batch[i])
if (epoch % 500) == 0:
print('epoch:{},loss:{}'.format(epoch,loss))
loss.backward()
optimizer.step()
def translated(word):
input_batch,output_batch,_ = make_batch([[word,'P'*len(word)]])
hidden = Variable(torch.zeros(1,1,n_hidden))
outputs = model(input_batch,hidden,output_batch)
predict = outputs.data.max(2,keepdim=True)[1]
decode = [char_list[i] for i in predict]
end = decode.index('P')
translated = ''.join(decode[:end])
print(translated)