對話系統-論文核心研究歷程
文章目錄
一、Tensorflow Seq-to-Seq API介紹和源碼分析
轉載原文:https://blog.csdn.net/liuchonge/article/details/78856692
tensorflow版本升級之後把之前的tf.nn.seq2seq的代碼遷移到了tf.contrib.legacy_seq2seq
下面,其實這部分API估計以後也會被遺棄,因爲已經開發出了新的API放在tf.contrib.seq2seq
下面,更加靈活,但是目前在網上找到的代碼和仿真實現基本上用的還是legacy_seq2seq
下面的代碼,所以我們先來分析一下這部分的函數功能及源碼實現。本次我們會介紹下面幾個函數,這部分代碼的定義都可以在python/ops/seq2seq.py文件中找到。
首先看一下這個文件的組成,主要包含下面幾個函數:
可以看到按照調用關係和功能不同可以分成下面的結構:
model_with_buckets
seq2seq
函數basic_rnn_seq2seq
rnn_decoder
tied_rnn_seq2seq
embedding_tied_rnn_seq2seq
embedding_rnn_seq2seq
embedding_rnn_decoder
embedding_attention_seq2seq
embedding_attention_decoder
attention_decoder
attention
one2many_rnn_seq2seq
loss
函數sequence_loss_by_example
sequence_loss
根據函數調用關係,介紹如下函數:
1. model_with_buckets()函數
最高層函數:model_with_buckets()
函數,定義:
def model_with_buckets(encoder_inputs,
decoder_inputs,
targets,
weights,
buckets,
seq2seq,
softmax_loss_function=None,
per_example_loss=False,
name=None):
if len(encoder_inputs) < buckets[-1][0]:
raise ValueError("Length of encoder_inputs (%d) must be at least that of la"
"st bucket (%d)." % (len(encoder_inputs), buckets[-1][0]))
if len(targets) < buckets[-1][1]:
raise ValueError("Length of targets (%d) must be at least that of last "
"bucket (%d)." % (len(targets), buckets[-1][1]))
if len(weights) < buckets[-1][1]:
raise ValueError("Length of weights (%d) must be at least that of last "
"bucket (%d)." % (len(weights), buckets[-1][1]))
all_inputs = encoder_inputs + decoder_inputs + targets + weights
#保存每個bucket對應的loss和output
losses = []
outputs = []
with ops.name_scope(name, "model_with_buckets", all_inputs):
#對每個bucket都要選擇數據進行構建模型
for j, bucket in enumerate(buckets):
#buckets之間的參數都要進行復用
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=True if j > 0 else None):
#調用seq2seq進行解碼得到輸出,這裏需要注意的是,encoder_inputs和decoder_inputs是定義好的placeholder,
#都是長度爲序列最大長度的列表(也就是最大的那個buckets的長度),按上面的例子,這兩個placeholder分別是長度爲20和30的列表
#在構建模型時,對於每個bucket,只取其對應的長度placeholder即可,如對於(5,10)這個bucket,就取前5/10個placeholder進行構建模型
bucket_outputs, _ = seq2seq(encoder_inputs[:bucket[0]],
decoder_inputs[:bucket[1]])
outputs.append(bucket_outputs)
#如果指定per_example_loss,則調用suquence_loss_by_example,losses添加的是一個batch_size大小的列表
if per_example_loss:
losses.append(
sequence_loss_by_example(
outputs[-1],
targets[:bucket[1]],
weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
#否則調用suquence_loss,對上面的結構進行求和,losses添加的是一個值
else:
losses.append(
sequence_loss(
outputs[-1],
targets[:bucket[1]],
weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
return outputs, losses
1.1 參數解析
encoder_inputs
: encoder的輸入,一個tensor的列表。列表中每一項都是encoder時的一個詞。decoder_inputs
:decoder的輸入,一個tensor的列表。列表中每一項都是decoder時的一個詞。targets
:目標值,與decoder_input只相差一個< EOS >符號,int32型weights
:目標序列長度值的mask標誌,如果是padding則weight=0,否則weight=1buckets
:就是定義的bucket值,是一個列表:[(5,10), (10,20),(20,30)…]seq2seq
:定義好的seq2seq模型,可以使用後面介紹的embedding_attention_seq2seq
,embedding_rnn_seq2seq
,basic_rnn_seq2seq
等softmax_loss_function
: 計算誤差的函數,(labels, logits),默認爲sparse_softmax_cross_entropy_with_logits
per_example_loss
: 如果爲真,則調用sequence_loss_by_example
,返回一個列表,其每個元素就是一個樣本的loss值。如果爲假,則調用sequence_loss
函數,對一個batch的樣本只返回一個求和的loss值,具體見後面的分析name
:Optional name for this operation, defaults to “model_with_buckets
”.
1.2 函數內部實現
目的是爲了減少計算量和加快模型計算速度;
在此這部分代碼比較古老,有些地方還在使用static_rnn()這種函數,其實新版的tf中引入dynamic_rnn之後就不需要這麼做了。
析:其實思路很簡單,就是將輸入長度分成不同的間隔,這樣數據的在填充時只需要填充到相應的bucket長度即可,不需要都填充到最大長度。
例:buckets取[(5,10), (10,20),(20,30)…](每個bucket的第一個數字表示source填充的長度,第二個數字表示target填充的長度。
eg:‘我愛你’–>‘I love you’,應該會被分配到第一個bucket中,然後‘我愛你’會被pad成長度爲5的序列,‘I love you’會被pad成長度爲10的序列。
其實就是每個bucket表示一個模型的參數配置,這樣對每個bucket都構造一個模型,然後訓練時取相應長度的序列進行,而這些模型將會共享參數。其實這一部分可以參考現在的dynamic_rnn
來進行理解,dynamic_rnn
是對每個batch的數據將其pad至本batch中長度最大的樣本,而bucket則是在數據預處理環節先對數據長度進行聚類操作。
2. embedding_attention_seq2seq()函數
tf.nn.seq2seq.embedding_attention_seq2seq
本函數會調用seq2seq函數進行解碼操作,從名字就可看出本函數實現了embedding和attention兩個功能,而attention則是使用了“Neural Machine Translation by Jointly Learning to Align and Translate”這篇論文裏的定義方法:
# T代表time_steps,時序長度
def embedding_attention_seq2seq(encoder_inputs, # [T, batch_size]
decoder_inputs, # [T, batch_size]
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
num_heads=1, #attention的抽頭數量
output_projection=None, #decoder的投影矩陣
feed_previous=False,
dtype=None,
scope=None,
initial_state_attention=False):
with variable_scope.variable_scope(
scope or "embedding_attention_seq2seq", dtype=dtype) as scope:
dtype = scope.dtype
# Encoder.先將cell進行deepcopy,因爲seq2seq模型是兩個相同的模型,但是模型參數不共享,所以encoder和decoder需要使用倆個不同的RNN cell
encoder_cell = copy.deepcopy(cell)
#先將encoder輸入進行embedding操作,直接在RNNcell的基礎上添加一個EmbeddingWrapper即可
encoder_cell = core_rnn_cell.EmbeddingWrapper(
encoder_cell,
embedding_classes=num_encoder_symbols,
embedding_size=embedding_size)
#這裏仍然使用static_rnn函數來構造RNN模型
encoder_outputs, encoder_state = rnn.static_rnn(
encoder_cell, encoder_inputs, dtype=dtype)
# First calculate a concatenation of encoder outputs to put attention on.
#將encoder的輸出由列表換成Tensor,shape爲[batch_size,encoder_input_length,output_size],
#轉換之後的Tensor就可以作爲Attention的輸入了
top_states = [
array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs
]
attention_states = array_ops.concat(top_states, 1)
# Decoder.
output_size = None
#將decoder的輸出進行映射到output_vocab_size維度,直接將RNNcell添加上一個OutputProjectionWrapper包裝即可
if output_projection is None:
cell = core_rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)
output_size = num_decoder_symbols
#如果feed_previous是bool型的值,則直接調用embedding_attention_decoder函數進行解碼
if isinstance(feed_previous, bool):
return embedding_attention_decoder(
decoder_inputs,
encoder_state,
attention_states,
cell,
num_decoder_symbols,
embedding_size,
num_heads=num_heads,
output_size=output_size,
output_projection=output_projection,
feed_previous=feed_previous,
initial_state_attention=initial_state_attention)
# If feed_previous is a Tensor, we construct 2 graphs and use cond.
# 如果feed_previous是一個tensor,則使用tf.cond構建兩個graph
def decoder(feed_previous_bool):
#本函數會被調用兩次,第一次不適用reuse,第二次使用reuse。所以decoder(True),decoder(false)
reuse = None if feed_previous_bool else True
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=reuse):
outputs, state = embedding_attention_decoder(
decoder_inputs,
encoder_state,
attention_states,
cell,
num_decoder_symbols,
embedding_size,
num_heads=num_heads,
output_size=output_size,
output_projection=output_projection,
feed_previous=feed_previous_bool,
update_embedding_for_previous=False,
initial_state_attention=initial_state_attention)
state_list = [state]
if nest.is_sequence(state):
state_list = nest.flatten(state)
return outputs + state_list
outputs_and_state = control_flow_ops.cond(feed_previous,
lambda: decoder(True),
lambda: decoder(False))
outputs_len = len(decoder_inputs) # Outputs length same as decoder inputs.
state_list = outputs_and_state[outputs_len:]
state = state_list[0]
if nest.is_sequence(encoder_state):
state = nest.pack_sequence_as(
structure=encoder_state, flat_sequence=state_list)
return outputs_and_state[:outputs_len], state
2.1 參數解析
encoder_inputs
:encoder的輸入,int32型 id tensor listdecoder_inputs
:decoder的輸入,int32型 id tensor listcell
:RNNCell常見的一些RNNCell定義都可以用.num_encoder_symbols
:source的vocab_size大小(詞表大小),用於embedding矩陣定義num_decoder_symbols
:target的vocab_size大小(詞表大小),用於embedding矩陣定義embedding_size
:embedding向量的維度num_heads
:Attention頭的個數,就是使用多少種attention的加權方式,用更多的參數來求出幾種attention向量output_projection=None
:輸出的映射層,想要得到num_decoder_symbols
對應的詞還需要增加一個映射層,參數是W和B,W:[output_size, num_decoder_symbols]
,b:[num_decoder_symbols]
。若output_projection
爲默認的None
時爲訓練模式,這時的cell加上了一層OutputProjectionWrapper
,即將[batch_size, output_size]轉化爲[batch_size, symbol]。如果output_projection
不爲空,則此時的cell輸出的爲[batch_size, output_size]。(兩個cell是不同的,這就直接影響到後續的embedding_rnn_decoder
解碼過程和loop_function
的定義操作)。feed_previous
:是否將上一時刻輸出作爲下一時刻輸入,一般測試的時候置爲True,此時只有第一個decoder的輸入(“GO"符號)有用,所有的decoder輸入都依賴於上一步的輸出。initial_state_attention
:默認爲False, 初始的attention是零;若爲True,將從initial state
和attention states
開始attention。
2.2 函數內部實現
上面的代碼進行了embedding的encoder階段,最終得到每個時間步的隱藏層向量表示encoder_outputs,然後將各個時間步的輸出進行reshape並concat變成一個[batch_size,encoder_input_length,output_size]的tensor。方便計算每個decode時刻的編碼向量Ci。
在decoder階段,先是對RNNCell封裝了一個OutputProjectionWrapper
用於輸出層的映射(將輸出映射成想要的維度),然後直接調用embedding_attention_decoder
函數解碼。但是當feed_previous不是bool型的變量,而是一個tensor的時候,會執行def decoder此函數.
2.3 output
(outputs, state) tuple pair:
- outputs是 2D Tensors list, 每個Tensor的shape是[batch_size, cell.state_size];
- state是 最後一個時間步,decoder cell的state,shape是[batch_size, cell.state_size]
3.embedding_attention_decoder函數
前面的embedding_attention_seq2seq
在解碼時會直接調用本函數。
代碼定義:
def embedding_attention_decoder(decoder_inputs,
initial_state,
attention_states,
cell,
num_symbols,
embedding_size,
num_heads=1,
output_size=None,
output_projection=None,
feed_previous=False,
update_embedding_for_previous=True,
dtype=None,
scope=None,
initial_state_attention=False):
if output_size is None:
output_size = cell.output_size
if output_projection is not None:
proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
proj_biases.get_shape().assert_is_compatible_with([num_symbols])
with variable_scope.variable_scope(
scope or "embedding_attention_decoder", dtype=dtype) as scope:
#decoder階段的embedding
embedding = variable_scope.get_variable("embedding",
#將上一個cell輸出進行output_projection,然後embedding得到當前cell的輸入,盡在feed_previous情況下使用 [num_symbols, embedding_size])
loop_function = _extract_argmax_and_embed(
embedding, output_projection,
update_embedding_for_previous) if feed_previous else None
#如果不是feed_previous的話,將decoder_inputs進行embedding得到詞向量
emb_inp = [
embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs
]
return attention_decoder(
emb_inp,
initial_state,
attention_states,
cell,
output_size=output_size,
num_heads=num_heads,
loop_function=loop_function,
initial_state_attention=initial_state_attention)
3.1 參數解析
decoder_inputs
:這裏input是token id,shape爲a list of [batch_size, ]也就是說,輸入不需要自己做embedding,直接輸入tokens在vocab中對應的idx(即ids)即可,內部會自動幫我們進行id到embedding的轉化。num_symbols
:就是decoder階段的vocab_sizeembedding_size
:每個token需要embedding成的維數。output_projection
:如果output_projection
爲默認的None,此時爲訓練模式,這時的cell加了一層OutputProjectionWrapper
,即將輸出的[batch_size, output_size]轉化爲[batch_size,nums_symbol]。而如果output_projection
不爲空,此時的cell的輸出還是[batch_size, output_size]。update_embedding_for_previous
:如果前一時刻的output不作爲當前的input的話(feed_previous=False
),這個參數沒影響();只有在feed_previous爲真的時候纔會起作用。就是在bp時只更新‘GO’的embedding向量,其他元素保持不變。initial_state
:2D Tensor [batch_size x cell.state_size],RNN的初始狀態attention_states
:3D Tensor [batch_size x attn_length x attn_size],就是上面計算出來的encoder階段的隱層向量
3.2 實現
第一步創建瞭解碼用的embedding;
第二步創建了一個循環函數loop_function,用於將上一步的輸出映射到詞表空間,輸出一個word embedding作爲下一步的輸入;
4. attention_decoder()函數
tf.nn.attention_decoder
論文涉及三個公式:
encoder輸出的隱層狀態(,…,),decoder的隱層狀態(,…,)。,,是模型要學的參數。所謂的attention,就是在每個解碼的時間步,對encoder的隱層狀態進行加權求和,針對不同信息進行不同程度的注意力。那麼我們的重點就是求出不同隱層狀態對應的權重。源碼中的attention機制裏是最常見的一種,可以分爲三步走:(1)通過當前隱層狀態()和關注的隱層狀態()求出對應權重;(2)softmax歸一化爲概率;(3)作爲加權係數對不同隱層狀態求和,得到一個的信息向量。後續的使用會因爲具體任務有所差別。
上面的含義是第t個時間步,對的加權係數。
4.1 代碼
def attention_decoder(decoder_inputs, # T*[batch_size, input_size]
initial_state, # [batch_size, cell.states]
attention_states,# [batch_size, attn_length, attn_size]
cell,
output_size=None,
num_heads=1,
loop_function=None,
dtype=None,
scope=None,
initial_state_attention=False):
if not decoder_inputs:
raise ValueError("Must provide at least 1 input to attention decoder.")
if num_heads < 1:
raise ValueError("With less than 1 heads, use a non-attention decoder.")
if attention_states.get_shape()[2].value is None:
raise ValueError("Shape[2] of attention_states must be known: %s" %
attention_states.get_shape())
if output_size is None:
output_size = cell.output_size
with variable_scope.variable_scope(
scope or "attention_decoder", dtype=dtype) as scope:
dtype = scope.dtype
batch_size = array_ops.shape(decoder_inputs[0])[0] # Needed for reshaping.
attn_length = attention_states.get_shape()[1].value
if attn_length is None:
attn_length = array_ops.shape(attention_states)[1]
attn_size = attention_states.get_shape()[2].value
# To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
# 爲了方便進行1*1卷積,將attention_states轉化爲[batch_size,num_steps,1,attention_size]的四維tensor
#第四個維度是attention_size,表示的是
hidden = array_ops.reshape(attention_states,
[-1, attn_length, 1, attn_size])
#用來保存num_heads個讀取頭的相關信息,hidden_states保存的是w*hj,v保存的是v,每個讀取頭的參數是不一樣的
hidden_features = []
v = []
#-----------------------------接下來計算v*tanh(w*hj+u*zi)來表示二者的相關性
attention_vec_size = attn_size # Size of query vectors for attention.
#對隱藏層的每個元素計算w*hj
for a in xrange(num_heads):
#卷積核的size是1*1,輸入channle爲attn_size,共有attention_vec_size個filter
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
#卷積之後的結果就是[batch_size,num_steps,1,attention_vec_size]
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
state = initial_state
#解碼RNN的隱層狀態
def attention(query):
"""Put attention masks on hidden using hidden_features and query."""
ds = [] # Results of attention reads will be stored here.
#如果query是tuple,則將其flatten,並連接成二維的tensor
if nest.is_sequence(query): # If the query is a tuple, flatten it.
query_list = nest.flatten(query)
for q in query_list: # Check that ndims == 2 if specified.
ndims = q.get_shape().ndims
if ndims:
assert ndims == 2
query = array_ops.concat(query_list, 1)
for a in xrange(num_heads):
with variable_scope.variable_scope("Attention_%d" % a):
#計算u*zi,並將其reshape成[batch_size, 1, 1, attention_vec_size]
y = Linear(query, attention_vec_size, True)(query)
y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
# Attention mask is a softmax of v^T * tanh(...).
#計算v * tanh(w * hj + u * zi)
#hidden_features[a] + y的shape爲[batch_size, num_steps, 1,attention_vec_size],在於v向量(【attention_vec_size】)相乘仍保持不變
#在2, 3兩個維度上進行reduce_sum操作,最終變成[batch_size,num_steps]的tensor,也就是各個hidden向量所對應的分數
s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
[2, 3])
#使用softmax函數進行歸一化操作
a = nn_ops.softmax(s)
# Now calculate the attention-weighted vector d.
#對所有向量進行加權求和
d = math_ops.reduce_sum(
array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
ds.append(array_ops.reshape(d, [-1, attn_size]))
return ds
outputs = []
prev = None
batch_attn_size = array_ops.stack([batch_size, attn_size])
attns = [
array_ops.zeros(
batch_attn_size, dtype=dtype) for _ in xrange(num_heads)
]
for a in attns: # Ensure the second shape of attention vectors is set.
a.set_shape([None, attn_size])
#如果使用全零初始化狀態,則直接調用attention並使用全另狀態。
if initial_state_attention:
attns = attention(initial_state)
#如果不用全另初始化狀態,則對所有decoder_inputs進行遍歷,並逐個解碼
for i, inp in enumerate(decoder_inputs):
if i > 0:
#如果i>0,則複用解碼RNN模型的參數
variable_scope.get_variable_scope().reuse_variables()
# If loop_function is set, we use it instead of decoder_inputs.
#如果要使用前一時刻輸出作爲本時刻輸入,則調用loop_function覆蓋inp的值
if loop_function is not None and prev is not None:
with variable_scope.variable_scope("loop_function", reuse=True):
inp = loop_function(prev, i)
# Merge input and previous attentions into one vector of the right size.
input_size = inp.get_shape().with_rank(2)[1]
if input_size.value is None:
raise ValueError("Could not infer input size from input: %s" % inp.name)
#輸入是將inp與attns進行concat,餵給RNNcell
inputs = [inp] + attns
x = Linear(inputs, input_size, True)(inputs)
# Run the RNN.
cell_output, state = cell(x, state)
# Run the attention mechanism.
#計算下一時刻的atten向量
if i == 0 and initial_state_attention:
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=True):
attns = attention(state)
else:
attns = attention(state)
with variable_scope.variable_scope("AttnOutputProjection"):
inputs = [cell_output] + attns
output = Linear(inputs, output_size, True)(inputs)
if loop_function is not None:
prev = output
outputs.append(output)
return outputs, state
對於num_heads參數,我們知道,attention就是對信息的加權求和,一個attention head對應了一種加權求和方式,這個參數定義了用多少個attention head去加權求和,所以公式三可以進一步表述爲
用的是卷積的方式實現,返回的tensor的形狀是[batch_size, attn_length, 1, attention_vec_size]
# To calculate W1 * h_t we use a 1-by-1 convolution
hidden = array_ops.reshape(
attention_states, [-1, attn_length, 1, attn_size])
hidden_features = []
v = []
attention_vec_size = attn_size # Size of query vectors for attention.
for a in xrange(num_heads):
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
,此項是通過下面的線性映射函數linear實現
for a in xrange(num_heads):
with variable_scope.variable_scope("Attention_%d" % a):
# query對應當前隱層狀態d_t
y = linear(query, attention_vec_size, True)
y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
# 計算u_t
s = math_ops.reduce_sum(
v[a] * math_ops.tanh(hidden_features[a] + y), [2, 3])
a = nn_ops.softmax(s)
# 計算 attention-weighted vector d.
d = math_ops.reduce_sum(
array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden,
[1, 2])
ds.append(array_ops.reshape(d, [-1, attn_size]))