1. 簡介

Seq2Seq的基本結構是encoder-decoder，這個模型的目標是生成一個完整的句子。這個模型曾經使得谷歌翻譯有較大幅度的提升，下面就以機器翻譯爲例子，來描述詳述這個模型。

注：學習此模型需要有LSTM深度學習模型相關基礎。

2. Seq2Seq

Seq2Seq框架依賴於encoder-decoder。 encoder對輸入序列進行編碼，而decoder生成目標序列。

2.1 Encoder

在encoder中輸入hao are you ，每個單詞，都被映射成一個維的詞向量 $w\subset \mathbb{R}^{d}$ ，在這個例子中，輸入將被轉化成 $[w_{0},w_{1},w_{2}]\subset \mathbb{R}^{d\times 3}$ ，經過LSTM後，我們可以得到每一個詞對應的隱狀態 $[e_{0},e_{1},e_{2}]$ ，，和代表這個句子的向量，在這裏， $e_{2} = e$ 。

2.2 Decoder

現在我們已經得到了代表句子的向量，這裏我們將使用這個向量，輸入到另一個LSTM單元，以特殊字符 $w_{sos}$ 作爲起時字符，得到目標序列。

當時間步等0時：

$h_{0}=LSTM(e,w_{sos})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (1)$

$s_{0} = g(h_{0})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (2)$

$p_0 = softmax(s_{0}) \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (3)$

$i_{0} = argmax(p_{0})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (4)$

$\huge e$ ：Encoder輸出的句子向量

$\huge w_{sos}$ ：特殊詞，代表起時位置，作爲當前時間步驟的輸入

$\huge h_{0}$ ：當前時間步驟的隱狀態。 $\huge h_{0}\subset \mathbb{R}^{h}$ ， $\huge h$ 隱層的維度

$\huge s_{0}$ ：詞表中，每個詞的得分。 $\huge s_{0}\subset \mathbb{R}^{v}$ ， $\huge v$ 詞表的大小

$\huge g$ ：函數(其實就是矩陣,w 和 b)， $\huge \mathbb{R}^{h} \mapsto \mathbb{R}^{v}$

$\huge p_{0}$ ： $\huge s_{0}$ 經過 $\huge softmax$ 歸一化後得到在詞表上的概率分佈， $\huge p_{0}\subset \mathbb{R}^{v}$ ， $\huge v$ 詞表的大小

$\huge i_{0}$ ： $\huge p_{0}$ 中最大概率詞的索引。int值。

當時間步等於1時：

$h_{1}=LSTM(h_{0},w_{i_{0}}) \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (5)$

$s_{1} = g(h_{1})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (6)$

$p_1 = softmax(s_{1})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (7)$

$i_{1} = argmax(p_{1})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (8)$

與時間步等0不同的時，LSTM的輸入

$e\rightarrow h_{0}$ ，隱狀態的輸入從e變成上一個時間步的隱狀態

$w_{sos}\rightarrow w_{i_{0}}$ ，詞也變成上一個時間步預測的詞。

一直到預測到了特殊字符，才停止。

上面的方法其實就是做了這麼一個轉換：

$\mathbb{P}[y_{t+1}|y_{1},\cdots ,y_{t},x_{0},\cdots ,x_{n}] \mapsto \mathbb{P}[y_{t+1}|y_{t},h_{t},e]$

3. Seq2Seq with Attention

通常來說，seq2seq 加入attention機制後，會使得模型的能力所以提高。模型在解碼階段時可以關注對encoder序列的特定部分，而不是僅僅依賴於代表整個句子的向量 $\huge e$ 。

加入attention機制後，encoder的過程不變，decoder過程發生相應的變化

3.1 Decoder

$\huge h_{t} = LSTM(h_{t-1},[w_{i_{t-1}},c_{t}])\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (9)$

$\huge s_{t}=g(h_{t})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (10)$

$\large \dpi{80} \huge p_{t}=softmax(s_{t})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (11)$

$\huge i_{t}=argmax(p_{t})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (12)$

$\huge h_{t-1}$ ：是上一個時間步的隱層輸入。 $\huge h_{t-1}\subset \mathbb{R}^h$

$\huge h_{t}$ ：當前時間步的隱層輸入,也是上一個時間步的輸出。 $\huge h_{t}\subset \mathbb{R}^h$

$\huge w_{i_{t-1}$ ：是上一個時間步的詞向量 $\huge w_{i_{t-1}}\subset\mathbb{R}^{d}$ , $\huge d$ 表示詞向量的維度

$\huge c_{t}$ ：是context vec，叫做上下文向量，是對encoder的output求加權和的結果， $\huge c_{t}\subset \mathbb{R}^{d}$ , $\huge d$ 是LSTM隱層的維度

$\huge g$ ， $\huge s_{t}$ ， $\huge p_{t}$ ， $\huge i_{t}$ 在2.1 已經做了說明，這裏完全相同，下面看 $\huge c_{t}$ 是怎麼得到的

$\huge \alpha _{t^{'}} = f(h_{t},e_{t^{'}})\subset \mathbb{R} \qquad \qquad for \quad all \quad t^{'}\cdots \cdots (13)$

$\huge \bar{\alpha } = softmax(\alpha)\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (14)$

$\huge c_{t} = \sum_{t^{'}=0}^{T}\bar{\alpha_{t^{'}}}e_{t^{'}}\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (15)$

$\huge e_{t^{'}}$ ：是encoder是時步爲 $\huge {t^{'}}$ 的隱層；

$\huge h_{t}$ ：當前時間步驟隱層的輸入；

$\huge \alpha _{t^{'}}$ ：decoder當前時間步對encoder時間步爲 $\huge {t^{'}}$ 關注度的得分；

$\huge \alpha =[\alpha_0,\alpha_1,\cdots ,\alpha_T]$ encoder每個時間步驟得分的向量

$\huge \bar{\alpha }$ 是 $\huge \alpha$ 進行softmax 歸一化的後的值， $\huge \bar{\alpha} =[\bar{\alpha_{0}},\bar{\alpha_{1}},\cdots ,\bar{\alpha_{T}}]$

$\huge c_{t}$ ：在decoder時間步驟爲 $\huge t$ 時刻，對encoder的output求加權和的結果。

而對於函數 $\huge f$ ,通常有以下幾種選擇，但是不限於以下三種，什麼運算效果好，用什麼運算。

$\huge f(h_{t},e_{t^{'}})=\left\{\begin{matrix} &h_{t}^{T}e_{t^{'}} &dot \\ &h_{t}^{T}We_{t^{'}} &general \\ &v^{T}tanh(W[h_{t},e_{t^{'}}]) &concat \end{matrix}\right. \cdots \cdots \cdots (16)$

4. Train

回顧例子，目標是進行翻譯，將“how are you” 翻譯成 "comment vas tu"

如果在訓練階段，decoder的過程中，將t-1時間步預測的詞，作爲t時間步的輸入詞，很有可能在某一步預測錯誤，後面的序列將會全部亂掉，導致錯誤積累，並且使得模型無法在正確的輸入分佈中進行，會導致模型訓練緩慢，甚至無法進行下去，爲了加快處理速度。一個技巧是輸入token序列: $\huge [<sos>,comment,vas,tu]$ ，並且預測對應位置的下一個token $\huge [comment,vas,tu,<eos>]$ 。

decoder模型，每一個時間步 $\huge t$ 的輸出是詞表上的一個概率 $\huge p_{t}\subset \mathbb{R}^{v}$ , $\huge v$ 是詞表的大小，對於給定的目標序列， $\huge [y_{1},y_{2},\cdots ,y_{n}]$ ，我們可以計算出整個句子的概率：

$\huge \mathbb{P}(y_{1},\cdots ,y_{n})=\prod_{t=1}^{n}p_{t}[y_{i}]\cdots \cdots \cdots \cdots (17)$

這裏 $\huge p_{t}[y_i]$ 是指decoder第t和時間步上，生成第 $\huge i$ 個單詞的概率，我們要使得這個這個概率在目標序列上最大化，等價於使得：

$\huge -log\mathbb{P}(y_{1},\cdots ,y_{n}) \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (18)$

最小化，我們定義式子18這個作爲損失函數。

$\huge -log\mathbb{P}(y_{1},\cdots ,y_{n}) =-log\prod_{t=1}^{T}p_{t}[y_{i}]\cdots \cdots (19)$

$\huge -log\mathbb{P}(y_{1},\cdots ,y_{n}) =-\sum_{t=1}^{T}logp_{t}[y_{i}]\cdots \cdots (20)$

再具體的例子中，我們的目標就是最小化：

$\tiny -logp_{1}[comment]-logp_{2}[vas]-logp_{3}[tu]-logp_{4}[<eos>]\cdots \cdots(21)$

這裏的損失函數其實就是交叉熵損失(Cross Entropy)

5. Decoding

這裏主要是說明解碼過程，不是解碼器

5.1 理論

在解碼的過程中，採用一種貪婪的模式，將上一步預測的最後可能的詞，作爲輸入，傳入到下一步。但是這種方法，一旦在一步發生錯誤，就可能會造成整個解碼序列的錯亂，爲了儘可可能降低(目前並不能消除)這個風險，採用一種Beam Search的方法，我們的目標不是得到當前時間步上的最高的分，而是得到前 $\tiny k$ 個的最高得分。

那麼對於在時間步 $\tiny [1,t]$ 上的解碼假設集合 $\tiny H_{t}:=\{(w_{1}^{1},\cdots ,{w_{t}^{1}),\cdots ,(w_{1}^{k},\cdots ,{w_{t}^{k})\}$ 一共 $\tiny k$ 組，下角標代表時間步，上角標代表top_k的第k個word。

那麼是如何從 $\tiny H_{t}$ 在 $\tiny t+1$ 時刻得到候選集合 $\tiny C_{t+1}$ 呢？

$\tiny C_{t+1}:=\bigcup_{i=1}^{k}\{(w_{1}^{i},\cdots ,{w_{t}^{i},1),\cdots ,(w_{1}^{i},\cdots ,{w_{t}^{i},v)\}$ ，這個候選集合一共有 $\tiny k\times v$ 個，然後再從中選取 $\tiny k$ 個最高的，作爲 $\tiny H_{t+1}$ 。

注意：這裏時從詞表中選取的詞彙，詞表一共 $\large v$ 個詞，因爲這裏將會是一個非常重要的點，與下一篇指針網絡有所不同。

5.2 實例

假設 $\tiny k=2$ ，假設 $\tiny H_{2}:=\{(comment,vas),(comment,tu)\}$ ，假設decoder一共就三個詞 $\tiny [comment,vas,tu]$ 可選, $\tiny v=3$ 。

那麼在 $\tiny t=2$ 一共有2種輸出 $\tiny [vas,tu]$ ，在 $\tiny t=3$ 時，認爲此時模型的 $\large \dpi{200} \tiny batch\_size=2$ ，，將 $\tiny [vas,tu]$ 輸入模型，得到的輸出 $\small output_{3}$ 是的 $\tiny shape$ 是 $\small (batch\_size,vob\_size)=(2,3)$ 。一共6個，即爲候選集合：

$\small C_{3}=\{(comment,vas,comment),(comment,vas,vas),(comment,vas,tu)\}\\\cup \{(comment,tu,coment),(comment,tu,vas),(comment,tu,tu)\}$

從 $\small output_{3}[0]$ 中挑選出 $\small t=3$ 時刻 $\large log(p)$ 最高2個詞，再從 $\small output_{3}[1]$ 中挑出 $\large log(p)$ 最高的2個

組成 $\small \bar{c_{3}}$

$\large \bar{C_{3}}=\{(comment,vas,comment),(comment,vas,tu)\}\cup \{(comment,tu,coment),(comment,tu,vas)\}$

然後後再從 $\small \bar{C_{3}}$ 中挑選整個句子得分 $\large score$ 最高的 $\small k=2$ 個。得到 $\small H_{3}$

$\small H_{3}:=\{(comment,vas,tu),(comment,tu,vas)\}$

下面說明 $\large score$ 的計算方法：

目標，從 $\small \bar{C_{3}}$ 中挑選 $\small k=2$ 個 $\large score$ 最大的句子。 $\LARGE {score}_{i}$ 代表 $\small \bar{C_{3}}$ 中第 $\small i$ 個句子的得分， $\small i\in [1,2,3,4]$ 。