這周在看循環數據網絡, 發現一個博客, 裏面推導極其詳細, 藉此記錄重點.

強烈建議手推一遍, 雖然會花一點時間, 但便於理清思路.

長短時記憶網絡

回顧BPTT算法裏誤差項沿時間反向傳播的公式:

\begin{aligned} (1) & δ_{k}^{T} = & δ_{t}^{T} \prod_{i = k}^{t - 1} d i a g [f^{'} ({n e t}_{i})] W \end{aligned}

根據範數的性質, 來獲取

δ_{k}^{T}

的模的上界:

\begin{aligned} (2) & ‖ δ_{k}^{T} ‖ ⩽ & ‖ δ_{t}^{T} ‖ \prod_{i = k}^{t - 1} ‖ d i a g [f^{'} ({n e t}_{i})] ‖ ‖ W ‖ \\ (3) & ⩽ & ‖ δ_{t}^{T} ‖ (β_{f} β_{W})^{t - k} \end{aligned}

可以看到, 誤差項

δ

從t時刻傳遞到k時刻, 其值上界是

β_{f} β_{w}

的指數函數.

β_{f} β_{w}

分別是對角矩陣

d i a g [f^{^{'}} (n e t_{i})]

和矩陣W模的上界. 顯然, 當t-k很大時, 會有梯度爆炸, 當t-k很小時, 會有梯度消失.

爲了解決RNN的梯度爆炸和梯度消失的問題, 就出現了長短時記憶網絡(Long Short Memory Network, LSTM). 原始RNN的隱藏層只有一個狀態h, 它對於短期的輸入非常敏感. 如果再增加一個狀態c, 讓它來保存長期的狀態, 那麼就可以解決原始RNN無法處理長距離依賴的問題.

新增加的狀態c, 稱爲單元狀態(cell state). 上圖按照時間維度展開:

上圖中, 在t時刻, LSTM的輸入有三個: 當前時刻網絡的輸入值 $x_{t}$ , 上一時刻LSTM的輸出值 $h_{t - 1}$ , 以及上一時刻的單元狀態 $c_{t - 1}$ ; LSTM的輸出有兩個: 當前時刻的LSTM輸出 $h_{t}$ , 當前時刻的狀態 $c_{t}$ . 其中 $x, h, c$ 都是向量.

LSTM的關鍵在於怎樣控制長期狀態c. 在這裏, LSTM的思路是使用三個控制開關:

第一個開關, 負責控制繼續保存長期狀態c; (遺忘門)

第二個開關, 負責控制把即時狀態輸入到長期狀態c; (輸入門)

第三個開關, 負責控制是都把長期狀態c作爲當前的LSTM的輸出. (輸出門)

接下來, 具體描述一下輸出h和單元狀態c的計算方法.

長短時記憶網絡的前向計算

開關在算法中用門(gate)實現. 門實際上就是一層全連接層, 它的輸入是一個向量, 輸出是一個0~1的實數向量. 假設w是門的權重向量, b是偏置項, 門可以表示爲:

g (x) = σ (W x + b)

門的使用, 就是用門的輸出向量按元素乘以我們需要控制的那個向量. 當門的輸出爲0時, 任何向量與之相乘都會得到0向量, 相當於什麼都不能通過; 當輸出爲1時, 任何向量與之相乘都爲本身, 相當於什麼都可以通過. 上式中

σ

是sigmoid函數, 值域爲(0,1), 所以門的狀態是半開半閉的.

LSTM用兩個門來控制單元狀態c的內容, 一個是遺忘門(forget gate), 它決定了上一時刻的單元狀態 $c_{t - 1}$ 有多少保留到當前時刻 $c_{t}$ ; 另一個是輸入門(input gate), 它決定了當前時刻網絡的輸入 $x_{t}$ 有多少保存到單元狀態 $c_{t}$ . LSTM用輸出門(output gate)來控制單元狀態 $c_{t}$ 有多少輸出到LSTM的當前輸出值 $h_{t}$ .

1. 遺忘門:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) (式 1)

上式中,

W_{f}

是遺忘門的權重矩陣,

[h_{t - 1}, x_{t}]

表示把兩個向量連接到一個更長的向量,

b_{f}

是遺忘門的偏置項,

σ

是sigmoid函數. 如果輸入的維度是

d_{h}

, 單元狀態的維度是

d_{c}

(通常

d_{c} = d_{h}

), 則遺忘門的權重矩陣

W_{f}

維度是

d_{c} \times (d_{h} + d_{x})

事實上, 權重矩陣 $W_{f}$ 都是兩個矩陣拼接而成的: 一個是 $W_{f h}$ , 它對應着輸入項 $h_{t - 1}$ , 其維度爲 $d_{c} \times d_{h}$ ; 一個是 $W_{f x}$ , 它對應着輸入項 $x_{t}$ , 其維度爲 $d_{c} \times d_{h}$ . $W_{f}$ 可以寫成:

\begin{aligned} (4) & [\begin{matrix} W_{f} \end{matrix}] [\begin{matrix} h_{t - 1} \\ x_{t} \end{matrix}] & = [\begin{matrix} W_{f h} & W_{f x} \end{matrix}] [\begin{matrix} h_{t - 1} \\ x_{t} \end{matrix}] \\ (5) & = W_{f h} h_{t - 1} + W_{f x} x_{t} \end{aligned}

下圖是遺忘門的計算:

2. 輸入門:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) (式 2)

上式中,

W_{i}

是輸入門的權重矩陣,

b_{i}

是輸入門的偏置項.

下圖是輸入門的計算:

接下來, 計算用於描述當前輸入的單元狀態 ${\tilde{c}}_{t}$ , 它是根據根據上一次的輸出和本次的輸入來計算的:

{\tilde{c}}_{t} = \tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c}) (式 3)

下圖是

{\tilde{c}}_{t}

的計算:

現在, 我們計算當前時刻的單元狀態 $c_{t}$ . 它是由上一次的單元狀態 $c_{t - 1}$ 按元素乘以遺忘門 $f_{t}$ , 再用當前輸入的單元狀態 ${\tilde{c}}_{t}$ 按元素乘以輸入門 $i_{t}$ , 再將兩個積加和產生的:

c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ {\tilde{c}}_{t} (式 4)

符號

\circ

表示按元素乘. 下圖是

c_{t}

的計算:

這樣, 就把LSTM關於當前的記憶 ${\tilde{c}}_{t}$ 和長期的記憶 $c_{t - 1}$ 組合在一起, 形成了新的單元狀態 $c_{t}$ . 由於遺忘門的控制, 它可以保存很久之前的信息, 由於輸入門的控制, 它又可以避免當前無關緊要的內容進入記憶.

3. 輸出門

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) (式 5)

下圖表示輸出門的計算:

LSTM最終的輸出, 是由輸出門和單元狀態共同確定的:

h_{t} = o_{t} \circ \tanh (c_{t}) (式 6)

下圖表示LSTM最終輸出的計算:

式1到式6就是LSTM前向計算的全部公式.

長短時記憶網絡的訓練

訓練部分比前向計算部分複雜, 具體推導如下.

LSTM訓練算法框架

LSTM的訓練算法仍然是反向傳播算法, 主要是三個步驟:

前向計算每個神經元的輸出值, 對於LSTM來說, 即 $f_{t}, i_{t}, c_{t} o_{t}, h_{t}$ 五個向量的值;
反向計算每個神經元的誤差項 $δ$ 值, 與RNN一樣, LSTM誤差項的反向傳播也是包括兩個方向: 一個沿時間的反向傳播, 即從當前t時刻開始, 計算每個時刻的誤差項; 一個是將誤差項向上一層傳播;
根據相應的誤差項, 計算每個權重的梯度.

關於公式和符號的說明

接下來的推導, 設定gate的激活函數爲sigmoid, 輸出的激活函數爲tanh函數. 他們的導數分別爲:

\begin{aligned} (6) & σ (z) & = y = \frac{1}{1 + e^{- z}} \\ (7) & σ^{'} (z) & = y (1 - y) \\ (8) & \tanh (z) & = y = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}} \\ (9) & \tanh^{'} (z) & = 1 - y^{2} \end{aligned}

從上式知, sigmoid函數和tanh函數的導數都是原函數的函數, 那麼計算出原函數的值, 導數便也計算出來.

LSTM需要學習的參數共有8組, 權重矩陣的兩部分在反向傳播中使用不同的公式, 分別是:

遺忘門的權重矩陣 $W_{f}$ 和偏置項 $b_{t}$ , $W_{f}$ 分開爲兩個矩陣 $W_{f h}$ 和 $W_{f x}$
輸入門的權重矩陣 $W_{i}$ 和偏置項 $b_{i}$ , $W_{i}$ 分開爲兩個矩陣 $W_{i h}$ 和 $W_{x i}$
輸出門的權重矩陣 $W_{o}$ 和偏置項 $b_{o}$ , $W_{o}$ 分開爲兩個矩陣 $W_{o h}$ 和 $W_{o x}$
計算單元狀態的權重矩陣 $W_{c}$ 和偏置項 $b_{c}$ , $W_{c}$ 分開爲兩個矩陣 $W_{c h}$ 和 $W_{c x}$

按元素乘 $\circ$ 符號. 當 $\circ$ 作用於兩個向量時, 運算如下:

a \circ b = [\begin{matrix} a_{1} \\ a_{2} \\ a_{3} \\ . . . \\ a_{n} \end{matrix}] \circ [\begin{matrix} b_{1} \\ b_{2} \\ b_{3} \\ . . . \\ b_{n} \end{matrix}] = [\begin{matrix} a_{1} b_{1} \\ a_{2} b_{2} \\ a_{3} b_{3} \\ . . . \\ a_{n} b_{n} \end{matrix}]

當

\circ

作用於一個向量和一個矩陣時, 運算如下:

\begin{aligned} (10) & a \circ X & = [\begin{matrix} a_{1} \\ a_{2} \\ a_{3} \\ . . . \\ a_{n} \end{matrix}] \circ [\begin{matrix} x_{11} & x_{12} & x_{13} & . . . & x_{1 n} \\ x_{21} & x_{22} & x_{23} & . . . & x_{2 n} \\ x_{31} & x_{32} & x_{33} & . . . & x_{3 n} \\ . . . \\ x_{n 1} & x_{n 2} & x_{n 3} & . . . & x_{n n} \end{matrix}] \\ (11) & = [\begin{matrix} a_{1} x_{11} & a_{1} x_{12} & a_{1} x_{13} & . . . & a_{1} x_{1 n} \\ a_{2} x_{21} & a_{2} x_{22} & a_{2} x_{23} & . . . & a_{2} x_{2 n} \\ a_{3} x_{31} & a_{3} x_{32} & a_{3} x_{33} & . . . & a_{3} x_{3 n} \\ . . . \\ a_{n} x_{n 1} & a_{n} x_{n 2} & a_{n} x_{n 3} & . . . & a_{n} x_{n n} \end{matrix}] \end{aligned}

當

\circ

作用於兩個矩陣時, 兩個矩陣對應位置的元素相乘. 按元素乘可以在某些情況下簡化矩陣和向量運算.

例如, 當一個對角矩陣右乘一個矩陣時, 相當於用對角矩陣的對角線組成的向量按元素乘那個矩陣:

d i a g [a] X = a \circ X

當一個行向量左乘一個對角矩陣時, 相當於這個行向量按元素乘那個矩陣對角組成的向量:

a^{T} d i a g [b] = a \circ b

在t時刻, LSTM的輸出值爲

h_{t}

. 我們定義t時刻的誤差項

δ_{t}

爲:

δ_{t} \overset{d e f}{=} \frac{\partial E}{\partial h_{t}}

這裏假設誤差項是損失函數對輸出值的導數, 而不是對加權輸出

n e t_{t}^{l}

的導數. 因爲LSTM有四個加權輸入, 分別對應

f_{t}, i_{t}, c_{t}, o_{t}

, 我們希望往上一層傳遞一個誤差項而不是四個, 但需要定義這四個加權輸入以及它們對應的誤差項.

\begin{aligned} (12) & {n e t}_{f, t} & = W_{f} [h_{t - 1}, x_{t}] + b_{f} \\ (13) & = W_{f h} h_{t - 1} + W_{f x} x_{t} + b_{f} \\ (14) & {n e t}_{i, t} & = W_{i} [h_{t - 1}, x_{t}] + b_{i} \\ (15) & = W_{i h} h_{t - 1} + W_{i x} x_{t} + b_{i} \\ (16) & {n e t}_{\tilde{c}, t} & = W_{c} [h_{t - 1}, x_{t}] + b_{c} \\ (17) & = W_{c h} h_{t - 1} + W_{c x} x_{t} + b_{c} \\ (18) & {n e t}_{o, t} & = W_{o} [h_{t - 1}, x_{t}] + b_{o} \\ (19) & = W_{o h} h_{t - 1} + W_{o x} x_{t} + b_{o} \\ (20) & δ_{f, t} & \overset{d e f}{=} \frac{\partial E}{\partial {n e t}_{f, t}} \\ (21) & δ_{i, t} & \overset{d e f}{=} \frac{\partial E}{\partial {n e t}_{i, t}} \\ (22) & δ_{\tilde{c}, t} & \overset{d e f}{=} \frac{\partial E}{\partial {n e t}_{\tilde{c}, t}} \\ (23) & δ_{o, t} & \overset{d e f}{=} \frac{\partial E}{\partial {n e t}_{o, t}} \end{aligned}

誤差項沿時間的反向傳遞

沿時間反向傳遞誤差項, 就是要計算出t-1時刻的誤差項 $δ_{t - 1}$ .

\begin{aligned} (24) & δ_{t - 1}^{T} & = \frac{\partial E}{\partial h_{t - 1}} \\ (25) & = \frac{\partial E}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{t - 1}} \\ (26) & = δ_{t}^{T} \frac{\partial h_{t}}{\partial h_{t - 1}} \end{aligned}

其中,

\frac{\partial h_{t}}{\partial h_{t - 1}}

是一個Jacobian矩陣, 爲了求出它, 需要列出

h_{t}

的計算公式, 即前面的式6和式4:

h_{t} = o_{t} \circ \tanh (c_{t}) (式 6) c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ {\tilde{c}}_{t} (式 4)

顯然,

o_{t}, f_{t}, i_{t}, {\tilde{c}}_{t}

都是

h_{t - 1}

的函數, 那麼, 利用全導數公式可得:

\begin{aligned} (27) & δ_{t}^{T} \frac{\partial h_{t}}{\partial h_{t - 1}} & = δ_{t}^{T} \frac{\partial h_{t}}{\partial o_{t}} \frac{\partial o_{t}}{\partial {n e t}_{o, t}} \frac{\partial {n e t}_{o, t}}{\partial h_{t - 1}} + δ_{t}^{T} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial f_{t}} \frac{\partial f_{t}}{\partial {n e t}_{f, t}} \frac{\partial {n e t}_{f, t}}{\partial h_{t - 1}} \\ (28) & + δ_{t}^{T} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial i_{t}} \frac{\partial i_{t}}{\partial {n e t}_{i, t}} \frac{\partial {n e t}_{i, t}}{\partial h_{t - 1}} + δ_{t}^{T} \frac{\partial h_{t}}{\partial c_{t}} \frac{\partial c_{t}}{\partial {\tilde{c}}_{t}} \frac{\partial {\tilde{c}}_{t}}{\partial {n e t}_{\tilde{c}, t}} \frac{\partial {n e t}_{\tilde{c}, t}}{\partial h_{t - 1}} \\ (29) & = δ_{o, t}^{T} \frac{\partial {n e t}_{o, t}}{\partial h_{t - 1}} + δ_{f, t}^{T} \frac{\partial {n e t}_{f, t}}{\partial h_{t - 1}} + δ_{i, t}^{T} \frac{\partial {n e t}_{i, t}}{\partial h_{t - 1}} + δ_{\tilde{c}, t}^{T} \frac{\partial {n e t}_{\tilde{c}, t}}{\partial h_{t - 1}} (式 7) \end{aligned}

下面, 要把式7中的每個偏導數都求出來, 根據式6, 可以求出:

\begin{aligned} (30) & \frac{\partial h_{t}}{\partial o_{t}} & = d i a g [\tanh (c_{t})] \\ (31) & \frac{\partial h_{t}}{\partial c_{t}} & = d i a g [o_{t} \circ (1 - \tanh (c_{t})^{2})] \end{aligned}

根據式4, 可以求出:

\begin{aligned} (32) & \frac{\partial c_{t}}{\partial f_{t}} & = d i a g [c_{t - 1}] \\ (33) & \frac{\partial c_{t}}{\partial i_{t}} & = d i a g [{\tilde{c}}_{t}] \\ (34) & \frac{\partial c_{t}}{\partial {\tilde{c}}_{t}} & = d i a g [i_{t}] \end{aligned}

因爲:

\begin{aligned} (35) & o_{t} & = σ ({n e t}_{o, t}) \\ (36) & {n e t}_{o, t} & = W_{o h} h_{t - 1} + W_{o x} x_{t} + b_{o} \\ (37) \\ (38) & f_{t} & = σ ({n e t}_{f, t}) \\ (39) & {n e t}_{f, t} & = W_{f h} h_{t - 1} + W_{f x} x_{t} + b_{f} \\ (40) \\ (41) & i_{t} & = σ ({n e t}_{i, t}) \\ (42) & {n e t}_{i, t} & = W_{i h} h_{t - 1} + W_{i x} x_{t} + b_{i} \\ (43) \\ (44) & {\tilde{c}}_{t} & = \tanh ({n e t}_{\tilde{c}, t}) \\ (45) & {n e t}_{\tilde{c}, t} & = W_{c h} h_{t - 1} + W_{c x} x_{t} + b_{c} \end{aligned}

可以得出:

\begin{aligned} (46) & \frac{\partial o_{t}}{\partial {n e t}_{o, t}} & = d i a g [o_{t} \circ (1 - o_{t})] \\ (47) & \frac{\partial {n e t}_{o, t}}{\partial h_{t - 1}} & = W_{o h} \\ (48) & \frac{\partial f_{t}}{\partial {n e t}_{f, t}} & = d i a g [f_{t} \circ (1 - f_{t})] \\ (49) & \frac{\partial {n e t}_{f, t}}{\partial h_{t - 1}} & = W_{f h} \\ (50) & \frac{\partial i_{t}}{\partial {n e t}_{i, t}} & = d i a g [i_{t} \circ (1 - i_{t})] \\ (51) & \frac{\partial {n e t}_{i, t}}{\partial h_{t - 1}} & = W_{i h} \\ (52) & \frac{\partial {\tilde{c}}_{t}}{\partial {n e t}_{\tilde{c}, t}} & = d i a g [1 - {\tilde{c}}_{t}^{2}] \\ (53) & \frac{\partial {n e t}_{\tilde{c}, t}}{\partial h_{t - 1}} & = W_{c h} \end{aligned}

將上述偏導數導入到式7, 可以得到:

\begin{aligned} (54) & δ_{t - 1} & = δ_{o, t}^{T} \frac{\partial {n e t}_{o, t}}{\partial h_{t - 1}} + δ_{f, t}^{T} \frac{\partial {n e t}_{f, t}}{\partial h_{t - 1}} + δ_{i, t}^{T} \frac{\partial {n e t}_{i, t}}{\partial h_{t - 1}} + δ_{\tilde{c}, t}^{T} \frac{\partial {n e t}_{\tilde{c}, t}}{\partial h_{t - 1}} \\ (55) & = δ_{o, t}^{T} W_{o h} + δ_{f, t}^{T} W_{f h} + δ_{i, t}^{T} W_{i h} + δ_{\tilde{c}, t}^{T} W_{c h} (式 8) \end{aligned}

根據

δ_{o, t}, δ_{f, t}, δ_{i, t}, δ_{\tilde{c}, t}

的定義, 可知:

\begin{aligned} (56) & δ_{o, t}^{T} & = δ_{t}^{T} \circ \tanh (c_{t}) \circ o_{t} \circ (1 - o_{t}) (式 9) \\ (57) & δ_{f, t}^{T} & = δ_{t}^{T} \circ o_{t} \circ (1 - \tanh (c_{t})^{2}) \circ c_{t - 1} \circ f_{t} \circ (1 - f_{t}) (式 10) \\ (58) & δ_{i, t}^{T} & = δ_{t}^{T} \circ o_{t} \circ (1 - \tanh (c_{t})^{2}) \circ {\tilde{c}}_{t} \circ i_{t} \circ (1 - i_{t}) (式 11) \\ (59) & δ_{\tilde{c}, t}^{T} & = δ_{t}^{T} \circ o_{t} \circ (1 - \tanh (c_{t})^{2}) \circ i_{t} \circ (1 - {\tilde{c}}^{2}) (式 12) \end{aligned}

式8到式12就是將誤差沿時間反向傳播一個時刻的公式. 有了它, 便可以寫出將誤差項傳遞到任意k時刻的公式:

δ_{k}^{T} = \prod_{j = k}^{t - 1} δ_{o, j}^{T} W_{o h} + δ_{f, j}^{T} W_{f h} + δ_{i, j}^{T} W_{i h} + δ_{\tilde{c}, j}^{T} W_{c h} (式 13)

將誤差項傳遞到上一層

假設當前是第 $l$ 層, 定義 $l - 1$ 層的誤差項是誤差函數對 $l - 1$ 層加權輸入的導數, 即:

δ_{t}^{l - 1} \overset{d e f}{=} \frac{\partial E}{{n e t}_{t}^{l - 1}}

本次LSTM的輸入

x_{t}

由下面的公式計算:

x_{t}^{l} = f^{l - 1} ({n e t}_{t}^{l - 1})

上式中,

f^{l - 1}

表示第

l - 1

的激活函數.

因爲 ${n e t}_{f, t}^{l}, {n e t}_{i, t}^{l}, {n e t}_{\tilde{c}, t}^{l}, {n e t}_{o, t}^{l}$ 都是 $x_{t}$ 的函數, $x_{t}$ 又是 ${n e t}_{t}^{l - 1}$ 的函數, 因此, 要求出 $E$ 對 ${n e t}_{t}^{l - 1}$ 的導數, 就需要使用全導數公式:

\begin{aligned} (60) & \frac{\partial E}{\partial {n e t}_{t}^{l - 1}} & = \frac{\partial E}{\partial {n e t}_{f, t}^{l}} \frac{\partial {n e t}_{f, t}^{l}}{\partial x_{t}^{l}} \frac{\partial x_{t}^{l}}{\partial {n e t}_{t}^{l - 1}} + \frac{\partial E}{\partial {n e t}_{i, t}^{l}} \frac{\partial {n e t}_{i, t}^{l}}{\partial x_{t}^{l}} \frac{\partial x_{t}^{l}}{\partial {n e t}_{t}^{l - 1}} \\ (61) & + \frac{\partial E}{\partial {n e t}_{\tilde{c}, t}^{l}} \frac{\partial {n e t}_{\tilde{c}, t}^{l}}{\partial x_{t}^{l}} \frac{\partial x_{t}^{l}}{\partial {n e t}_{t}^{l - 1}} + \frac{\partial E}{\partial {n e t}_{o, t}^{l}} \frac{\partial {n e t}_{o, t}^{l}}{\partial x_{t}^{l}} \frac{\partial x_{t}^{l}}{\partial {n e t}_{t}^{l - 1}} \\ (62) & = δ_{f, t}^{T} W_{f x} \circ f^{'} ({n e t}_{t}^{l - 1}) + δ_{i, t}^{T} W_{i x} \circ f^{'} ({n e t}_{t}^{l - 1}) + δ_{\tilde{c}, t}^{T} W_{c x} \circ f^{'} ({n e t}_{t}^{l - 1}) + δ_{o, t}^{T} W_{o x} \circ f^{'} ({n e t}_{t}^{l - 1}) \\ (63) & = (δ_{f, t}^{T} W_{f x} + δ_{i, t}^{T} W_{i x} + δ_{\tilde{c}, t}^{T} W_{c x} + δ_{o, t}^{T} W_{o x}) \circ f^{'} ({n e t}_{t}^{l - 1}) (式 14) \end{aligned}

式14就是將誤差傳遞到上一層的公式.

權重梯度的計算

對於 $W_{f h}, W_{i h}, W_{c h}, W_{o h}$ 的權重梯度, 我們知道它的梯度是各個時刻梯度之和. 我們首先求出它們在t時刻的梯度, 然後再求出他們最終的梯度.

我們已經求得了誤差項 $δ_{o, t}, δ_{f, t}, δ_{i, t}, δ_{\tilde{c}, t}$ , 很容易求出t時刻的 $W_{o h}, W_{f h}, W_{i h}, W_{c h}$ :

\begin{aligned} (64) & \frac{\partial E}{\partial W_{o h, t}} & = \frac{\partial E}{\partial {n e t}_{o, t}} \frac{\partial {n e t}_{o, t}}{\partial W_{o h, t}} \\ (65) & = δ_{o, t} h_{t - 1}^{T} \\ (66) \\ (67) & \frac{\partial E}{\partial W_{f h, t}} & = \frac{\partial E}{\partial {n e t}_{f, t}} \frac{\partial {n e t}_{f, t}}{\partial W_{f h, t}} \\ (68) & = δ_{f, t} h_{t - 1}^{T} \\ (69) \\ (70) & \frac{\partial E}{\partial W_{i h, t}} & = \frac{\partial E}{\partial {n e t}_{i, t}} \frac{\partial {n e t}_{i, t}}{\partial W_{i h, t}} \\ (71) & = δ_{i, t} h_{t - 1}^{T} \\ (72) \\ (73) & \frac{\partial E}{\partial W_{c h, t}} & = \frac{\partial E}{\partial {n e t}_{\tilde{c}, t}} \frac{\partial {n e t}_{\tilde{c}, t}}{\partial W_{c h, t}} \\ (74) & = δ_{\tilde{c}, t} h_{t - 1}^{T} \end{aligned}

將各個時刻的梯度加在一起, 就能得到最終的梯度:

\begin{aligned} (75) & \frac{\partial E}{\partial W_{o h}} & = \sum_{j = 1}^{t} δ_{o, j} h_{j - 1}^{T} \\ (76) & \frac{\partial E}{\partial W_{f h}} & = \sum_{j = 1}^{t} δ_{f, j} h_{j - 1}^{T} \\ (77) & \frac{\partial E}{\partial W_{i h}} & = \sum_{j = 1}^{t} δ_{i, j} h_{j - 1}^{T} \\ (78) & \frac{\partial E}{\partial W_{c h}} & = \sum_{j = 1}^{t} δ_{\tilde{c}, j} h_{j - 1}^{T} \end{aligned}

對於偏置項

b_{f}, b_{i}, b_{c}, b_{o}

的梯度, 先求出各個時刻的偏置項梯度:

\begin{aligned} (79) & \frac{\partial E}{\partial b_{o, t}} & = \frac{\partial E}{\partial {n e t}_{o, t}} \frac{\partial {n e t}_{o, t}}{\partial b_{o, t}} \\ (80) & = δ_{o, t} \\ (81) \\ (82) & \frac{\partial E}{\partial b_{f, t}} & = \frac{\partial E}{\partial {n e t}_{f, t}} \frac{\partial {n e t}_{f, t}}{\partial b_{f, t}} \\ (83) & = δ_{f, t} \\ (84) \\ (85) & \frac{\partial E}{\partial b_{i, t}} & = \frac{\partial E}{\partial {n e t}_{i, t}} \frac{\partial {n e t}_{i, t}}{\partial b_{i, t}} \\ (86) & = δ_{i, t} \\ (87) \\ (88) & \frac{\partial E}{\partial b_{c, t}} & = \frac{\partial E}{\partial {n e t}_{\tilde{c}, t}} \frac{\partial {n e t}_{\tilde{c}, t}}{\partial b_{c, t}} \\ (89) & = δ_{\tilde{c}, t} \end{aligned}

將各個時刻的偏置項梯度加在一起:

\begin{aligned} (90) & \frac{\partial E}{\partial b_{o}} & = \sum_{j = 1}^{t} δ_{o, j} \\ (91) & \frac{\partial E}{\partial b_{i}} & = \sum_{j = 1}^{t} δ_{i, j} \\ (92) & \frac{\partial E}{\partial b_{f}} & = \sum_{j = 1}^{t} δ_{f, j} \\ (93) & \frac{\partial E}{\partial b_{c}} & = \sum_{j = 1}^{t} δ_{\tilde{c}, j} \end{aligned}

對於

W_{f x}, W_{i x}, W_{c x}, W_{o x}

的權重梯度, 只需要根據相應的誤差項直接計算即可:

\begin{aligned} (94) & \frac{\partial E}{\partial W_{o x}} & = \frac{\partial E}{\partial {n e t}_{o, t}} \frac{\partial {n e t}_{o, t}}{\partial W_{o x}} \\ (95) & = δ_{o, t} x_{t}^{T} \\ (96) \\ (97) & \frac{\partial E}{\partial W_{f x}} & = \frac{\partial E}{\partial {n e t}_{f, t}} \frac{\partial {n e t}_{f, t}}{\partial W_{f x}} \\ (98) & = δ_{f, t} x_{t}^{T} \\ (99) \\ (100) & \frac{\partial E}{\partial W_{i x}} & = \frac{\partial E}{\partial {n e t}_{i, t}} \frac{\partial {n e t}_{i, t}}{\partial W_{i x}} \\ (101) & = δ_{i, t} x_{t}^{T} \\ (102) \\ (103) & \frac{\partial E}{\partial W_{c x}} & = \frac{\partial E}{\partial {n e t}_{\tilde{c}, t}} \frac{\partial {n e t}_{\tilde{c}, t}}{\partial W_{c x}} \\ (104) & = δ_{\tilde{c}, t} x_{t}^{T} \end{aligned}

以上就是LSTM的訓練算法的全部公式

GRU

上面所述是一種普通的LSTM, 事實上LSTM存在很多變體, GRU就是其中一種最成功的變體. 它對LSTM做了很多簡化, 同時保持和LSTM相同的效果.

GRU對LSTM做了兩大改動:

將輸入門, 遺忘門, 輸出門變爲兩個門: 更新門(Update Gate) $z_{t}$ 和充值門(Reset Gate) $r_{t}$ .
將單元狀態與輸出合併爲一個狀態: $h$

GRU的前向計算公式爲:

\begin{aligned} (105) & z_{t} & = σ (W_{z} \cdot [h_{t - 1}, x_{t}]) \\ (106) & r_{t} & = σ (W_{r} \cdot [h_{t - 1}, x_{t}]) \\ (107) & {\tilde{h}}_{t} & = \tanh (W \cdot [r_{t} \circ h_{t - 1}, x_{t}]) \\ (108) & h & = (1 - z_{t}) \circ h_{t - 1} + z_{t} \circ {\tilde{h}}_{t} \end{aligned}

下圖是GRU的示意圖:

循環神經網絡2--LSTM

長短時記憶網絡

長短時記憶網絡的前向計算

長短時記憶網絡的訓練

LSTM訓練算法框架

關於公式和符號的說明

誤差項沿時間的反向傳遞

將誤差項傳遞到上一層

權重梯度的計算

GRU

python gdal 安裝使用（Windows， python 3.6.8）

2012-NSDI-RDD

常用Linux命令--刪除及清理

常用Linux命令--解壓

本地訪問服務器jupyter notebook

循環神經網絡2--LSTM

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結