部分內容引用自https://zybuluo.com/hanbingtao/note/541458

1. Why RNN

循環神經網絡

RNN爲語言模型來建模，語言模型就是：給定一個一句話前面的部分，預測接下來最有可能的一個詞是什麼。

RNN理論上可以往前看(往後看)任意多個詞。

2. RNN結構

2.1 最基本的結構：

$x_{t-1},x_t,x_{t+1}$ 是輸入的連續一句話裏的單詞， $o_{t-1},o_t,o_{t+1}$ 是對應單詞的輸出概率，s是神經元。

$U,V,W$ 是權重矩陣，f，g是激活函數。
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \mathrm{o}_t&=…$
這個網絡在t時刻接收到輸入 $x_t$ 之後，隱藏層的值是 $s_t$ ，輸出值是 $o_t$ 。關鍵一點是， $s_t$ 的值不僅僅取決於 $x_t$ ，還取決於 $x_{t-1}$ 。

展開就是：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \mathrm{o}_t&=…$
每一層的W是相同的，每一層的U是相同的。

接下來我們在此結構上進行反向傳播講解。

(2.2 加入雙向循環)

-> 雙向循環神經網絡

區別就是輸出 $o_t$ 不僅依賴正向的神經元（ $A_t$ 位置），還依賴於反向計算的神經元（ $A_t^{'}$ 位置）。
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \mathrm{o}_t&=…$

(2.3 加入多層)

（即黃色的部分從1層神經元變成3層神經元） -> 深度循環神經網絡

3. 訓練

Backpropagation through time (BPTT)

我們對最基本的結構即2.1裏提到的進行反向傳播。

3.0 設定

整個神經網絡有三個參數， $V,W,U$ ，其中 $W和U$ 的推導十分類似，我們主要推導 $V,W$ ，U會說明下。

參考了Recurrent Neural Networks Tutorial, Part 3 以及pdf

PDF裏用到了Einstein Summation，其實很簡單，就是省略了求和符號，如下
$\frac{\partial E_t}{\partial V_{ij}}=\sum_m \frac{\partial E_t}{\partial O_{t_m}} \frac{\partial O_{t_m}}{\partial V_{ij}}= \frac{\partial E_t}{\partial O_{t_m}} \frac{\partial O_{t_m}}{\partial V_{ij}}$
其中m是啞變量（dummy index），我們可以省略對m求和的符號，這就是Einstein Summation。

下面的求導我們不用Einstein Summation，爲了好理解，但是用這個確實簡潔點。
各變量的維度：

$V:m*n\\ x_t:m*1\\ s_t:n*1\\ U:n*m\\ W:n*n\\ y:m*1\quad真實label\\ \hat{y}:m*1\quad概率$

誤差如下：
$E=\sum_t E_t$
我們對每個誤差分別求導，再相加。
時間長度爲 $T$ ，t從0到 $t-1$

3.1 對V求導

等式
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E_t&=-\sum_k (…$
對 $V_{ij}$ 求導：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
()第一項：
$\frac{\partial E_t}{\partial \hat{y_{t_k}}}=-y_{t_k}*\frac{1}{\hat{y_{t_k}}}$
()第二項：
$KaTeX parse error: No such environment: equation at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲}̲ \frac{\partial…$
前兩項合併：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
(*)第三項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
將(**)與(***)合併：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
所以：
$\frac{\partial E_t}{\partial V}=(\hat{y_{t}}-y_t)\otimes s_t$

3.2 對W求導

等式：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E_t&=-\sum_k (…$
同對 $V_{ij}$ 求導，對 $W_{ij}$ 求導：
$\frac{\partial E_t}{\partial W_{ij}}=\sum_k \sum_l \sum_m(\frac{\partial E_t}{\partial \hat{y_{t_k}}} \frac{\partial \hat{y_{t_k}}}{\partial q_{t_l}} \frac{\partial q_{t_l}}{\partial s_{t_m}} \frac{\partial s_{t_m}}{\partial W_{ij}} ) \quad (*)$

()的前兩項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
()的第三項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
()的第四項：（ $s_{t_m}$ 依賴於 $s_0-s_{t-1}$ ， $s_t=tanh(Ux_t+Ws_{t-1})$ ）
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
所以()可以表示爲：
$\frac{\partial E_t}{\partial W_{ij}}=\sum_l \{(\hat{y_{t_l}}-y_{t_l})\sum_m[ V_{lm} \sum_{r=0}^t (\frac{\partial s_{t_m}}{\partial s_{r_n}} \frac{\partial s_{r_n}}{\partial W_{ij}})]\}$

3.2.0 代碼：

針對以上的推導，可以下面的反向傳播代碼：

其中：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ o&:\hat{y_t}&,…$

def bptt(self, x, y):
    T = len(y)
    # Perform forward propagation
    o, s = self.forward_propagation(x)
    # We accumulate the gradients in these variables
    dLdU = np.zeros(self.U.shape)
    dLdV = np.zeros(self.V.shape)
    dLdW = np.zeros(self.W.shape)
    delta_o = o
    delta_o[np.arange(len(y)), y] -= 1.
    # For each output backwards...
    for t in np.arange(T)[::-1]: # t:(T-1)->0
        dLdV += np.outer(delta_o[t], s[t].T)
        # Initial delta calculation: dL/dz
        delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
        # Backpropagation through time (for at most self.bptt_truncate steps)
        for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]: # bptt_step:t->...
            # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
            # Add to gradients at each previous step
            dLdW += np.outer(delta_t, s[bptt_step-1])              
            dLdU[:,x[bptt_step]] += delta_t
            # Update delta for next step dL/dz at t-1
            delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
    return [dLdU, dLdV, dLdW]

3.2.1 delta_t的解釋

代碼裏的dLdW += np.outer(delta_t, s[bptt_step-1])實現(****)這個等式，第一項和後面的若干項是分開的。

下面具體解釋：

(****)的第一項：

$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$

其與(**) 、(***)結合， $\frac{\partial E_t}{\partial W_{ij}}$ 第一項則爲：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ [\frac{\partia…$
其中 $\sum_l \{(\hat{y_{t_l}}-y_{t_l}) V_{li}\}$ 就是V的第 $l$ 列與 $(\hat{y_{t}}-y_{t})$ 的內積（代碼用V的轉置乘以delta_o實現）。

delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2)) 就是實現$ (1-s_{t_i}^2) *\sum_l {(\hat{y_{t_l}}-y_{t_l}) V_{li}}$

(****)的第2項：

首先我們要推導：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
然後第二項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
同第一項的步驟，與(**) 、(***)結合， $\frac{\partial E_t}{\partial W_{ij}}$ 第二項則爲：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ [\frac{\partia…$

其中係數 $s_{{t-2}_j}$ 由代碼dLdW += np.outer(delta_t, s[bptt_step-1]) 實現。
下面我們解釋爲什麼剩下的由代碼delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)實現。

2.1 不難理解 $(1-s_{{t-1}_i}^2)$ 對應代碼(1 - s[bptt_step-1] ** 2).

2.2 那麼爲什麼$\sum_l {(\hat{y_{t_l}}-y_{t_l})\sum_m [V_{lm} (1-s_{t_m}^2)W_{mi} ]} $ 可以由上一次的delta_t直接乘以W呢？

我們觀察下第一次的delta_t的第i個元素：$ (1-s_{t_i}^2) *\sum_l {(\hat{y_{t_l}}-y_{t_l}) V_{li}} $

self.W.T.dot(delta_t)的第k個元素是W的第k列.dot(delta)，即
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \sum_{d=1}^n (…$

(****)的第3項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ [\frac{\partia…$
同樣可以由上一步的delta乘以W得到，證明類似。

3.3 對U求導

與W十分類似。

等式：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ E_t&=-\sum_k (…$
同對 $V_{ij}$ 求導，對 $W_{ij}$ 求導：
$\frac{\partial E_t}{\partial U_{ij}}=\sum_k \sum_l \sum_m(\frac{\partial E_t}{\partial \hat{y_{t_k}}} \frac{\partial \hat{y_{t_k}}}{\partial q_{t_l}} \frac{\partial q_{t_l}}{\partial s_{t_m}} \frac{\partial s_{t_m}}{\partial U_{ij}} ) \quad (*)$
我們只要看第四項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
與 $\frac{\partial s_{t_m}}{\partial W_{ij}}$ 的第一項基本一樣，除了最後的 $x_{t_j}$ ，

所以 $\frac{\partial E_t}{\partial U_{ij}}$ 爲：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \frac{\partial…$
delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2)) 實現的是 $(1-s_{t_i}^2) *\sum_l \{(\hat{y_{t_l}}-y_{t_l}) V_{li}\}$ .

dLdU[:,x[bptt_step]] += delta_t 實現的是 $x_{t_j}$ ，因爲 $x_t$ 的取值只爲0或1，所以只要在dLdU的 $x_t$ 不爲0的那列加上delta_t即可。

循環神經網絡RNN詳解反向傳播公式推導+代碼（十分詳細）

1. Why RNN

2. RNN結構

2.1 最基本的結構：

(2.2 加入雙向循環)

(2.3 加入多層)

3. 訓練

3.0 設定

3.1 對V求導

3.2 對W求導

3.2.0 代碼：

3.2.1 delta_t的解釋

3.3 對U求導

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

循環神經網絡RNN詳解反向傳播公式推導+代碼（十分詳細）

指針、引用&、地址&、指針的引用&、指針的指針**

Machine Learning In Action-Chapter8 線性迴歸

Machine Learning In Action - Chapter 2 KNN

Machine Learning In Action - Chapter 9 Tree-based regression

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

循環神經網絡RNN詳解 反向傳播公式推導+代碼（十分詳細）

1. Why RNN

2. RNN結構

2.1 最基本的結構：

(2.2 加入雙向循環)

(2.3 加入多層)

3. 訓練

3.0 設定

3.1 對V求導

3.2 對W求導

3.2.0 代碼：

3.2.1 delta_t的解釋

3.3 對U求導

循環神經網絡RNN詳解反向傳播公式推導+代碼（十分詳細）