Backpropagation-反向傳播算法

 算法通過神經網絡往輸入端方向傳遞信息,計算對應參數造成損失的梯度,是以誤差爲主導,根據損失函數梯度指明的方向,不斷前進,從而不斷減少誤差,達到局部最優的算法。

  本文主要詳細闡述BP算法的迭代計算過程,並不包含各公式定理的證明,旨在展示BP算法的運行過程,即“怎麼做”的問題。

運行前

首先列舉一下有關於ANN的一些概念,輸入層/輸入神經元,隱含層/隱含神經元,輸出層/輸出神經元,權值,偏置,激活函數,超參數(層數,學習率等不可調參數)。

然後列出本次算法運行實例的樣本數據和神經網絡結構

樣本數據:

X1 X2 Y
0.5 2.5 1
2 1 0
3.14 2.11 0

神經網絡結構:

初始參數:

權值 w1=w2=w3=w4=w5=w6=0.5  偏置 b1=b2=0.5

激活函數:S(x)=\frac{1}{1+e^{-x}}(sigmoid函數)

學習率:\eta =0.5

迭代次數100000

 

樣本數據歸一化:

(X1,X2)/10=([0.05,0.2,0.314],[0.25,0.1,0.211])

向前傳播

 net1=w1*X1+w3*X2+b1=\bigl(\begin{smallmatrix} 0.5*0.05\\ 0.5*0.2\\ 0.5*0.314 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5*0.25\\ 0.5*0.1\\ 0.5*0.211 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5\\ 0.5\\ 0.5 \end{smallmatrix}\bigr)=\bigl(\begin{smallmatrix} 0.65\\ 0.65\\ 0.7625 \end{smallmatrix}\bigr)

net2=w2*X1+w4*X2+b1=\bigl(\begin{smallmatrix} 0.5*0.05\\ 0.5*0.2\\ 0.5*0.314 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5*0.25\\ 0.5*0.1\\ 0.5*0.211 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5\\ 0.5\\ 0.5 \end{smallmatrix}\bigr)=\bigl(\begin{smallmatrix} 0.65\\ 0.65\\ 0.7625 \end{smallmatrix}\bigr)

 out1=sigmoid(net1)=\bigl(\begin{smallmatrix} 0.65701046\\ 0.65701046\\ 0.68189626\end{smallmatrix}\bigr)

out2=sigmoid(net2)=\bigl(\begin{smallmatrix} 0.65701046\\ 0.65701046\\ 0.68189626\end{smallmatrix}\bigr)

 \begin{align*} r1&=w5*out1+w6*out2+b2 \\ \qquad&=\bigl(\begin{smallmatrix} 0.5*0.65701046\\ 0.5*0.65701046\\ 0.5*0.68189626 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5*0.65701046\\ 0.5*0.65701046\\ 0.5*0.68189626 \end{smallmatrix}\bigr)=\bigl(\begin{smallmatrix} 1.15701046\\1.15701046\\ 1.18189626 \end{smallmatrix}\bigr) \end{align*}

 

 t1=sigmoid(r1)=\bigl(\begin{smallmatrix} 0.76078908\\ 0.76078908\\ 0.76528859\end{smallmatrix}\bigr)

以上就是神經網絡的向前傳播過程。

向後求偏導

算法理論基礎

  根據神經網絡的向前傳播過程,我們得出了一個輸出,顯然,實際輸出和我們期望的輸出是有誤差的。於是對於相對固定的輸入,我們可以構建函數C,有C(w,b)=(f(w,b)-Y)^{2},其中f即神經網絡向前傳播的計算過程。於是我們的目的可以描述爲,對輸入X,求w和b,使得C(w,b)取極小值。

  那麼如何求C(w,b)的極小值,我們需要知道兩樣東西:梯度和鏈式法則。
在微積分裏面,對多元函數的參數求偏導數,把求得的各個參數的偏導數以向量的形式寫出來,就是梯度。梯度的意義從幾何意義上講,就是函數變化增加最快的地方。所以,我們只要往這個方向的反方向走,就可以保證函數值快速遞減。所以,在每一次向前傳播得出輸出後,我們即可以對C(w,b)求偏導,然後對變量w,b在梯度相反的方向上進行調整。
  那麼,如何對C(w,b)求w,b的偏導呢?我們得知道鏈式法則:

\frac{\partial e}{\partial a}=\frac{\partial e}{\partial c}\cdot \frac{\partial c}{\partial a} \qquad \& \qquad \frac{\partial e}{\partial b}=\frac{\partial e}{\partial c}\cdot \frac{\partial c}{\partial b} +\frac{\partial e}{\partial d}\cdot \frac{\partial d}{\partial b}

(在以上公式中,由於c和d都包含了b的計算結果,所以在對e求b的導數時,需要分開c和d,分別對b求導,然後再相加。)
利用鏈式法則,我們可以計算出C(w,b)對w,b的梯度方向。

  若神經網絡結構複雜,存在大量的w和b,則計算C(w,b)對所有的w,b的偏導時,計算量無疑是巨大的。而BP算法從輸出層開始往後求偏導,利用了每一層的計算結果,使得計算量大大減少。具體的做法是,BP算法首先計算出輸出層w,b的偏導,然後,往後一層的隱藏層繼續求偏導,根據隱藏層-輸出層的組合計算過程,和剛剛計算輸出層的w,b的偏導時的部分結果,利用鏈式法則,即可計算出隱藏層w,b的偏導結果。如若需要再往後一層求偏導,則可利用對前一層求偏導結果。如此,每一層的誤差偏導結果,都可以被後一層使用,所以BP算法是層層向後計算,大大節省了求導計算量。

算法公式簡單推導

(注:由於一個簡單的神經網絡也涉及多個參數,所以這裏的公式推導,並沒有嚴格地使用上下標,去標明符號所代表的參數(因爲推導公式並不是本文重點,而嚴格使用符號上下標固然嚴謹,但是看起來複雜繚亂,所以這裏就不標了))

輸出層:

根據鏈式法則,有\frac{\partial E}{\partial w}=\frac{\partial E}{\partial t1}\cdot \frac{\partial t1}{\partial r1} \cdot \frac{\partial r1}{\partial w} \qquad \& \qquad \frac{\partial E}{\partial b}=\frac{\partial E}{\partial t1}\cdot \frac{\partial t1}{\partial r1} \cdot \frac{\partial r1}{\partial b},另設誤差函數E=\frac{1}{2}(Y-t1)^{2}

則有:

\frac{\partial E}{\partial t1}=-(Y-t1)\quad \& \quad \frac{\partial t1}{\partial r1}=t1(1-t1)\quad \& \quad \frac{\partial r1}{\partial w}=out1\quad \& \quad \frac{\partial r1}{\partial b}=1

 隱含層:

 (注:若隱含層某節點通向右側有多個節點,則說明對應連通節點的誤差都受該隱含層的節點影響,此時,所有受影響的誤差都需要對該隱含層節點求導,即:

若E1,E2,E(i)都受w影響,則有:

\begin{align*} \frac{\partial E}{\partial w}&=\frac{\partial E1}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w} +\frac{\partial E2}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}\\ \qquad&=(\frac{\partial E1}{\partial out}+\frac{\partial E2}{\partial out}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}\\ \qquad&=(\frac{\partial E1}{\partial t1}\cdot \frac{\partial t1}{\partial r1}\cdot \frac{\partial r1}{\partial out} +\cdot \cdot \cdot +\frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot \frac{\partial r(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w} \end{align*}
同理,有:

\begin{align*} \frac{\partial E}{\partial b}&=\frac{\partial E1}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b} +\frac{\partial E2}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b}\\ \qquad&=(\frac{\partial E1}{\partial out}+\frac{\partial E2}{\partial out}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b}\\ \qquad&=(\frac{\partial E1}{\partial t1}\cdot \frac{\partial t1}{\partial r1}\cdot \frac{\partial r1}{\partial out} +\cdot \cdot \cdot +\frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot \frac{\partial r(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b} \end{align*}

通過列舉這些等式,我們可以看出輸出層和隱含層的誤差對參數的偏導式子有相似之處,於是我們總結出以下結論:

\frac{\partial t1}{\partial r1}=\frac{\partial out}{\partial net},都是激活函數的導數

進而有

輸出層:

∂E/∂w=-(期望輸出-實際輸出)* 激活函數的導數* out

∂E/∂b=-(期望輸出-實際輸出)* 激活函數的導數

隱含層:

根據公式推導有\frac{\partial E}{\partial w}=(\sum \frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot w_{i})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}\frac{\partial E}{\partial b}=(\sum \frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot w_{i})\cdot \frac{\partial out}{\partial net}\cdot\frac{\partial net}{\partial b}其中\frac{\partial net}{\partial b}=1

所以設δ,δ=(-(期望輸出-實際輸出))*激活函數的導數(即\delta =\frac{\partial E}{\partial t}\cdot \frac{\partial t}{\partial r}),則有:

輸出層的誤差對參數求偏導:

∂E/∂w=(對應w的隱藏層輸出)*δ

∂E/∂b=δ

隱含層的誤差對參數求偏導:

∂E/∂w=(對應w的隱藏層(輸入層)輸出)*(右層每個節點的δ加權求和)*激活函數的導數

∂E/∂b=(右層每個節點的δ加權求和)*激活函數的導數

有了求偏導的公式,根據前文提到的梯度向量,我們可以確定參數w,b的調整公式:

w^{'}=w-\eta \frac{\partial E}{\partial w} \quad \& \quad b^{'}=b-\eta \frac{\partial E}{\partial b}
其中η爲學習率,學習率指定了反向傳播過程中梯度下降的步長。

算法運行過程

接下來,把向後傳播的過程計算一遍:

 \begin{align*} \frac{\partial E}{\partial w5}&=(-(Y-t1)*t1*(1-t1)*out1) \\ \qquad&=\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)*0.65701046\\ -(0-0.76078908)*0.76078908*(1-0.76078908)*0.65701046\\ -(0-0.76528859)*0.76528859*(1-0.76528859)*0.68189626 \end{pmatrix}\\ \qquad&=\begin{pmatrix} -0.02860214\\ 0.09096657\\ 0.09373526 \end{pmatrix} \end{align*}
w5^{'}=\begin{pmatrix} 0.5-0.5*-0.02860214\\ 0.5-0.5*-0.09096657\\ 0.5-0.5*-0.09373526 \end{pmatrix}=\begin{pmatrix} 0.51430107\\ 0.45451671\\ 0.45313237\end{pmatrix}
\begin{align*} \frac{\partial E}{\partial b2}&=(-(Y-t1)*t1*(1-t1)) \\ \qquad&=\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)\\ -(0-0.76078908)*0.76078908*(1-0.76078908)\\ -(0-0.76528859)*0.76528859*(1-0.76528859) \end{pmatrix}\\ \qquad&=\begin{pmatrix} -0.04353377\\ 0.13845529\\ 0.13746264 \end{pmatrix} \end{align*}
b2^{'}=\begin{pmatrix} 0.5-0.5*-0.04353377\\ 0.5-0.5*-0.13845529\\ 0.5-0.5*-0.13746264\end{pmatrix}=\begin{pmatrix} 0.52176689\\ 0.43077236\\ 0.43126868\end{pmatrix}
同理可得:

w6^{'}=\begin{pmatrix} 0.5-0.5*-0.02860214\\ 0.5-0.5*-0.09096657\\ 0.5-0.5*-0.09373526 \end{pmatrix}=\begin{pmatrix} 0.51430107\\ 0.45451671\\ 0.45313237\end{pmatrix}

接下來繼續往後算

\begin{align*} \frac{\partial E}{\partial w1}&=(X1*(-(Y-t1)*t1*(1-t1)*w5)*out1*(1-out1)) \\ \qquad&=\begin{pmatrix}0.05*(-(1-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\0.2*(-(0-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\ 0.314*(-(0-0.76528859)*0.76528859*(1-0.76528859)*0.5*0.68189626*(1-0.68189626) \end{pmatrix}\\ \qquad&=\begin{pmatrix} -0.00024526\\ 0.00312006\\ 0.00468135 \end{pmatrix} \end{align*}

w1^{'}=\begin{pmatrix} 0.5-0.5*-0.00024526\\ 0.5-0.5*-0.00312006\\ 0.5-0.5*-0.00468135 \end{pmatrix}=\begin{pmatrix} 0.50012263\\ 0.49843997\\ 0.49765932\end{pmatrix}

\begin{align*} \frac{\partial E}{\partial b1}&=(-(Y-t1)*t1*(1-t1)*w5)*out1*(1-out1) \\ \qquad&+(-(Y-t1)*t1*(1-t1)*w6)*out2*(1-out2) \\ \qquad&=\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\ -(0-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\-(0-0.76528859)*0.76528859*(1-0.76528859)*0.5*0.68189626*(1-0.68189626) \end{pmatrix}\\ \qquad&+\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\ -(0-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\-(0-0.76528859)*0.76528859*(1-0.76528859)*0.5*0.68189626*(1-0.68189626) \end{pmatrix}\\\qquad&=\begin{pmatrix} -0.00981024\\ 0.03120058\\ 0.02981754\end{pmatrix} \end{align*}
b1^{'}=\begin{pmatrix} 0.5-0.5*-0.00981024\\ 0.5-0.5*0.03120058\\ 0.5-0.5*0.02981754 \end{pmatrix}=\begin{pmatrix} 0.50490512\\ 0.48439971\\ 0.48509123\end{pmatrix}

同理可計算得:

w2^{'}=\begin{pmatrix} 0.50061314\\ 0.49921999\\ 0.49842712\end{pmatrix} \quad \& \quad w3^{'}=\begin{pmatrix} 0.50012263\\ 0.49843997\\ 0.49765932\end{pmatrix} \quad \& \quad w4^{'}=\begin{pmatrix} 0.50061314\\ 0.49921999\\ 0.49842712\end{pmatrix} \quad

至此,已完成一次BP算法的迭代過程。

接下來,根據調整後的w'和b',將上述向前傳播和BP向後傳播的過程迭代,直至誤差在可接受範圍內即可。

根據我的代碼,迭代了100000次,最後結果是:

b1=[1.4829065 , 1.36039658, 1.312224]

b2=[2.67796477, -3.03037886, -2.94488393]

w1=[0.52457266, 0.58603966, 0.62751917]

w2=[0.62286331, 0.54301983, 0.58568963]

w3=[0.52457266, 0.58603966, 0.62751917]

w4=[0.62286331, 0.54301983, 0.58568963]

w5=[ 2.12102269, -1.94604351, -1.96760757]

w6=[ 2.12102269, -1.94604351, -1.96760757]

實際輸出=[0.99806371 0.00196405 0.00195205]

import numpy as np


# "pd" 偏導
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoidDerivationx(y):
    return y * (1 - y)


if __name__ == "__main__":
    bias = [0.5, 0.5]
    weight = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
    # 輸入
    X1 = np.array([0.05, 0.2, 0.314])
    X2 = np.array([0.25, 0.1, 0.211])
    # 期望輸出
    target1 = np.array([1, 0, 0])
    alpha = 0.5  # 學習速率
    numIter = 10000  # 迭代次數
    for i in range(numIter):
        # 正向傳播
        net1 = X1 * weight[1 - 1] + X2 * weight[2 - 1] + bias[0]
        net2 = X1 * weight[3 - 1] + X2 * weight[4 - 1] + bias[0]
        out1 = sigmoid(net1)
        out2 = sigmoid(net2)
        r1 = out1 * weight[5 - 1] + out2 * weight[6 - 1] + bias[1]
        t1 = sigmoid(r1)

        print(str(i) + ", target1 : " + str(target1 - t1))
        if i == numIter - 1:
            print("lastst result : " + str(t1))
        # 反向傳播
        # 計算w5-w8(輸出層權重)的誤差
        pdEt1 = - (target1 - t1)
        pdt1r1 = sigmoidDerivationx(t1)
        pdr1W5 = out1
        pdEW5 = pdEt1 * pdt1r1 * pdr1W5
        pdr1W6 = out2
        pdEW6 = pdEt1 * pdt1r1 * pdr1W6

        # 計算b2
        pdEB2 = pdEt1 * pdt1r1

        # 計算w1-w4(輸出層權重)的誤差
        pdEt1 = - (target1 - t1)
        pdt1r1 = sigmoidDerivationx(t1)
        pdr1out1 = weight[5 - 1]
        pdEout1 = pdEt1 * pdt1r1 * pdr1out1
        pdout1net1 = sigmoidDerivationx(out1)
        pdnet1W1 = X1
        pdnet1W2 = X2
        pdEW1 = pdEout1 * pdout1net1 * pdnet1W1
        pdEW2 = pdEout1 * pdout1net1 * pdnet1W2
        pdr1out2 = weight[6 - 1]
        pdout2net2 = sigmoidDerivationx(out2)
        pdnet2W3 = X1
        pdnet2W4 = X2
        pdEout2 = pdEt1 * pdt1r1 * pdr1out2
        pdEW3 = pdEout2 * pdout2net2 * pdnet2W3
        pdEW4 = pdEout2 * pdout2net2 * pdnet2W4

        # 計算b1
        pdEB1 = pdEout1 * pdout1net1 + pdEout2 * pdout2net2

        # 權重更新
        weight[1 - 1] = weight[1 - 1] - alpha * pdEW1
        weight[2 - 1] = weight[2 - 1] - alpha * pdEW2
        weight[3 - 1] = weight[3 - 1] - alpha * pdEW3
        weight[4 - 1] = weight[4 - 1] - alpha * pdEW4
        weight[5 - 1] = weight[5 - 1] - alpha * pdEW5
        weight[6 - 1] = weight[6 - 1] - alpha * pdEW6

        bias[1 - 1] = bias[1 - 1] - alpha * pdEB1
        bias[2 - 1] = bias[2 - 1] - alpha * pdEB2

參考

https://www.cnblogs.com/wlzy/p/7751297.html
https://www.zhihu.com/question/27239198?rf=24827633
http://www.cnblogs.com/charlotte77/p/5629865.html
http://baijiahao.baidu.com/s?id=1600509705305690820&wfr=spider&for=pc
https://blog.csdn.net/lyl771857509/article/details/78990215
https://blog.csdn.net/u014303046/article/details/78200010
https://www.cnblogs.com/fuqia/p/8982405.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章