深度學習入門(四):誤差反向傳播

本文爲《深度學習入門 基於Python的理論與實現》的部分讀書筆記
代碼以及圖片均參考此書

計算圖

用計算圖求解

  • 問題1: 太郎在超市買了2個100日元一個的蘋果,消費稅是10%,請計算支付金額。
    在這裏插入圖片描述

局部計算

  • 計算圖的特徵是可以通過傳遞“局部計算”獲得最終結果。換言之,各個節點處只需進行與自己有關的計算,不用考慮全局。

反向傳播

在這裏插入圖片描述

加法節點的反向傳播

在這裏插入圖片描述
在這裏插入圖片描述

乘法節點的反向傳播

在這裏插入圖片描述
在這裏插入圖片描述

  • 乘法的反向傳播需要正向傳播時的輸入信號值。因此,實現乘法節點的反向傳播時,要保存正向傳播的輸入信號

鏈式法則與計算圖

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述

通過計算圖進行反向傳播

  • 使用計算圖最大的原因是,可以通過反向傳播高效計算導數。

  • 例:求問題一中“支付金額關於蘋果的價格的導數“
    在這裏插入圖片描述
    如圖5-5 所示,反向傳播使用與正方向相反的箭頭(粗線)表示。反向傳播傳遞“局部導數”,將導數的值寫在箭頭的下方。從這個結果中可知,“支付金額關於蘋果的價格的導數”的值是2.2。

激活函數層的實現

Relu層

在這裏插入圖片描述
在這裏插入圖片描述

import numpy as np

class Relu:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = x < 0
        x[self.mask] = 0

        return x

    def backward(self, dout):
        dout[self.mask] = 0
        return dout

if __name__ == "__main__":
    layer = Relu()
    x = np.random.randn(3, 3)
    print('x:', x, sep='\n')
    
    save = x.copy()
    out = layer.forward(save)
    print('out:', out, sep='\n')
    
    dout = layer.backward(np.ones_like(save))
    print('dout:', dout, sep='\n')

代碼輸出:

x:
[[ 0.08621289  1.20328454  1.81030439]
 [-1.31113673 -0.11453987  0.88408891]
 [ 0.14068574 -0.479992   -1.73015689]]
out:
[[0.08621289 1.20328454 1.81030439]
 [0.         0.         0.88408891]
 [0.14068574 0.         0.        ]]
dout:
[[1. 1. 1.]
 [0. 0. 1.]
 [1. 0. 0.]]

Sigmoid層

在這裏插入圖片描述
在這裏插入圖片描述

  • 因此,Sigmoid 層的反向傳播,只根據正向傳播的輸出就能計算出來。
class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        self.out = out.copy()
        return out

    def backward(self, dout):
        return dout * self.out * (1 - self.out)

Affine/Softmax層的實現

Affine層

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述
以批版本的Affine層爲例進行推導:

  • 設批處理的樣本數量爲NN,上一層神經元數量爲aa,本層神經元數量爲bb
    XX爲輸入,WW爲本層權重,BB爲本層偏置,YY=XW+BX\cdot W + B
    xijx_{ij}表示第i個樣本的第j個輸入
    wijw_{ij}表示前一層第i個神經元與後一層第j個神經元連接的權重
    yijy_{ij}表示第i個樣本的第j個輸出
    bib_{i}表示第i個神經元的偏置

Lwij=k=1NLykjykjwij=k=1NLykj(m=1a(xkmwmj)+bj)wij=k=1NLykjxkiLW=XTLY\begin{aligned} \frac{\partial L}{\partial w_{ij}} &= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} * \frac{\partial y_{kj}}{\partial w_{ij}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} * \frac{\partial (\sum_{m=1}^{a} (x_{km} * w_{mj}) + b_{j})}{\partial w_{ij}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} * x_{ki} \\ \therefore \frac{\partial L}{\partial W} &= X^T \cdot \frac{\partial L}{\partial Y} \end{aligned}
Lxij=k=1bLyikyikxij=k=1bLyik(m=1a(ximwmk)+bk)xij=k=1bLyikwjkLX=LYWT\begin{aligned} \frac{\partial L}{\partial x_{ij}} &= \sum_{k=1}^{b} \frac{\partial L}{\partial y_{ik}} * \frac{\partial y_{ik}}{\partial x_{ij}} \\&= \sum_{k=1}^{b} \frac{\partial L}{\partial y_{ik}} * \frac{\partial (\sum_{m=1}^{a} (x_{im} * w_{mk}) + b_{k})}{\partial x_{ij}} \\&= \sum_{k=1}^{b} \frac{\partial L}{\partial y_{ik}} * w_{jk} \\ \therefore \frac{\partial L}{\partial X} &= \frac{\partial L}{\partial Y} \cdot W^T \end{aligned}

Lbi=k=1NLykiykibi=k=1NLyki(m=1a(xkmwmi)+bi)bi=k=1NLykiLB=LY0\begin{aligned} \frac{\partial L}{\partial b_{i}} &= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} * \frac{\partial y_{ki}}{\partial b_{i}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} * \frac{\partial (\sum_{m=1}^{a} (x_{km} * w_{mi}) + b_{i})}{\partial b_{i}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} \\ \therefore \frac{\partial L}{\partial B} &= \frac{\partial L}{\partial Y} 的第0軸上的和 \end{aligned}

class Affine:
    def __init__(self, w, b):
        self.w = w
        self.b = b

    def forward(self, x):
        # 對應張量要reshape爲二維矩陣進行全連接層計算
        self.original_x_shape = x.shape
        x = x.reshape(x.shape[0], -1)

        self.x = x
        return np.dot(self.x, self.w) + self.b

    def backward(self, dout):
        self.dw = np.dot(self.x.T, dout)
        self.db = dout if dout.ndim == 1 else np.sum(dout, axis=0)
        return np.dot(dout, self.w.T).reshape(*self.original_x_shape)  # 還原輸入數據的形狀(對應張量) # dx

Softmax-with-loss(cross enrtopy loss)層

正向傳播

在這裏插入圖片描述
在這裏插入圖片描述

反向傳播

在這裏插入圖片描述
在這裏插入圖片描述

  • 正向傳播時若有分支流出,則反向傳播時它們的反向傳播的值會相加。
    以右上角的 “/” 結點爲例:
    LS=Ly1y11S1SS+Ly2y21S1SS+Ly3y31S1SS\frac{\partial L}{\partial S} = \frac{\partial L}{\partial y_1} * \frac{\partial y_1}{\partial \frac{1}{S} } * \frac{\partial \frac{1}{S}}{\partial S} + \frac{\partial L}{\partial y_2} * \frac{\partial y_2}{\partial \frac{1}{S} } * \frac{\partial \frac{1}{S}}{\partial S} + \frac{\partial L}{\partial y_3} * \frac{\partial y_3}{\partial \frac{1}{S} } * \frac{\partial \frac{1}{S}}{\partial S}
    在這裏插入圖片描述
    不使用計算圖進行推導:
    爲了推理上方便書寫,先引入克羅內克符號:
    δi,j=1             if i=j\delta_{i,j} = 1 \ \ \ \ \ \ \ \ \ \ \ \ \ if \ i = j δi,j=0             if ij\delta_{i,j} = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ if \ i \neq j
    下面正式進行推導:
    S=meam,yp=eapS\begin{aligned} 設S=\sum_m e^{a_m} , 則y_p = \frac {e^{a_p}}{S} \end{aligned}Lyp=tkyk\begin{aligned} \frac {\partial L}{\partial y_p} &= -\frac{t_k}{y_k} \end{aligned} Lak=pLypypak=ptpypSeapδpkeapSakS2\begin{aligned} \therefore \frac {\partial L}{\partial a_k} &= \sum_p \frac {\partial L}{\partial y_p} \frac {\partial y_p}{\partial a_k} \\&= \sum_p -\frac{t_p}{y_p} \frac{Se^{a_p}\delta _{pk} - {e^{a_p}} \frac {\partial S}{\partial a_k} }{S^2} \end{aligned} Sak=meamak=eak\begin{aligned} \because \frac {\partial S}{\partial a_k} &= \frac {\partial \sum_m e^{a_m}}{\partial a_k} \\&= e^{a_k} \end{aligned} Lak=ptpeapSeapδpkeapeakS=ptpeap(eapδpkeapeakS)=ptpδpk+tpeakS=tk+ptpyk=tk+yk\begin{aligned} \therefore \frac {\partial L}{\partial a_k} &= \sum_p -\frac{t_p}{e^{a_p}} \frac{Se^{a_p}\delta _{pk} - e^{a_p} e^{a_k} }{S} \\&=\sum_p -\frac{t_p}{e^{a_p}} (e^{a_p}\delta _{pk} - \frac {e^{a_p} e^{a_k}}{S} ) \\&= \sum_p -t_p \delta_{pk} + \frac {t_p e^{a_k}}{S} \\&= -t_k + \sum_p t_py_k \\&= -t_k + y_k \end{aligned}

  • 使用交叉熵誤差作爲softmax函數的損失函數後,反向傳播得到(y1 − t1, y2 − t2, y3 − t3)這樣“ 漂亮”的結果。實際上,這樣“漂亮”的結果並不是偶然的,而是爲了得到這樣的結果,特意設計了交叉熵誤差函數。迴歸問題中輸出層使用“恆等函數”,損失函數使用“平方和誤差”,也是出於同樣的理由。也就是說,使用“平方和誤差”作爲“恆等函數”的損失函數,反向傳播才能得到(y1 −t1, y2 − t2, y3 − t3)這樣“漂亮”的結果。

class SoftmaxWithLoss:
    def __init__(self):
        pass

    def forward(self, x, t):
        self.t = t.copy()
        self.y = softmax(x)
        return cross_entropy_error(self.y, t)

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        if self.t.size == self.y.size: # 監督數據是one-hot-vector的情況
            dx = self.y - self.t
        else:
            dx = self.y.copy()
            dx[np.arange(batch_size), self.t] -= 1
        
        # 書上寫的是這裏除以batch_size後,傳遞給前面的層的是單個數據的誤差
        # 我的理解是與前面全連接層的導數計算有關
        return dx / batch_size

最後將導數除以batch_size,我的理解是與前面全連接層的導數計算有關

  • Lwij=k=1NLykjykjwij=k=1NLykj(m=1a(xkmwmj)+bj)wij=k=1NLykjxki\frac{\partial L}{\partial w_{ij}} \\= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} * \frac{\partial y_{kj}}{\partial w_{ij}} \\= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} * \frac{\partial (\sum_{m=1}^{a} (x_{km} * w_{mj}) + b_{j})}{\partial w_{ij}} \\= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} * x_{ki}

  • Lbi=k=1NLykiykibi=k=1NLyki(m=1a(xkmwmi)+bi)bi=k=1NLyki\frac{\partial L}{\partial b_{i}} \\= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} * \frac{\partial y_{ki}}{\partial b_{i}} \\= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} * \frac{\partial (\sum_{m=1}^{a} (x_{km} * w_{mi}) + b_{i})}{\partial b_{i}} \\= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}}

如上所示batch_size越大(即N越大),則損失函數值對每一個權重或偏置的偏導數也就越大,這是不對的,因此需要將LY\frac{\partial L}{\partial Y}除以N

  • 由此可以看出損失函數求得的導數LY\frac{\partial L}{\partial Y}都應該除以batch_size!

梯度確認(gradient check)

  • 數值微分的優點是實現簡單,因此,一般情況下不太容易出錯。而誤差反向傳播法的實現很複雜,容易出錯。所以,經常會比較數值微分的結果和誤差反向傳播法的結果,以確認誤差反向傳播法的實現是否正確。
    在這裏插入圖片描述
  • 通過矩陣的歐幾里得範數來判斷,分母使得該式成比例,不會太大也不會太小
  • Doesn’t work with dropout(dropout隨機刪去一些神經元,使得損失函數L難以計算)
  • Run at random initialization perhaps again after some training(有可能(極小的可能性)w和b只有在接近於0的時候梯度確認是正常的,但訓練一段時間後w和b遠離0後反向傳播計算的梯度就不正常了,因此可以在網絡訓練一段時間之後再進行梯度確認)
  • 當計算值比較大時,應逐一比較數值微分計算的梯度與反向傳播計算的梯度中的每一項,看看是哪一個參數的梯度計算出了問題
def gradient_check(net, x_batch, t_batch):
    grad_numerical = net.numerical_gradient(x_batch, t_batch)
    grad_backprop = net.gradient(x_batch, t_batch)

    for key in grad_numerical.keys():
        print(key, ':')

        diff1 = np.mean(np.abs(grad_numerical[key] - grad_backprop[key]))
        print('diff1:', diff1)

        # diff2 = 1e-7 -> correct
        # diff2 > 1e-5 -> please check again!
        # diff2 > 1e-3 -> concerned
        diff2 = np.linalg.norm(grad_numerical[key] - grad_backprop[key], 2) / (np.linalg.norm(grad_numerical[key], 2) + np.linalg.norm(grad_backprop[key], 2))
        print('diff2:', diff2)
  • 這裏diff1的實現是使用的本書中的方法
  • diff2的實現是通過計算矩陣的歐幾里得範數,判斷的標準寫在了上面代碼的註釋裏

通過組裝各個層重新實現二層神經網絡

import sys
file_path = __file__.replace('\\', '/')
dir_path = file_path[: file_path.rfind('/')] # 當前文件夾的路徑
pardir_path = dir_path[: dir_path.rfind('/')]
sys.path.append(pardir_path) # 添加上上級目錄到python模塊搜索路徑

import numpy as np
from func.gradient import numerical_gradient, gradient_check
from layer.activation import Relu, Affine, SoftmaxWithLoss, Sigmoid
import matplotlib.pyplot as plt
from collections import OrderedDict

class TwoLayerNet:
    """
    2 Fully Connected layers
    softmax with cross entropy error
    """
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        self.params = {}
        self.params['w1'] = np.random.randn(input_size, hidden_size) * weight_init_std
        self.params['b1'] = np.zeros(hidden_size)
        self.params['w2'] = np.random.randn(hidden_size, output_size) * weight_init_std
        self.params['b2'] = np.zeros(output_size)

        self.layers = OrderedDict()
        self.layers['affine1'] = Affine(self.params['w1'], self.params['b1'])
        self.layers['relu1'] = Relu()
        self.layers['affine2'] = Affine(self.params['w2'], self.params['b2'])
        
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)

    def accuracy(self, x, t):
        y = self.predict(x)
        y = y.argmax(axis=1)

        if t.ndim != 1:
            t = t.argmax(axis=1)

        accuracy = np.sum(y == t) / x.shape[0]
        return accuracy

    def numerical_gradient(self, x, t):
        loss = lambda w: self.loss(x, t)

        grads = {}
        grads['w1'] = numerical_gradient(loss, self.params['w1'])
        grads['b1'] = numerical_gradient(loss, self.params['b1'])
        grads['w2'] = numerical_gradient(loss, self.params['w2'])
        grads['b2'] = numerical_gradient(loss, self.params['b2'])

        return grads

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)
        for layer_name in reversed(self.layers):
            dout = self.layers[layer_name].backward(dout)

        grads = {}
        grads['w1'] = self.layers['affine1'].dw
        grads['b1'] = self.layers['affine1'].db
        grads['w2'] = self.layers['affine2'].dw
        grads['b2'] = self.layers['affine2'].db

        return grads


if __name__ == '__main__':
    from dataset.mnist import load_mnist
    import pickle
    import os

    (x_train, t_train),  (x_test, t_test) = load_mnist(normalize=True, flatten=True, one_hot_label=True)
    
    # hyper parameters
    lr = 0.1
    batch_size = 100
    iters_num = 10000

    # setting
    train_flag = 0 # 進行訓練還是預測
    pretrain_flag = 0 # 加載上一次訓練的參數
    gradcheck_flag = 1 # 對已訓練的網絡進行梯度檢驗
    
    pkl_file_name = dir_path + '/two_layer_net.pkl'
    train_size = x_train.shape[0]
    train_loss_list = []
    train_acc_list = []
    test_acc_list = []
    best_acc = 0

    iter_per_epoch = max(int(train_size / batch_size), 1)

    net = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

    if (pretrain_flag == 1 or train_flag == 0) and os.path.exists(pkl_file_name):
        with open(pkl_file_name, 'rb') as f:
            net = pickle.load(f)
            print('params loaded!')

    if train_flag == 1:
        print('start training!')
        for i in range(iters_num):
            # 選出mini-batch
            batch_mask = np.random.choice(train_size, batch_size)
            x_batch = x_train[batch_mask]
            t_batch = t_train[batch_mask]

            # 計算梯度
            # grads_numerical = net.numerical_gradient(x_batch, t_batch)
            grads = net.gradient(x_batch, t_batch)

            # 更新參數
            for key in ('w1', 'b1', 'w2', 'b2'):
                net.params[key] -= lr * grads[key]
            
            train_loss_list.append(net.loss(x_batch, t_batch))

            # 記錄學習過程
            if i % iter_per_epoch == 0:
                train_acc_list.append(net.accuracy(x_train, t_train))
                test_acc_list.append(net.accuracy(x_test, t_test))
                print("train acc, test acc | ", train_acc_list[-1], ", ", test_acc_list[-1])

                if test_acc_list[-1] > best_acc:
                    best_acc = test_acc_list[-1]
                    with open(pkl_file_name, 'wb') as f:
                        pickle.dump(net, f)
                        print('net params saved!')

        # 繪製圖形
        fig, axis = plt.subplots(1, 1)

        x = np.arange(len(train_acc_list))
        axis.plot(x, train_acc_list, 'r', label='train acc')
        axis.plot(x, test_acc_list, 'g--', label='test acc')
        
        markers = {'train': 'o', 'test': 's'}
        axis.set_xlabel("epochs")
        axis.set_ylabel("accuracy")
        axis.set_ylim(0, 1.0)
        axis.legend(loc='best')
        plt.show()
    else:
        if gradcheck_flag == 1:
            gradient_check(net, x_train[:3], t_train[:3])
        print(net.accuracy(x_train[:], t_train[:]))

先進行梯度確認,設置gradcheck_flag=1,train_flag=0
代碼輸出如下

w1 :
diff1: 2.6115099710177576e-11
diff2: 5.083025533280819e-08
b1 :
diff1: 2.2425837345792317e-10
diff2: 5.04159103684074e-08
w2 :
diff1: 1.3242105311984443e-10
diff2: 4.724560085528282e-08
b2 :
diff1: 2.712055815786953e-10
diff2: 5.095936065001308e-08

看起來反向傳播計算得到的梯度應該是正確的

那麼下面就正式進入網絡訓練吧,設置train_flag=1

代碼輸出:

start training!
train acc, test acc |  0.10571666666666667 ,  0.1042
net params saved!
train acc, test acc |  0.9058833333333334 ,  0.9077
net params saved!
train acc, test acc |  0.9251666666666667 ,  0.9275
net params saved!
train acc, test acc |  0.9367166666666666 ,  0.9353
net params saved!
train acc, test acc |  0.9477166666666667 ,  0.9447
net params saved!
train acc, test acc |  0.9528833333333333 ,  0.9509
net params saved!
train acc, test acc |  0.9583 ,  0.9555
net params saved!
train acc, test acc |  0.9610333333333333 ,  0.9578
net params saved!
train acc, test acc |  0.9665 ,  0.9623
net params saved!
train acc, test acc |  0.9686333333333333 ,  0.9647
net params saved!
train acc, test acc |  0.9702333333333333 ,  0.9659
net params saved!
train acc, test acc |  0.9720833333333333 ,  0.9679
net params saved!
train acc, test acc |  0.9742166666666666 ,  0.9671
train acc, test acc |  0.9735666666666667 ,  0.9682
net params saved!
train acc, test acc |  0.9770833333333333 ,  0.9709
net params saved!
train acc, test acc |  0.9751 ,  0.9678
train acc, test acc |  0.9789166666666667 ,  0.9717
net params saved!

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章