深層神經網絡:超參數調試、正則化以及優化

作業二

作業內容:

在本作業中,你將練習編寫反向傳播代碼,訓練神經網絡和卷積神經網絡

超參數調試

深度學習可能存在的參數

在這裏插入圖片描述
其中學習速率是需要調試的最重要的超參數。

如何選擇調試值呢?

在早一代的機器學習算法中,如果你有兩個超參數,這裏我會稱之爲超參1,超參2,常見的做法是在網格中取樣點,像這樣,然後系統的研究這些數值。,實踐證明,網格可以是5×5,也可多可少,但對於這個例子,你可以嘗試這所有的25個點,然後選擇哪個參數效果最好。當參數的數量相對較少時,這個方法很實用。
在深度學習領域,我們常做的,我推薦你採用下面的做法,隨機選擇點,所以你可以選擇同等數量的點,對嗎?25個點,接着,用這些隨機取的點試驗超參數的效果。之所以這麼做是因爲,對於你要解決的問題而言,你很難提前知道哪個超參數最重要,正如你之前看到的,一些超參數的確要比其它的更重要。在這裏插入圖片描述

爲超參數選擇合適的範圍

假設你要選取隱藏單元的數量,假設,你選取的取值範圍是從50到100中某點,這種情況下,看到這條從50-100的數軸,你可以隨機在其取點,這是一個搜索特定超參數的很直觀的方式。或者,如果你要選取神經網絡的層數,我們稱之爲字母L,你也許會選擇層數爲2到4中的某個值,接着順着2,3,4隨機均勻取樣才比較合理,你還可以應用網格搜索,你會覺得2,3,4,這三個數值是合理的,這是在幾個在你考慮範圍內隨機均勻取值的例子,這些取值還蠻合理的,但對某些超參數而言不適用。
看看這個例子,假設你在搜索超參數(學習速率),假設你懷疑其值最小是0.0001或最大是1。如果你畫一條從0.0001到1的數軸,沿其隨機均勻取值,那90%的數值將會落在0.1到1之間,結果就是,在0.1到1之間,應用了90%的資源,而在0.0001到0.1之間,只有10%的搜索資源,這看上去不太對。
反而,用對數標尺搜索超參數的方式會更合理,因此這裏不使用線性軸,分別依次取0.0001,0.001,0.01,0.1,1,在對數軸上均勻隨機取點,這樣,在0.0001到0.001之間,就會有更多的搜索資源可用,還有在0.001到0.01之間等等。
最後,另一個棘手的例子是給B取值,用於計算指數的加權平均值。假設你認爲是0.9到0.999之間的某個值,也許這就是你想搜索的範圍。記住這一點,當計算指數的加權平均值時,取0.9就像在10個值中計算平均值,有點類似於計算10天的溫度平均值,而取0.999就是在1000個值中取平均。

如何搜索超參數

我見過大概兩種重要的思想流派或人們通常採用的兩種重要但不同的方式。
一種是你照看一個模型,通常是有龐大的數據組,但沒有許多計算資源或足夠的CPU和GPU的前提下,基本而言,你只可以一次負擔起試驗一個模型或一小批模型,在這種情況下,即使當它在試驗時,你也可以逐漸改良。比如,第0天,你將隨機參數初始化,然後開始試驗,然後你逐漸觀察自己的學習曲線,也許是損失函數J,或者數據設置誤差或其它的東西,在第1天內逐漸減少,那這一天末的時候,你可能會說,看,它學習得真不錯。我試着增加一點學習速率,看看它會怎樣,也許結果證明它做得更好,那是你第二天的表現。兩天後,你會說,它依舊做得不錯,也許我現在可以填充下Momentum或減少變量。然後進入第三天,每天,你都會觀察它,不斷調整你的參數。也許有一天,你會發現你的學習率太大了,所以你可能又迴歸之前的模型,像這樣,但你可以說是在每天花時間照看此模型,即使是它在許多天或許多星期的試驗過程中。所以這是一個人們照料一個模型的方法,觀察它的表現,耐心地調試學習率,但那通常是因爲你沒有足夠的計算能力,不能在同一時間試驗大量模型時才採取的辦法。
另一種方法則是同時試驗多種模型,你設置了一些超參數,儘管讓它自己運行,或者是一天甚至多天,然後你會獲得像這樣的學習曲線,這可以是損失函數J或實驗誤差或損失或數據誤差的損失,但都是你曲線軌跡的度量。同時你可以開始一個有着不同超參數設定的不同模型,所以,你的第二個模型會生成一個不同的學習曲線,也許是像這樣的一條(紫色曲線),我會說這條看起來更好些。與此同時,你可以試驗第三種模型,等等。或者你可以同時平行試驗許多不同的模型,橙色的線就是不同的模型。用這種方式你可以試驗許多不同的參數設定,然後只是最後快速選擇工作效果最好的那個。

代碼

best_model = None
################################################################################
# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #
# find batch/layer normalization and dropout useful. Store your best model in  #
# the best_model variable.                                                     #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

model = FullyConnectedNet([100, 100, 100, 100], dropout=0.5, normalization=None, weight_scale=5e-2, reg = 1e-1)
learning_rate = 1e-4
solver = Solver(model, small_data,
              num_epochs=10, batch_size=64,
              update_rule=update_rule,
              optim_config={
                'learning_rate': learning_rate
              },
              verbose=True)
solvers['adam'] = solver
solver.train()
best_model = model

正則化

深度學習可能存在過擬合問題——高方差,有兩個解決方法,一個是正則化,另一個是準備更多的數據,這是非常可靠的方法,但你可能無法時時刻刻準備足夠多的訓練數據或者獲取更多數據的成本很高,但正則化通常有助於避免過擬合或減少你的網絡誤差。
如果你懷疑神經網絡過度擬合了數據,即存在高方差問題,那麼最先想到的方法可能是正則化,另一個解決高方差的方法就是準備更多數據,這也是非常可靠的辦法,但你可能無法時時準備足夠多的訓練數據,或者,獲取更多數據的成本很高,但正則化有助於避免過度擬合,或者減少網絡誤差,下面我們就來講講正則化的作用原理。
L2正則化是最常見的正則化類型,你可能聽說過L1正則化,如果用的是L1正則化,最終會是稀疏的,也就是說向量中有很多0。
爲什麼正則化有利於預防過擬合呢?
在這裏插入圖片描述
直觀上理解就是如果正則化參數設置得足夠大,權重矩陣W被設置爲接近於0的值,直觀理解就是把多隱藏單元的權重設爲0,於是基本上消除了這些隱藏單元的許多影響。如果是這種情況,這個被大大簡化了的神經網絡會變成一個很小的網絡,小到如同一個邏輯迴歸單元,可是深度卻很大,它會使這個網絡從過度擬合的狀態更接近左圖的高偏差狀態。
但是會存在一箇中間值,於是會有一個接近“Just Right”的中間狀態。
直觀理解就是參數增加到足夠大,W會接近於0,實際上是不會發生這種情況的,我們嘗試消除或至少減少許多隱藏單元的影響,最終這個網絡會變得更簡單,這個神經網絡越來越接近邏輯迴歸,我們直覺上認爲大量隱藏單元被完全消除了,其實不然,實際上是該神經網絡的所有隱藏單元依然存在,但是它們的影響變得更小了。神經網絡變得更簡單了,貌似這樣更不容易發生過擬合,不過在編程中執行正則化時,你實際看到一些方差減少的結果。
總結一下,如果正則化參數變得很大,參數W很小,z也會相對變小,此時忽略b的影響,z會相對變小,實際上,z的取值範圍很小,這個激活函數,也就是曲線函數tanh會相對呈線性,整個神經網絡會計算離線性函數近的值,這個線性函數非常簡單,並不是一個極複雜的高度非線性函數,不會發生過擬合。
dropout 正則化
在這裏插入圖片描述
dropout正則化後爲:
在這裏插入圖片描述
工作原理:假設你在訓練上圖這樣的神經網絡,它存在過擬合,這就是dropout所要處理的,我們複製這個神經網絡,dropout會遍歷網絡的每一層,並設置消除神經網絡中節點的概率。假設網絡中的每一層,每個節點都以拋硬幣的方式設置概率,每個節點得以保留和消除的概率都是0.5,設置完節點概率,我們會消除一些節點,然後刪除掉從該節點進出的連線,最後得到一個節點更少,規模更小的網絡,然後用backprop方法進行訓練。
注意:在測試階段不使用dropout函數

代碼

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.

    NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
    See http://cs231n.github.io/neural-networks-2/#reg for more details.

    NOTE 2: Keep in mind that p is the probability of **keep** a neuron
    output; this might be contrary to some sources, where it is referred to
    as the probability of dropping a neuron output.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        mask = np.ones((1, x.shape[1]))
        prop = np.random.rand(1, x.shape[1])
        mask[prop < p] = 0
        out = x*mask/p

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        out = x

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        dx = dout*mask/dropout_param['p']

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == 'test':
        dx = dout
    return dx

Batch歸一化
深度學習文獻中有一些爭論,關於在激活函數之前是否應該歸一化,或是否應該在應用激活函數後再規範值。實踐中,經常做的是歸一化激活前的值,這就是默認選擇,那下面就是Batch歸一化的使用方法。
在這裏插入圖片描述
在這裏插入圖片描述
使用Batch歸一化,你能夠訓練更深的網絡,讓你的學習算法運行速度更快。

代碼

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        #######################################################################
        # TODO: Implement the training-time forward pass for batch norm.      #
        # Use minibatch statistics to compute the mean and variance, use      #
        # these statistics to normalize the incoming data, and scale and      #
        # shift the normalized data using gamma and beta.                     #
        #                                                                     #
        # You should store the output in the variable out. Any intermediates  #
        # that you need for the backward pass should be stored in the cache   #
        # variable.                                                           #
        #                                                                     #
        # You should also use your computed sample mean and variance together #
        # with the momentum variable to update the running mean and running   #
        # variance, storing your result in the running_mean and running_var   #
        # variables.                                                          #
        #                                                                     #
        # Note that though you should be keeping track of the running         #
        # variance, you should normalize the data based on the standard       #
        # deviation (square root of variance) instead!                        # 
        # Referencing the original paper (https://arxiv.org/abs/1502.03167)   #
        # might prove to be helpful.                                          #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        mean = np.mean(x, axis = 0, keepdims = True)
        var = np.var(x, axis = 0, keepdims = True)
        xnorm = (x - mean)/np.sqrt(var + eps)
        out = gamma*xnorm + beta
        running_mean = momentum*running_mean + (1 - momentum)*mean
        running_var = momentum*running_var + (1 - momentum)*var
        cache = [x, gamma, mean, var, eps]

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x = (x - running_mean)/np.sqrt(running_var + eps)
        out = gamma*x + beta

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache


def batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    # Referencing the original paper (https://arxiv.org/abs/1502.03167)       #
    # might prove to be helpful.                                              #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, gamma, mean, var, eps = cache
    xnorm = (x - mean)/np.sqrt(var + eps)
    std = np.sqrt(var + eps)
    N, D = x.shape
    
    dbeta = np.sum(dout, axis = 0, keepdims = True)
    dgamma = np.sum(xnorm*dout, axis = 0, keepdims = True)
    dy = dout * gamma.reshape(1, -1)
    dx1 = dy/std
    dx2 = -np.sum(dy/(std*N), axis = 0, keepdims = True)
    dx3 = -np.sum(dy*xnorm/std**2, axis = 0, keepdims = True)*((x - mean)/N - np.sum(x - mean, axis = 0, keepdims = True)/N**2)
    dx = dx1 + dx2 + dx3
    
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta


def batchnorm_backward_alt(dout, cache):
    """
    Alternative backward pass for batch normalization.

    For this implementation you should work out the derivatives for the batch
    normalizaton backward pass on paper and simplify as much as possible. You
    should be able to derive a simple expression for the backward pass. 
    See the jupyter notebook for more hints.
     
    Note: This implementation should expect to receive the same cache variable
    as batchnorm_backward, but might not use all of the values in the cache.

    Inputs / outputs: Same as batchnorm_backward
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    #                                                                         #
    # After computing the gradient with respect to the centered inputs, you   #
    # should be able to compute gradients with respect to the inputs in a     #
    # single statement; our implementation fits on a single 80-character line.#
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x, gamma, mean, var, eps = cache
    x_norm = (x - mean)/np.sqrt(var + eps)
    std = np.sqrt(var + eps)
    N, D = x.shape

    # out = x_norm*gamma+beta,這個正向代碼時gamma和beta進行了廣播
    dbeta = np.sum(dout, axis=0, keepdims=True) # (1,D),這裏sum因爲廣播
    dgamma = np.sum(dout*x_norm, axis=0, keepdims=True) # (1,D),這裏sum因爲廣播
    dx_norm = dout*gamma #(N,D)
    ### 下面的求dvar,dmean,dx有些麻煩
    # dvar可以直接通過求導公式計算
    dvar = np.sum(dx_norm * -1.0/2*(x - mean)/(var + eps)**(3.0/2), axis=0, keepdims=True)
    # dmean有兩個部分,var也是對於mean的函數
    dmean1 = np.sum(dx_norm * -1.0/np.sqrt(var + eps), axis=0, keepdims=True)
    dmean2_var = dvar * -2.0/N*np.sum(x - mean, axis=0, keepdims=True)
    dmean = dmean1 + dmean2_var
    # dx有三個部分,meam和var都是對x的函數
    dx1 = dx_norm * 1.0/np.sqrt(var + eps)
    dx2_mean = dmean * 1.0/N # (1,D)
    dx3_var = dvar * 2.0/N*(x-mean)
    dx = dx1 + dx2_mean + dx3_var

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

優化算法

mini-batch梯度下降法
如果你的訓練樣本一共有500萬個,每個mini-batch都有1000個樣本,也就是說,你有5000個mini-batch,因爲5000乘以1000就是5000萬。在訓練集上運行mini-batch梯度下降法,你運行for t=1……5000,因爲我們有5000個各有1000個樣本的組,在for循環裏你要做得基本就是執行一步梯度下降法。假設你有一個擁有1000個樣本的訓練集,而且假設你已經很熟悉一次性處理完的方法,你要用向量化去幾乎同時處理1000個樣本。
好處:使用batch梯度下降法,一次遍歷訓練集只能讓你做一個梯度下降,使用mini-batch梯度下降法,一次遍歷訓練集,能讓你做5000個梯度下降。當然正常來說你想要多次遍歷訓練集,還需要爲另一個while循環設置另一個for循環。所以你可以一直處理遍歷訓練集,直到最後你能收斂到一個合適的精度。
你需要決定的變量之一是mini-batch的大小,就是訓練集的大小m,極端情況下,如果mini-batch的大小等於m,其實就是batch梯度下降法;另一個極端情況,假設mini-batch大小爲1,就有了新的算法,叫做隨機梯度下降法,每個樣本都是獨立的mini-batch。
動量梯度下降法(Gradient descent with Momentum)
在這裏插入圖片描述
在縱軸上,你希望學習慢一點,因爲你不想要這些擺動,但是在橫軸上,你希望加快學習,你希望快速從左向右移,移向最小值,移向紅點。所以使用動量梯度下降法.
具體如何計算,算法在此:
在這裏插入圖片描述
所以你有兩個超參數,學習率以及另一個參數,控制着指數加權平均數。另一個參數常爲0.9.
RMSprop
在這裏插入圖片描述
RMSprop的算法,全稱是root mean square prop算法,它也可以加速梯度下降,我們來看看它是如何運作的。
回憶一下我們之前的例子,如果你執行梯度下降,雖然橫軸方向正在推進,但縱軸方向會有大幅度擺動所以,你想減緩方向b的學習,即縱軸方向,同時加快,至少不是減緩橫軸方向的學習,RMSprop算法可以實現這一點。
在這裏插入圖片描述
Adam 優化算法(Adam optimization algorithm)
Adam優化算法基本上就是將Momentum和RMSprop結合在一起,那麼來看看如何使用Adam算法。
在這裏插入圖片描述
參數選擇:
在這裏插入圖片描述

代碼

import numpy as np

"""
This file implements various first-order update rules that are commonly used
for training neural networks. Each update rule accepts current weights and the
gradient of the loss with respect to those weights and produces the next set of
weights. Each update rule has the same interface:

def update(w, dw, config=None):

Inputs:
  - w: A numpy array giving the current weights.
  - dw: A numpy array of the same shape as w giving the gradient of the
    loss with respect to w.
  - config: A dictionary containing hyperparameter values such as learning
    rate, momentum, etc. If the update rule requires caching values over many
    iterations, then config will also hold these cached values.

Returns:
  - next_w: The next point after the update.
  - config: The config dictionary to be passed to the next iteration of the
    update rule.

NOTE: For most update rules, the default learning rate will probably not
perform well; however the default values of the other hyperparameters should
work well for a variety of different problems.

For efficiency, update rules may perform in-place updates, mutating w and
setting next_w equal to w.
"""


def sgd(w, dw, config=None):
    """
    Performs vanilla stochastic gradient descent.

    config format:
    - learning_rate: Scalar learning rate.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)

    w -= config['learning_rate'] * dw
    return w, config


def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    v = config['momentum']*v - config['learning_rate']*dw
    next_w = w + v

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config['velocity'] = v

    return next_w, config



def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('decay_rate', 0.99)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of w #
    # in the next_w variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    cache = config['decay_rate']*config['cache'] + (1-config['decay_rate'])*dw*dw
    config['cache'] = cache
    next_w = w - config['learning_rate']*dw/(np.sqrt(cache)+config['epsilon'])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config


def adam(w, dw, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(w))
    config.setdefault('v', np.zeros_like(w))
    config.setdefault('t', 0)

    next_w = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of w in #
    # the next_w variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    #                                                                         #
    # NOTE: In order to match the reference output, please modify t _before_  #
    # using it in any calculations.                                           #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    m = config['m']*config['beta1'] + dw*(1-config['beta1'])
    v = config['v']*config['beta2'] + dw*dw*(1-config['beta2'])
    config['m'] = m
    config['v'] = v
    t = config['t'] + 1
    mt = m/(1 - config['beta1']**t)
    vt = v/(1 - config['beta2']**t)
    config['t'] = t
    next_w = w - config['learning_rate']*mt/(np.sqrt(vt)+config['epsilon'])

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

作業代碼

import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.classifiers.fc_net import *
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array
from cs231n.solver import Solver

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def print_mean_std(x,axis=0):
    print('  means: ', x.mean(axis=axis))
    print('  stds:  ', x.std(axis=axis))
    print() 
# 加載數據集
data = get_CIFAR10_data()
for k, v in data.items():
  print('%s: ' % k, v.shape)
# 檢查歸一化前後的代碼
# Simulate the forward pass for a two-layer network
np.random.seed(231)
N, D1, D2, D3 = 200, 50, 60, 3
X = np.random.randn(N, D1)
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
a = np.maximum(0, X.dot(W1)).dot(W2)

print('Before batch normalization:')
print_mean_std(a,axis=0)

gamma = np.ones((D3,))
beta = np.zeros((D3,))
# Means should be close to zero and stds close to one
print('After batch normalization (gamma=1, beta=0)')
a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=0)

gamma = np.asarray([1.0, 2.0, 3.0])
beta = np.asarray([11.0, 12.0, 13.0])
# Now means should be close to beta and stds close to gamma
print('After batch normalization (gamma=', gamma, ', beta=', beta, ')')
a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=0)

np.random.seed(231)
N, D1, D2, D3 = 200, 50, 60, 3
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)

bn_param = {'mode': 'train'}
gamma = np.ones(D3)
beta = np.zeros(D3)

for t in range(50):
  X = np.random.randn(N, D1)
  a = np.maximum(0, X.dot(W1)).dot(W2)
  batchnorm_forward(a, gamma, beta, bn_param)

bn_param['mode'] = 'test'
X = np.random.randn(N, D1)
a = np.maximum(0, X.dot(W1)).dot(W2)
a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)

# Means should be close to zero and stds close to one, but will be
# noisier than training-time forward passes.
print('After batch normalization (test-time):')
print_mean_std(a_norm,axis=0)
# 檢查 batchnorm 的梯度
np.random.seed(231)
N, D = 4, 5
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

bn_param = {'mode': 'train'}
fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]
fg = lambda a: batchnorm_forward(x, a, beta, bn_param)[0]
fb = lambda b: batchnorm_forward(x, gamma, b, bn_param)[0]

dx_num = eval_numerical_gradient_array(fx, x, dout)
da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)
db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)

_, cache = batchnorm_forward(x, gamma, beta, bn_param)
dx, dgamma, dbeta = batchnorm_backward(dout, cache)
#You should expect to see relative errors between 1e-13 and 1e-8
print('dx error: ', rel_error(dx_num, dx))
print('dgamma error: ', rel_error(da_num, dgamma))
print('dbeta error: ', rel_error(db_num, dbeta))
np.random.seed(231)
N, D = 100, 500
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

bn_param = {'mode': 'train'}
out, cache = batchnorm_forward(x, gamma, beta, bn_param)

t1 = time.time()
dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)
t2 = time.time()
dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)
t3 = time.time()

print('dx difference: ', rel_error(dx1, dx2))
print('dgamma difference: ', rel_error(dgamma1, dgamma2))
print('dbeta difference: ', rel_error(dbeta1, dbeta2))
print('speedup: %.2fx' % ((t2 - t1) / (t3 - t2)))
np.random.seed(231)
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))

# You should expect losses between 1e-4~1e-10 for W, 
# losses between 1e-08~1e-10 for b,
# and losses between 1e-08~1e-09 for beta and gammas.
for reg in [0, 3.14]:
  print('Running check with reg = ', reg)
  model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,
                            reg=reg, weight_scale=5e-2, dtype=np.float64,
                            normalization='batchnorm')

  loss, grads = model.loss(X, y)
  print('Initial loss: ', loss)

  for name in sorted(grads):
    f = lambda _: model.loss(X, y)[0]
    grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
    print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))
  if reg == 0: print()
np.random.seed(231)
# T在深度神經網絡中訓練batchNorm
hidden_dims = [100, 100, 100, 100, 100]

num_train = 1000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

weight_scale = 2e-2
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

print('Solver with batch norm:')
bn_solver = Solver(bn_model, small_data,
                num_epochs=10, batch_size=50,
                update_rule='adam',
                optim_config={
                  'learning_rate': 1e-3,
                },
                verbose=True,print_every=20)
bn_solver.train()

print('\nSolver without batch norm:')
solver = Solver(model, small_data,
                num_epochs=10, batch_size=50,
                update_rule='adam',
                optim_config={
                  'learning_rate': 1e-3,
                },
                verbose=True, print_every=20)
solver.train()
def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):
    """utility function for plotting training history"""
    plt.title(title)
    plt.xlabel(label)
    bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]
    bl_plot = plot_fn(baseline)
    num_bn = len(bn_plots)
    for i in range(num_bn):
        label='with_norm'
        if labels is not None:
            label += str(labels[i])
        plt.plot(bn_plots[i], bn_marker, label=label)
    label='baseline'
    if labels is not None:
        label += str(labels[0])
    plt.plot(bl_plot, bl_marker, label=label)
    plt.legend(loc='lower center', ncol=num_bn+1) 

    
plt.subplot(3, 1, 1)
plot_training_history('Training loss','Iteration', solver, [bn_solver], \
                      lambda x: x.loss_history, bl_marker='o', bn_marker='o')
plt.subplot(3, 1, 2)
plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \
                      lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')
plt.subplot(3, 1, 3)
plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \
                      lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')

plt.gcf().set_size_inches(15, 15)
plt.show()
# 研究批次歸一化和權重初始化的相互作
np.random.seed(231)
# Try training a very deep net with batchnorm
hidden_dims = [50, 50, 50, 50, 50, 50, 50]
num_train = 1000
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

bn_solvers_ws = {}
solvers_ws = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):
  print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))
  bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
  model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)

  bn_solver = Solver(bn_model, small_data,
                  num_epochs=10, batch_size=50,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  verbose=False, print_every=200)
  bn_solver.train()
  bn_solvers_ws[weight_scale] = bn_solver

  solver = Solver(model, small_data,
                  num_epochs=10, batch_size=50,
                  update_rule='adam',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  verbose=False, print_every=200)
  solver.train()
  solvers_ws[weight_scale] = solver
# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []

for ws in weight_scales:
  best_train_accs.append(max(solvers_ws[ws].train_acc_history))
  bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))
  
  best_val_accs.append(max(solvers_ws[ws].val_acc_history))
  bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))
  
  final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))
  bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))
  
plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')

plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend()

plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend()
plt.gca().set_ylim(1.0, 3.5)

plt.gcf().set_size_inches(15, 15)
plt.show()
# 現在,我們將運行一個小實驗來研究批處理規範化和批處理大小的相互作用。
def run_batchsize_experiments(normalization_mode):
    np.random.seed(231)
    # Try training a very deep net with batchnorm
    hidden_dims = [100, 100, 100, 100, 100]
    num_train = 1000
    small_data = {
      'X_train': data['X_train'][:num_train],
      'y_train': data['y_train'][:num_train],
      'X_val': data['X_val'],
      'y_val': data['y_val'],
    }
    n_epochs=10
    weight_scale = 2e-2
    batch_sizes = [5,10,50]
    lr = 10**(-3.5)
    solver_bsize = batch_sizes[0]

    print('No normalization: batch size = ',solver_bsize)
    model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)
    solver = Solver(model, small_data,
                    num_epochs=n_epochs, batch_size=solver_bsize,
                    update_rule='adam',
                    optim_config={
                      'learning_rate': lr,
                    },
                    verbose=False)
    solver.train()
    
    bn_solvers = []
    for i in range(len(batch_sizes)):
        b_size=batch_sizes[i]
        print('Normalization: batch size = ',b_size)
        bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)
        bn_solver = Solver(bn_model, small_data,
                        num_epochs=n_epochs, batch_size=b_size,
                        update_rule='adam',
                        optim_config={
                          'learning_rate': lr,
                        },
                        verbose=False)
        bn_solver.train()
        bn_solvers.append(bn_solver)
        
    return bn_solvers, solver, batch_sizes

batch_sizes = [5,10,50]
bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')
# 繪製圖像
plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
                      lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \
                      lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)

plt.gcf().set_size_inches(15, 10)
plt.show()
# 批處理規範化已被證明可以有效地使網絡更容易訓練,
# 但是對批處理大小的依賴性使其在複雜網絡中的用處不大,
# 而複雜網絡由於硬件限制而限制了輸入批處理大小。

# 爲了緩解這個問題,已經提出了幾種批處理標準化的替代方法。
# 一種這樣的技術是層歸一化[2]。 我們沒有對批處理進行標準化,
# 而是對功能進行了標準化。換句話說,當使用“層歸一化”時,
# 將基於該特徵向量內所有項的總和對與單個數據點相對應的每個特徵向量進行歸一化。
np.random.seed(231)
N, D1, D2, D3 =4, 50, 60, 3
X = np.random.randn(N, D1)
W1 = np.random.randn(D1, D2)
W2 = np.random.randn(D2, D3)
a = np.maximum(0, X.dot(W1)).dot(W2)

print('Before layer normalization:')
print_mean_std(a,axis=1)

gamma = np.ones(D3)
beta = np.zeros(D3)
# Means should be close to zero and stds close to one
print('After layer normalization (gamma=1, beta=0)')
a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=1)

gamma = np.asarray([3.0,3.0,3.0])
beta = np.asarray([5.0,5.0,5.0])
# Now means should be close to beta and stds close to gamma
print('After layer normalization (gamma=', gamma, ', beta=', beta, ')')
a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})
print_mean_std(a_norm,axis=1)

# Gradient check batchnorm backward pass
np.random.seed(231)
N, D = 4, 5
x = 5 * np.random.randn(N, D) + 12
gamma = np.random.randn(D)
beta = np.random.randn(D)
dout = np.random.randn(N, D)

ln_param = {}
fx = lambda x: layernorm_forward(x, gamma, beta, ln_param)[0]
fg = lambda a: layernorm_forward(x, a, beta, ln_param)[0]
fb = lambda b: layernorm_forward(x, gamma, b, ln_param)[0]

dx_num = eval_numerical_gradient_array(fx, x, dout)
da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)
db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)

_, cache = layernorm_forward(x, gamma, beta, ln_param)
dx, dgamma, dbeta = layernorm_backward(dout, cache)

#You should expect to see relative errors between 1e-12 and 1e-8
print('dx error: ', rel_error(dx_num, dx))
print('dgamma error: ', rel_error(da_num, dgamma))
print('dbeta error: ', rel_error(db_num, dbeta))
ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')

plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
                      lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \
                      lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)

plt.gcf().set_size_inches(15, 10)
plt.show()

參考資料

1.http://www.ai-start.com/dl2017/html/lesson2-week2.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章