【深度學習基礎4】深度神經網絡的優化與調參(1)

     轉載請註明出處。謝謝。

     本博文根據 coursera 吳恩達 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization整理。作爲深度學習網絡優化的記錄,將三章內容均以要點的形式記錄,並結合實例說明。

Q1. 數據設計(數據集該怎麼分割)

     常規來說,我們可以把數據分爲三類:訓練集(進行訓練)、驗證集(進行交叉驗證,選出最好的模型)、測試集(獲取模型最後的無偏估計)。

     在小數據集時代,三者比例通常爲70%(訓練) /  30%(測試),或者 60%(訓練) /  20%(驗證)/  20%(測試)。但在大數據時期,分配比例會發生變化,如100萬數據時, 98%(訓練) /  1%(驗證)/  1%(測試),超百萬95%(訓練) /  2.5%(驗證)/  2.5%(測試).。

     值得注意的是,驗證集和測試集儘量保持同一分佈,保持效果。

 

Q2. 偏差和方差

high bias(高偏差):模型效果不好,模型過於簡單,處於欠飽和(underfitting)的狀態。

high variance(高方差):模型效果好,但是模型過於複雜,過度依賴數據,處於過飽和(overfitting)。

相應特點

  • train acc 低,dev acc 低 -------(1) 增加網絡結構; (2) 訓練更長時間; (3) 更換網絡
  • train acc 高,dev acc 低 -------(1) 獲取更多數據; (2) 正則化; (3) 更換網絡

 

Q3. 正則化

1. 正則化的定義:在cost function中加入一項正則項,約束模型的複雜度

L2 正則化:\frac{\lambda}{2m}\left \|w \right \|^2_2=\frac{\lambda}{2m}\sum_{j=1}^{n_x}(w_j)^2=\frac{\lambda}{2m}w^Tw

L1正則化:\frac{\lambda}{2m}\left \|w \right \|_1=\frac{\lambda}{2m}\sum_{j=1}^{n_x}\left |w_j \right |

其中,\lambda 稱爲正則化因子。

在網絡中,  J(w^{[1]},b^{[1]},\cdots ,w^{[L]},b^{[L]})=-\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^{L}\left \| w^{[l]} \right \|^2_F   其中,\left \| \cdot \right \|^2_F  稱爲弗羅貝尼烏斯範數。

進一步推導:

dW^{[l]}=({from\, backprop})+ \frac{\lambda}{m}W^{[l]}

W^{[l]}=W^{[l]}-\alpha dW^{[l]}=(1-\frac{\alpha\lambda}{m})W^{[l]}-\alpha(from \, backprop)

可見,1-\frac{\alpha\lambda}{m}<1, 稱其爲權重衰減 weight decay。

理解\lambda 大時,\left \| w \right \|需要儘可能小才能保證loss小,儘可能趨近於0,使之消除了許多神經元的影響(降低權重),使網絡變得簡單,不會過擬合。

在cost func中加入正則項:

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (1.0/m) * (lambd/2.0)*(np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))
    ### END CODER HERE ###
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

反向傳播中加入正則項:

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + 1.0 * lambd/m* W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + 1.0 * lambd/m* W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + 1.0 * lambd/m* W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

 

2. 其餘正則化的辦法:

(a)dropout 隨機失活

理解:加入dropout之後,輸入的特徵有可能被隨機清除,因而不能特別依賴於任何一個輸入的特徵,不會給設置大權重,通過傳播,dropout和前述正則化一樣,產生了權重收縮的效果。

使用上,在神經元少時,減少使用dropout, keep_prob=1, 神經元多時,減小keep_prob。但其使得 J 無法明確定義,難以繪圖,因而一般先關閉dropout,確保收斂下降,再打開。

實現:

和 L2 Regulation 不同,dropout在前向通道就已經加入:隨機選擇神經元激活/不激活。記得把 A/keep_prob ,以還原到原期望

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0],A1.shape[1])      # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = (D1 < keep_prob)                             # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1/ keep_prob                                # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0],A2.shape[1])          # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 =(D2 < keep_prob)                                         # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                        # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                              # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

在反向傳播過程中,關閉與前向相同的神經元,並恢復到相應期望 

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = 1.0* dA2 /  keep_prob      # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1          # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = 1.0* dA1 / keep_prob   # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

 結果:

因而,在實踐中應當選擇適合數據的正則化方式。

 

(b)data augmentation

(c)early stopping 誤差上升之前停止,但會導致可能無法最優

 

Q4. 歸一化輸入

對於數據集特徵進行歸一化:

均值: \mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}

方差:\sigma^2=\frac{1}{m}\sum_{i=1}^{m}(x^{(i)})^2

歸一化均值:x:=x-\mu

歸一化方差:x=x/\sigma^2

直觀化理解可以見下圖,數據的歸一化,可以使梯度下降較快,加速訓練。

 

Q5. 梯度消失與爆炸

1. 什麼是梯度消失與爆炸

舉例:

                             

首先,對於sigmoid函數的偏導,最大值爲1/4,

       如果權重很小,\left | w \right |<1, 則前面層的梯度變化 如式(1)更小,所以更慢,引起了梯度消失的現象;同理當\left | w \right |>>1,權重很大的時候,前面層變化的更快,引起了梯度爆炸

      因而,梯度出現指數遞增或遞減的情況,分別稱爲梯度爆炸和梯隊消失。

 

2. 利用初始化緩解梯度消失和梯度爆炸的問題

Xavier initialization: 設置 Var(w_i)= \frac{1}{n},  核心思想是保持輸入和輸出的方差一致,避免了所有輸出都趨近於0.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(1/n)

He initialization: 在ReLU 網絡中的關鍵思想是,假定每一層有一半的神經元被激活,另一半爲0,所以爲保證 variance 不變,在Xavier 的基礎上再除以2.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(2/n)

若激活函數爲tanh時,Var(w_i)= \frac{1}{n}

以下是完整的He initialization:

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
        
    return parameters

Q6.  梯度檢驗

    核心是判斷計算的梯度是否正確。即判斷 d\theta_{approx}\approx d\theta 。 具體公式如下:

\frac{\left \| d\theta_{approx}-d\theta \right \|_2}{\left \| d\theta_{approx}\right \|_2+\left \| d\theta \right \|_2}<\epsilon

其中,

d\theta_{approx}=\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}

2維:

def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaplus = theta + epsilon                    # Step 1
    thetaminus = theta - epsilon                   # Step 2
    J_plus = forward_propagation(x,thetaplus)      # Step 3
    J_minus = forward_propagation(x,thetaminus)     # Step 4
    gradapprox = (J_plus - J_minus)/(2.0*epsilon)                             # Step 5
    ### END CODE HERE ###
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, theta)
    ### END CODE HERE ###
    
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                           # Step 2'
    difference = numerator / denominator                              # Step 3'
    ### END CODE HERE ###
    
    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")
    
    return difference

n維:

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                     # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                               # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary( thetaplus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                     # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ =  forward_propagation_n(X, Y, vector_to_dictionary( thetaminus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = ( J_plus[i] -  J_minus[i] )/ (2.0 * epsilon)  
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator =  np.linalg.norm(grad - gradapprox)                                          # Step 1'
    denominator =  np.linalg.norm(grad )+ np.linalg.norm(gradapprox)                                         # Step 2'
    difference = numerator / denominator                                         # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

注意:梯度檢驗一般只在debug時使用,且不要和dropout連用。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章