【深度學習基礎4】深度神經網絡的優化與調參(1)

轉載請註明出處。謝謝。

本博文根據 coursera 吳恩達 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization整理。作爲深度學習網絡優化的記錄，將三章內容均以要點的形式記錄，並結合實例說明。

Q1. 數據設計（數據集該怎麼分割）

常規來說，我們可以把數據分爲三類：訓練集（進行訓練）、驗證集（進行交叉驗證，選出最好的模型）、測試集（獲取模型最後的無偏估計）。

在小數據集時代，三者比例通常爲70%（訓練） / 30%（測試），或者 60%（訓練） / 20%（驗證）/ 20%（測試）。但在大數據時期，分配比例會發生變化，如100萬數據時， 98%（訓練） / 1%（驗證）/ 1%（測試），超百萬95%（訓練） / 2.5%（驗證）/ 2.5%（測試）.。

值得注意的是，驗證集和測試集儘量保持同一分佈，保持效果。

Q2. 偏差和方差

high bias（高偏差）：模型效果不好，模型過於簡單，處於欠飽和（underfitting）的狀態。

high variance（高方差）：模型效果好，但是模型過於複雜，過度依賴數據，處於過飽和（overfitting）。

相應特點：

train acc 低，dev acc 低 -------(1) 增加網絡結構； (2) 訓練更長時間； (3) 更換網絡
train acc 高，dev acc 低 -------(1) 獲取更多數據； (2) 正則化； (3) 更換網絡

Q3. 正則化

1. 正則化的定義：在cost function中加入一項正則項，約束模型的複雜度

L2 正則化： $\frac{\lambda}{2m}\left \|w \right \|^2_2=\frac{\lambda}{2m}\sum_{j=1}^{n_x}(w_j)^2=\frac{\lambda}{2m}w^Tw$

L1正則化： $\frac{\lambda}{2m}\left \|w \right \|_1=\frac{\lambda}{2m}\sum_{j=1}^{n_x}\left |w_j \right |$

其中， $\lambda$ 稱爲正則化因子。

在網絡中， $J(w^{[1]},b^{[1]},\cdots ,w^{[L]},b^{[L]})=-\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^{L}\left \| w^{[l]} \right \|^2_F$ 其中， $\left \| \cdot \right \|^2_F$ 稱爲弗羅貝尼烏斯範數。

進一步推導：

$dW^{[l]}=({from\, backprop})+ \frac{\lambda}{m}W^{[l]}$

$W^{[l]}=W^{[l]}-\alpha dW^{[l]}=(1-\frac{\alpha\lambda}{m})W^{[l]}-\alpha(from \, backprop)$

可見， $1-\frac{\alpha\lambda}{m}<1$ , 稱其爲權重衰減 weight decay。

理解：當 $\lambda$ 大時， $\left \| w \right \|$ 需要儘可能小才能保證loss小，儘可能趨近於0，使之消除了許多神經元的影響（降低權重），使網絡變得簡單，不會過擬合。

在cost func中加入正則項：

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (1.0/m) * (lambd/2.0)*(np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))
    ### END CODER HERE ###
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

反向傳播中加入正則項：

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + 1.0 * lambd/m* W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + 1.0 * lambd/m* W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + 1.0 * lambd/m* W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

2. 其餘正則化的辦法：

（a）dropout 隨機失活

理解：加入dropout之後，輸入的特徵有可能被隨機清除，因而不能特別依賴於任何一個輸入的特徵，不會給設置大權重，通過傳播，dropout和前述正則化一樣，產生了權重收縮的效果。

使用上，在神經元少時，減少使用dropout， keep_prob=1, 神經元多時，減小keep_prob。但其使得 J 無法明確定義，難以繪圖，因而一般先關閉dropout，確保收斂下降，再打開。

實現：

和 L2 Regulation 不同，dropout在前向通道就已經加入：隨機選擇神經元激活/不激活。記得把 A/keep_prob ，以還原到原期望

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0],A1.shape[1])      # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = (D1 < keep_prob)                             # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1/ keep_prob                                # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0],A2.shape[1])          # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 =(D2 < keep_prob)                                         # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                        # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                              # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

在反向傳播過程中，關閉與前向相同的神經元，並恢復到相應期望

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = 1.0* dA2 /  keep_prob      # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1          # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = 1.0* dA1 / keep_prob   # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

結果：

因而，在實踐中應當選擇適合數據的正則化方式。

（b）data augmentation

（c）early stopping 誤差上升之前停止，但會導致可能無法最優

Q4. 歸一化輸入

對於數據集特徵進行歸一化：

均值： $\mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}$

方差： $\sigma^2=\frac{1}{m}\sum_{i=1}^{m}(x^{(i)})^2$

歸一化均值： $x:=x-\mu$

歸一化方差： $x=x/\sigma^2$

直觀化理解可以見下圖，數據的歸一化，可以使梯度下降較快，加速訓練。

Q5. 梯度消失與爆炸

1. 什麼是梯度消失與爆炸

舉例：

首先，對於sigmoid函數的偏導，最大值爲1/4，

如果權重很小， $\left | w \right |<1$ ，則前面層的梯度變化如式（1）更小，所以更慢，引起了梯度消失的現象；同理當 $\left | w \right |>>1$ ，權重很大的時候，前面層變化的更快，引起了梯度爆炸。

因而，梯度出現指數遞增或遞減的情況，分別稱爲梯度爆炸和梯隊消失。

2. 利用初始化緩解梯度消失和梯度爆炸的問題

Xavier initialization: 設置 $Var(w_i)= \frac{1}{n}$ , 核心思想是保持輸入和輸出的方差一致，避免了所有輸出都趨近於0.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(1/n)

He initialization: 在ReLU 網絡中的關鍵思想是，假定每一層有一半的神經元被激活，另一半爲0，所以爲保證 variance 不變，在Xavier 的基礎上再除以2.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(2/n)

若激活函數爲tanh時， $Var(w_i)= \frac{1}{n}$

以下是完整的He initialization:

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
        
    return parameters

Q6. 梯度檢驗

核心是判斷計算的梯度是否正確。即判斷 $d\theta_{approx}\approx d\theta$ 。具體公式如下：

$\frac{\left \| d\theta_{approx}-d\theta \right \|_2}{\left \| d\theta_{approx}\right \|_2+\left \| d\theta \right \|_2}<\epsilon$

其中，

$d\theta_{approx}=\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}$

2維：

def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaplus = theta + epsilon                    # Step 1
    thetaminus = theta - epsilon                   # Step 2
    J_plus = forward_propagation(x,thetaplus)      # Step 3
    J_minus = forward_propagation(x,thetaminus)     # Step 4
    gradapprox = (J_plus - J_minus)/(2.0*epsilon)                             # Step 5
    ### END CODE HERE ###
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, theta)
    ### END CODE HERE ###
    
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                           # Step 2'
    difference = numerator / denominator                              # Step 3'
    ### END CODE HERE ###
    
    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")
    
    return difference

n維：

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                     # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                               # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary( thetaplus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                     # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ =  forward_propagation_n(X, Y, vector_to_dictionary( thetaminus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = ( J_plus[i] -  J_minus[i] )/ (2.0 * epsilon)  
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator =  np.linalg.norm(grad - gradapprox)                                          # Step 1'
    denominator =  np.linalg.norm(grad )+ np.linalg.norm(gradapprox)                                         # Step 2'
    difference = numerator / denominator                                         # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

注意：梯度檢驗一般只在debug時使用，且不要和dropout連用。

【深度學習基礎4】深度神經網絡的優化與調參(1)

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

【算法】leetcode5 Longest Palindromic Substring

【面試基礎--關鍵點檢測】深度學習關鍵點檢測方法發展概述

【面試基礎--分類網絡】inception系列（v1,v2,v3,v4）

【面試基礎--超分辨率】深度學習超分辨率方法發展概述

【paper】ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結