【深度学习基础4】深度神经网络的优化与调参(1)

转载请注明出处。谢谢。

本博文根据 coursera 吴恩达 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization整理。作为深度学习网络优化的记录，将三章内容均以要点的形式记录，并结合实例说明。

Q1. 数据设计（数据集该怎么分割）

常规来说，我们可以把数据分为三类：训练集（进行训练）、验证集（进行交叉验证，选出最好的模型）、测试集（获取模型最后的无偏估计）。

在小数据集时代，三者比例通常为70%（训练） / 30%（测试），或者 60%（训练） / 20%（验证）/ 20%（测试）。但在大数据时期，分配比例会发生变化，如100万数据时， 98%（训练） / 1%（验证）/ 1%（测试），超百万95%（训练） / 2.5%（验证）/ 2.5%（测试）.。

值得注意的是，验证集和测试集尽量保持同一分布，保持效果。

Q2. 偏差和方差

high bias（高偏差）：模型效果不好，模型过于简单，处于欠饱和（underfitting）的状态。

high variance（高方差）：模型效果好，但是模型过于复杂，过度依赖数据，处于过饱和（overfitting）。

相应特点：

train acc 低，dev acc 低 -------(1) 增加网络结构； (2) 训练更长时间； (3) 更换网络
train acc 高，dev acc 低 -------(1) 获取更多数据； (2) 正则化； (3) 更换网络

Q3. 正则化

1. 正则化的定义：在cost function中加入一项正则项，约束模型的复杂度

L2 正则化： $\frac{\lambda}{2m}\left \|w \right \|^2_2=\frac{\lambda}{2m}\sum_{j=1}^{n_x}(w_j)^2=\frac{\lambda}{2m}w^Tw$

L1正则化： $\frac{\lambda}{2m}\left \|w \right \|_1=\frac{\lambda}{2m}\sum_{j=1}^{n_x}\left |w_j \right |$

其中， $\lambda$ 称为正则化因子。

在网络中， $J(w^{[1]},b^{[1]},\cdots ,w^{[L]},b^{[L]})=-\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^{L}\left \| w^{[l]} \right \|^2_F$ 其中， $\left \| \cdot \right \|^2_F$ 称为弗罗贝尼乌斯范数。

进一步推导：

$dW^{[l]}=({from\, backprop})+ \frac{\lambda}{m}W^{[l]}$

$W^{[l]}=W^{[l]}-\alpha dW^{[l]}=(1-\frac{\alpha\lambda}{m})W^{[l]}-\alpha(from \, backprop)$

可见， $1-\frac{\alpha\lambda}{m}<1$ , 称其为权重衰减 weight decay。

理解：当 $\lambda$ 大时， $\left \| w \right \|$ 需要尽可能小才能保证loss小，尽可能趋近于0，使之消除了许多神经元的影响（降低权重），使网络变得简单，不会过拟合。

在cost func中加入正则项：

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (1.0/m) * (lambd/2.0)*(np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))
    ### END CODER HERE ###
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

反向传播中加入正则项：

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + 1.0 * lambd/m* W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + 1.0 * lambd/m* W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + 1.0 * lambd/m* W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

2. 其余正则化的办法：

（a）dropout 随机失活

理解：加入dropout之后，输入的特征有可能被随机清除，因而不能特别依赖于任何一个输入的特征，不会给设置大权重，通过传播，dropout和前述正则化一样，产生了权重收缩的效果。

使用上，在神经元少时，减少使用dropout， keep_prob=1, 神经元多时，减小keep_prob。但其使得 J 无法明确定义，难以绘图，因而一般先关闭dropout，确保收敛下降，再打开。

实现：

和 L2 Regulation 不同，dropout在前向通道就已经加入：随机选择神经元激活/不激活。记得把 A/keep_prob ，以还原到原期望

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0],A1.shape[1])      # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = (D1 < keep_prob)                             # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1/ keep_prob                                # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0],A2.shape[1])          # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 =(D2 < keep_prob)                                         # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                        # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                              # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

在反向传播过程中，关闭与前向相同的神经元，并恢复到相应期望

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = 1.0* dA2 /  keep_prob      # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1          # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = 1.0* dA1 / keep_prob   # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

结果：

因而，在实践中应当选择适合数据的正则化方式。

（b）data augmentation

（c）early stopping 误差上升之前停止，但会导致可能无法最优

Q4. 归一化输入

对于数据集特征进行归一化：

均值： $\mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}$

方差： $\sigma^2=\frac{1}{m}\sum_{i=1}^{m}(x^{(i)})^2$

归一化均值： $x:=x-\mu$

归一化方差： $x=x/\sigma^2$

直观化理解可以见下图，数据的归一化，可以使梯度下降较快，加速训练。

Q5. 梯度消失与爆炸

1. 什么是梯度消失与爆炸

举例：

首先，对于sigmoid函数的偏导，最大值为1/4，

如果权重很小， $\left | w \right |<1$ ，则前面层的梯度变化如式（1）更小，所以更慢，引起了梯度消失的现象；同理当 $\left | w \right |>>1$ ，权重很大的时候，前面层变化的更快，引起了梯度爆炸。

因而，梯度出现指数递增或递减的情况，分别称为梯度爆炸和梯队消失。

2. 利用初始化缓解梯度消失和梯度爆炸的问题

Xavier initialization: 设置 $Var(w_i)= \frac{1}{n}$ , 核心思想是保持输入和输出的方差一致，避免了所有输出都趋近于0.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(1/n)

He initialization: 在ReLU 网络中的关键思想是，假定每一层有一半的神经元被激活，另一半为0，所以为保证 variance 不变，在Xavier 的基础上再除以2.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(2/n)

若激活函数为tanh时， $Var(w_i)= \frac{1}{n}$

以下是完整的He initialization:

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
        
    return parameters

Q6. 梯度检验

核心是判断计算的梯度是否正确。即判断 $d\theta_{approx}\approx d\theta$ 。具体公式如下：

$\frac{\left \| d\theta_{approx}-d\theta \right \|_2}{\left \| d\theta_{approx}\right \|_2+\left \| d\theta \right \|_2}<\epsilon$

其中，

$d\theta_{approx}=\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}$

2维：

def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaplus = theta + epsilon                    # Step 1
    thetaminus = theta - epsilon                   # Step 2
    J_plus = forward_propagation(x,thetaplus)      # Step 3
    J_minus = forward_propagation(x,thetaminus)     # Step 4
    gradapprox = (J_plus - J_minus)/(2.0*epsilon)                             # Step 5
    ### END CODE HERE ###
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, theta)
    ### END CODE HERE ###
    
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                           # Step 2'
    difference = numerator / denominator                              # Step 3'
    ### END CODE HERE ###
    
    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")
    
    return difference

n维：

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                     # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                               # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary( thetaplus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                     # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ =  forward_propagation_n(X, Y, vector_to_dictionary( thetaminus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = ( J_plus[i] -  J_minus[i] )/ (2.0 * epsilon)  
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator =  np.linalg.norm(grad - gradapprox)                                          # Step 1'
    denominator =  np.linalg.norm(grad )+ np.linalg.norm(gradapprox)                                         # Step 2'
    difference = numerator / denominator                                         # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

注意：梯度检验一般只在debug时使用，且不要和dropout连用。

【深度学习基础4】深度神经网络的优化与调参(1)

python gdal 安装使用（Windows， python 3.6.8）

【算法】leetcode5 Longest Palindromic Substring

【面試基礎--關鍵點檢測】深度學習關鍵點檢測方法發展概述

【面試基礎--分類網絡】inception系列（v1,v2,v3,v4）

【面試基礎--超分辨率】深度學習超分辨率方法發展概述

【paper】ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結