【深度学习基础4】深度神经网络的优化与调参(1)

     转载请注明出处。谢谢。

     本博文根据 coursera 吴恩达 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization整理。作为深度学习网络优化的记录,将三章内容均以要点的形式记录,并结合实例说明。

Q1. 数据设计(数据集该怎么分割)

     常规来说,我们可以把数据分为三类:训练集(进行训练)、验证集(进行交叉验证,选出最好的模型)、测试集(获取模型最后的无偏估计)。

     在小数据集时代,三者比例通常为70%(训练) /  30%(测试),或者 60%(训练) /  20%(验证)/  20%(测试)。但在大数据时期,分配比例会发生变化,如100万数据时, 98%(训练) /  1%(验证)/  1%(测试),超百万95%(训练) /  2.5%(验证)/  2.5%(测试).。

     值得注意的是,验证集和测试集尽量保持同一分布,保持效果。

 

Q2. 偏差和方差

high bias(高偏差):模型效果不好,模型过于简单,处于欠饱和(underfitting)的状态。

high variance(高方差):模型效果好,但是模型过于复杂,过度依赖数据,处于过饱和(overfitting)。

相应特点

  • train acc 低,dev acc 低 -------(1) 增加网络结构; (2) 训练更长时间; (3) 更换网络
  • train acc 高,dev acc 低 -------(1) 获取更多数据; (2) 正则化; (3) 更换网络

 

Q3. 正则化

1. 正则化的定义:在cost function中加入一项正则项,约束模型的复杂度

L2 正则化:\frac{\lambda}{2m}\left \|w \right \|^2_2=\frac{\lambda}{2m}\sum_{j=1}^{n_x}(w_j)^2=\frac{\lambda}{2m}w^Tw

L1正则化:\frac{\lambda}{2m}\left \|w \right \|_1=\frac{\lambda}{2m}\sum_{j=1}^{n_x}\left |w_j \right |

其中,\lambda 称为正则化因子。

在网络中,  J(w^{[1]},b^{[1]},\cdots ,w^{[L]},b^{[L]})=-\frac{1}{m}\sum_{i=1}^{m}L(\widehat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^{L}\left \| w^{[l]} \right \|^2_F   其中,\left \| \cdot \right \|^2_F  称为弗罗贝尼乌斯范数。

进一步推导:

dW^{[l]}=({from\, backprop})+ \frac{\lambda}{m}W^{[l]}

W^{[l]}=W^{[l]}-\alpha dW^{[l]}=(1-\frac{\alpha\lambda}{m})W^{[l]}-\alpha(from \, backprop)

可见,1-\frac{\alpha\lambda}{m}<1, 称其为权重衰减 weight decay。

理解\lambda 大时,\left \| w \right \|需要尽可能小才能保证loss小,尽可能趋近于0,使之消除了许多神经元的影响(降低权重),使网络变得简单,不会过拟合。

在cost func中加入正则项:

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (1.0/m) * (lambd/2.0)*(np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))
    ### END CODER HERE ###
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

反向传播中加入正则项:

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + 1.0 * lambd/m* W3
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + 1.0 * lambd/m* W2
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + 1.0 * lambd/m* W1
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

 

2. 其余正则化的办法:

(a)dropout 随机失活

理解:加入dropout之后,输入的特征有可能被随机清除,因而不能特别依赖于任何一个输入的特征,不会给设置大权重,通过传播,dropout和前述正则化一样,产生了权重收缩的效果。

使用上,在神经元少时,减少使用dropout, keep_prob=1, 神经元多时,减小keep_prob。但其使得 J 无法明确定义,难以绘图,因而一般先关闭dropout,确保收敛下降,再打开。

实现:

和 L2 Regulation 不同,dropout在前向通道就已经加入:随机选择神经元激活/不激活。记得把 A/keep_prob ,以还原到原期望

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0],A1.shape[1])      # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = (D1 < keep_prob)                             # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1/ keep_prob                                # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0],A2.shape[1])          # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 =(D2 < keep_prob)                                         # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                        # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                              # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

在反向传播过程中,关闭与前向相同的神经元,并恢复到相应期望 

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = 1.0* dA2 /  keep_prob      # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1          # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = 1.0* dA1 / keep_prob   # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

 结果:

因而,在实践中应当选择适合数据的正则化方式。

 

(b)data augmentation

(c)early stopping 误差上升之前停止,但会导致可能无法最优

 

Q4. 归一化输入

对于数据集特征进行归一化:

均值: \mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}

方差:\sigma^2=\frac{1}{m}\sum_{i=1}^{m}(x^{(i)})^2

归一化均值:x:=x-\mu

归一化方差:x=x/\sigma^2

直观化理解可以见下图,数据的归一化,可以使梯度下降较快,加速训练。

 

Q5. 梯度消失与爆炸

1. 什么是梯度消失与爆炸

举例:

                             

首先,对于sigmoid函数的偏导,最大值为1/4,

       如果权重很小,\left | w \right |<1, 则前面层的梯度变化 如式(1)更小,所以更慢,引起了梯度消失的现象;同理当\left | w \right |>>1,权重很大的时候,前面层变化的更快,引起了梯度爆炸

      因而,梯度出现指数递增或递减的情况,分别称为梯度爆炸和梯队消失。

 

2. 利用初始化缓解梯度消失和梯度爆炸的问题

Xavier initialization: 设置 Var(w_i)= \frac{1}{n},  核心思想是保持输入和输出的方差一致,避免了所有输出都趋近于0.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(1/n)

He initialization: 在ReLU 网络中的关键思想是,假定每一层有一半的神经元被激活,另一半为0,所以为保证 variance 不变,在Xavier 的基础上再除以2.

WL= np.random.randn(WL.shape[0],WL.shape[1])*np.sqrt(2/n)

若激活函数为tanh时,Var(w_i)= \frac{1}{n}

以下是完整的He initialization:

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2./layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
        
    return parameters

Q6.  梯度检验

    核心是判断计算的梯度是否正确。即判断 d\theta_{approx}\approx d\theta 。 具体公式如下:

\frac{\left \| d\theta_{approx}-d\theta \right \|_2}{\left \| d\theta_{approx}\right \|_2+\left \| d\theta \right \|_2}<\epsilon

其中,

d\theta_{approx}=\frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}

2维:

def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaplus = theta + epsilon                    # Step 1
    thetaminus = theta - epsilon                   # Step 2
    J_plus = forward_propagation(x,thetaplus)      # Step 3
    J_minus = forward_propagation(x,thetaminus)     # Step 4
    gradapprox = (J_plus - J_minus)/(2.0*epsilon)                             # Step 5
    ### END CODE HERE ###
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, theta)
    ### END CODE HERE ###
    
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                               # Step 1'
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)                           # Step 2'
    difference = numerator / denominator                              # Step 3'
    ### END CODE HERE ###
    
    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")
    
    return difference

n维:

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                     # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                               # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary( thetaplus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                     # Step 1
        thetaminus[i][0] = thetaminus[i][0] - epsilon                               # Step 2        
        J_minus[i], _ =  forward_propagation_n(X, Y, vector_to_dictionary( thetaminus))                                 # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = ( J_plus[i] -  J_minus[i] )/ (2.0 * epsilon)  
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator =  np.linalg.norm(grad - gradapprox)                                          # Step 1'
    denominator =  np.linalg.norm(grad )+ np.linalg.norm(gradapprox)                                         # Step 2'
    difference = numerator / denominator                                         # Step 3'
    ### END CODE HERE ###

    if difference > 2e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

注意:梯度检验一般只在debug时使用,且不要和dropout连用。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章