gradient checking(梯度檢驗)

Gradient Checking(梯度檢驗)

        我們有時候實現完backward propagation,我們不知道自己實現的backward propagation到底是不是完全正確的(這篇博客只面向自己手擼的網絡,直接搬磚的不需要考慮這個問題…..),因此,通常要用梯度檢驗來檢查自己實現的bp是否正確。其實所謂梯度檢驗,就是自己實現下導數的定義,去求w和b的導數(梯度),然後去和bp求到的梯度比較,如果差值在很小的範圍內,則可以認爲我們實現的bp沒問題。
        先來回顧下,導數的定義:

導數的定義

具體應用到神經網絡的梯度檢驗中,因爲我們沒法做到Δx0 ,只能取一個比較小的數,所以爲了結果更精確,我們可以把上面的公式稍微變下形:

導數定義的變形

通常設置 ε=e7 即可。
具體到神經網絡中,我們做梯度檢驗的步驟通常爲:

  1. W[1],b[1],......,W[L],b[L] 轉化成向量 θ
  2. 同樣把 dW[1],db[1],......,dW[L],db[L] 轉化成向量 dθ
  3. 接下來是實現導數定義:
    求dtheta
  4. 接下來我們比較 dθapproxdθ 是否大致相等,主要是計算兩個向量之間的歐式距離:
    difference

通常設置閾值 threshold=e7 。 如果difference小於閾值,則認爲實現的bp沒問題。

關於上面的步驟,我們來看看代碼實現,

1.把 W[1],b[1],......,W[L],b[L] 轉化成向量 θ

#convert parameter into vector
def dictionary_to_vector(parameters):
    """
    Roll all our parameters dictionary into a single vector satisfying our specific required shape.
    """
    count = 0
    for key in parameters:
        # flatten parameter
        new_vector = np.reshape(parameters[key], (-1, 1))#convert matrix into vector
        if count == 0:#剛開始時新建一個向量
            theta = new_vector
        else:
            theta = np.concatenate((theta, new_vector), axis=0)#和已有的向量合併成新向量
        count = count + 1

    return theta

2.把 dW[1],db[1],......,dW[L],db[L] 轉化成向量 dθ
注:這個地方一定要注意bp求得的gradients字典的存儲順序是{dWL,dbL,….dW2,db2,dW1,db1},因爲後面要求歐式距離,所以一定要把順序轉化爲[dW1,db1,…dWL,dbL]。在這個地方踩過坑,花了很長時間才找出bug。

#convert gradients into vector
def gradients_to_vector(gradients):
    """
    Roll all our parameters dictionary into a single vector satisfying our specific required shape.
    """
    # 因爲gradient的存儲順序是{dWL,dbL,....dW2,db2,dW1,db1},
    #爲了統一採用[dW1,db1,...dWL,dbL]方面後面求歐式距離(對應元素)
    L = len(gradients) // 2
    keys = []
    for l in range(L):
        keys.append("dW" + str(l + 1))
        keys.append("db" + str(l + 1))
    count = 0
    for key in keys:
        # flatten parameter
        new_vector = np.reshape(gradients[key], (-1, 1))#convert matrix into vector
        if count == 0:#剛開始時新建一個向量
            theta = new_vector
        else:
            theta = np.concatenate((theta, new_vector), axis=0)#和已有的向量合併成新向量
        count = count + 1

    return theta

第三步、第四步的代碼如下:

def gradient_check(parameters, gradients, X, Y, layer_dims, epsilon=1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n

    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    layer_dims -- the layer dimension of nn
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """

    parameters_vector = dictionary_to_vector(parameters)  # parameters_values
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_vector.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))

    # Compute gradapprox
    for i in range(num_parameters):
        thetaplus = np.copy(parameters_vector)
        thetaplus[i] = thetaplus[i] + epsilon
        AL, _ = forward_propagation(X, vector_to_dictionary(thetaplus,layer_dims))
        J_plus[i] = compute_cost(AL,Y)

        thetaminus = np.copy(parameters_vector)
        thetaminus[i] = thetaminus[i] - epsilon
        AL, _ = forward_propagation(X, vector_to_dictionary(thetaminus, layer_dims))
        J_minus[i] = compute_cost(AL,Y)
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)

    numerator = np.linalg.norm(grad - gradapprox)
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
    difference = numerator / denominator

    if difference > 2e-7:
        print(
            "\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print(
            "\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")

    return difference

這裏在每一次計算bp時,還要把向量轉化成矩陣,具體實現如下:

#convert vector into dictionary
def vector_to_dictionary(theta, layer_dims):
    """
    Unroll all our parameters dictionary from a single vector satisfying our specific required shape.
    """
    parameters = {}
    L = len(layer_dims)  # the number of layers in the network
    start = 0
    end = 0
    for l in range(1, L):
        end += layer_dims[l]*layer_dims[l-1]
        parameters["W" + str(l)] = theta[start:end].reshape((layer_dims[l],layer_dims[l-1]))
        start = end
        end += layer_dims[l]*1
        parameters["b" + str(l)] = theta[start:end].reshape((layer_dims[l],1))
        start = end
    return parameters


還是拿sklearn中自帶的breast_cancer數據集,來測試下,自己實現的bp到底對不對,順便也是測下我們的gradient checking的代碼對不對,測試結果如下:

Your backward propagation works perfectly fine! difference = 5.649104934345307e-11

可以看到我們實現的bp求到的梯度和用導數定義實現的梯度之間的差距是e11 這個數量級,所以我們實現的bp正確無誤。
完整的代碼已放到github上:gradient_checking.py

cs231n中也有講解關於gradient cheking的資料,不過它的relative error 定義和ng講的有些細微的差別,不過不是什麼大問題,只是定義不一樣而已,只要改變相應的閾值即可。具體見:CS231n Convolutional Neural Networks for Visual Recognition



參考資料
- ng Coursera 《Improving Deep Neural Networks Hyperparameter tuning, Regularization and Optimization》課



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章