在深度學習的路上，從頭開始瞭解一下各項技術。本人是DL小白，連續記錄我自己看的一些東西，大家可以互相交流。

本文參考：本文參考吳恩達老師的Coursera深度學習課程，很棒的課，推薦

本文默認你已經大致瞭解深度學習的簡單概念，如果需要更簡單的例子，可以參考吳恩達老師的入門課程：

http://study.163.com/courses-search?keyword=%E5%90%B4%E6%81%A9%E8%BE%BE#/?ot=5

轉載請註明出處，其他的隨你便咯

一、前言

在上篇文章中，我們介紹了神經網絡的一些基礎知識，但是並不能讓你真正的做點什麼。我們如何訓練神經網絡？具體該怎麼計算？隱層可以添加嗎，多少層合適？這些問題，會在本篇文章中給出。

二、神經網絡前向計算

首先，我們在上文中已經初步瞭解到神經網絡的結構，由於我們有很多的全連接，如果用單一的乘法計算，會導致訓練一個深層的神經網絡，需要上百萬次的計算。這時候，我們可以用向量化的方式，將所有的參數疊加成矩陣，通過矩陣來計算。我們將上文中的神經網絡複製到上圖。

在上圖中，我們可以發現，每個隱層的神經元結點的計算分爲兩個部分，計算z和計算a。

要注意的是層與層之間參數矩陣的形狀：

輸入層和隱層之間

w[1].shape = (4, 3)：4爲隱層神經元的個數，3爲輸入層神經元的個數；

b[1].shape = (4, 1)：4爲隱層神經元的個數，1不用擔心，python的廣播機制，會讓b複製成適合的形狀去進行矩陣加法；

隱層和輸出層之間

w[2].shape = (1, 4)：1爲輸出層神經元的個數，4個隱層神經元的個數；

b[2].shape = (1, 1)：1爲輸出層神經元的個數，1可以被廣播機制所擴展。

通過上述描述，我們可以看出w矩陣的規則，我們以相鄰兩層來說，前面一層作爲輸入層，後層爲輸出。兩層之間的w參數矩陣大小爲(n_out，n_in)，b參數矩陣爲(n_out，1)。其中n爲該層的神經元個數。

那麼我們現在用向量化的方式來計算我們的輸出值：

在對應的圖中，使用矩陣的方法，實際上只用實現右邊的四個公式，即可得到a[2]，也就是我們的輸出值yhat。

三、向量化神經網絡

通過向量化參數，我們可以簡化我們的單次訓練計算。同樣在m個訓練樣本的計算過程中，我們發現，每個樣本的計算過程實際上是相同的，如果按照之前的思路，我們可以用for循環來計算m個樣本。

for i in m:

單次訓練

但是這種for循環在python中實際上會佔用大量的資源，同樣我們也可以用向量化的方式，一次性計算所有m個樣本，提高我們的計算速度。

下面是實現向量化的解釋：

在上面，我們用 [ l ] 表示第幾層，用 ( i ) 表示第幾個樣本，我們先假設b = 0。

在m個訓練樣本中，其實都是在重複相同的過程，那麼我們可以將m個樣本，疊加在一個X矩陣中，其形狀爲(xn，m)。其中xn表示單個樣本的特徵數，m爲訓練樣本的個數。

四、反向傳播算法

在實現了前向計算之後，我們可以通過計算損失函數和代價函數來得到我們這個神經網絡的效果。同時我們也可以開始我們的反向傳播(Backward Prop)，以此來更新參數，讓我們的模型更能得到我們想要的預測值。梯度下降法即使一種優化w和b的方法。

簡單理解梯度下降

首先我們使用一個簡單的例子來講解什麼是梯度下降：

我們先給出一個簡單的神經網絡(可能叫神經元更合適)，損失函數的計算公式爲：

我們將上述公式化爲一個計算圖如下：

現在我們要優化w1、w2和b，來使得L(a，y)的值最小化，那麼我們需要對求偏導數，用偏導數來更新我們的w1、w2和b。因爲L(a，y)是一個凸函數，我們在逐步更新的過程中，一點點的達到全局最優解。

計算過程如下：

首先我們對da、dz求導：

在對w1、w2和b進行求導：

接下來用梯度下降法更新參數：

其中 α表示學習率(learning-rate)，也可以理解爲學習的步長，就是我們每次朝着最優解前進的速度。如果學習率過大，我們可能會在最優解附近來回震盪，沒辦法到達最優解。如果學習率過小，我們可能需要很多次數的迭代，才能到達最優解，所以選擇合適的學習率，也是很重要的。

接下來，我們給出m個樣本的損失函數：

損失函數關於w和b的偏導數，在m個樣本的情況下，可以寫成所有樣本點偏導數的平均形式：

接下來，和單個樣本一樣，我們可以更新w1、w2和b來進行下一次的訓練：

在吳恩達老師的課程中，給出了兩幅動圖來講解更新率對梯度下降的影響：

當梯度下降很小或合適時候，我們會得到如下的過程，模型最終會走向最優解。

當我們的更新率設置過高時，我們的步長會讓我們不得不在最終結果周圍震盪，這會讓我們浪費更多時間，甚至達不到最終的最優解，如下：

淺層神經網絡的梯度下降

好了，讓我們回到本文的第一個例子：

我們繼續通過這個式子來講解梯度下降，首先我們給出單個梯度下降的求導公式：

在上圖中，我們直接給出了求導的結果，我給出一個dz[2]的手算過程，大家可以以此推導以下其他的結果：

(字比較醜大家忍住看，或者自己手算一遍吧...)整體計算不難，核心思想是鏈式求導，相信大家都能理解。

接下來，我們給出向量化的求導結果：

其中與單個樣本求導不同的在於，w和b是偏導數的平均數。這樣我們就可以更新參數，完成一次迭代。

總結而言

反向傳播是相對與正向傳播而言的，在神經網絡的訓練中，我們通過正向傳播來計算當前模型的預測值，根據最終得到的代價函數，通過梯度下降算法，求取每個參數的偏導數，更新參數實現反向傳播以此來讓我們的模型更能準確的預測問題。

五、神經網絡代碼及查漏補缺

這算是第一篇原創文章，由於分了兩篇文章來講解，我覺得有必要通過代碼來將所有的點都串聯一下了。

通過一個簡單的二分類問題來梳理一下神經網絡的構建流程：

0、數據集

如上圖所示，在這個例子中，我們需要用一個簡單的神經網絡來劃分圖片上的區域，橫軸和數軸爲特徵x1和x2。每個點的顏色爲最終的值y，藍色爲1，紅色爲0。我們的目標是通過得知該點的座標(x1,x2)來預測該點的顏色(y)。

神經網絡模型

我們選擇如上圖所示的神經網絡模型，在隱層中選擇tanh函數來做激活函數，在輸出層中，用sigmoid函數來做激活函數。

對於每個訓練樣本，計算公式如下：

最終代價函數公式如下：

現在我們給出構建神經網絡的方法：

定義神經網絡的結構(輸入的神經元數，隱層的神經元數等)
初始化模型的參數
循環(給定的次數)

實現前向傳播
計算損失函數
實現反向傳播、獲得梯度
更新參數(梯度下降)

最終將1-3步驟合併爲一個模型。構建的模型學習了正確的參數後(訓練完成)，就可以對新數據進行預測了。

1、定義神經網絡結構

我們定義X爲輸入值(特徵數，樣本數)；

Y爲輸出值(結果，樣本數)，每單個x都有對應的y，所以樣本數是一致的；

n_x爲輸入層的大小；

n_h爲隱層的大小，4；

n_y爲輸出層的大小

def layer_sizes(X, Y):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)

    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    n_x = X.shape[0] # size of input layer
    n_h = 4
    n_y = Y.shape[0]# size of output layer
    return (n_x, n_h, n_y)

2、初始化模型參數

我們使用np.random.randn(a, b) * 0.01來初始化權重w；

使用np.zeros((a, b))來初始化偏置b。

其中w不能用0來初始化。如果用0來初始化w，那麼所以的特徵值在通過同樣的運算，換言之，所有特徵值對最後結果的影響是一樣的，那麼就損失了所有的特徵值，我們用randn()隨機數來生成w，在將其變的很小，就避免了上述問題。

def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer

    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """

    np.random.seed(2) # we set up a seed so that your output matches ours although the initialization is random.

    W1 = np.random.randn(n_h, n_x)
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h)
    b2 = np.zeros((n_y, 1))

    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters

3、實現循環

首先需要實現前向傳播(forward prop)。

我們可以從dict parameters中得到我們初始化的參數w，b。在計算前向傳播中，我們將z、a存儲在緩存(cache)中，方便我們在反向傳播中調用。

def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)

    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    ### START CODE HERE ### (≈ 4 lines of code)
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    ### END CODE HERE ###

    # Implement Forward Propagation to calculate A2 (probabilities)
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2 
    A2 = sigmoid(Z2)

    assert(A2.shape == (1, X.shape[1]))

    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}

    return A2, cache

接下來我們計算代價函數

我們通過A2，Y即可計算損失函數。

def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)

    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2

    Returns:
    cost -- cross-entropy cost given equation (13)
    """

    m = Y.shape[1] # number of example

    # Compute the cross-entropy cost
    logprobs = np.multiply(np.log(A2), Y) + np.multiply(np.log(1-A2), (1-Y))
    cost = -(1.0/m)*np.sum(logprobs)

    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))

    return cost

接下來我們計算反向傳播(backward prop)

我們將求導值存儲在緩存(grads)中。

其中計算公式如下：

代碼如下：

def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.

    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)

    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]

    # First, retrieve W1 and W2 from the dictionary "parameters".
    ### START CODE HERE ### (≈ 2 lines of code)
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    ### END CODE HERE ###

    # Retrieve also A1 and A2 from dictionary "cache".
    ### START CODE HERE ### (≈ 2 lines of code)
    A1 = cache["A1"]
    A2 = cache["A2"]
    ### END CODE HERE ###

    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2 = A2 - Y
    dW2 = 1.0/m*np.dot(dZ2, A1.T)
    db2 = 1.0/m*np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2)*(1-np.power(A1, 2))
    dW1 = 1.0/m*np.dot(dZ1, X.T)
    db1 = 1.0/m*np.sum(dZ1, axis=1, keepdims=True)

    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}

    return grads

接下來我們更新參數，結束本次循環：

設置更新率爲1.2，從dict parameters和grads中取出參數和導數，將更新後的參數，重新存入parameters中。

def update_parameters(parameters, grads, learning_rate = 1.2):
    """
    Updates parameters using the gradient descent update rule given above

    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients 

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # Retrieve each gradient from the dictionary "grads"
    dW1 = grads["dW1"]
    db1 = grads["db1"]
    dW2 = grads["dW2"]
    db2 = grads["db2"]

    # Update rule for each parameter
    W1 = W1 - learning_rate*dW1
    b1 = b1 - learning_rate*db1
    W2 = W2 - learning_rate*dW2
    b2 = b2 - learning_rate*db2

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters

4、整合模型

接下來我們將上述的步驟，整合爲一個模型，即爲我們的神經網絡模型。

我們設定訓練次數(num_iterations)爲10000次，每1000次打印出我們的誤差。

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations

    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]

    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    parameters = initialize_parameters(n_x, n_h, n_y)
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)

        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)

        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)

        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads, learning_rate = 1.2)


        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters

在上圖中，我們可以看到每1000次循環，我們的代價函數都會變小，這說明我們的梯度下降是成功的！

5、預測函數

最終，我們用一個預測函數，來結束我們這個文章。

我們將測試數據輸入模型，得到預測結果A2，如果A2 > 0.5，就意味着有超過50%的概率，是藍色的點，反之則是紅色的點。

def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X

    Arguments:
    parameters -- python dictionary containing your parameters 
    X -- input data of size (n_x, m)

    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """

    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    A2, cache = forward_propagation(X, parameters)
    predictions = (A2 > 0.5)

    return predictions

最終，我們將原來的數據集劃分爲如下圖片：

總結而言：

通過這篇文章能瞭解一個MLP或神經網絡是如何組成的。前向傳播是通過計算得到一個預測值，而反向傳播是通過反向求導，通過梯度下降算法來優化模型參數，讓模型能更準確的預測樣本數值。

神經網絡基礎-梯度下降和BP算法