梯度下降算法原理網上很多，我這裏只是作爲自己學習過程的札記，方便自己查看複習，因此不會那麼詳細，一般只記錄對自己有用的部分。

1.什麼是梯度？

可以簡單理解爲多變量函數的導數，即對每個變量單獨求偏導。

梯度是改變率或者斜度的另一個稱呼。如果你需要回顧這個概念，可以看下可汗學院對這個問題的講解。

梯度：https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient

避免局部最低點方法:

https://ruder.io/optimizing-gradient-descent/index.html#momentum

可汗學院微積分：https://www.khanacademy.org/math/multivariable-calculus

向量：https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/vectors/v/vector-introduction-linear-algebra

矩陣：https://www.khanacademy.org/math/precalculus-2018/precalc-matrices

2.誤差函數

首先，梯度下降法要求，誤差函數是可微的，連續的；

這裏用均方差（mean of the square errors，MSE）

$E = \frac{1}2m\sum_{\mu=1}^{m}(y^{\mu}-\hat{y}^\mu)^2$

3.梯度下降的基本過程：

誤差函數就代表着一座山。我們的目標就是找到這個函數的最小值，也就是山底。

最快的下山的方式就是找到當前位置最陡峭的方向(梯度方向)，然後沿着此方向向下走，因爲梯度的方向就是函數之變化最快的方向；

重複這個過程，反覆求取梯度，最後就能到達局部的最小值，這就類似於我們下山的過程。

4.實現基本函數

Sigmoid 激活函數

$\sigma(x) = \frac{1}{1+e^{-x}}$

輸出（預測）公式

$\hat{y} = \sigma(w_1 x_1 + w_2 x_2 + b)$

誤差函數

$Error(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y})$

更新權重的函數

$w_i^{'} \longleftarrow w_i + \alpha (y - \hat{y}) x_i$

$b^{'} \longleftarrow b + \alpha (y - \hat{y})$

5.梯度計算公式推導

首先要注意的是 s 型函數具有很完美的導數。即

$\sigma'(x) = \sigma(x) (1-\sigma(x))$

原因是，我們可以使用商式計算它：

現在，如果有 m 個樣本點，標爲 $x^{(1)}, x^{(2)}, \ldots, x^{(m)}$

誤差公式是： $E = -\frac{1}{m} \sum_{i=1}^m \left( y^{(i)} \ln(\hat{y^{(i)}}) + (1-y^{(i)}) \ln (1-\hat{y^{(i)}}) \right)$

預測是: $\hat{y^{(i)}} = \sigma(Wx^{(i)} + b)$

我們的目標是計算E, 在單個樣本點 x 時的梯度（偏導數），其中 x 包含 n 個特徵，即 $x = (x_1, \ldots, x_n),$

$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$
爲此，首先我們要計算, $\frac{\partial}{\partial w_j} \hat{y}$

因爲這是上述公式裏的第一個元素, $\hat{y} = \sigma(Wx+b)$

因此：

最後一個等式是因爲和中的唯一非常量項相對於 $w_j$ 正好是 $w_j x_j$ , 明顯具有導數 $x_j$ .

現在可以計算 $\frac {\partial} {\partial w_j} E$

類似的計算將得出,針對單個樣本點時，E 對 b 求偏導的公式爲：

$\frac {\partial} {\partial b} E=-(y -\hat{y})$

這個實際上告訴了我們很重要的規則。對於具有座標 $(x_1, \ldots, x_n)$ 的點，標籤 y,預測 $\hat{y}$ , 該點的誤差函數梯度是

$\left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right)$

總之

$\nabla E(W,b) = -(y - \hat{y}) (x_1, \ldots, x_n, 1)$

如果思考下，會發現很神奇。梯度實際上是標量乘以點的座標！什麼是標量？也就是標籤和預測之間的差別。這意味着，如果標籤與預測接近（表示點分類正確），該梯度將很小，如果標籤與預測差別很大（表示點分類錯誤），那麼此梯度將很大。

請記下：小的梯度表示我們將稍微修改下座標，大的梯度表示我們將大幅度修改座標。

如果覺得這聽起來像感知器算法，其實並非偶然性！

6.梯度下降法更新權重的算法概述：

1.權重步長設定爲 0： $\Delta w_i = 0$

2.對訓練數據中的每一條記錄：

a.通過網絡做正向傳播，計算輸出 $\hat y = f(\sum_i w_i x_i)$

b.計算輸出單元的誤差項（error term） $\delta = (y - \hat y) * f'(\sum_i w_i x_i)$

c.更新權重步長 $\Delta w_i = \Delta w_i + \delta x_i$

d.更新權重 $w_i = w_i + \eta \Delta w_i / m$ . 其中 $\eta$ 是學習率， m 是數據點個數。這裏我們對權重步長做了平均，爲的是降低訓練數據中大的變化。

3.重複 e 代（epoch）。

你也可以對每條記錄更新權重，而不是把所有記錄都訓練過之後再取平均。

7.梯度下降法實例

以一個二維平面點集的二分類爲例，用梯度下降法擬合直線；

7.1讀取與繪製數據

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#Some helper functions for plotting and drawing lines

def plot_points(X, y):
    admitted = X[np.argwhere(y==1)]
    rejected = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'blue', edgecolor = 'k')
    plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'red', edgecolor = 'k')

def display(m, b, color='g--'):
    plt.xlim(-0.05,1.05)
    plt.ylim(-0.05,1.05)
    x = np.arange(-10, 10, 0.1)
    plt.plot(x, m*x+b, color)

data = pd.read_csv('data.csv', header=None)
X = np.array(data[[0,1]])
y = np.array(data[2])
plot_points(X,y)
plt.show()

由上圖可明顯觀測兩個類別點集，用線性迴歸可以分類

# Implement the following functions

# Activation (sigmoid) function
# 激活函數
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Output (prediction) formula
# 感知器輸出
def output_formula(features, weights, bias):
    return sigmoid(np.dot(features, weights) + bias)

# Error (log-loss) formula
# 誤差函數
def error_formula(y, output):
    return - y*np.log(output) - (1 - y) * np.log(1-output)

# Gradient descent step
# 用梯度更新權重
def update_weights(x, y, weights, bias, learnrate):
    output = output_formula(x, weights, bias)
    d_error = (y - output)
    weights += learnrate * d_error * x
    bias += learnrate * d_error
    return weights, bias

7.2 訓練函數

該函數將幫助我們通過所有數據來迭代梯度下降算法，用於多個 epoch。它還將繪製數據，以及在我們運行算法時繪製出一些邊界線。

np.random.seed(44)

epochs = 100
learnrate = 0.01

def train(features, targets, epochs, learnrate, graph_lines=False):
    
    errors = []
    n_records, n_features = features.shape
    last_loss = None
    #獲得隨機值
    weights = np.random.normal(scale=1 / n_features**.5, size=n_features)
    bias = 0
    for e in range(epochs):
        del_w = np.zeros(weights.shape)
        for x, y in zip(features, targets):
            output = output_formula(x, weights, bias)
            error = error_formula(y, output)
            weights, bias = update_weights(x, y, weights, bias, learnrate)
        
        # Printing out the log-loss error on the training set
        out = output_formula(features, weights, bias)
        loss = np.mean(error_formula(targets, out))
        errors.append(loss)
        if e % (epochs / 10) == 0:
            print("\n========== Epoch", e,"==========")
            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss
            predictions = out > 0.5
            accuracy = np.mean(predictions == targets)
            print("Accuracy: ", accuracy)
        if graph_lines and e % (epochs / 100) == 0:
            display(-weights[0]/weights[1], -bias/weights[1])
            

    # Plotting the solution boundary
    plt.title("Solution boundary")
    display(-weights[0]/weights[1], -bias/weights[1], 'black')

    # Plotting the data
    plot_points(features, targets)
    plt.show()

    # Plotting the error
    plt.title("Error Plot")
    plt.xlabel('Number of epochs')
    plt.ylabel('Error')
    plt.plot(errors)
    plt.show()

7.3訓練算法

當我們運行該函數時，我們將獲得以下內容：

目前的訓練損失與準確性的 10 次更新
獲取的數據圖和一些邊界線的圖。最後一個是黑色的。請注意，隨着我們遍歷更多的 epoch ，線會越來越接近最佳狀態。
誤差函數的圖。請留意，隨着我們遍歷更多的 epoch，它會如何降低。

train(X, y, epochs, learnrate, True)

========== Epoch 0 ==========
Train loss:  0.7135845195381634
Accuracy:  0.4

========== Epoch 10 ==========
Train loss:  0.6225835210454962
Accuracy:  0.59

========== Epoch 20 ==========
Train loss:  0.5548744083669508
Accuracy:  0.74

========== Epoch 30 ==========
Train loss:  0.501606141872473
Accuracy:  0.84

========== Epoch 40 ==========
Train loss:  0.4593334641861401
Accuracy:  0.86

========== Epoch 50 ==========
Train loss:  0.42525543433469976
Accuracy:  0.93

========== Epoch 60 ==========
Train loss:  0.3973461571671399
Accuracy:  0.93

========== Epoch 70 ==========
Train loss:  0.3741469765239074
Accuracy:  0.93

========== Epoch 80 ==========
Train loss:  0.35459973368161973
Accuracy:  0.94

========== Epoch 90 ==========
Train loss:  0.3379273658879921
Accuracy:  0.94

機器學習12：用梯度下降法實現線性迴歸

1.什麼是梯度？

2.誤差函數

3.梯度下降的基本過程：

4.實現基本函數

5.梯度計算公式推導

6.梯度下降法更新權重的算法概述：

7.梯度下降法實例

7.1讀取與繪製數據

7.2 訓練函數

7.3訓練算法

SQL優化-20231016

paper之論文閱讀方法

paper專區文章彙總

機器學習18：用Keras實現遷移學習方法，原理

機器學習17：用Keras實現圖片數據增廣的方法和實踐

機器學習7：樸素貝葉斯

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結