梯度下降法實現（Python語言描述）

原文地址：傳送門

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use(['ggplot'])

當你初次涉足機器學習時，你學習的第一個基本算法就是 梯度下降 (Gradient Descent), 可以說梯度下降法是機器學習算法的支柱。在這篇文章中，我嘗試使用 $python$ 解釋梯度下降法的基本原理。一旦掌握了梯度下降法，很多問題就會變得容易理解，並且利於理解不同的算法。

如果你想嘗試自己實現梯度下降法，你需要加載基本的 $python$ $packages$ —— $numpy$ and $matplotlib$

首先，我們將創建包含着噪聲的線性數據

# 隨機創建一些噪聲
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

接下來通過 matplotlib 可視化數據

# 可視化數據
plt.plot(X, y, 'b.')
plt.xlabel("$x$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])

顯然， $y$ 與 $x$ 具有良好的線性關係，這個數據非常簡單，只有一個自變量 $x$ .

我們可以將其表示爲簡單的線性關係：

$y = b + mx$

並求出 $b$ , $m$ 。

這種被稱爲解方程的分析方法並沒有什麼不妥，但機器學習是涉及矩陣計算的，因此我們使用矩陣法（向量法）進行分析。

我們將 $y$ 替換成 $J(\theta)$ ， $b$ 替換成 $\theta_0$ ， $m$ 替換成 $\theta_1$ 。
得到如下表達式：

$J(\theta) = \theta_0 + \theta_1 x$

注意： 本例中 $\theta_0= 4$ ， $\theta_1= 3$

求解 $\theta_0$ 和 $\theta_1$ 的分析方法，代碼如下：

X_b = np.c_[np.ones((100, 1)), X] # 爲X添加了一個偏置單位，對於X中的每個向量都是1
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
theta_best

array([[3.86687149],
       [3.12408839]])

不難發現這個值接近真實的 $\theta_0$ ， $\theta_1$ ，由於我在數據中引入了噪聲，所以存在誤差。

X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict = X_new_b.dot(theta_best)
y_predict

array([[ 3.86687149],
       [10.11504826]])

梯度下降法（Gradient Descent）

Cost Function & Gradients

計算代價函數和梯度的公式如下所示。

注意：代價函數用於線性迴歸，對於其他算法，代價函數是不同的，梯度必須從代價函數中推導出來。

Cost

$J(\theta) = 1/2m \sum_{i=1}^{m} (h(\theta)^{(i)} - y^{(i)})^2$

Gradient

$\frac{\partial J(\theta)}{\partial \theta_j} = 1/m\sum_{i=1}^{m}(h(\theta^{(i)} - y^{(i)}).X_j^{(i)}$

Gradients

$\theta_0: = \theta_0 -\alpha . (1/m .\sum_{i=1}^{m}(h(\theta^{(i)} - y^{(i)}).X_0^{(i)})$
$\theta_1: = \theta_1 -\alpha . (1/m .\sum_{i=1}^{m}(h(\theta^{(i)} - y^{(i)}).X_1^{(i)})$
$\theta_2: = \theta_2 -\alpha . (1/m .\sum_{i=1}^{m}(h(\theta^{(i)} - y^{(i)}).X_2^{(i)})$
$\theta_j: = \theta_j -\alpha . (1/m .\sum_{i=1}^{m}(h(\theta^{(i)} - y^{(i)}).X_0^{(i)})$

def cal_cost(theta, X, y):
    '''
    Calculates the cost for given X and Y. The following shows and example of a single dimensional X
    theta = Vector of thetas 
    X     = Row of X's np.zeros((2,j))
    y     = Actual y's np.zeros((2,1))
    
    where:
        j is the no of features
    '''
    
    m = len(y)
    
    predictions = X.dot(theta)
    cost = (1/2*m) * np.sum(np.square(predictions - y))
    
    return cost

def gradient_descent(X, y, theta, learning_rate = 0.01, iterations = 100):
    '''
    X    = Matrix of X with added bias units
    y    = Vector of Y
    theta=Vector of thetas np.random.randn(j,1)
    learning_rate 
    iterations = no of iterations
    
    Returns the final theta vector and array of cost history over no of iterations    
    '''
    
    m = len(y)
    # learning_rate = 0.01
    # iterations = 100
    
    cost_history = np.zeros(iterations)
    theta_history = np.zeros((iterations, 2))
    for i in range(iterations):
        prediction = np.dot(X, theta)
        
        theta = theta - (1/m) * learning_rate * (X.T.dot((prediction - y)))
        theta_history[i, :] = theta.T
        cost_history[i] = cal_cost(theta, X, y)
        
    return theta, cost_history, theta_history

# 從1000次迭代開始，學習率爲0.01。從高斯分佈的θ開始
lr =0.01
n_iter = 1000
theta = np.random.randn(2, 1)
X_b = np.c_[np.ones((len(X), 1)), X]
theta, cost_history, theta_history = gradient_descent(X_b, y, theta, lr, n_iter)

print('Theta0:          {:0.3f},\nTheta1:          {:0.3f}'.format(theta[0][0],theta[1][0]))
print('Final cost/MSE:  {:0.3f}'.format(cost_history[-1]))

Theta0:          3.867,
Theta1:          3.124
Final cost/MSE:  5457.747

# 繪製迭代的成本圖
fig, ax = plt.subplots(figsize=(12,8))

ax.set_ylabel('J(Theta)')
ax.set_xlabel('Iterations')
ax.plot(range(1000), cost_history, 'b.')

在大約 150 次迭代之後代價函數趨於穩定，因此放大到迭代200，看看曲線

fig, ax = plt.subplots(figsize=(10,8))
ax.plot(range(200), cost_history[:200], 'b.')

值得注意的是，最初成本下降得更快，然後成本降低的收益就不那麼多了。我們可以嘗試使用不同的學習速率和迭代組合，並得到不同學習率和迭代的效果會如何。

讓我們建立一個函數，它可以顯示效果，也可以顯示梯度下降實際上是如何工作的。

def plot_GD(n_iter, lr, ax, ax1=None):
    '''
    n_iter = no of iterations
    lr = Learning Rate
    ax = Axis to plot the Gradient Descent
    ax1 = Axis to plot cost_history vs Iterations plot
    '''
    
    ax.plot(X, y, 'b.')
    theta = np.random.randn(2, 1)
    
    tr = 0.1
    cost_history = np.zeros(n_iter)
    for i in range(n_iter):
        pred_prev = X_b.dot(theta)
        theta, h, _ = gradient_descent(X_b, y, theta, lr, 1)
        pred = X_b.dot(theta)
        
        cost_history[i] = h[0]
        
        if ((i % 25 == 0)):
            ax.plot(X, pred, 'r-', alpha=tr)
            if tr < 0.8:
                tr += 0.2
    
    if not ax1 == None:
        ax1.plot(range(n_iter), cost_history, 'b.')

# 繪製不同迭代和學習率組合的圖
fig = plt.figure(figsize=(30,25), dpi=200)
fig.subplots_adjust(hspace=0.4, wspace=0.4)

it_lr = [(2000, 0.001), (500, 0.01), (200, 0.05), (100, 0.1)]
count = 0
for n_iter, lr in it_lr:
    count += 1
    
    ax = fig.add_subplot(4, 2, count)
    count += 1
   
    ax1 = fig.add_subplot(4, 2, count)
    
    ax.set_title("lr:{}" .format(lr))
    ax1.set_title("Iterations:{}" .format(n_iter))
    plot_GD(n_iter, lr, ax, ax1)

通過觀察發現，以較小的學習速率收集解決方案需要很長時間，而學習速度越大，學習速度越快。

_, ax = plt.subplots(figsize=(14, 10))
plot_GD(100, 0.1, ax)

隨機梯度下降法（Stochastic Gradient Descent）

隨機梯度下降法，其實和批量梯度下降法原理類似，區別在與求梯度時沒有用所有的 $m$ 個樣本的數據，而是僅僅選取一個樣本 $j$ 來求梯度。對應的更新公式是：

$\theta_i = \theta_i - \alpha (h_\theta(x_0^{(j)}, x_1^{(j)}, ...x_n^{(j)}) - y_j)x_i^{(j)}$

def stocashtic_gradient_descent(X, y, theta, learning_rate=0.01, iterations=10):
    '''
    X    = Matrix of X with added bias units
    y    = Vector of Y
    theta=Vector of thetas np.random.randn(j,1)
    learning_rate 
    iterations = no of iterations
    
    Returns the final theta vector and array of cost history over no of iterations
    '''
    
    m = len(y)
    cost_history = np.zeros(iterations)
    
    for it in range(iterations):
        cost = 0.0
        for i in range(m):
            rand_ind = np.random.randint(0, m)
            X_i = X[rand_ind, :].reshape(1, X.shape[1])
            y_i = y[rand_ind, :].reshape(1, 1)
            prediction = np.dot(X_i, theta)
            
            theta -= (1/m) * learning_rate * (X_i.T.dot((prediction - y_i)))
            cost += cal_cost(theta, X_i, y_i)
        cost_history[it] = cost
        
    return theta, cost_history

lr = 0.5
n_iter = 50
theta = np.random.randn(2,1)
X_b = np.c_[np.ones((len(X),1)), X]
theta, cost_history = stocashtic_gradient_descent(X_b, y, theta, lr, n_iter)

print('Theta0:          {:0.3f},\nTheta1:          {:0.3f}' .format(theta[0][0],theta[1][0]))
print('Final cost/MSE:  {:0.3f}' .format(cost_history[-1]))

Theta0:          3.762,
Theta1:          3.159
Final cost/MSE:  46.964

fig, ax = plt.subplots(figsize=(10,8))

ax.set_ylabel('$J(\Theta)$' ,rotation=0)
ax.set_xlabel('$Iterations$')
theta = np.random.randn(2,1)

ax.plot(range(n_iter), cost_history, 'b.')

小批量梯度下降法（Mini-batch Gradient Descent）

小批量梯度下降法是批量梯度下降法和隨機梯度下降法的折衷，也就是對於 $m$ 個樣本，我們採用x個樣子來迭代， $1<x<m$ 。一般可以取 $x=10$ ，當然根據樣本的數據，可以調整這個 $x$ 的值。對應的更新公式是：

$\theta_i = \theta_i - \alpha \sum\limits_{j=t}^{t+x-1}(h_\theta(x_0^{(j)}, x_1^{(j)}, ...x_n^{(j)}) - y_j)x_i^{(j)}$

def minibatch_gradient_descent(X, y, theta, learning_rate=0.01, iterations=10, batch_size=20):
    '''
    X    = Matrix of X without added bias units
    y    = Vector of Y
    theta=Vector of thetas np.random.randn(j,1)
    learning_rate 
    iterations = no of iterations
    
    Returns the final theta vector and array of cost history over no of iterations
    '''
    
    m = len(y)
    cost_history = np.zeros(iterations)
    n_batches = int(m / batch_size)
    
    for it in range(iterations):
        cost = 0.0
        indices = np.random.permutation(m)
        X = X[indices]
        y = y[indices]
        for i in range(0, m, batch_size):
            X_i = X[i: i+batch_size]
            y_i = y[i: i+batch_size]
            
            X_i = np.c_[np.ones(len(X_i)), X_i]
            prediction = np.dot(X_i, theta)
            
            theta -= (1/m) * learning_rate * (X_i.T.dot((prediction - y_i)))
            cost += cal_cost(theta, X_i, y_i)
        cost_history[it] = cost
    
    return theta, cost_history

lr = 0.1
n_iter = 200
theta = np.random.randn(2, 1)
theta, cost_history = minibatch_gradient_descent(X, y, theta, lr, n_iter)

print('Theta0:          {:0.3f},\nTheta1:          {:0.3f}' .format(theta[0][0], theta[1][0]))
print('Final cost/MSE:  {:0.3f}' .format(cost_history[-1]))

Theta0:          3.842,
Theta1:          3.146
Final cost/MSE:  1090.518

fig, ax = plt.subplots(figsize=(10,8))

ax.set_ylabel('$J(\Theta)$', rotation=0)
ax.set_xlabel('$Iterations$')
theta = np.random.randn(2, 1)

ax.plot(range(n_iter), cost_history, 'b.')

參考：

梯度下降法實現（Python語言描述）

梯度下降法（Gradient Descent）

Cost Function & Gradients

隨機梯度下降法（Stochastic Gradient Descent）

小批量梯度下降法（Mini-batch Gradient Descent）

矩陣分解在推薦系統中的應用及實踐

梯度下降法實現（Python語言描述）

線性迴歸之梯度下降詳解

《C程序設計新思維》PDF電子書

《深入理解C指針》PDF電子書

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

梯度下降法實現（Python語言描述）

梯度下降法 （Gradient Descent）

Cost Function & Gradients

隨機梯度下降法（Stochastic Gradient Descent）

小批量梯度下降法（Mini-batch Gradient Descent）

梯度下降法（Gradient Descent）