【圖解例說機器學習】線性迴歸 (Linear Regression)

線性迴歸之於機器學習，正如Hello World之於編程語言，也如MINST之於深度學習。

首先，我們先定義一些即將用到的數學符號：

Notations	Meaning	Notations	Meaning
$M$	Number of parameters $\mathrm w$	$N$	Number of instances
$\mathrm X=\{\mathrm x_1,\mathrm x_2,\cdots,\mathrm x_N\}^{\mathrm T}$	$N\times M$ matrix for training	$D$	Number of features
$\mathrm y=\{y_1,y_2,\cdots,y_N\}^\mathrm{T}$	Set of targets	$y_i$	Target of instance $i$
$\mathrm{x}_i=\{x_i^{(1)},x_i^{(2)},\cdots,x_i^{(D)}\}^\mathrm{T}$	Set of features for instance $i$	$x_i^{(j)}$	Feature $j$ for instance $i$
$\mathrm w=\{\omega_1,\omega_2,\cdots,\omega_M\}^\mathrm{T}$	Weights of input $\mathrm x$	$\omega_i$	Weight of feature $i$
$\phi=\{\phi_1,\phi_2,\cdots,\phi_M\}^\mathrm{T}$	Set of functions	$\phi_i(\mathrm x)$	Function of features

模型描述

在線性迴歸中，假設目標值與參數 $\mathrm{w}=\{\omega_n\}$ 之間線性相關，通過構建損失函數 $E$ ，求解損失函數最小時的參數。也就是說，線性迴歸試圖學習得到如下函數：
$\hat y=\omega_0+\sum\limits_{j=1}^{M}\omega_j\phi_j(\mathrm x)=\omega_0+\mathrm w^{\mathrm T}\phi(\mathrm x)\tag{1}$
公式(1)是線性迴歸模型的一般形式，看起來不是那麼直觀。其常見的形式如下：

當 $D=1,\phi_j(x)=x^j$ 時，公式(1)可以表示爲：
$\hat y=\omega_0+\omega_1x+\omega_2x^2+\cdots+\omega_Mx^M\tag{2}$
此時，線性迴歸就變成了多項式迴歸。
當 $D=M,\phi_j(\mathrm x)=x^{(j)}$ 時，公式(1)可以表示爲：
$\hat y=\omega_0+\omega_1x^{(1)}+\omega_2x^{(2)}+\cdots+\omega_Mx^{(M)}\tag{3}$
此時，線性迴歸就變成了我們通常所說的線性迴歸—多元一次方程。當只有一維特徵( $M=1$ ) 時，可以得到我們初中就學過的一元一次方程

$\hat y=\omega_1x+\omega_0\tag{4}$
爲使本文通俗易懂，除非作特別說明，本文仿真都以這個一元一次方程爲例介紹線性迴歸

代價函數

線性迴歸的目的就是使得我們預測得到的 $\hat y$ 與真實值 $y$ 之間的誤差最小。這裏的誤差可以用不同的距離度量，這裏我們使用平方和。此時，代價函數就可以表示爲
$E=\sum\limits_{i=1}^N{(\hat y_i-y_i)^2}=\sum\limits_{i=1}^N{(\omega_0+\mathrm w^{\mathrm T}\phi(\mathrm x_i)-y_i)^2}=\sum\limits_{i=1}^{N}{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]^2}}\tag{5}$
下面我們在二維空間( $M=1$ )和三維空間( $M=2$ )畫出代價函數圖像。這裏我們假定 $\phi_i(\mathrm x)=x^{(i)}$ ， $\omega_0,\omega_1,\omega_2$ 已知，則公式(1)可以分別表示爲：
$\hat y=\omega_0+\omega_1x^{(1)}\tag{6}$

$\hat y=\omega_0+\omega_1x^{(1)}+\omega_2x^{(2)}\tag{7}$

根據公式(6),(7)，我們可以得到圖1和圖2中的直線和二維平面：

圖1

圖2

圖1和圖2中的紅色的點是 $\mathrm x$ 對應的真實值 $\mathrm y$ ，紅色線段即爲誤差值。

一個例子

圖1和圖2展示的是給定參數 $\omega_0,\omega_1,\omega_2$ 下的真實值 $y$ 與預測值 $\hat { y}$ 的誤差。不同的參數可以得到不同的誤差值，線性迴歸的目的就是尋找一組參數是的誤差最小。下面我們通過圖3和圖4來說明：

我們假設訓練集有3組數據 $(x, y)$ ： $(1, 0.8) (2, 2) (3, 3.1)$ 。我們這裏使用一元線性迴歸，即公式(6)，此時線性迴歸的目的就是找到一條直線 $\hat y=\omega_1x+\omega_0$ 使得這3組數據點離直線最近。

圖3

圖4

圖3畫出了當 $\omega_0=0,\omega_1=0.5\sim1.5$ 時，直線 $\hat y=\omega_1x+\omega_0$ 的圖像。圖4給出了當 $\omega_1$ 取不同值時，代價函數值的變化。從圖3和圖4可以看出，當 $\omega_1=1$ 時，代價最小。

正規方程與梯度下降

線性迴歸的本質就是解如下優化問題：
$\min\limits_{\mathrm w}\quad E=\sum\limits_{i=1}^{N}{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]^2}}\tag{8}$
令 $\bar{\mathrm w}=\{\omega_0,\mathrm w\},\bar{\phi}=\{\phi_0,\phi\},\phi_0(\mathrm x)=1$ ，並將問題(8)表示成向量相乘的形式：
$\min\limits_{\mathrm {\bar w}}\quad E=[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]^{\mathrm T}[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]\tag{9}$
公式(9)中， $\bar{\phi}(\mathrm X)$ 是一個 $N\times M+1$ 維的矩陣:
$\bar\phi(\mathrm X)= \left\{\begin{matrix} \phi_0(\mathrm x_1) & \phi_1(\mathrm x_1) & \cdots & \phi_M(\mathrm x_1)\\ \phi_0(\mathrm x_2) & \phi_1(\mathrm x_2) & \cdots & \phi_M(\mathrm x_2)\\ \vdots & \vdots & \cdots &\vdots \\ \phi_0(\mathrm x_N) & \phi_1(\mathrm x_N) & \cdots & \phi_M(\mathrm x_N) \end{matrix} \right\}\tag{10}$
通過求表達式(8)的Hessian矩陣，可以知道這是一個凸優化問題。那麼問題就變得十分簡單了，可以用現成的工具來求解：比如CVX, CPLEX, MATLAB等等。這些解法器一般都是通過梯度法(後面會講解)來求解問題的。當然我們也可以通過凸問題的性質，得到其解析解。

正規方程法

由於誤差函數(8)是一個凸函數，所以其導數爲0的點就是最優點。爲此，我們將 $E$ 對 $\mathrm{\bar w}$ 進行微分求導入下：
$\frac{\partial{E}}{\partial{\mathrm{\bar w}}}=\bar\phi^{\mathrm T}(\mathrm X)[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]=0\tag{11}$

$\bar\phi^{\mathrm T}(\mathrm X)\bar\phi(\mathrm X)\mathrm{\bar w}=\bar\phi^{\mathrm T}(\mathrm X)\mathrm y\rightarrow\mathrm{\bar w}=[\bar\phi^{\mathrm T}(\mathrm X)\bar\phi(\mathrm X)]^{-1}\bar\phi^{\mathrm T}(\mathrm X)\mathrm y\tag{12}$

由公式(11)可知，給定訓練數據 $\mathrm X$ ，我們就可以求出最佳的 $\mathrm{\bar w}$ 。需要注意的是，這裏需要求矩陣的逆，計算量比較大，不適合當訓練數據較大的情況。這時我們可以通過梯度下降法來求解。注意：在本文之前，我曾經從線性代數的角度解釋了公式(12)是如何得到的，希望對讀者有所幫助，詳情請見【線性代數】最小二乘與投影矩陣。

梯度下降法

使用梯度下降法，可以對凸問題求得最優解，對非凸問題，可以找到局部最優解。梯度下降法的算法思想如下圖5和圖6所示：

圖5

圖6

在左圖(圖5)中，梯度爲 $\frac{d\hat y}{d x}=x-2$ 。當 $x<2$ 時，梯度小於零，此時 $x$ 應當向右移動來減小函數值(負梯度方向)；當 $x>2$ 時，梯度大於零，此時 $x$ 應當向左移動來減小函數值(負梯度方向)。
在右圖(圖6)中，函數不是凸函數的情況下，使用梯度下降法會得到局部最優解(假定初始值爲 $x=0$ )。當初始值 $x=7$ 時，我們可以得到最優解。因此，初始值對梯度下降法影響較大，我們可以通過隨機選擇初始值來克服陷入局部最優解的情況。

根據(10)得到的梯度表達式，梯度下降的每一次迭代過程如下：
$\bar{\mathrm w}^{t+1}=\bar{\mathrm w}^{t}-\eta\frac{\partial{E}}{\partial{\mathrm{\bar w}}}=\bar{\mathrm w}^{t}-\eta\bar\phi^{\mathrm T}(X)[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]\tag{13}$
將公式(13)的矩陣相乘展開可以得到
$\omega_j^{t+1}=\omega_j^t-\eta\sum\limits_{i=1}^{N}{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]}\phi_j(\mathrm x_i)}\tag{14}$
公式(13)或(14)就是標準的梯度下降法，其中 $\eta$ 是每次迭代的步長大小。

$\eta$ 較小時，迭代較慢，當時可以保證收斂到最優解(凸函數的情況下)； $\eta$ 較大時，函數值下降較快，但容易發生震盪。
每次迭代時，需要使用所有的樣本點 $\mathrm x_i,i=1,2,\cdots,N$ 。當數據樣本點非常大時，開銷十分大。

爲此，有人提出了隨機梯度下降，其迭代公式如下：
$\omega_j^{t+1}=\omega_j^t-\eta{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]}\phi_j(\mathrm x_i)}\tag{15}$
隨機梯度下降又稱連續梯度下降，比較適合於實時系統，即整個數據集 $\mathrm x_i$ 不是可以一次性獲得的，但是我們需要作出預測的場景。相較於梯度下降法(14)，隨機梯度只根據當前樣本更新迭代，隨機性較大。因此有可能跳出標準梯度下降法的局部最優解。

算法實現

這裏我們使用sklearn中波士頓房價的數據集，該數據集有13維特徵，506個樣例。爲簡便起見，我們只取前2維特徵作爲輸入( $M=D=2,\hat y=\omega_0+\omega_1*x^{(1)}+\omega_2*x^{(2)}$ )，前500個作爲輸入樣例，後6個作爲預測樣例。在算法實現中，我們分別考慮了正規方程法和梯度下降法。並且，考慮到 $x^{(1)}$ 和 $x^{(2)}$ 的取值範圍差距較大，我們還考慮了特徵值縮放。爲此，我們實現了上述四種算法的組合[特徵不縮放(特徵縮放)+正規方程法(梯度下降法)]。

算法結果

圖7給出了上述4種算法的結果：

圖7

圖7中，E_train爲訓練誤差，即前500個樣例的真實值與預測值的誤差，E_test爲預測誤差，即最後6個樣例的真實值與預測值的誤差。由於誤差函數對於參數w是凸函數，我們總能得到最優解，即最小的訓練誤差，所以上述四種方法的訓練誤差相同。

特徵縮放與梯度下降法

圖7能得到最小誤差函數值，是因爲目標函數 $E$ 是參數 $\omega_1$ 和 $\omega_2$ 的凸函數。爲方便起見，對於具體實例，我們給出 $E$ 的表達式：
$E(\omega_0,\omega_1,\omega_2)=\sum\limits_{i=1}^{500}(\omega_0+\omega_1*x^{(1)}+\omega_2*x^{(2)}-y_i)^2\tag{16}$
公式(16)中， $\omega_0$ 與具體樣例無關， $\omega_0$ 的值不改變 $E$ 的圖像形狀，改變 $\omega_0$ 相當於進行位移，我們這裏假定 $\omega_0=0$ 。爲此，當給定波士頓房價數據集，即 $x^{(1)},x^{(2)},y_i$ 給定時，我們可以畫出公式(16)對應的等高線圖，圖8。

圖8

圖9

從圖8可以看出，當改變 $\omega_2$ 時， $E$ 變的較快(等高線在 $\omega_2$ 方向較爲稀疏)。這是因爲 $\omega_2$ 的係數爲 $x^{(2)}$ ，而 $x^{(2)}$ 相對於 $x^{(1)}$ 有較大的取值。在這種情況下，對梯度下降法就十分不友好–很容易跳過最優解。也就是說，步長設置要十分小，這就會導致收斂速度慢。在我們這個實例中，步長最大隻能設置爲 $\eta=5e^{-6}$ ，此時需要差不多30000次迭代才能收斂到最優，如圖10所示。
特徵縮放是一種解決上述情況下，梯度下降法收斂慢的方法。特徵縮放的表達式都十分簡單，這裏不再贅述，我們這裏是直接使用的sklearn庫中的preprocessing.StandardScaler()函數對樣例進行特徵縮放。對 $x^{(1)},x^{(2)}$ 縮放後，我們可以用相同的方式畫出對應的等高線圖，圖9，以及收斂圖，圖11。經過特徵縮放後，圖9中等高線在 $\omega_1,\omega_2$ 方向上的稀疏程度差不多。圖11中，步長可以設置得較大( $\eta=1e^{-3}$ )，收斂速度變得極快，只需要迭代8次左右就達到最優。

圖10

圖11

附錄

下面給出圖1—圖11的Python源代碼如下：

圖1-圖2：

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


# Set the format of labels
def LabelFormat(plt):
    ax = plt.gca()
    plt.tick_params(labelsize=14)
    labels = ax.get_xticklabels() + ax.get_yticklabels()
    [label.set_fontname('Times New Roman') for label in labels]
    font = {'family': 'Times New Roman',
            'weight': 'normal',
            'size': 16,
            }
    return font


# 2-d case
omega_0 = 0
omega_1 = 1
data_train = [[0.5, 0.2], [1, 0.8], [1.5, 1.2], [2, 2], [2.5, 2.8], [3, 3.1], [3.5, 3.8]]
x_train = [d[0] for d in data_train]
y_train = [d[1] for d in data_train]

x = np.linspace(0, 4, 30).reshape(30, 1)
y = omega_1 * x + omega_0

x_test = x_train
y_test = y_train
y_hat = omega_1 * x_test

plt.figure()
plt.plot(x, y, 'k-')
for i in range(len(x_test)):
    plt.stem([x_test[i], ], [y_test[i], ], linefmt='rx', bottom=y_hat[i], basefmt='ko', markerfmt='C3o',
             use_line_collection=True)
# Set the labels
font = LabelFormat(plt)
plt.xlabel('$x$', font)
plt.ylabel('$\hat y$', font)
plt.title('$M=1,\omega_0=0,\omega_1=1$')
plt.xlim(0, 4)
plt.ylim(0, 4)
plt.grid()

plt.show()

# 3-d case
omega_0 = 2
omega_1 = 0.25
omega_2 = 0.5


x1 = np.linspace(0, 4, 30).reshape(30, 1)
x2 = np.linspace(0, 4, 30).reshape(30, 1)

X1, X2 = np.meshgrid(x1, x2)
y_hat = omega_0 + omega_1 * X1 + omega_2 * X2
fig = plt.figure()
ax = fig.gca(projection='3d')

x1_test=np.array([1,2,3])
x2_test=np.array([1,2,3])
X1_test, X2_test = np.meshgrid(x1_test, x2_test)

y_test = omega_0 + omega_1 * X1_test + omega_2 * X2_test+8*np.random.rand(3,3)-4

ax.plot_surface(X1, X2, y_hat, cmap='rainbow')

for i in range(len(x1_test)):
    for j in range(len(x2_test)):
        y_predict= omega_0 + omega_1 * x1_test[i] + omega_2 * x2_test[j]
        ax.plot([x1_test[i],x1_test[i]],[x2_test[j],x2_test[j]],[y_test[i][j],y_predict],'r-o')

# Set the labels
font = LabelFormat(plt)
ax.set_xlabel('$x^{(1)}$', font)
ax.set_ylabel('$x^{(2)}$', font)
ax.set_zlabel('$\hat y$', font)
ax.set_xlim(0, 4)
ax.set_ylim(0, 4)
ax.set_zlim(0, 8)
ax.set_xticks([0,1,2,3,4])
ax.set_yticks([0,1,2,3,4])
ax.set_title('$M=2,\omega_0=2,\omega_1=0.25,\omega_2=0.5$')

# Customize the view angle so it's easier to see that the scatter points lie
ax.view_init(elev=5., azim=-25)
plt.show()

圖3-圖4：

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib as mpl
import math


# Set the format of labels
def LabelFormat(plt):
    ax = plt.gca()
    plt.tick_params(labelsize=14)
    labels = ax.get_xticklabels() + ax.get_yticklabels()
    [label.set_fontname('Times New Roman') for label in labels]
    font = {'family': 'Times New Roman',
             'weight': 'normal',
             'size': 16,
             }
    return font


# Plot the training points: different
def PlotTrainPoint(X):
    for i in range(0, len(X)):
        plt.plot(X[i][0], X[i][1], 'rs', markersize=6, markerfacecolor="r")


# Loss function--Square Error function
def LossFunction(Y, predictedY):
    lengthY = len(Y)
    error = 0
    for i in range(lengthY):
        error += pow(Y[i] - predictedY[i], 2)

    return math.sqrt(error)


trainData = [[1, 0.8], [2, 2], [3, 3.1]]

# Predicted function: y=\omega_1*x+\omega_0 Here \omega_0 is assumed to be 0 for simplifcity

x = np.linspace(0, 4, 30).reshape(30, 1)
omega_1 = np.linspace(0.5, 1.5, 41).reshape(41, 1)
omega_0 = 0
y_hat = []
#Get the value of x and y in the trainData
x_train = [d[0] for d in trainData]
y_train = [d[1] for d in trainData]
error_all = []

# Plot the figure to show the function: y=\omega_1*x+\omega_0
for i in range(len(omega_1)):
    y_hat.append(omega_1[i] * x)
    if omega_1[i]==0.5:
        plt.plot(x, y_hat[i],  color='cyan', alpha=1)
    elif omega_1[i]==1:
        plt.plot(x, y_hat[i], color='blue', alpha=1)
    elif omega_1[i]==1.5:
        plt.plot(x, y_hat[i], color='orange', alpha=1)
    else:
        plt.plot(x, y_hat[i], color='black', alpha=0.3)
    # Compute the errors for each omega_1
    error_all.append(LossFunction(y_train, omega_1[i].T*x_train+omega_0))

# Set the axis
font=LabelFormat(plt)
PlotTrainPoint(trainData)
# Label the critical points
plt.annotate('$\omega_1=1.5$', xy=(2.5, 2.5*1.5), xycoords='data',
             xytext=(-35, 35), textcoords='offset points', color='orange', fontsize=12, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=90", color='orange'))
plt.annotate('$\omega_1=1$', xy=(2.5, 2.5*1), xycoords='data',
             xytext=(-45, 95), textcoords='offset points', color='b', fontsize=12, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=90", color='b'))
plt.annotate('$\omega_1=0.5$', xy=(2.5, 2.5*0.5), xycoords='data',
             xytext=(-75, 155), textcoords='offset points', color='cyan', fontsize=12, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=90", color='cyan'))

plt.annotate('$\omega_1=0.5\sim 1.5$', xy=(1, 2.2), xycoords='data',
             xytext=(8, -125), textcoords='offset points', color='k', fontsize=12, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=90", color='k'))
plt.xlabel('$x$',font)
plt.ylabel('$\hat y$',font)
plt.xlim([0,3.2])
plt.ylim([0,4.5])
plt.show()

# Show the error when omega_1 changes
plt.figure()
font=LabelFormat(plt)
plt.plot(omega_1,error_all, 'k-s')
error_min=min(error_all)
index_min=error_all.index(error_min)
print(index_min)
# plot the error at the given three point
plt.plot(omega_1[index_min],error_min,'bs')
plt.plot(omega_1[0],error_all[0],'cyan',marker='s')
plt.plot(omega_1[-1],error_all[-1],'orange',marker='s')

plt.xlabel('$\omega_1$', font)
plt.ylabel('Value of loss function', font)
plt.show()

圖5-圖6：

import numpy as np
import matplotlib.pyplot as plt

# Set the format of labels
def LabelFormat(plt):
    ax = plt.gca()
    plt.tick_params(labelsize=14)
    labels = ax.get_xticklabels() + ax.get_yticklabels()
    [label.set_fontname('Times New Roman') for label in labels]
    font = {'family': 'Times New Roman',
            'weight': 'normal',
            'size': 16,
            }
    return font

x = np.linspace(0, 4, 30).reshape(30, 1)
y=(x-2)**2/2

plt.figure()
plt.plot(x,y,'k-')
plt.plot(3.5,1.5**2/2,'ro')
plt.annotate('$\\frac{dE}{dx}$', xy=(3.5, 1.5**2/2), xycoords='data',
             xytext=(-60, -125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="<-",
             connectionstyle="arc,rad=90", color='r'))

plt.plot(0.5,1.5**2/2,'ro')
plt.annotate('$\\frac{dE}{dx}$', xy=(0.5, 1.5**2/2), xycoords='data',
             xytext=(48, -125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="<-",
             connectionstyle="arc,rad=90", color='r'))

plt.annotate('$\hat y=\\frac{1}{2}(x-2)^2$', xy=(0.25, 1.75**2/2), xycoords='data',
             xytext=(108, 0), textcoords='offset points',color='k', fontsize=14, arrowprops=dict(arrowstyle="<-",
             connectionstyle="arc,rad=90", color='w'))
# Set the labels
font = LabelFormat(plt)
plt.xlabel('$x$', font)
plt.ylabel('$\hat y$', font)
plt.show()

# To plot figure 6
x1 = np.linspace(0, 5/4.0*np.pi, 50).reshape(50, 1)
y1=np.cos(x1)

x2 = np.linspace(5/4.0*np.pi, 8, 50).reshape(50, 1)
y2=0.5*np.cos(2*x2+1*np.pi)-0.71

plt.figure()
plt.plot(x1,y1,'k-')
plt.plot(x2,y2,'k-')
plt.plot(np.pi,-1,'ro')
plt.annotate('Local optimal', xy=(np.pi, -1), xycoords='data',
             xytext=(-48, 125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=90", color='r'))

plt.plot(np.pi*2,-1.21,'ro')
plt.annotate('Global optimal', xy=(2*np.pi, -1.21), xycoords='data',
             xytext=(-48, 125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="->",
             connectionstyle="arc,rad=90", color='r'))

# Set the labels
font = LabelFormat(plt)
plt.xlabel('$x$', font)
plt.ylabel('$\hat y$', font)
plt.show()

圖7-圖11：

# -*- coding: utf-8 -*-
# @Time : 2020/4/7 11:28
# @Author : tengweitw

import numpy as np
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import preprocessing


def Linear_regression_normal_equation(train_data, train_target, test_data, test_target):
    # the 1st column is 1 i.e., x_0=1
    temp = np.ones([np.size(train_data, 0), 1])
    # X is a 500*(1+2)-dim matrix
    X = np.concatenate((temp, train_data), axis=1)

    # Normal equation
    w_bar = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, train_target))

    # Training Error
    y_predict_train = np.matmul(X, w_bar)
    E_train = np.linalg.norm(y_predict_train - train_target)/len(y_predict_train)

    # Predicting
    x0 = np.ones((np.size(test_data, 0), 1))
    test_data1 = np.concatenate((x0, test_data), axis=1)
    y_predict_test = np.matmul(test_data1, w_bar)

    # Prediction Error
    E_test = np.linalg.norm(y_predict_test - test_target)/len(y_predict_test)

    return y_predict_test, E_train, E_test


def Linear_regression_normal_equation_scale(train_data, train_target, test_data, test_target):
    # Data processing: scaling
    # For training data
    ss = preprocessing.StandardScaler()
    ss.partial_fit(train_data)
    train_data_scale = ss.fit_transform(train_data)
    # For testing data
    ss.partial_fit(test_data)
    test_data_scale = ss.fit_transform(test_data)

    # the 1st column is 1 i.e., x_0=1
    temp = np.ones([np.size(train_data_scale, 0), 1])
    # X is a 500*(1+2)-dim matrix
    X = np.concatenate((temp, train_data_scale), axis=1)

    # Normal equation
    w_bar = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, train_target))

    # Training Error
    y_predict_train = np.matmul(X, w_bar)
    E_train = np.linalg.norm(y_predict_train - train_target) / len(y_predict_train)

    # Predicting
    x0 = np.ones((np.size(test_data_scale, 0), 1))
    test_data1 = np.concatenate((x0, test_data_scale), axis=1)
    y_predict_test = np.matmul(test_data1, w_bar)

    # Prediction Error
    E_test = np.linalg.norm(y_predict_test - test_target) / len(y_predict_test)

    return y_predict_test, E_train, E_test


def Linear_regression_gradient_descend(train_data, train_target, test_data, test_target):
    # learning rate
    eta = 5e-6
    M = np.size(train_data, 1)
    N = np.size(train_data, 0)
    w_bar = np.zeros((M + 1, 1))

    # the 1st column is 1 i.e., x_0=1
    temp = np.ones([N, 1])
    # X is a N*(1+M)-dim matrix
    X = np.concatenate((temp, train_data), axis=1)
    train_target = np.mat(train_target).T

    iter = 0
    num_iter = 5000
    E_train = np.zeros((num_iter, 1))

    while iter < num_iter:
        temp = np.matmul(X, w_bar) - train_target
        w_bar = w_bar - eta * np.matmul(X.T, temp)
        # Predicting training data
        y_predict_train = np.matmul(X, w_bar)
        # Training Error
        E_train[iter]=np.linalg.norm(y_predict_train - train_target)/len(y_predict_train)
        iter += 1

    # Predicting
    x0 = np.ones((np.size(test_data, 0), 1))
    test_data1 = np.concatenate((x0, test_data), axis=1)
    y_predict_test = np.matmul(test_data1, w_bar)

    # Prediction Error
    E_test = np.linalg.norm(y_predict_test.ravel()- test_target)/len(y_predict_test)

    return y_predict_test, E_train, E_test

def Linear_regression_gradient_descend_scale(train_data, train_target, test_data, test_target):
    # Data processing: scaling
    # For training data
    ss = preprocessing.StandardScaler()
    ss.partial_fit(train_data)
    train_data_scale = ss.fit_transform(train_data)
    # For testing data
    ss.partial_fit(test_data)
    test_data_scale = ss.fit_transform(test_data)

    # learning rate
    eta = 1e-3
    M = np.size(train_data_scale, 1)
    N = np.size(train_data_scale, 0)
    w_bar = np.zeros((M + 1, 1))

    # the 1st column is 1 i.e., x_0=1
    temp = np.ones([N, 1])
    # X is a N*(1+M)-dim matrix
    X = np.concatenate((temp, train_data_scale), axis=1)
    train_target = np.mat(train_target).T

    iter = 0
    num_iter = 10
    E_train = np.zeros((num_iter, 1))

    while iter < num_iter:
        temp = np.matmul(X, w_bar) - train_target
        w_bar = w_bar - eta * np.matmul(X.T, temp)
        # Predicting training data
        y_predict_train = np.matmul(X, w_bar)
        # Training Error
        E_train[iter]=np.linalg.norm(y_predict_train - train_target)/len(y_predict_train)
        iter += 1
    # Predicting
    x0 = np.ones((np.size(test_data_scale, 0), 1))
    test_data1 = np.concatenate((x0, test_data_scale), axis=1)
    y_predict_test = np.matmul(test_data1, w_bar)

    # Prediction Error
    E_test = np.linalg.norm(y_predict_test.ravel()- test_target)/len(y_predict_test)

    return y_predict_test, E_train, E_test


# Set the format of labels
def LabelFormat(plt):
    ax = plt.gca()
    plt.tick_params(labelsize=14)
    labels = ax.get_xticklabels() + ax.get_yticklabels()
    [label.set_fontname('Times New Roman') for label in labels]
    font = {'family': 'Times New Roman',
            'weight': 'normal',
            'size': 16,
            }
    return font

def Plot_error_vs_omega(train_data,train_target):
    # ---------Show the contour of E with respect to omegas---------------------
    x1 = train_data[:, 0]
    x2 = train_data[:, 1]
    omega_1 = np.linspace(-30, 30, 30)
    omega_2 = np.linspace(-30, 30, 30)

    Y_hat = np.zeros((len(omega_1),len( omega_2)))
    for i in range(len(omega_1)):
        for j in range(len(omega_2)):
            for k in range(len(train_data)):
                temp=train_target[k] - (omega_1[i] * x1[k] + omega_2[j] * x2[k])
                Y_hat[i][j] = Y_hat[i][j] + np.square(temp)

    fig = plt.figure()

    plt.contour(omega_2,omega_1,Y_hat,20)
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('$\omega_1$', font)
    plt.ylabel('$\omega_2$', font)

    plt.show()

def Plot_error_vs_omega_scale(train_data, train_target):
    # ---------Show the contour of E with respect to omegas---------------------
    # Data processing: scaling
    # For training data
    ss = preprocessing.StandardScaler()
    ss.partial_fit(train_data)
    train_data_scale = ss.fit_transform(train_data)

    x1 = train_data_scale[:, 0]
    x2 = train_data_scale[:, 1]
    omega_1 = np.linspace(-30, 30, 30)
    omega_2 = np.linspace(-30, 30, 30)

    Y_hat = np.zeros((len(omega_1), len(omega_2)))
    for i in range(len(omega_1)):
        for j in range(len(omega_2)):
            for k in range(len(train_data_scale)):
                temp = train_target[k] - (omega_1[i] * x1[k] + omega_2[j] * x2[k])
                Y_hat[i][j] = Y_hat[i][j] + np.square(temp)

    fig = plt.figure()

    plt.contour(omega_2, omega_1, Y_hat, 20)
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('$\omega_1$', font)
    plt.ylabel('$\omega_2$', font)

    plt.show()


if __name__ == '__main__':

    # load house price of Boston
    data, target = load_boston(return_X_y=True)
    # The number of selected features
    M = 2
    # The first 500 data for training
    train_data = data[0:500, 0:0 + M]
    train_target = target[0:500]
    train_target.reshape(len(train_data), 1)

    # ------------------------------
    # The last 6 data for testing
    test_data = data[500:, 0:0 + M]
    test_target = target[500:]

    # To show the contour of error function E with respect to omega
    # We can see that it's a convex function, not easy for gradient descend
    Plot_error_vs_omega(train_data, train_target)
    Plot_error_vs_omega_scale(train_data, train_target)

    #---------------------------------#
    y_predict_normal_equation, E_train,E_test = Linear_regression_normal_equation(train_data, train_target, test_data,
                                                                         test_target)
    print("Linear Regression Using Normal Equation: E_train=%f, E_test=%f" % (E_train,E_test))
    for i in range(len(test_data)):
        print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_normal_equation[i]))


    # ---------------------------------#
    y_predict_normal_equation_scale, E_train,E_test = Linear_regression_normal_equation_scale(train_data, train_target,
                                                                                     test_data, test_target)
    print("Linear Regression Using Normal Equation with scaling: E_train=%f, E_test=%f" % (E_train,E_test))
    for i in range(len(test_data)):
        print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_normal_equation_scale[i]))


    # ---------------------------------#
    y_predict_gradient_descent, E_train,E_test = Linear_regression_gradient_descend(train_data, train_target, test_data,
                                                                           test_target)
    print("Linear Regression Using Gradient Descend: E_train=%f, E_test=%f" % (E_train[-1],E_test))
    for i in range(len(test_data)):
        print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_gradient_descent[i]))

    plt.figure()
    plt.plot(E_train,'r-')
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('Iteration', font)
    plt.ylabel('Average error: $E/N$', font)
    plt.show()


    # ---------------------------------#
    y_predict_gradient_descent_scale, E_train,E_test = Linear_regression_gradient_descend_scale(train_data, train_target,
                                                                                       test_data, test_target)
    print("Linear Regression Using Gradient Descend with scaling: E_train=%f, E_test=%f" % (E_train[-1],E_test))
    for i in range(len(test_data)):
        print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_gradient_descent_scale[i]))
    plt.figure()
    plt.plot(E_train,'r-')
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('Iteration', font)
    plt.ylabel('Average error: $E/N$', font)

    plt.show()

【圖解例說機器學習】線性迴歸 (Linear Regression)

目錄

模型描述

代價函數

一個例子

正規方程與梯度下降

正規方程法

梯度下降法

算法實現

算法結果

特徵縮放與梯度下降法

附錄

【漫漫科研路\pgfplots】畫雙Y座標圖

【圖解例說機器學習】K最近鄰 (KNN)

【圖解例說機器學習】線性迴歸 (Linear Regression)

【漫漫科研路\LaTeX】使用Sublime Text3撰寫科研論文

【漫漫科研路\C&C++】Win10 + VS2017 + CUDA10.1 + CPLEX12.9 配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結