【圖解例說機器學習】線性迴歸 (Linear Regression)

線性迴歸之於機器學習,正如Hello World之於編程語言,也如MINST之於深度學習。

首先,我們先定義一些即將用到的數學符號:

Notations Meaning Notations Meaning
MM Number of parameters w\mathrm w NN Number of instances
X={x1,x2,,xN}T\mathrm X=\{\mathrm x_1,\mathrm x_2,\cdots,\mathrm x_N\}^{\mathrm T} N×MN\times M matrix for training DD Number of features
y={y1,y2,,yN}T\mathrm y=\{y_1,y_2,\cdots,y_N\}^\mathrm{T} Set of targets yiy_i Target of instance ii
xi={xi(1),xi(2),,xi(D)}T\mathrm{x}_i=\{x_i^{(1)},x_i^{(2)},\cdots,x_i^{(D)}\}^\mathrm{T} Set of features for instance ii xi(j)x_i^{(j)} Feature jj for instance ii
w={ω1,ω2,,ωM}T\mathrm w=\{\omega_1,\omega_2,\cdots,\omega_M\}^\mathrm{T} Weights of input x\mathrm x ωi\omega_i Weight of feature ii
ϕ={ϕ1,ϕ2,,ϕM}T\phi=\{\phi_1,\phi_2,\cdots,\phi_M\}^\mathrm{T} Set of functions ϕi(x)\phi_i(\mathrm x) Function of features


模型描述

在線性迴歸中,假設目標值與參數 w={ωn}\mathrm{w}=\{\omega_n\}之間線性相關,通過構建損失函數EE,求解損失函數最小時的參數。也就是說,線性迴歸試圖學習得到如下函數:
y^=ω0+j=1Mωjϕj(x)=ω0+wTϕ(x)(1) \hat y=\omega_0+\sum\limits_{j=1}^{M}\omega_j\phi_j(\mathrm x)=\omega_0+\mathrm w^{\mathrm T}\phi(\mathrm x)\tag{1}
公式(1)是線性迴歸模型的一般形式,看起來不是那麼直觀。其常見的形式如下:

  • D=1,ϕj(x)=xjD=1,\phi_j(x)=x^j時,公式(1)可以表示爲:
    y^=ω0+ω1x+ω2x2++ωMxM(2) \hat y=\omega_0+\omega_1x+\omega_2x^2+\cdots+\omega_Mx^M\tag{2}
    此時,線性迴歸就變成了多項式迴歸。

  • D=M,ϕj(x)=x(j)D=M,\phi_j(\mathrm x)=x^{(j)}時,公式(1)可以表示爲:
    y^=ω0+ω1x(1)+ω2x(2)++ωMx(M)(3) \hat y=\omega_0+\omega_1x^{(1)}+\omega_2x^{(2)}+\cdots+\omega_Mx^{(M)}\tag{3}
    此時,線性迴歸就變成了我們通常所說的線性迴歸—多元一次方程。當只有一維特徵(M=1M=1) 時,可以得到我們初中就學過的一元一次方程

y^=ω1x+ω0(4) \hat y=\omega_1x+\omega_0\tag{4}
爲使本文通俗易懂,除非作特別說明,本文仿真都以這個一元一次方程爲例介紹線性迴歸


代價函數

線性迴歸的目的就是使得我們預測得到的y^\hat y與真實值yy之間的誤差最小。這裏的誤差可以用不同的距離度量,這裏我們使用平方和。此時,代價函數就可以表示爲
E=i=1N(y^iyi)2=i=1N(ω0+wTϕ(xi)yi)2=i=1N[ω0+j=1Mωjϕj(xi)yi]2(5) E=\sum\limits_{i=1}^N{(\hat y_i-y_i)^2}=\sum\limits_{i=1}^N{(\omega_0+\mathrm w^{\mathrm T}\phi(\mathrm x_i)-y_i)^2}=\sum\limits_{i=1}^{N}{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]^2}}\tag{5}
下面我們在二維空間(M=1M=1)和三維空間(M=2M=2)畫出代價函數圖像。這裏我們假定ϕi(x)=x(i)\phi_i(\mathrm x)=x^{(i)}ω0,ω1,ω2\omega_0,\omega_1,\omega_2已知,則公式(1)可以分別表示爲:
y^=ω0+ω1x(1)(6) \hat y=\omega_0+\omega_1x^{(1)}\tag{6}

y^=ω0+ω1x(1)+ω2x(2)(7) \hat y=\omega_0+\omega_1x^{(1)}+\omega_2x^{(2)}\tag{7}

根據公式(6),(7),我們可以得到圖1和圖2中的直線和二維平面:

圖1
圖2

圖1和圖2中的紅色的點是 x\mathrm x 對應的真實值 y\mathrm y ,紅色線段即爲誤差值。


一個例子

圖1和圖2展示的是給定參數ω0,ω1,ω2\omega_0,\omega_1,\omega_2下的真實值yy與預測值y^\hat { y}的誤差。不同的參數可以得到不同的誤差值,線性迴歸的目的就是尋找一組參數是的誤差最小。下面我們通過圖3和圖4來說明:

我們假設訓練集有3組數據(x,y)(x, y)(1,0.8)(2,2)(3,3.1)(1, 0.8) (2, 2) (3, 3.1) 。我們這裏使用一元線性迴歸,即公式(6),此時線性迴歸的目的就是找到一條直線y^=ω1x+ω0\hat y=\omega_1x+\omega_0使得這3組數據點離直線最近。

圖3
圖4

圖3畫出了當ω0=0,ω1=0.51.5\omega_0=0,\omega_1=0.5\sim1.5時,直線y^=ω1x+ω0\hat y=\omega_1x+\omega_0的圖像。圖4給出了當ω1\omega_1取不同值時,代價函數值的變化。從圖3和圖4可以看出,當ω1=1\omega_1=1時,代價最小。


正規方程與梯度下降

線性迴歸的本質就是解如下優化問題:
minwE=i=1N[ω0+j=1Mωjϕj(xi)yi]2(8) \min\limits_{\mathrm w}\quad E=\sum\limits_{i=1}^{N}{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]^2}}\tag{8}
wˉ={ω0,w},ϕˉ={ϕ0,ϕ},ϕ0(x)=1\bar{\mathrm w}=\{\omega_0,\mathrm w\},\bar{\phi}=\{\phi_0,\phi\},\phi_0(\mathrm x)=1,並將問題(8)表示成向量相乘的形式:
minwˉE=[ϕˉ(X)wˉy]T[ϕˉ(X)wˉy](9) \min\limits_{\mathrm {\bar w}}\quad E=[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]^{\mathrm T}[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]\tag{9}
公式(9)中,ϕˉ(X)\bar{\phi}(\mathrm X)是一個N×M+1N\times M+1維的矩陣:
ϕˉ(X)={ϕ0(x1)ϕ1(x1)ϕM(x1)ϕ0(x2)ϕ1(x2)ϕM(x2)ϕ0(xN)ϕ1(xN)ϕM(xN)}(10) \bar\phi(\mathrm X)= \left\{\begin{matrix} \phi_0(\mathrm x_1) & \phi_1(\mathrm x_1) & \cdots & \phi_M(\mathrm x_1)\\ \phi_0(\mathrm x_2) & \phi_1(\mathrm x_2) & \cdots & \phi_M(\mathrm x_2)\\ \vdots & \vdots & \cdots &\vdots \\ \phi_0(\mathrm x_N) & \phi_1(\mathrm x_N) & \cdots & \phi_M(\mathrm x_N) \end{matrix} \right\}\tag{10}
通過求表達式(8)的Hessian矩陣,可以知道這是一個凸優化問題。那麼問題就變得十分簡單了,可以用現成的工具來求解:比如CVX, CPLEX, MATLAB等等。這些解法器一般都是通過梯度法(後面會講解)來求解問題的。當然我們也可以通過凸問題的性質,得到其解析解。


正規方程法

由於誤差函數(8)是一個凸函數,所以其導數爲0的點就是最優點。爲此,我們將EEwˉ\mathrm{\bar w}進行微分求導入下:
Ewˉ=ϕˉT(X)[ϕˉ(X)wˉy]=0(11) \frac{\partial{E}}{\partial{\mathrm{\bar w}}}=\bar\phi^{\mathrm T}(\mathrm X)[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]=0\tag{11}

ϕˉT(X)ϕˉ(X)wˉ=ϕˉT(X)ywˉ=[ϕˉT(X)ϕˉ(X)]1ϕˉT(X)y(12) \bar\phi^{\mathrm T}(\mathrm X)\bar\phi(\mathrm X)\mathrm{\bar w}=\bar\phi^{\mathrm T}(\mathrm X)\mathrm y\rightarrow\mathrm{\bar w}=[\bar\phi^{\mathrm T}(\mathrm X)\bar\phi(\mathrm X)]^{-1}\bar\phi^{\mathrm T}(\mathrm X)\mathrm y\tag{12}

由公式(11)可知,給定訓練數據X\mathrm X,我們就可以求出最佳的wˉ\mathrm{\bar w}。需要注意的是,這裏需要求矩陣的逆,計算量比較大,不適合當訓練數據較大的情況。這時我們可以通過梯度下降法來求解。注意:在本文之前,我曾經從線性代數的角度解釋了公式(12)是如何得到的,希望對讀者有所幫助,詳情請見【線性代數】最小二乘與投影矩陣


梯度下降法

使用梯度下降法,可以對凸問題求得最優解,對非凸問題,可以找到局部最優解。梯度下降法的算法思想如下圖5和圖6所示:

圖5
圖6
  • 在左圖(圖5)中,梯度爲dy^dx=x2\frac{d\hat y}{d x}=x-2。當x<2x<2時,梯度小於零,此時xx應當向右移動來減小函數值(負梯度方向);當x>2x>2時,梯度大於零,此時xx應當向左移動來減小函數值(負梯度方向)。
  • 在右圖(圖6)中,函數不是凸函數的情況下,使用梯度下降法會得到局部最優解(假定初始值爲x=0x=0)。當初始值x=7x=7時,我們可以得到最優解。因此,初始值對梯度下降法影響較大,我們可以通過隨機選擇初始值來克服陷入局部最優解的情況。

根據(10)得到的梯度表達式,梯度下降的每一次迭代過程如下:
wˉt+1=wˉtηEwˉ=wˉtηϕˉT(X)[ϕˉ(X)wˉy](13) \bar{\mathrm w}^{t+1}=\bar{\mathrm w}^{t}-\eta\frac{\partial{E}}{\partial{\mathrm{\bar w}}}=\bar{\mathrm w}^{t}-\eta\bar\phi^{\mathrm T}(X)[\bar\phi(\mathrm X)\mathrm{\bar w}-\mathrm y]\tag{13}
將公式(13)的矩陣相乘展開可以得到
ωjt+1=ωjtηi=1N[ω0+j=1Mωjϕj(xi)yi]ϕj(xi)(14) \omega_j^{t+1}=\omega_j^t-\eta\sum\limits_{i=1}^{N}{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]}\phi_j(\mathrm x_i)}\tag{14}
公式(13)或(14)就是標準的梯度下降法,其中η\eta是每次迭代的步長大小。

  • η\eta較小時,迭代較慢,當時可以保證收斂到最優解(凸函數的情況下);η\eta較大時,函數值下降較快,但容易發生震盪。
  • 每次迭代時,需要使用所有的樣本點xi,i=1,2,,N\mathrm x_i,i=1,2,\cdots,N。當數據樣本點非常大時,開銷十分大。

爲此,有人提出了隨機梯度下降,其迭代公式如下:
ωjt+1=ωjtη[ω0+j=1Mωjϕj(xi)yi]ϕj(xi)(15) \omega_j^{t+1}=\omega_j^t-\eta{[\omega_0+\sum\limits_{j=1}^{M}{\omega_j\phi_j(\mathrm{x}_i)-y_i]}\phi_j(\mathrm x_i)}\tag{15}
隨機梯度下降又稱連續梯度下降,比較適合於實時系統,即整個數據集xi\mathrm x_i不是可以一次性獲得的,但是我們需要作出預測的場景。相較於梯度下降法(14),隨機梯度只根據當前樣本更新迭代,隨機性較大。因此有可能跳出標準梯度下降法的局部最優解。


算法實現

這裏我們使用sklearn中波士頓房價的數據集,該數據集有13維特徵,506個樣例。爲簡便起見,我們只取前2維特徵作爲輸入(M=D=2,y^=ω0+ω1x(1)+ω2x(2)M=D=2,\hat y=\omega_0+\omega_1*x^{(1)}+\omega_2*x^{(2)}),前500個作爲輸入樣例,後6個作爲預測樣例。在算法實現中,我們分別考慮了正規方程法梯度下降法。並且,考慮到x(1)x^{(1)}x(2)x^{(2)}的取值範圍差距較大,我們還考慮了特徵值縮放。爲此,我們實現了上述四種算法的組合[特徵不縮放(特徵縮放)+正規方程法(梯度下降法)]。


算法結果

圖7給出了上述4種算法的結果:

圖7

圖7中,E_train爲訓練誤差,即前500個樣例的真實值與預測值的誤差,E_test爲預測誤差,即最後6個樣例的真實值與預測值的誤差。由於誤差函數對於參數w是凸函數,我們總能得到最優解,即最小的訓練誤差,所以上述四種方法的訓練誤差相同。


特徵縮放與梯度下降法

圖7能得到最小誤差函數值,是因爲目標函數EE是參數ω1\omega_1ω2\omega_2的凸函數。爲方便起見,對於具體實例,我們給出EE的表達式:
E(ω0,ω1,ω2)=i=1500(ω0+ω1x(1)+ω2x(2)yi)2(16) E(\omega_0,\omega_1,\omega_2)=\sum\limits_{i=1}^{500}(\omega_0+\omega_1*x^{(1)}+\omega_2*x^{(2)}-y_i)^2\tag{16}
公式(16)中,ω0\omega_0與具體樣例無關,ω0\omega_0的值不改變EE的圖像形狀,改變ω0\omega_0相當於進行位移,我們這裏假定ω0=0\omega_0=0。爲此,當給定波士頓房價數據集,即x(1),x(2),yix^{(1)},x^{(2)},y_i 給定時,我們可以畫出公式(16)對應的等高線圖,圖8。

圖8
圖9
  • 從圖8可以看出,當改變ω2\omega_2時,EE變的較快(等高線在ω2\omega_2方向較爲稀疏)。這是因爲ω2\omega_2的係數爲x(2)x^{(2)},而x(2)x^{(2)}相對於x(1)x^{(1)}有較大的取值。在這種情況下,對梯度下降法就十分不友好–很容易跳過最優解。也就是說,步長設置要十分小,這就會導致收斂速度慢。在我們這個實例中,步長最大隻能設置爲η=5e6\eta=5e^{-6},此時需要差不多30000次迭代才能收斂到最優,如圖10所示。
  • 特徵縮放是一種解決上述情況下,梯度下降法收斂慢的方法。特徵縮放的表達式都十分簡單,這裏不再贅述,我們這裏是直接使用的sklearn庫中的preprocessing.StandardScaler()函數對樣例進行特徵縮放。對x(1),x(2)x^{(1)},x^{(2)}縮放後,我們可以用相同的方式畫出對應的等高線圖,圖9,以及收斂圖,圖11。經過特徵縮放後,圖9中等高線在ω1,ω2\omega_1,\omega_2方向上的稀疏程度差不多。圖11中,步長可以設置得較大(η=1e3\eta=1e^{-3}),收斂速度變得極快,只需要迭代8次左右就達到最優。
圖10
圖11

附錄

下面給出圖1—圖11的Python源代碼如下:

  • 圖1-圖2:

    import numpy as np
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    
    
    # Set the format of labels
    def LabelFormat(plt):
        ax = plt.gca()
        plt.tick_params(labelsize=14)
        labels = ax.get_xticklabels() + ax.get_yticklabels()
        [label.set_fontname('Times New Roman') for label in labels]
        font = {'family': 'Times New Roman',
                'weight': 'normal',
                'size': 16,
                }
        return font
    
    
    # 2-d case
    omega_0 = 0
    omega_1 = 1
    data_train = [[0.5, 0.2], [1, 0.8], [1.5, 1.2], [2, 2], [2.5, 2.8], [3, 3.1], [3.5, 3.8]]
    x_train = [d[0] for d in data_train]
    y_train = [d[1] for d in data_train]
    
    x = np.linspace(0, 4, 30).reshape(30, 1)
    y = omega_1 * x + omega_0
    
    x_test = x_train
    y_test = y_train
    y_hat = omega_1 * x_test
    
    plt.figure()
    plt.plot(x, y, 'k-')
    for i in range(len(x_test)):
        plt.stem([x_test[i], ], [y_test[i], ], linefmt='rx', bottom=y_hat[i], basefmt='ko', markerfmt='C3o',
                 use_line_collection=True)
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('$x$', font)
    plt.ylabel('$\hat y$', font)
    plt.title('$M=1,\omega_0=0,\omega_1=1$')
    plt.xlim(0, 4)
    plt.ylim(0, 4)
    plt.grid()
    
    plt.show()
    
    # 3-d case
    omega_0 = 2
    omega_1 = 0.25
    omega_2 = 0.5
    
    
    x1 = np.linspace(0, 4, 30).reshape(30, 1)
    x2 = np.linspace(0, 4, 30).reshape(30, 1)
    
    X1, X2 = np.meshgrid(x1, x2)
    y_hat = omega_0 + omega_1 * X1 + omega_2 * X2
    fig = plt.figure()
    ax = fig.gca(projection='3d')
    
    x1_test=np.array([1,2,3])
    x2_test=np.array([1,2,3])
    X1_test, X2_test = np.meshgrid(x1_test, x2_test)
    
    y_test = omega_0 + omega_1 * X1_test + omega_2 * X2_test+8*np.random.rand(3,3)-4
    
    ax.plot_surface(X1, X2, y_hat, cmap='rainbow')
    
    for i in range(len(x1_test)):
        for j in range(len(x2_test)):
            y_predict= omega_0 + omega_1 * x1_test[i] + omega_2 * x2_test[j]
            ax.plot([x1_test[i],x1_test[i]],[x2_test[j],x2_test[j]],[y_test[i][j],y_predict],'r-o')
    
    # Set the labels
    font = LabelFormat(plt)
    ax.set_xlabel('$x^{(1)}$', font)
    ax.set_ylabel('$x^{(2)}$', font)
    ax.set_zlabel('$\hat y$', font)
    ax.set_xlim(0, 4)
    ax.set_ylim(0, 4)
    ax.set_zlim(0, 8)
    ax.set_xticks([0,1,2,3,4])
    ax.set_yticks([0,1,2,3,4])
    ax.set_title('$M=2,\omega_0=2,\omega_1=0.25,\omega_2=0.5$')
    
    # Customize the view angle so it's easier to see that the scatter points lie
    ax.view_init(elev=5., azim=-25)
    plt.show()
    
    
  • 圖3-圖4:

    import numpy as np
    import matplotlib.pyplot as plt
    from matplotlib.colors import ListedColormap
    import matplotlib as mpl
    import math
    
    
    # Set the format of labels
    def LabelFormat(plt):
        ax = plt.gca()
        plt.tick_params(labelsize=14)
        labels = ax.get_xticklabels() + ax.get_yticklabels()
        [label.set_fontname('Times New Roman') for label in labels]
        font = {'family': 'Times New Roman',
                 'weight': 'normal',
                 'size': 16,
                 }
        return font
    
    
    # Plot the training points: different
    def PlotTrainPoint(X):
        for i in range(0, len(X)):
            plt.plot(X[i][0], X[i][1], 'rs', markersize=6, markerfacecolor="r")
    
    
    # Loss function--Square Error function
    def LossFunction(Y, predictedY):
        lengthY = len(Y)
        error = 0
        for i in range(lengthY):
            error += pow(Y[i] - predictedY[i], 2)
    
        return math.sqrt(error)
    
    
    trainData = [[1, 0.8], [2, 2], [3, 3.1]]
    
    # Predicted function: y=\omega_1*x+\omega_0 Here \omega_0 is assumed to be 0 for simplifcity
    
    x = np.linspace(0, 4, 30).reshape(30, 1)
    omega_1 = np.linspace(0.5, 1.5, 41).reshape(41, 1)
    omega_0 = 0
    y_hat = []
    #Get the value of x and y in the trainData
    x_train = [d[0] for d in trainData]
    y_train = [d[1] for d in trainData]
    error_all = []
    
    # Plot the figure to show the function: y=\omega_1*x+\omega_0
    for i in range(len(omega_1)):
        y_hat.append(omega_1[i] * x)
        if omega_1[i]==0.5:
            plt.plot(x, y_hat[i],  color='cyan', alpha=1)
        elif omega_1[i]==1:
            plt.plot(x, y_hat[i], color='blue', alpha=1)
        elif omega_1[i]==1.5:
            plt.plot(x, y_hat[i], color='orange', alpha=1)
        else:
            plt.plot(x, y_hat[i], color='black', alpha=0.3)
        # Compute the errors for each omega_1
        error_all.append(LossFunction(y_train, omega_1[i].T*x_train+omega_0))
    
    # Set the axis
    font=LabelFormat(plt)
    PlotTrainPoint(trainData)
    # Label the critical points
    plt.annotate('$\omega_1=1.5$', xy=(2.5, 2.5*1.5), xycoords='data',
                 xytext=(-35, 35), textcoords='offset points', color='orange', fontsize=12, arrowprops=dict(arrowstyle="->",
                 connectionstyle="arc,rad=90", color='orange'))
    plt.annotate('$\omega_1=1$', xy=(2.5, 2.5*1), xycoords='data',
                 xytext=(-45, 95), textcoords='offset points', color='b', fontsize=12, arrowprops=dict(arrowstyle="->",
                 connectionstyle="arc,rad=90", color='b'))
    plt.annotate('$\omega_1=0.5$', xy=(2.5, 2.5*0.5), xycoords='data',
                 xytext=(-75, 155), textcoords='offset points', color='cyan', fontsize=12, arrowprops=dict(arrowstyle="->",
                 connectionstyle="arc,rad=90", color='cyan'))
    
    plt.annotate('$\omega_1=0.5\sim 1.5$', xy=(1, 2.2), xycoords='data',
                 xytext=(8, -125), textcoords='offset points', color='k', fontsize=12, arrowprops=dict(arrowstyle="->",
                 connectionstyle="arc,rad=90", color='k'))
    plt.xlabel('$x$',font)
    plt.ylabel('$\hat y$',font)
    plt.xlim([0,3.2])
    plt.ylim([0,4.5])
    plt.show()
    
    # Show the error when omega_1 changes
    plt.figure()
    font=LabelFormat(plt)
    plt.plot(omega_1,error_all, 'k-s')
    error_min=min(error_all)
    index_min=error_all.index(error_min)
    print(index_min)
    # plot the error at the given three point
    plt.plot(omega_1[index_min],error_min,'bs')
    plt.plot(omega_1[0],error_all[0],'cyan',marker='s')
    plt.plot(omega_1[-1],error_all[-1],'orange',marker='s')
    
    plt.xlabel('$\omega_1$', font)
    plt.ylabel('Value of loss function', font)
    plt.show()
    
  • 圖5-圖6:

    import numpy as np
    import matplotlib.pyplot as plt
    
    # Set the format of labels
    def LabelFormat(plt):
        ax = plt.gca()
        plt.tick_params(labelsize=14)
        labels = ax.get_xticklabels() + ax.get_yticklabels()
        [label.set_fontname('Times New Roman') for label in labels]
        font = {'family': 'Times New Roman',
                'weight': 'normal',
                'size': 16,
                }
        return font
    
    x = np.linspace(0, 4, 30).reshape(30, 1)
    y=(x-2)**2/2
    
    plt.figure()
    plt.plot(x,y,'k-')
    plt.plot(3.5,1.5**2/2,'ro')
    plt.annotate('$\\frac{dE}{dx}$', xy=(3.5, 1.5**2/2), xycoords='data',
                 xytext=(-60, -125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="<-",
                 connectionstyle="arc,rad=90", color='r'))
    
    plt.plot(0.5,1.5**2/2,'ro')
    plt.annotate('$\\frac{dE}{dx}$', xy=(0.5, 1.5**2/2), xycoords='data',
                 xytext=(48, -125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="<-",
                 connectionstyle="arc,rad=90", color='r'))
    
    plt.annotate('$\hat y=\\frac{1}{2}(x-2)^2$', xy=(0.25, 1.75**2/2), xycoords='data',
                 xytext=(108, 0), textcoords='offset points',color='k', fontsize=14, arrowprops=dict(arrowstyle="<-",
                 connectionstyle="arc,rad=90", color='w'))
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('$x$', font)
    plt.ylabel('$\hat y$', font)
    plt.show()
    
    # To plot figure 6
    x1 = np.linspace(0, 5/4.0*np.pi, 50).reshape(50, 1)
    y1=np.cos(x1)
    
    x2 = np.linspace(5/4.0*np.pi, 8, 50).reshape(50, 1)
    y2=0.5*np.cos(2*x2+1*np.pi)-0.71
    
    plt.figure()
    plt.plot(x1,y1,'k-')
    plt.plot(x2,y2,'k-')
    plt.plot(np.pi,-1,'ro')
    plt.annotate('Local optimal', xy=(np.pi, -1), xycoords='data',
                 xytext=(-48, 125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="->",
                 connectionstyle="arc,rad=90", color='r'))
    
    plt.plot(np.pi*2,-1.21,'ro')
    plt.annotate('Global optimal', xy=(2*np.pi, -1.21), xycoords='data',
                 xytext=(-48, 125), textcoords='offset points',color='r', fontsize=14, arrowprops=dict(arrowstyle="->",
                 connectionstyle="arc,rad=90", color='r'))
    
    # Set the labels
    font = LabelFormat(plt)
    plt.xlabel('$x$', font)
    plt.ylabel('$\hat y$', font)
    plt.show()
    
    
  • 圖7-圖11:

    # -*- coding: utf-8 -*-
    # @Time : 2020/4/7 11:28
    # @Author : tengweitw
    
    import numpy as np
    from sklearn.datasets import load_boston
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    from sklearn import preprocessing
    
    
    def Linear_regression_normal_equation(train_data, train_target, test_data, test_target):
        # the 1st column is 1 i.e., x_0=1
        temp = np.ones([np.size(train_data, 0), 1])
        # X is a 500*(1+2)-dim matrix
        X = np.concatenate((temp, train_data), axis=1)
    
        # Normal equation
        w_bar = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, train_target))
    
        # Training Error
        y_predict_train = np.matmul(X, w_bar)
        E_train = np.linalg.norm(y_predict_train - train_target)/len(y_predict_train)
    
        # Predicting
        x0 = np.ones((np.size(test_data, 0), 1))
        test_data1 = np.concatenate((x0, test_data), axis=1)
        y_predict_test = np.matmul(test_data1, w_bar)
    
        # Prediction Error
        E_test = np.linalg.norm(y_predict_test - test_target)/len(y_predict_test)
    
        return y_predict_test, E_train, E_test
    
    
    def Linear_regression_normal_equation_scale(train_data, train_target, test_data, test_target):
        # Data processing: scaling
        # For training data
        ss = preprocessing.StandardScaler()
        ss.partial_fit(train_data)
        train_data_scale = ss.fit_transform(train_data)
        # For testing data
        ss.partial_fit(test_data)
        test_data_scale = ss.fit_transform(test_data)
    
        # the 1st column is 1 i.e., x_0=1
        temp = np.ones([np.size(train_data_scale, 0), 1])
        # X is a 500*(1+2)-dim matrix
        X = np.concatenate((temp, train_data_scale), axis=1)
    
        # Normal equation
        w_bar = np.matmul(np.linalg.inv(np.matmul(X.T, X)), np.matmul(X.T, train_target))
    
        # Training Error
        y_predict_train = np.matmul(X, w_bar)
        E_train = np.linalg.norm(y_predict_train - train_target) / len(y_predict_train)
    
        # Predicting
        x0 = np.ones((np.size(test_data_scale, 0), 1))
        test_data1 = np.concatenate((x0, test_data_scale), axis=1)
        y_predict_test = np.matmul(test_data1, w_bar)
    
        # Prediction Error
        E_test = np.linalg.norm(y_predict_test - test_target) / len(y_predict_test)
    
        return y_predict_test, E_train, E_test
    
    
    def Linear_regression_gradient_descend(train_data, train_target, test_data, test_target):
        # learning rate
        eta = 5e-6
        M = np.size(train_data, 1)
        N = np.size(train_data, 0)
        w_bar = np.zeros((M + 1, 1))
    
        # the 1st column is 1 i.e., x_0=1
        temp = np.ones([N, 1])
        # X is a N*(1+M)-dim matrix
        X = np.concatenate((temp, train_data), axis=1)
        train_target = np.mat(train_target).T
    
        iter = 0
        num_iter = 5000
        E_train = np.zeros((num_iter, 1))
    
        while iter < num_iter:
            temp = np.matmul(X, w_bar) - train_target
            w_bar = w_bar - eta * np.matmul(X.T, temp)
            # Predicting training data
            y_predict_train = np.matmul(X, w_bar)
            # Training Error
            E_train[iter]=np.linalg.norm(y_predict_train - train_target)/len(y_predict_train)
            iter += 1
    
        # Predicting
        x0 = np.ones((np.size(test_data, 0), 1))
        test_data1 = np.concatenate((x0, test_data), axis=1)
        y_predict_test = np.matmul(test_data1, w_bar)
    
        # Prediction Error
        E_test = np.linalg.norm(y_predict_test.ravel()- test_target)/len(y_predict_test)
    
        return y_predict_test, E_train, E_test
    
    def Linear_regression_gradient_descend_scale(train_data, train_target, test_data, test_target):
        # Data processing: scaling
        # For training data
        ss = preprocessing.StandardScaler()
        ss.partial_fit(train_data)
        train_data_scale = ss.fit_transform(train_data)
        # For testing data
        ss.partial_fit(test_data)
        test_data_scale = ss.fit_transform(test_data)
    
        # learning rate
        eta = 1e-3
        M = np.size(train_data_scale, 1)
        N = np.size(train_data_scale, 0)
        w_bar = np.zeros((M + 1, 1))
    
        # the 1st column is 1 i.e., x_0=1
        temp = np.ones([N, 1])
        # X is a N*(1+M)-dim matrix
        X = np.concatenate((temp, train_data_scale), axis=1)
        train_target = np.mat(train_target).T
    
        iter = 0
        num_iter = 10
        E_train = np.zeros((num_iter, 1))
    
        while iter < num_iter:
            temp = np.matmul(X, w_bar) - train_target
            w_bar = w_bar - eta * np.matmul(X.T, temp)
            # Predicting training data
            y_predict_train = np.matmul(X, w_bar)
            # Training Error
            E_train[iter]=np.linalg.norm(y_predict_train - train_target)/len(y_predict_train)
            iter += 1
        # Predicting
        x0 = np.ones((np.size(test_data_scale, 0), 1))
        test_data1 = np.concatenate((x0, test_data_scale), axis=1)
        y_predict_test = np.matmul(test_data1, w_bar)
    
        # Prediction Error
        E_test = np.linalg.norm(y_predict_test.ravel()- test_target)/len(y_predict_test)
    
        return y_predict_test, E_train, E_test
    
    
    # Set the format of labels
    def LabelFormat(plt):
        ax = plt.gca()
        plt.tick_params(labelsize=14)
        labels = ax.get_xticklabels() + ax.get_yticklabels()
        [label.set_fontname('Times New Roman') for label in labels]
        font = {'family': 'Times New Roman',
                'weight': 'normal',
                'size': 16,
                }
        return font
    
    def Plot_error_vs_omega(train_data,train_target):
        # ---------Show the contour of E with respect to omegas---------------------
        x1 = train_data[:, 0]
        x2 = train_data[:, 1]
        omega_1 = np.linspace(-30, 30, 30)
        omega_2 = np.linspace(-30, 30, 30)
    
        Y_hat = np.zeros((len(omega_1),len( omega_2)))
        for i in range(len(omega_1)):
            for j in range(len(omega_2)):
                for k in range(len(train_data)):
                    temp=train_target[k] - (omega_1[i] * x1[k] + omega_2[j] * x2[k])
                    Y_hat[i][j] = Y_hat[i][j] + np.square(temp)
    
        fig = plt.figure()
    
        plt.contour(omega_2,omega_1,Y_hat,20)
        # Set the labels
        font = LabelFormat(plt)
        plt.xlabel('$\omega_1$', font)
        plt.ylabel('$\omega_2$', font)
    
        plt.show()
    
    def Plot_error_vs_omega_scale(train_data, train_target):
        # ---------Show the contour of E with respect to omegas---------------------
        # Data processing: scaling
        # For training data
        ss = preprocessing.StandardScaler()
        ss.partial_fit(train_data)
        train_data_scale = ss.fit_transform(train_data)
    
        x1 = train_data_scale[:, 0]
        x2 = train_data_scale[:, 1]
        omega_1 = np.linspace(-30, 30, 30)
        omega_2 = np.linspace(-30, 30, 30)
    
        Y_hat = np.zeros((len(omega_1), len(omega_2)))
        for i in range(len(omega_1)):
            for j in range(len(omega_2)):
                for k in range(len(train_data_scale)):
                    temp = train_target[k] - (omega_1[i] * x1[k] + omega_2[j] * x2[k])
                    Y_hat[i][j] = Y_hat[i][j] + np.square(temp)
    
        fig = plt.figure()
    
        plt.contour(omega_2, omega_1, Y_hat, 20)
        # Set the labels
        font = LabelFormat(plt)
        plt.xlabel('$\omega_1$', font)
        plt.ylabel('$\omega_2$', font)
    
        plt.show()
    
    
    if __name__ == '__main__':
    
        # load house price of Boston
        data, target = load_boston(return_X_y=True)
        # The number of selected features
        M = 2
        # The first 500 data for training
        train_data = data[0:500, 0:0 + M]
        train_target = target[0:500]
        train_target.reshape(len(train_data), 1)
    
        # ------------------------------
        # The last 6 data for testing
        test_data = data[500:, 0:0 + M]
        test_target = target[500:]
    
        # To show the contour of error function E with respect to omega
        # We can see that it's a convex function, not easy for gradient descend
        Plot_error_vs_omega(train_data, train_target)
        Plot_error_vs_omega_scale(train_data, train_target)
    
        #---------------------------------#
        y_predict_normal_equation, E_train,E_test = Linear_regression_normal_equation(train_data, train_target, test_data,
                                                                             test_target)
        print("Linear Regression Using Normal Equation: E_train=%f, E_test=%f" % (E_train,E_test))
        for i in range(len(test_data)):
            print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_normal_equation[i]))
    
    
        # ---------------------------------#
        y_predict_normal_equation_scale, E_train,E_test = Linear_regression_normal_equation_scale(train_data, train_target,
                                                                                         test_data, test_target)
        print("Linear Regression Using Normal Equation with scaling: E_train=%f, E_test=%f" % (E_train,E_test))
        for i in range(len(test_data)):
            print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_normal_equation_scale[i]))
    
    
        # ---------------------------------#
        y_predict_gradient_descent, E_train,E_test = Linear_regression_gradient_descend(train_data, train_target, test_data,
                                                                               test_target)
        print("Linear Regression Using Gradient Descend: E_train=%f, E_test=%f" % (E_train[-1],E_test))
        for i in range(len(test_data)):
            print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_gradient_descent[i]))
    
        plt.figure()
        plt.plot(E_train,'r-')
        # Set the labels
        font = LabelFormat(plt)
        plt.xlabel('Iteration', font)
        plt.ylabel('Average error: $E/N$', font)
        plt.show()
    
    
        # ---------------------------------#
        y_predict_gradient_descent_scale, E_train,E_test = Linear_regression_gradient_descend_scale(train_data, train_target,
                                                                                           test_data, test_target)
        print("Linear Regression Using Gradient Descend with scaling: E_train=%f, E_test=%f" % (E_train[-1],E_test))
        for i in range(len(test_data)):
            print("True value: %f    Predicted value: %f" % (test_target[i], y_predict_gradient_descent_scale[i]))
        plt.figure()
        plt.plot(E_train,'r-')
        # Set the labels
        font = LabelFormat(plt)
        plt.xlabel('Iteration', font)
        plt.ylabel('Average error: $E/N$', font)
    
        plt.show()
    
    
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章