機器學習算法的Python實現 (1)：logistics迴歸與線性判別分析（LDA）

原創

Will-Lin

2020-02-20 16:21

本文爲筆者在學習周志華老師的機器學習教材後，寫的課後習題的的編程題。之前放在答案的博文中，現在重新進行整理，將需要實現代碼的部分單獨拿出來，慢慢積累。希望能寫一個機器學習算法實現的系列。

本文主要包括：

1、logistics迴歸

2、線性判別分析（LDA）

使用的python庫：

numpy
matplotlib
pandas

使用的數據集：機器學習教材上的西瓜數據集3.0α

Idx	density	ratio_sugar	label
1	0.697	0.46	1
2	0.774	0.376	1
3	0.634	0.264	1
4	0.608	0.318	1
5	0.556	0.215	1
6	0.403	0.237	1
7	0.481	0.149	1
8	0.437	0.211	1
9	0.666	0.091	0
10	0.243	0.0267	0
11	0.245	0.057	0
12	0.343	0.099	0
13	0.639	0.161	0
14	0.657	0.198	0
15	0.36	0.37	0
16	0.593	0.042	0
17	0.719	0.103	0

logistic迴歸：

參考《機器學習實戰》的內容。本題分別寫了梯度上升方法以及隨機梯度上升方法。對書本上的程序做了一點點改動

# -*- coding: cp936 -*-  
from numpy import *  
import pandas as pd  
import matplotlib.pyplot as plt  
  
#讀入csv文件數據  
df=pd.read_csv('watermelon_3a.csv')  
m,n=shape(dataMat)  
df['norm']=ones((m,1))  
dataMat=array(df[['norm','density','ratio_sugar']].values[:,:])  
labelMat=mat(df['label'].values[:]).transpose()  
  
#sigmoid函數  
def sigmoid(inX):  
    return 1.0/(1+exp(-inX))  
  
#梯度上升算法  
def gradAscent(dataMat,labelMat):  
    m,n=shape(df.values)  
    alpha=0.1  
    maxCycles=500  
    weights=array(ones((n,1)))  
  
    for k in range(maxCycles):   
        a=dot(dataMat,weights)  
        h=sigmoid(a)  
        error=(labelMat-h)  
        weights=weights+alpha*dot(dataMat.transpose(),error)  
    return weights  
  
#隨機梯度上升  
def randomgradAscent(dataMat,label,numIter=50):  
    m,n=shape(dataMat)  
    weights=ones(n)  
    for j in range(numIter):  
        dataIndex=range(m)  
        for i in range(m):  
            alpha=40/(1.0+j+i)+0.2  
  
            randIndex_Index=int(random.uniform(0,len(dataIndex)))  
            randIndex=dataIndex[randIndex_Index]  
            h=sigmoid(sum(dot(dataMat[randIndex],weights)))  
            error=(label[randIndex]-h)  
            weights=weights+alpha*error[0,0]*(dataMat[randIndex].transpose())  
            del(dataIndex[randIndex_Index])  
    return weights  
  
#畫圖  
def plotBestFit(weights):  
    m=shape(dataMat)[0]  
    xcord1=[]  
    ycord1=[]  
    xcord2=[]  
    ycord2=[]  
    for i in range(m):  
        if labelMat[i]==1:  
            xcord1.append(dataMat[i,1])  
            ycord1.append(dataMat[i,2])  
        else:  
            xcord2.append(dataMat[i,1])  
            ycord2.append(dataMat[i,2])  
    plt.figure(1)  
    ax=plt.subplot(111)  
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')  
    ax.scatter(xcord2,ycord2,s=30,c='green')  
    x=arange(0.2,0.8,0.1)  
    y=array((-weights[0]-weights[1]*x)/weights[2])  
    print shape(x)  
    print shape(y)  
    plt.sca(ax)  
    plt.plot(x,y)      #ramdomgradAscent  
    #plt.plot(x,y[0])   #gradAscent  
    plt.xlabel('density')  
    plt.ylabel('ratio_sugar')  
    #plt.title('gradAscent logistic regression')  
    plt.title('ramdom gradAscent logistic regression')  
    plt.show()  
  
#weights=gradAscent(dataMat,labelMat)  
weights=randomgradAscent(dataMat,labelMat)  
plotBestFit(weights)

梯度上升法得到的結果如下：

隨機梯度上升法得到的結果如下：

可以看出，兩種方法的效果基本差不多。但是隨機梯度上升方法所需要的迭代次數要少很多

線性判別分析

LDA的編程主要參考書上P62的3.39 以及P61的3.33這兩個式子。由於用公式可以直接算出，因此比較簡單

公式如下：

代碼如下：

# -*- coding: cp936 -*-  
from numpy import *  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
  
df=pd.read_csv('watermelon_3a.csv')  
  
def calulate_w():  
    df1=df[df.label==1]  
    df2=df[df.label==0]  
    X1=df1.values[:,1:3]  
    X0=df2.values[:,1:3]  
    mean1=array([mean(X1[:,0]),mean(X1[:,1])])  
    mean0=array([mean(X0[:,0]),mean(X0[:,1])])  
    m1=shape(X1)[0]  
    sw=zeros(shape=(2,2))  
    for i in range(m1):  
        xsmean=mat(X1[i,:]-mean1)  
        sw+=xsmean.transpose()*xsmean  
    m0=shape(X0)[0]  
    for i in range(m0):  
        xsmean=mat(X0[i,:]-mean0)  
        sw+=xsmean.transpose()*xsmean  
    w=(mean0-mean1)*(mat(sw).I)  
    return w  
  
def plot(w):  
    dataMat=array(df[['density','ratio_sugar']].values[:,:])  
    labelMat=mat(df['label'].values[:]).transpose()  
    m=shape(dataMat)[0]  
    xcord1=[]  
    ycord1=[]  
    xcord2=[]  
    ycord2=[]  
    for i in range(m):  
        if labelMat[i]==1:  
            xcord1.append(dataMat[i,0])  
            ycord1.append(dataMat[i,1])  
        else:  
            xcord2.append(dataMat[i,0])  
            ycord2.append(dataMat[i,1])  
    plt.figure(1)  
    ax=plt.subplot(111)  
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')  
    ax.scatter(xcord2,ycord2,s=30,c='green')  
    x=arange(-0.2,0.8,0.1)  
    y=array((-w[0,0]*x)/w[0,1])  
    print shape(x)  
    print shape(y)  
    plt.sca(ax)  
    #plt.plot(x,y)      #ramdomgradAscent  
    plt.plot(x,y)   #gradAscent  
    plt.xlabel('density')  
    plt.ylabel('ratio_sugar')  
    plt.title('LDA')  
    plt.show()  
  
w=calulate_w()  
plot(w)

結果如下：

對應的w值爲：

[ -6.62487509e-04, -9.36728168e-01]

由於數據分佈的關係，所以LDA的效果不太明顯。所以我改了幾個label=0的樣例的數值，重新運行程序得到結果如下：

效果比較明顯，對應的w值爲：

[-0.60311161, -0.67601433]

Will-Lin

發佈了33 篇原創文章 · 獲贊 225 · 訪問量 50萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習算法的Python實現 (1)：logistics迴歸與線性判別分析（LDA）

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

XGBoost-安裝(Windows/VS2015)

機器學習算法的Python實現 (1)：logistics迴歸與線性判別分析（LDA）

機器學習算法的Python實現 (3)：決策樹剪枝處理

行爲識別筆記：Stacked Fisher Vector基本原理

XGBoost-Python完全調參指南-介紹篇

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

機器學習算法的Python實現 (1)：logistics迴歸 與 線性判別分析（LDA）

機器學習算法的Python實現 (1)：logistics迴歸與線性判別分析（LDA）