機器學習算法整理1(python)

1,機器學習算法

1.1 K-NN

(1)K-NN(k近鄰):K-NN是一種基於實例的學習,其分類不取決於其內在的模型,而是對標籤測試集進行參考。k-NN只是簡單的記住所有訓練數據,並與每個新樣本進行比較,因此它是一種非歸納方法。

  • KNeighborsClassifier:用戶指定k,近鄰數據的數量,噪聲較大時,k值用較大值,但犧牲了分類邊界的明確性。
  • RadiusNeighborsClassifier:對每個訓練的數據點指定固定的半徑,當數據不是均勻採樣時,較好。

(2)鳶尾花分類代碼

# -*- coding: utf-8 -*-
"""
Created on Sun Mar 10 21:24:33 2019

@author: Larry
"""
#k-NN對鳶尾花數據進行分類學習算例
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

def knnDemo(X,y,n):
    #creates the classifier and fits it to the data
    res = 0.05
    k1 = knn(n_neighbors = n,p = 2,metric = 'minkowski')
    k1.fit(X,y)#對數據進行訓練, 'minkowski'爲閔可夫斯基距離
    
    #sets up the grid
    x1_min, x1_max = X[:,0].min() - 1,X[:,0].max() + 1
    x2_min, x2_max = X[:,1].min() - 1,X[:,1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min,x1_max,res), np.arange(x2_min, x2_max, res))#生成座標軸的[x,y]的矩陣,範圍包括了所有值
    
    #makes the prediction
    Z = k1.predict(np.array([xx1.ravel(),xx2.ravel()]).T)#轉化成(x,y)對應的二維座標
    Z = Z.reshape(xx1.shape)#xx1.shape是獲得xx1的大小,然後將Z預測的結果變成和其一樣
    
    #creates the color map
    cmap_light = ListedColormap(['#FFAAAA','#AAFFAA','#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000','#00FF00','#0000FF'])
    
    #plots the decision surface
    plt.contourf(xx1,xx2,Z,alpha = 0.4,cmap = cmap_light)
    plt.xlim(xx1.min(),xx1.max())
    plt.ylim(xx2.min(),xx2.max())

    #plots the samples
    for idx, c1 in enumerate(np.unique(y)):
        plt.scatter(X[:,0],X[:,1],c = y, cmap = cmap_bold)
    
    plt.show()


iris = datasets.load_iris()
X1 = iris.data[:,0:3:2]
X2 = iris.data[:,0:2]
X3 = iris.data[:,1:3]
y = iris.target
knnDemo(X2,y,15)                            

(3)關於部分函數的使用

1.2 Scikit-learn解決迴歸問題

(1)LinearRegression()對象

#線性迴歸問題
from sklearn import linear_model
clf = linear_model.LinearRegression()
y=clf.fit([[0, 0], [1, 1], [2, 2]],[0,1,2])
clf.coef_     #線性迴歸問題的估計係數組
array([0.5, 0.5])

(2)linear_model.Ridge()

  • 嶺迴歸可以解決多重線性問題,還可以用於輸入變量遠遠超出樣本數量的情況
  • linear_model.Ridge()對象使用了L2正則化。對權值向量加以懲罰,這樣會使平均權重更小。減少了對極值的敏感度,模型更爲穩定.
  • linear_model.Ridge()對象增加了一個正則化參數alpha,小的正值會提高模型的穩定性。可以是浮點數,也可以是數組(大小與目標變量相同)
#當特徵之間有相關性
from sklearn.linear_model import Ridge
import numpy as np
def ridgeReg(alpha):
   n_samples,n_features = 10,5
   y = np.random.randn(n_samples)
   X = np.random.randn(n_samples,n_features)
   clf = Ridge(0.001)
   res = clf.fit(X,y)
   return(res)

res = ridgeReg(0.001)
print(res.coef_)
print(res.intercept_)   #線性模型中的截距或獨立項數組

(3)scikit-learn中的降維算法

  • 降維可以減少模型輸入或特徵變量,同時還能減少過擬合而提高模型的普遍性
  • 主要工作確定哪些是冗餘或無關的數據。特徵選擇特徵提取
    選擇是找子集,提取是結合具有相關性的變量,創造新的特徵變量。
  • 最常用的特徵提取算法:PCA
  • PCA使用正交變換將一組相關變量轉換成一組不相關變量。
  • PCA降維要求特徵進行了縮放和平均歸一化,即特徵要具有零均值和相應的值域
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 12 10:20:22 2019

@author: Larry
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_circles
np.random.seed(0)
X,y = make_circles(n_samples = 400,factor = 0.3,noise = 0.05)
kpca = KernelPCA(kernel = 'rbf',gamma=10)
X_kpca = kpca.fit_transform(X)
plt.figure()
plt.subplot(2,2,1,aspect = 'equal')
plt.title("Original space")
reds = y == 0
biues = y == 1
plt.plot(X[reds,0],X[reds,1],"ro")
plt.plot(X[reds,0],X[reds,1],"ro")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.subplot(2,2,3,aspect = 'equal')
plt.plot(X_kpca[reds,0],X_kpca[reds,1],"ro")
plt.plot(X_kpca[reds,0],X_kpca[reds,1],"ro")
plt.title("Projection by KPCA")
plt.xlabel("1st principal component in space induced by $\phi$")
plt.ylabel("2nd component")
plt.subplots_adjust(0.02,0.01,0.98,0.94,0.04,0.35)
plt.show()

(4)交叉驗證

# -*- coding: utf-8 -*-
"""
Created on Tue Mar 12 11:03:31 2019

@author: Larry
"""

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn import model_selection
iris = datasets.load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.4,random_state=0)
clf = svm.SVC(kernel='linear',C=1).fit(X_train,y_train)
scores=model_selection.cross_val_score(clf,X_train,y_train,cv=5)
print("Accuracy:%0.2f(+/-%0.2f)"%(scores.mean(),scores.std()*2))

(5)決策樹(DT)

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 13 14:36:44 2019

@author: Larry
"""

from sklearn import tree
names = ['size','scale','fruit','butt']
labels = [1,1,1,1,1,0,0,0]
p1=[2,1,0,1]
p2=[1,1,0,1]
p3=[1,1,0,0]
p4=[1,1,0,0]
n1=[0,0,0,0]
n2=[1,0,0,0]
n3=[0,0,1,0]
n4=[1,1,0,0]
data=[p1,p2,p3,p4,n1,n2,n3,n4]

def pred(test,data=data):
    dtre = tree.DecisionTreeClassifier()
    dtre = dtre.fit(data,labels)
    print(dtre.predict([test]))
    with open('data/treeDemo.dot','w') as f:
        f = tree.export_graphviz(dtre,out_file=f,feature_names=names)
        
pred([1,1,0,1]) 

1.3集成學習

1.3.1Bagging(裝袋)方法

  • 也叫自舉聚類(bootstrap aggregating)
  • 最常見的bagging指的是有放回的抽樣
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 14 16:29:50 2019

@author: Larry
"""

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets

bcls=BaggingClassifier(DecisionTreeClassifier(),max_samples=0.5,max_features=0.5,n_estimators=50)
X,y=datasets.make_blobs(n_samples=8000,centers=2,random_state=0,cluster_std=4)
bcls.fit(X,y)
print(bcls.score(X,y))

(1)sklearn.ensemble模塊中有兩種基於決策樹的算法:隨機森林和極端隨機樹。算法比較

# -*- coding: utf-8 -*-
"""
Created on Thu Mar 14 17:03:35 2019

@author: Larry
"""

from sklearn import model_selection
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import datasets

def vclas(w1,w2,w3,w4):
    X,y=datasets.make_classification(n_features=10,n_informative=4,n_samples=500,n_clusters_per_class=5)
    Xtrain,Xtest,ytrain,ytest=model_selection.train_test_split(X,y,test_size=0.4)
    
    clf1=LogisticRegression(random_state=123)
    clf2=GaussianNB()
    clf3=RandomForestClassifier(n_estimators=10,bootstrap=True,random_state=123)
    clf4=ExtraTreesClassifier(n_estimators=10,bootstrap=True,random_state=123)
    
    
    clfes=[clf1,clf2,clf3,clf4]
    
    
    eclf=VotingClassifier(estimators=[('lr',clf1),('gnb',clf2),('rf',clf3),('et',clf4)],voting='soft',weights=[w1,w2,w3,w4])
    
    [c.fit(Xtrain,ytrain) for c in (clf1,clf2,clf3,clf4,eclf)]
    
    
    
    N=5
    ind = np.arange(N)
    width = 0.3
    fig,ax=plt.subplots()
    
    
    
    plt.rcParams['font.sans-serif']=['SimHei']#方法1簡單,解決中文顯示問題
    plt.rcParams['axes.unicode_minus'] = False
    
    for i,clf in enumerate(clfes):
        print(clf,i)
        p1=ax.bar(i,clfes[i].score(Xtrain,ytrain),width=width,color='black')
        p2=ax.bar(i+width,clfes[i].score(Xtest,ytest),width=width,color='grey')
    
    ax.bar(len(clfes)+width,eclf.score(Xtrain,ytrain),width=width,color='black') 
    ax.bar(len(clfes)+width*2,eclf.score(Xtest,ytest),width=width,color='grey') 
    plt.axvline(3.8,color='k',linestyle='dashed')
    ax.set_xticks(ind+width)
    ax.set_xticklabels(['LogisticRegression',
                       'GaussianNB',
                       'RandomForestClassifier',
                       'ExtraTrees',
                       'VotingClassifier'],rotation=40,ha='right')
    #ExtraTrees
    
    plt.title('Train and test score for different classifiers')
    plt.legend([p1[0],p2[0]],['測試','test'],loc='lower left')
#    plt.show()
    plt.savefig("data/temp.png",dpi=500,bbox_inches = 'tight')#解決圖片不清晰,不完整的問題

vclas(1,3,5,4)

1.3.2 Boosting方法

(1)AdaBoost(自適應Boosting):採用了決策樹分類器作爲基學習器,並且對不可線性分裂的數據建立了決策邊界。
(2)梯度Boosting

  • 有利於對混合數據類型
  • 預測能力強
  • 採用串行架構不適合並行技術,無法較好的擴展到大數據。

2相關評價指標

(1)ROC曲線(接受者操作特性),繪製了不同閾值下的真正率和假正率。
(2)在信號檢測理論中,ROC圖一直被用來描述分類器命中率和誤報警率之間的權衡。
(3)對於多分類的ROC,可畫出多個ROC,指定其中一個爲正,其他全爲負。

# -*- coding: utf-8 -*-
"""
Created on Thu Mar 14 09:45:27 2019

@author: Larry
"""
import matplotlib.pyplot as plt
from sklearn import svm,datasets
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

X,y=datasets.make_classification(n_samples=100,n_classes=3,n_features=5,n_informative=3,n_redundant=0,random_state=42)
#binarize the output
y=label_binarize(y,classes=[0,1,2])
n_classes=y.shape[1]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.5)
classifier=OneVsRestClassifier(svm.SVC(kernel='linear',probability=True))
y_score=classifier.fit(X_train,y_train).decision_function(X_test)
fpr,tpr,_=roc_curve(y_test[:,0],y_score[:,0])
roc_auc=auc(fpr,tpr)
plt.figure()
plt.plot(fpr,tpr,label='ROC AUC %0.2f' % roc_auc)
plt.plot([0,1],[0,1],'k--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="best")
plt.show()

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章