引入

一個機器可以根據照片來辨別鮮花的品種嗎？在機器學習角度，這其實是一個分類問題，即機器根據不同品種鮮花的數據進行學習，使其可以對未標記的測試圖片數據進行分類。這一小節，我們還是從scikit-learn出發，理解基本的分類原則，多動手實踐。

Iris數據集

Iris flower數據集是1936年由Sir Ronald Fisher引入的經典多維數據集，可以作爲判別分析（discriminant analysis）的樣本。該數據集包含Iris花的三個品種(Iris setosa, Iris virginica and Iris versicolor)各50個樣本，每個樣本還有4個特徵參數（分別是萼片<sepals>的長寬和花瓣<petals>的長寬，以釐米爲單位），Fisher利用這個數據集開發了一個線性判別模型來辨別花朵的品種。基於Fisher的線性判別模型，該數據集成爲了機器學習中各種分類技術的典型實驗案例。

現在我們要解決的分類問題是，當我們看到一個新的iris花朵，我們能否根據以上測量參數成功預測新iris花朵的品種。

我們利用給定標籤的數據，設計一種規則進而應用到其他樣本中做預測，這是基本的監督問題（分類問題）。

由於iris數據集樣本量和維度都很小，所以可以方便進行可視化和操作。

數據的可視化(visualization)

scikit-learn自帶有一些經典的數據集，比如用於分類的iris和digits數據集，還有用於迴歸分析的boston house prices數據集。可以通過下面的方式載入數據：

from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

該數據集是一種字典結構，數據存儲在.data成員中，輸出標籤存儲在.target成員中。

畫出任意兩維的數據散點圖

可以用下面的方式畫出任意兩個維度的散點圖，這裏以第一維sepal length和第二維數據sepal width爲例：

from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
iris = datasets.load_iris()
irisFeatures = iris["data"]
irisFeaturesName = iris["feature_names"]
irisLabels = iris["target"]
def scatter_plot(dim1, dim2):
    for t,marker,color in zip(xrange(3),">ox","rgb"):
           # zip()接受任意多個序列參數，返回一個元組tuple列表
        # 用不同的標記和顏色畫出每種品種iris花朵的前兩維數據
        # We plot each class on itown to get different colored markers
        plt.scatter(irisFeatures[irisLabels == t,dim1],
        irisFeatures[irisLabels == t,dim2],marker=marker,c=color)
    dim_meaning = {0:'setal length',1:'setal width',2:'petal length',3:'petal width'}
    plt.xlabel(dim_meaning.get(dim1))
    plt.ylabel(dim_meaning.get(dim2))
plt.subplot(231)
scatter_plot(0,1)
plt.subplot(232)
scatter_plot(0,2)
plt.subplot(233)
scatter_plot(0,3)
plt.subplot(234)
scatter_plot(1,2)
plt.subplot(235)
scatter_plot(1,3)
plt.subplot(236)
scatter_plot(2,3)
plt.show()

效果如圖：

構建分類模型

根據某一維度的閾值進行分類

如果我們的目標是區別這三種花朵，我們可以做一些假設。比如花瓣的長度(petal length)好像將Iris Setosa品種與其它兩種花朵區分開來。我們可以以此來寫一段小代碼看看這個屬性的邊界是什麼：

petalLength = irisFeatures[:,2] #select the third column,since the features is 150*4
isSetosa = (irisLabels == 0) #label 0 means iris Setosa
maxSetosaPlength = petalLength[isSetosa].max()
minNonSetosaPlength = petalLength[~isSetosa].min()

print ('Maximum of setosa:{0} '.format(maxSetosaPlength))
print ('Minimum of others:{0} '.format(minNonSetosaPlength))

'''
顯示結果是：
Maximum of setosa:1.9 
Minimum of others:3.0 
'''

我們根據實驗結果可以建立一個簡單的分類模型，如果花瓣長度小於2，就是Iris Setosa花朵，否則就是其他兩種花朵。

這個模型的結構非常簡單，是由數據的一個維度閾值來確定的。我們通過實驗確定這個維度的最佳閾值。

以上的例子將Iris Setosa花朵和其他兩種花朵很容易的分開了，然而我們不能立即確定Iris Virginica花朵和Iris Versicolor花朵的最佳閾值，我們甚至發現，我們無法根據某一維度的閾值將這兩種類別很完美的分開。

比較準確率來得到閾值

我們先選出非Setosa的花朵。

irisFeatures = irisFeatures[~isSetosa]
labels = irisLabels[~isSetosa]
isVirginica = (labels == 2)    #label 2 means iris virginica

這裏我們非常依賴NumPy對於數組的操作，isSetosa是一個Boolean值數組，我們可以用它來選擇出非Setosa的花朵。最後，我們還構造了一個新的Boolean數組，isVirginica。接下來，我們對每一維度的特徵寫一個循環小程序，然後看一下哪一個閾值能得到更好的準確率。

# search the threshold between virginica and versicolor
irisFeatures = irisFeatures[~isSetosa]
labels = irisLabels[~isSetosa]
isVirginica = (labels == 2)	#label 2 means iris virginica
bestAccuracy = -1.0
for fi in xrange(irisFeatures.shape[1]):
thresh = irisFeatures[:,fi].copy()
thresh.sort()
for t in thresh:
pred = (irisFeatures[:,fi] > t)
acc = (pred == isVirginica).mean()
if acc > bestAccuracy:
bestAccuracy = acc;
bestFeatureIndex = fi;
bestThreshold = t;
print 'Best Accuracy:\t\t',bestAccuracy
print 'Best Feature Index:\t',bestFeatureIndex
print 'Best Threshold:\t\t',bestThreshold
'''
最終結果：
Best Accuracy:		0.94
Best Feature Index:	3
Best Threshold:		1.6
'''

這裏我們首先對每一維度進行排序，然後從該維度中取出任一值作爲閾值的一個假設，再計算這個假設的Boolean序列和實際的標籤Boolean 序列的一致情況，求平均，即得到了準確率。經過所有的循環，最終得到的閾值和所對應的維度。最後，我們得到了最佳模型針對第四維花瓣的寬度petal width，我們就可以得到這個決策邊界decision boundary。

評估模型——交叉檢驗

上面，我們得到了一個簡單的模型，並且針對訓練數據實現了94%的正確率，但這個模型參數可能過於優化了。

我們需要的是評估模型針對新數據的泛化能力，所以我們需要保留一部分數據，進行更加嚴格的評估，而不是用訓練數據做測試數據。爲此，我們會保留一部分數據進行交叉檢驗。

這樣我們就會得到訓練誤差和測試誤差，當複雜的模型下，可能訓練的準確率是100%，但是測試時效果可能只是比隨機猜測好一點。

交叉檢驗

在許多實際應用中，數據是不充足的。爲了選擇更好的模型，可以採用交叉檢驗方法。交叉檢驗的基本想法是重複地使用數據；把給定數據進行切分，將切分的數據集組合爲訓練集和測試集，在此基礎上反覆地進行訓練、測試以及模型選擇。

S-fold交叉檢驗

應用最多的是S折交叉檢驗(S-fold cross validation)，方法如下：首先隨機地將已給數據切分爲S個互不相交的大小相同的子集；然後利用S-1個子集的數據訓練模型，利用餘下的子集測試模型；將這一過程對可能的S種選擇重複進行；最後選出S次評測中平均測試誤差最小的模型。

如上圖，我們將數據集分成5部分，即5-fold交叉檢驗。接下來，我們可以對每一個fold生成一個模型，留出20%的數據進行檢驗。

leave-one-out交叉檢驗方法

留一交叉檢驗(leave-one-out cross validation)是S折交叉檢驗的特殊情形，是S爲給定數據集的容量時情形。我們可以從訓練數據中挑選一個樣本，然後拿其他訓練數據得到模型，最後看該模型是否能將這個挑出來的樣本正確的分類。

def learn_model(features,labels):
bestAccuracy = -1.0
for fi in xrange(features.shape[1]):
thresh = features[:,fi].copy()
thresh.sort()
for t in thresh:
pred = (features[:,fi] > t)
acc = (pred == labels).mean()
if acc > bestAccuracy:
bestAccuracy = acc;
bestFeatureIndex = fi;
bestThreshold = t;
'''
print 'Best Accuracy:\t\t',bestAccuracy
print 'Best Feature Index:\t',bestFeatureIndex
print 'Best Threshold:\t\t',bestThreshold
'''
return {'dim':bestFeatureIndex, 'thresh':bestThreshold, 'accuracy':bestAccuracy}
def apply_model(features,labels,model):
prediction = (features[:,model['dim']] > model['thresh'])
return prediction
#-----------cross validation-------------
error = 0.0
for ei in range(len(irisFeatures)):
# select all but the one at position 'ei':
training = np.ones(len(irisFeatures), bool)
training[ei] = False
testing = ~training
model = learn_model(irisFeatures[training], isVirginica[training])
predictions = apply_model(irisFeatures[testing],
  isVirginica[testing], model)
error += np.sum(predictions != isVirginica[testing])

上面的程序，我們用所有的樣本對一系列的模型進行了測試，最終的估計說明了模型的泛化能力。

小結

對於上面對數據集進行劃分時，我們需要注意平衡分配數據。如果對於一個子集，所有的數據都來自一個類別，則結果沒有代表性。基於以上的討論，我們利用一個簡單的模型來訓練，交叉檢驗過程給出了這個模型泛化能力的估計。

參考文獻

Wiki:Iris flower data set

Building Machine Learning Systems with Python

轉載請註明作者Jason Ding及其出處

Github主頁(http://jasonding1354.github.io/)

CSDN博客(http://blog.csdn.net/jasonding1354)

簡書主頁(http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

文章出處：JasonDing的博客

【scikit-learn】Python分類實例

引入

Iris數據集

數據的可視化(visualization)

畫出任意兩維的數據散點圖

構建分類模型

根據某一維度的閾值進行分類

比較準確率來得到閾值

評估模型——交叉檢驗

交叉檢驗

S-fold交叉檢驗

leave-one-out交叉檢驗方法

小結

參考文獻

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

普利策獎《哥德爾、埃舍爾、巴赫——集異璧之大成》

PRML讀書會第三章 Linear Models for Regression

Python實現貝葉斯推斷及其互聯網應用：拼寫檢查

【scikit-learn】Python分類實例

譜聚類算法原理介紹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結