一篇基於pthon和scikt-learn的關於機器學習的介紹

A Gentle Introduction to Machine Learning with Pythonand Scikit-learn

一篇基於pthon和scikt-learn的關於機器學習的介紹
GuillermoMoncecchi, Diego Garat, Raúl Garreta

本文主要展現使用scikit-learn機器學習方法基本的使用。主要包括分類、迴歸和聚類,分類和聚類的數據集是使用1936年由Sir Ronald Fisher 引入的鶯尾花數據,迴歸使用的是Boston 房屋數據。

設置環境

In [161]:

%pylab inline
Populating the interactive namespace from numpy and matplotlib
Import scikit-learn, numpy, scipy andpyplotIn [162]:
importnumpyasnp
importscipyassp
importmatplotlib.pyplotasplt
importsklearn
importIPython
importplatform

print ('Python version:', platform.python_version())
print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sklearn.__version__)
print ('matplotlib version:', matplotlib.__version__)

Python version: 3.3.5
IPython version: 3.2.0
numpy version: 1.9.2
scikit-learn version: 0.16.1
matplotlib version: 1.4.3

數據集

每個sklearn的方法都對應有數據集,它引入一些大家耳熟能詳的數據集合,比如鶯尾花數據集合,包含了150組數據,三中類型,每組數據包含萼片的長度和寬度,花瓣的長度和寬度。
In [163]:

fromsklearnimport datasets
iris= datasets.load_iris()
X_iris= iris.data
y_iris= iris.target

這份數據集合150*4,對於每組數據,我們有一個對應的分類。
In [164]:

print (X_iris.shape, y_iris.shape)
print ('Feature names:{0}'.format(iris.feature_names))
print ('Target classes:{0}'.format(iris.target_names))
print ('First instance features:{0}'.format(X_iris[0]))
(150, 4) (150,)
Feature names:['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target classes:['setosa' 'versicolor' 'virginica']
First instance features:[ 5.1  3.5  1.4  0.2]

然後我們將數據打印出來,顯示萼片,然後是花瓣數據。
In [212]:

plt.figure('sepal')
colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
for i inrange(len(colormarkers)):
    px = X_iris[:, 0][y_iris== i]
    py = X_iris[:, 1][y_iris== i]
    plt.scatter(px, py, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Iris Dataset: Sepal width vs sepal length')
plt.legend(iris.target_names)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.figure('petal')

for i inrange(len(colormarkers)):
    px = X_iris[:, 2][y_iris== i]
    py = X_iris[:, 3][y_iris== i]
    plt.scatter(px, py, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Iris Dataset: petal width vs petal length')
plt.legend(iris.target_names)
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.show()

這裏寫圖片描述
這裏寫圖片描述

SupervisedLearning: Classification

1936年,Ronald Fisher引入鶯尾花數據,使用他訓練一條線性分類模型。構建一條特徵的線性組合,即構造一條直線。
我們的任務是預測給定一份鶯尾花萼片和花瓣長寬數據預測其類型。然後我們嘗試只使用兩個特徵即萼片的長和寬。
常規的分類處理只要包括如下幾步:
(1)特徵選擇
(2)基於可行數據構建模型
(3)對於測試數據評估模型效果
所以,在構建模型之前,我們應該將數據分爲訓練集和測試集,訓練集將用於構建模型,測試集評估模型效果。.
Separatetraining and testing sets分離訓練和測試數據集

我們第一步要做的就是分離數據集,75%爲訓練集,剩餘的25%爲測試集。同時,我們要做數據規範化,計算特徵均值,每個特徵和均值做差值,將結果除以標準差。規範化之後每維特徵的均值爲0.
In [213]:
fromsklearn.cross_validationimport train_test_split
fromsklearnimport preprocessing

# Create dataset with only the first two attributes
X, y = X_iris[:, [0,1]], y_iris
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

# Standarize the features
scaler= preprocessing.StandardScaler().fit(X_train)
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)
Check that, after scaling, the mean is 0and the standard deviation is 1 (this should be exact in the training set, butonly approximated in the testing set, because we used the training set mediaand standard deviation):
In [214]:
print ('Training set mean:{:.2f} and standard deviation:{:.2f}'.format(np.average(X_train),np.std(X_train)))
print ('Testing set mean:{:.2f} and standard deviation:{:.2f}'.format(np.average(X_test),np.std(X_test)))
Training set mean:0.00 and standard deviation:1.00
Testing set mean:0.13 and standard deviation:0.71
Display the training data, after scaling.
In [215]:
colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
plt.figure('Training Data')
for i inrange(len(colormarkers)):
    xs = X_train[:, 0][y_train== i]
    ys = X_train[:, 1][y_train== i]
    plt.scatter(xs, ys, c=colormarkers[i][0], marker=colormarkers[i][1])

plt.title('Training instances, after scaling')
plt.legend(iris.target_names)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

這裏寫圖片描述
Alinear, binary classifier

首先,我們將問題轉化成二分類問題,知識區別是否是setosa類型的花,只需要將非setosa類型的話的類型標識成一樣的。
In [169]:

importcopy
y_train_setosa= copy.copy(y_train)
# Every 1 and 2 classes in the training set will became just 1
y_train_setosa[y_train_setosa>0]=1
y_test_setosa= copy.copy(y_test)
y_test_setosa[y_test_setosa>0]=1

print ('New training target classes:\n{0}'.format(y_train_setosa))
New training target classes:
[1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0
 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 0 1 1
 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1
 0]

當前問題的分類是一個二分類,一條直線可以分隔開。
線性分類模型已有有了多年的研究,並且包含很多不同的方法來構建分類超平面。這裏我們使用SGDClassifier的方法來構建線性模型,同時包好規範化。SGDClassifier分類模型如其名字使用了隨機梯度下降的方法,一種計算函數局部最小值非常有效的方法。
梯度下降方法由Louis Cauchy在1847年提出,用來解線性方程組。這個思想基於多變量函數總是在它的負梯度方向下降最快。如果我們想得到它的最小值,我們可以在它的負梯度方向移動。
sklearn中的每個分類方法都使用的相同的模式,我們通過可配置參數調用一種分類方法,在這個例子中,我們使用 linear_model.SGDClassifier,來告訴sklearn使用對數損失函數。
In [170]:

fromsklearnimport linear_model 
clf= linear_model.SGDClassifier(loss='log', random_state=42)
print (clf)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)

可以看出訓練器包含幾個變量。通常sklearn參數都有默認值。但是我們需要知道只使用默認參數不一定得到一個好的訓練結果。
然後我們調用fit方法來訓練分類器。
In [171]:

clf.fit(X_train, y_train_setosa)


Out[171]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)

我們可以將分類器的係數打印出來。
In [172]:

print (clf.coef_,clf.intercept_)
[[ 30.97129662 -17.82969037]] [ 17.34844577]

我們也可以把分類邊界展現出來
In [173]:

x_min, x_max = X_train[:, 0].min()-.5, X_train[:, 0].max()+.5
y_min, y_max = X_train[:, 1].min()-.5, X_train[:, 1].max()+.5
xs= np.arange(x_min, x_max, 0.5)
fig,axes= plt.subplots()
axes.set_aspect('equal')
axes.set_title('Setosa classification')
axes.set_xlabel('Sepal length')
axes.set_ylabel('Sepal width')
axes.set_xlim(x_min, x_max)
axes.set_ylim(y_min, y_max)
plt.sca(axes)
plt.scatter(X_train[:,0][y_train==0], X_train[:, 1][y_train==0], c='red', marker='s')
plt.scatter(X_train[:,0][y_train==1], X_train[:, 1][y_train==1], c='black', marker='x')
ys= (-clf.intercept_[0]- xs * clf.coef_[0,0])/ clf.coef_[0,1]
plt.plot(xs, ys, hold=True)
plt.show()

這裏寫圖片描述
The blue line is our decision boundary. Every time 30.97×sepal_length−17.82×sepal_width−17.3430.97×sepal_length−17.82×sepal_width−17.34 isgreater than zero we will have an iris setosa (class 0).
Prediction預測

當我們有一個新的花,我們只需要知道它的花瓣的長和寬,然後調用predict方法來預測。我們所有機器學習的模型的預測都是直接調用predict,不管是任何分類方法或者任何方式構建的。
In [174]:

print ('If the flower has 4.7 petal width and 3.1 petal length is a {}'.format(
        iris.target_names[clf.predict(scaler.transform([[4.7,3.1]]))]))

If the flower has 4.7 petal width and 3.1 petal length is a [‘setosa’]
Note that we first scaled the newinstance, then applyied the predict method, and used the resultto lookup into the iris target names arrays.

Backto the original three-class problem回到三種類型的方法上

Now, do the training using the threeoriginal classes. Using scikit-learn this is simple: we do exactly the sameprocedure, using the original three target classes:
In [175]:

clf2= linear_model.SGDClassifier(loss='log', random_state=33)
clf2.fit(X_train, y_train) 
print (len(clf2.coef_))

三分類中sklearn將問題分解成三份二分類的子問題,如類型0可以線性分離出來,此時類型2和3是混在一起的。
In [176]:

x_min, x_max = X_train[:, 0].min()-.5, X_train[:, 0].max()+.5
y_min, y_max = X_train[:, 1].min()-.5, X_train[:, 1].max()+.5
xs= np.arange(x_min,x_max,0.5)
fig, axes = plt.subplots(1,3)
fig.set_size_inches(10,6)
for i in [0,1,2]:
    axes[i].set_aspect('equal')
    axes[i].set_title('Class '+ iris.target_names[i]+' versus the rest')
    axes[i].set_xlabel('Sepal length')
    axes[i].set_ylabel('Sepal width')
    axes[i].set_xlim(x_min, x_max)
    axes[i].set_ylim(y_min, y_max)
    plt.sca(axes[i])
    ys=(-clf2.intercept_[i]-xs*clf2.coef_[i,0])/clf2.coef_[i,1]
    plt.plot(xs,ys,hold=True)    
    for j in [0,1,2]:
        px = X_train[:, 0][y_train== j]
        py = X_train[:, 1][y_train== j]
        color = colormarkers[j][0]if j==ielse'black'
        marker ='o'if j==ielse'x'
        plt.scatter(px, py, c=color, marker=marker)     

plt.show()

這裏寫圖片描述

Let us evaluate on the previous instanceto find the three-class prediction. Scikit-learn tries the three classifiers.
In [177]:

scaler.transform([[4.7,3.1]])
print(clf2.decision_function(scaler.transform([[4.7,3.1]])))
clf2.predict(scaler.transform([[4.7,3.1]]))
[[ 15.45793755  -1.60852842 -37.65225636]]

array([0])

分類方法中可以得出分類模型的打分值。在例子中,第一個分類類型得出花朵是setosa,那麼它就不回是另外兩種花。如果得出的結果中有兩個類型的都是整數,那麼最大的scores值離分類線最遠,那麼我們選取最大的值作爲預測類型。

Evaluatingthe classifier評估模型

評估模型用來評估分類算法的好壞,最常用的方法是準確率,給定一個分類器和一組類型數據,他將給出分類器預測值正確的比例。
In [178]:

fromsklearnimport metrics
y_train_pred= clf2.predict(X_train)
print ('Accuracy on the training set:{:.2f}'.format(metrics.accuracy_score(y_train, y_train_pred)))
Accuracy on the training set:0.83

This means that our classifier correctlypredicts 83\% of the instances in the training set. But this is actually a badidea. The problem with the evaluating on the training set is that you havebuilt your model using this data, and it is possible that your model adjustsactually very well to them, but performs poorly in previously unseen data(which is its ultimate purpose). This phenomenon is called overfitting, and youwill see it once and again while you read this book. If you measure on yourtraining data, you will never detect overfitting. So, neverever measure on yourtraining data.
Remember we separated a portion of thetraining set? Now it is time to use it: since it was not used for training, weexpect it to give us and idead of how well our classifier performs onpreviously unseen data.
In [179]:

y_pred= clf2.predict(X_test)
print ('Accuracy on the training set:{:.2f}'.format(metrics.accuracy_score(y_test, y_pred)))
Accuracy on the training set:0.68

一般情況下,測試集合上的準確率要低於訓練集合,因爲模型是基於訓練集合訓練的。
使用準確率的一個問題是不能反應模型在不同目標類型的效果。比如,我們知道我們的分類模型在setosa類型的花朵上效果很好,但是在分類其它兩種類型是很有可能是失敗的。所以如果我們同時衡量這些,對於我們提升效果、改變訓練方法或者特徵都很有效。
評估多分類中一種有效的方法是混淆矩陣。混淆矩陣中第i行的第j列的值是類型i被預測爲類型j的示例數量。有了原始類型和預測類型,我們能很容易的打印混淆矩陣。
In [180]:

print (metrics.confusion_matrix(y_test, y_pred))
[[ 8  0  0]
 [ 0  3  8]
 [ 0  4 15]]

通過打印出的矩陣,我們可以看出2行3列的數據8標識類型爲1的花被預測爲類型2,我們的分類器預測類型0的花朵效果很好,但是它在類型1和2上效果很差。混淆矩陣可以得到分類錯誤的很多有用信息。
Accuracy on the test set is a goodperformance measure when the number of instances of each class is similar,i.e., we have a uniform distribution of classes. However, consider that 99percent of your instances belong to just one class (you have a skewed): aclassifier that always predicts this majority class will have an excellentperformance in terms of accuracy, despite the fact that it is an extremelynaive method (and that it will surely fail in the “difficult” 1% cases).
Within scikit-learn, there are severalevaluation functions; we will show three popular ones: precision, recall, andF1-score (or f-measure).
In [181]:

print (metrics.classification_report(y_test, y_pred, target_names=iris.target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.43      0.27      0.33        11
  virginica       0.65      0.79      0.71        19

avg / total       0.66      0.68      0.66        38

· Precision computes theproportion of instances predicted as positives that were correctly evaluated(it measures how right is our classifier when it says that an instance ispositive).準確率,樣本中被預測爲正樣本的概率
· Recall counts the proportionof positive instances that were correctly evaluated (measuring how right ourclassifier is when faced with a positive instance).召回率:被預測爲正樣本的數據佔全部正樣本的概率。
· F1-score is the harmonic meanof precision and recall, and tries to combine both in a single number.F1值是準確率和召回率的調和均值,組合爲一個單獨的數值F1。
Usingthe four flower attributes使用鶯尾花的4維數據

我們使用四維數據重複整個過程,檢測它的效果是否有提升。
In [182]:

# Test set will be the 25% taken randomly
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_iris, y_iris, test_size=0.25, random_state=33)

# Standarize the features
scaler= preprocessing.StandardScaler().fit(X_train4)
X_train4= scaler.transform(X_train4)
X_test4= scaler.transform(X_test4)

# Build the classifier
clf3= linear_model.SGDClassifier(loss='log', random_state=33)
clf3.fit(X_train4, y_train4) 

# Evaluate the classifier on the evaluation set
y_pred4= clf3.predict(X_test4)
print (metrics.classification_report(y_test4, y_pred4, target_names=iris.target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.78      0.64      0.70        11
  virginica       0.81      0.89      0.85        19

avg / total       0.84      0.84      0.84        38

UnsupervisedLearning: Clustering

在有些情況下只得到沒有標籤的數據集,需要去探尋數據中隱藏的結構或者模式,前提是沒有給定的目標分類或者評估模型,我們稱這種機器學習任務是非監督的。聚類方法將數據集分組分成子集,在同一子集中的元素是相似的,和其它子集中的元素有差別。
K-means是最常用的距離算法,因爲它簡單且容易構建,並且在不同的任務上效果都很好。它屬於分類算法將不同數據分類到不同組,即簇。

In [183]:

fromsklearnimport cluster
clf_sepal= cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
clf_sepal.fit(X_train4[:,0:2])
Out[183]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=33, tol=0.0001,
    verbose=0)
We can show the label assigned for eachinstance (note that this label is a cluster name, it has nothing to do with ouroriginal target classes... actually, when you are doing clustering you have notarget class!).
In [184]:
print (clf_sepal.labels_)
[1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 2 1 2 1 0 0 1 1 0 0 2 0 1 2 2 1 1 0 0 2 1 0
 1 1 2 1 0 2 0 1 0 2 2 0 2 1 0 0 1 0 0 0 2 1 0 1 0 1 0 1 2 1 1 1 0 1 0 2 1
 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 1 2 1 2 0 2 0 0 0 1 1 2 1 1 1 2
 0]

Using NumPy’s indexing capabilities, wecan display the actual target classes for each cluster, just to compare thebuilt clusters with our flower type classes…
In [185]:

print (y_train4[clf_sepal.labels_==0])
[0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]

In [186]:

print (y_train4[clf_sepal.labels_==1])
[1 1 1 1 1 1 2 1 0 2 1 2 2 1 1 2 2 1 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
 2 1 2 1 1 2 1]
In [187]:
print (y_train4[clf_sepal.labels_==2])
[2 2 1 2 2 2 2 1 1 2 2 1 2 2 1 1 2 2 2 2 2 2 1 2 2]
As usually, is a good idea to display ourinstances and the clusters they belong to, to have a first approximation to howwell our algorithm is behaving on our data:
In [188]:
colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
step=.01
margin=.1   
sl_min, sl_max = X_train4[:, 0].min()-margin, X_train4[:, 0].max()+ margin
sw_min, sw_max = X_train4[:, 1].min()-margin, X_train4[:, 1].max()+ margin
sl, sw  = np.meshgrid(
    np.arange(sl_min, sl_max, step),
np.arange(sw_min, sw_max, step)
    )
Zs= clf_sepal.predict(np.c_[sl.ravel(), sw.ravel()]).reshape(sl.shape)
centroids_s= clf_sepal.cluster_centers_
Display the data points and thecalculated regions
In [189]:
plt.figure(1)
plt.clf()
plt.imshow(Zs, interpolation='nearest', extent=(sl.min(), sl.max(), sw.min(), sw.max()), cmap= plt.cm.Pastel1, aspect='auto', origin='lower')
for j in [0,1,2]:
    px = X_train4[:, 0][y_train== j]
    py = X_train4[:, 1][y_train== j]
    plt.scatter(px, py, c=colormarkers[j][0], marker= colormarkers[j][1])
plt.scatter(centroids_s[:,0], centroids_s[:, 1],marker='*',linewidths=3, color='black', zorder=10)
plt.title('K-means clustering on the Iris dataset using Sepal dimensions\nCentroids are marked with stars')
plt.xlim(sl_min, sl_max)
plt.ylim(sw_min, sw_max)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()

**Repeat the experiment, using petaldimensions**
In [190]:
clf_petal= cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
clf_petal.fit(X_train4[:,2:4])
Out[190]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=33, tol=0.0001,
    verbose=0)
In [191]:
print (y_train4[clf_petal.labels_==0])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0]
In [192]:
print (y_train4[clf_petal.labels_==1])
[1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 1]
In [193]:
print (y_train4[clf_petal.labels_==2])
[2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2]

畫出每一簇。

In [196]:
colormarkers= [ ['red','s'], ['greenyellow','o'], ['blue','x']]
step=.01
margin=.1
sl_min, sl_max = X_train4[:, 2].min()-margin, X_train4[:, 2].max()+ margin
sw_min, sw_max = X_train4[:, 3].min()-margin, X_train4[:, 3].max()+ margin
sl, sw  = np.meshgrid(
    np.arange(sl_min, sl_max, step),
    np.arange(sw_min, sw_max, step), 
    )
Zs= clf_petal.predict(np.c_[sl.ravel(), sw.ravel()]).reshape(sl.shape)
centroids_s= clf_petal.cluster_centers_
plt.figure(1)
plt.clf()
plt.imshow(Zs, interpolation='nearest', extent=(sl.min(), sl.max(), sw.min(), sw.max()), cmap= plt.cm.Pastel1, aspect='auto', origin='lower')
for j in [0,1,2]:
    px = X_train4[:, 2][y_train4== j]
    py = X_train4[:, 3][y_train4== j]
    plt.scatter(px, py, c=colormarkers[j][0], marker= colormarkers[j][1])
plt.scatter(centroids_s[:,0], centroids_s[:, 1],marker='*',linewidths=3, color='black', zorder=10)
plt.title('K-means clustering on the Iris dataset using Petal dimensions\nCentroids are marked with stars')
plt.xlim(sl_min, sl_max)
plt.ylim(sw_min, sw_max)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.show()

這裏寫圖片描述
這裏寫圖片描述
計算每一個簇
In [197]:

clf= cluster.KMeans(init='k-means++', n_clusters=3, random_state=33)
clf.fit(X_train4)
Out[197]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=33, tol=0.0001,
    verbose=0)

In [198]:

print (y_train[clf.labels_==0])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0]

In [199]:

print (y_train[clf.labels_==1])
[1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1]

In [200]:

print (y_train[clf.labels_==2])
[2 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2]

Measure precision & recall in thetesting set, using all attributes, and using only petal measures
In [201]:

y_pred=clf.predict(X_test4)
print (metrics.classification_report(y_test, y_pred, target_names=['setosa','versicolor','virginica']))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.64      0.64      0.64        11
  virginica       0.79      0.79      0.79        19

avg / total       0.79      0.79      0.79        38

In [202]:

y_pred_petal=clf_petal.predict(X_test4[:,2:4])
print (metrics.classification_report(y_test, y_pred_petal, target_names=['setosa','versicolor','virginica']))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00         8
 versicolor       0.85      1.00      0.92        11
  virginica       1.00      0.89      0.94        19

avg / total       0.96      0.95      0.95        38

Wait, every performance measure is betterusing just two attributes. It is possible that less features give betterresults? Although at a first glance this seems contradictory, we will see infuture notebooks that selecting the right subset of features, a process calledfeature selection, could actually improve the performance of our algorithms.

SupervisedLearning: Regression監督學習:迴歸

在上面的例子中我們可以看出我們的目標是預測非連續空間的類型。對於分類,集合是目標類,聚類中用於訓練的數據集包含不同的計算集合。如果我們想得到一條線上的一個準確的數據,那麼這就是一個迴歸問題。
我們使用經典的房屋價格的例子。

fromsklearn.datasetsimport load_boston
boston= load_boston()
print ('Boston dataset shape:{}'.format(boston.data.shape))
Boston dataset shape:(506, 13)
In [204]:
print (boston.feature_names)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Create training and testing sets, andscale values, as usual
In [206]:
X_train_boston=boston.data
y_train_boston=boston.target
X_train_boston= preprocessing.StandardScaler().fit_transform(X_train_boston)
y_train_boston= preprocessing.StandardScaler().fit_transform(y_train_boston)

創建一個訓練和評估模型,這次評估中我們使用交叉驗證。
交叉驗證包含如下幾步:
(1)將數據集分成k個不同的子集
(2)創建k個不同的模型,其中k-1個自己是訓練集合,1個是測試集合
(3)計算k個模型的效果並使用均值作爲結果。
In [207]:

deftrain_and_evaluate(clf, X_train, y_train, folds):
    clf.fit(X_train, y_train)
    print ('Score on training set: {:.2f}'.format(clf.score(X_train, y_train)))
    #create a k-fold cross validation iterator of k=5 folds
    cv = sklearn.cross_validation.KFold(X_train.shape[0], folds, shuffle=True, random_state=33)
    scores = sklearn.cross_validation.cross_val_score(clf, X_train, y_train, cv=cv)
    print ('Average score using {}-fold crossvalidation:{:.2f}'.format(folds,np.mean(scores)))

交叉驗證的主要優勢是降低評估特徵的方差,所以已有一份訓練和測試集合,最終結果的將取決於構建這兩份集合的方式。在機器學習中,普遍認爲訓練集和測試集的數據分類是均勻相似的,如果不均勻,則最終得到的結果是不準確的。交叉驗證使我們降低這種風險,因爲我們在k份不同的數據集上分別建立k個模型,然後取平均,這樣我們會得到一個低方差且更加可信的模型。
sklearn有一個線性模型叫 linear_model.SGDRegressor ,它使用隨機梯度下降來降低平方損失。

In [208]:

fromsklearnimport linear_model
clf_sgd= linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=33)
train_and_evaluate(clf_sgd, X_train_boston, y_train_boston,5)
Score on training set: 0.73
Average score using 5-fold crossvalidation:0.70

在分類中常使用準確率衡量模型效果,在迴歸中,我們預測真實值。在sklearn中默認的score方法爲判定係數,用來衡量模型預測輸出變量的概率。R2R2方法的結果從0到1,如果模型預測到所有的目標值則結果是最大值1.
In [209]:

print(clf_sgd.coef_)
[-0.06777406  0.06767528 -0.04290825  0.08828856 -0.11797833  0.3394894
 -0.01969258 -0.23195707  0.09594823 -0.05271866 -0.19913907  0.10355794
 -0.36852382]

在上面的線性模型中我們調用penalty的參數爲None,clf_sgd= linear_model.SGDRegressor(loss=’squared_loss’, penalty=None, random_state=33),則引入懲罰係數來避免過擬合,通過懲罰那些系統太大的超平面實現。這個參數默認是L2 或者L1,下面我們示例使用L2.

In [210]:

clf_sgd1= linear_model.SGDRegressor(loss='squared_loss', penalty='l2', random_state=33)
train_and_evaluate(clf_sgd1, X_train_boston, y_train_boston,folds=5)
Score on training set: 0.73
Average score using 5-fold crossvalidation:0.70

Summary總結

結束這篇介紹,我們將總結sklearn中應用監督學習模型的主要通用步驟:
(1)數據集,選擇你要訓練的數據特徵,並構造成二維數據,每行代表一個學習示例,每列標識一唯特徵。每個特徵由一個實數表示,但是數據原始可能不是這個樣子。在真實環境中,這些預處理可能會花費一些時間。
(2)創建一個估計實例(分類器或者回歸器)。在sklearn中,由fit和predict來建立實例,這個實例需要估計參數的模型,可以手動設置這些參數,也可以使用工具。
(3)分離訓練和測試數據。
(4)使用fit(X,y)構建模型,其中x是訓練數據,y是對應的分類。
(5)在測試集上評估模型效果,predict(T), T是測試集合
(6)和真實的目標分類比對。
這些步驟是一個複習,sklearn基於不同的機器學習操作還提供很多方法(比如降維、聚類、半監督學習),以及一些數據轉換方法。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章