↑↑↑關注後"星標"Datawhale

每日干貨 & 每月組隊學習，不錯過

Datawhale乾貨

作者：李祖賢深圳大學，Datawhale高校羣成員

機器學習分爲兩類基本問題----迴歸與分類。在之前的文章中，也介紹了很多基本的機器學習模型。可在Datawhale機器學習專輯中查看。

但是，當我們建立好了相關模型以後我們怎麼評價我們建立的模型的好壞以及優化我們建立的模型呢？那本次分享的內容就是關於機器學習模型評估與超參數調優的。本次分享的內容包括：

用管道簡化工作流
使用k折交叉驗證評估模型性能
使用學習和驗證曲線調試算法
通過網格搜索進行超參數調優
比較不同的性能評估指標

一、用管道簡化工作流

在很多機器學習算法中，我們可能需要做一系列的基本操作後才能進行建模，如：在建立邏輯迴歸之前，我們可能需要先對數據進行標準化，然後使用PCA將維，最後擬合邏輯迴歸模型並預測。那有沒有什麼辦法可以同時進行這些操作，使得這些操作形成一個工作流呢？下面請看代碼：

1. 加載基本工具庫

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")
import warnings
warnings.filterwarnings("ignore")

2. 加載數據，並做基本預處理

# 加載數據
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
# 做基本的數據預處理
from sklearn.preprocessing import LabelEncoder


X = df.iloc[:,2:].values
y = df.iloc[:,1].values
le = LabelEncoder()    #將M-B等字符串編碼成計算機能識別的0-1
y = le.fit_transform(y)
le.transform(['M','B'])
# 數據切分8：2
from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)

3. 把所有的操作全部封在一個管道pipeline內形成一個工作流：標準化+PCA+邏輯迴歸

完成以上操作，共有兩種方式：

方式1：make_pipeline

# 把所有的操作全部封在一個管道pipeline內形成一個工作流：
## 標準化+PCA+邏輯迴歸


### 方式1：make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


pipe_lr1 = make_pipeline(StandardScaler(),PCA(n_components=2),LogisticRegression(random_state=1))
pipe_lr1.fit(X_train,y_train)
y_pred1 = pipe_lr.predict(X_test)
print("Test Accuracy: %.3f"% pipe_lr1.score(X_test,y_test))

Test Accuracy: 0.956

方式2：Pipeline

# 把所有的操作全部封在一個管道pipeline內形成一個工作流：
## 標準化+PCA+邏輯迴歸


### 方式2：Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


pipe_lr2 = Pipeline([['std',StandardScaler()],['pca',PCA(n_components=2)],['lr',LogisticRegression(random_state=1)]])
pipe_lr2.fit(X_train,y_train)
y_pred2 = pipe_lr2.predict(X_test)
print("Test Accuracy: %.3f"% pipe_lr2.score(X_test,y_test))

Test Accuracy: 0.956

二、使用k折交叉驗證評估模型性能

評估方式1：k折交叉驗證

# 評估方式1：k折交叉驗證


from sklearn.model_selection import cross_val_score


scores1 = cross_val_score(estimator=pipe_lr,X = X_train,y = y_train,cv=10,n_jobs=1)
print("CV accuracy scores:%s" % scores1)
print("CV accuracy:%.3f +/-%.3f"%(np.mean(scores1),np.std(scores1)))

評估方式2：分層k折交叉驗證

# 評估方式2：分層k折交叉驗證


from sklearn.model_selection import StratifiedKFold


kfold = StratifiedKFold(n_splits=10,random_state=1).split(X_train,y_train)
scores2 = []
for k,(train,test) in enumerate(kfold):
    pipe_lr.fit(X_train[train],y_train[train])
    score = pipe_lr.score(X_train[test],y_train[test])
    scores2.append(score)
    print('Fold:%2d,Class dist.:%s,Acc:%.3f'%(k+1,np.bincount(y_train[train]),score))
print('\nCV accuracy :%.3f +/-%.3f'%(np.mean(scores2),np.std(scores2)))

三、使用學習和驗證曲線調試算法

如果模型過於複雜，即模型有太多的自由度或者參數，就會有過擬合的風險（高方差）；而模型過於簡單，則會有欠擬合的風險(高偏差)。

下面我們用這些曲線去識別並解決方差和偏差問題：

1. 用學習曲線診斷偏差與方差

# 用學習曲線診斷偏差與方差
from sklearn.model_selection import learning_curve


pipe_lr3 = make_pipeline(StandardScaler(),LogisticRegression(random_state=1,penalty='l2'))
train_sizes,train_scores,test_scores = learning_curve(estimator=pipe_lr3,X=X_train,y=y_train,train_sizes=np.linspace(0.1,1,10),cv=10,n_jobs=1)
train_mean = np.mean(train_scores,axis=1)
train_std = np.std(train_scores,axis=1)
test_mean = np.mean(test_scores,axis=1)
test_std = np.std(test_scores,axis=1)
plt.plot(train_sizes,train_mean,color='blue',marker='o',markersize=5,label='training accuracy')
plt.fill_between(train_sizes,train_mean+train_std,train_mean-train_std,alpha=0.15,color='blue')
plt.plot(train_sizes,test_mean,color='red',marker='s',markersize=5,label='validation accuracy')
plt.fill_between(train_sizes,test_mean+test_std,test_mean-test_std,alpha=0.15,color='red')
plt.xlabel("Number of training samples")
plt.ylabel("Accuracy")
plt.legend(loc='lower right')
plt.ylim([0.8,1.02])
plt.show()

2. 用驗證曲線解決欠擬合和過擬合

# 用驗證曲線解決欠擬合和過擬合
from sklearn.model_selection import validation_curve


pipe_lr3 = make_pipeline(StandardScaler(),LogisticRegression(random_state=1,penalty='l2'))
param_range = [0.001,0.01,0.1,1.0,10.0,100.0]
train_scores,test_scores = validation_curve(estimator=pipe_lr3,X=X_train,y=y_train,param_name='logisticregression__C',param_range=param_range,cv=10,n_jobs=1)
train_mean = np.mean(train_scores,axis=1)
train_std = np.std(train_scores,axis=1)
test_mean = np.mean(test_scores,axis=1)
test_std = np.std(test_scores,axis=1)
plt.plot(param_range,train_mean,color='blue',marker='o',markersize=5,label='training accuracy')
plt.fill_between(param_range,train_mean+train_std,train_mean-train_std,alpha=0.15,color='blue')
plt.plot(param_range,test_mean,color='red',marker='s',markersize=5,label='validation accuracy')
plt.fill_between(param_range,test_mean+test_std,test_mean-test_std,alpha=0.15,color='red')
plt.xscale('log')
plt.xlabel("Parameter C")
plt.ylabel("Accuracy")
plt.legend(loc='lower right')
plt.ylim([0.8,1.02])
plt.show()

四、通過網格搜索進行超參數調優

如果只有一個參數需要調整，那麼用驗證曲線手動調整是一個好方法，但是隨着需要調整的超參數越來越多的時候，我們能不能自動去調整呢？！！！注意對比各個算法的時間複雜度。

（注意參數與超參數的區別：參數可以通過優化算法進行優化，如邏輯迴歸的係數；超參數是不能用優化模型進行優化的，如正則話的係數。）

方式1：網格搜索GridSearchCV()

# 方式1：網格搜索GridSearchCV()
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import time


start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X_train,y_train)
end_time = time.time()
print("網格搜索經歷時間：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

方式2：隨機網格搜索RandomizedSearchCV()

# 方式2：隨機網格搜索RandomizedSearchCV()
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import time


start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
# param_grid = [{'svc__C':param_range,'svc__kernel':['linear','rbf'],'svc__gamma':param_range}]
gs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X_train,y_train)
end_time = time.time()
print("隨機網格搜索經歷時間：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

方式3：嵌套交叉驗證

# 方式3：嵌套交叉驗證
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import time


start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,scoring='accuracy',cv=2,n_jobs=-1)
scores = cross_val_score(gs,X_train,y_train,scoring='accuracy',cv=5)
end_time = time.time()
print("嵌套交叉驗證：%.3f S" % float(end_time-start_time))
print('CV accuracy :%.3f +/-%.3f'%(np.mean(scores),np.std(scores)))

五、比較不同的性能評估指標

有時候，準確率不是我們唯一需要考慮的評價指標，因爲有時候會存在各類預測錯誤的代價不一樣。例如：在預測一個人的腫瘤疾病的時候，如果病人A真實得腫瘤但是我們預測他是沒有腫瘤，跟A真實是健康但是預測他是腫瘤，二者付出的代價很大區別（想想爲什麼）。所以我們需要其他更加廣泛的指標：

1. 繪製混淆矩陣

# 繪製混淆矩陣
from sklearn.metrics import confusion_matrix


pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat, cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

2. 各種指標的計算

# 各種指標的計算
from sklearn.metrics import precision_score,recall_score,f1_score


print('Precision:%.3f'%precision_score(y_true=y_test,y_pred=y_pred))
print('recall_score:%.3f'%recall_score(y_true=y_test,y_pred=y_pred))
print('f1_score:%.3f'%f1_score(y_true=y_test,y_pred=y_pred))

3. 將不同的指標與GridSearch結合

# 將不同的指標與GridSearch結合
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
gs = gs.fit(X_train,y_train)
print(gs.best_score_)
print(gs.best_params_)

4. 繪製ROC曲線

# 繪製ROC曲線
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
#y_pred = gs.predict(X_test)
fpr,tpr,threshold = roc_curve(y_test, y_pred) ###計算真陽率和假陽率
roc_auc = auc(fpr,tpr) ###計算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假陽率爲橫座標，真陽率爲縱座標做曲線
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic ')
plt.legend(loc="lower right")
plt.show()

本文電子版及代碼源文件 後臺回覆 模型評估 獲取

“感謝你的分享，點贊，在看三連↓

機器學習模型評估與超參數調優詳解

一、用管道簡化工作流

二、使用k折交叉驗證評估模型性能

三、使用學習和驗證曲線調試算法

四、通過網格搜索進行超參數調優

數據分析之Pandas合併操作總結

機器學習模型評估與超參數調優詳解

180萬獎金！數據挖掘，NLP，CV等23個賽道，2020 科大訊飛AI大賽正式發佈！

常用數據分析方法：方差分析及實現！

數據分析之Pandas缺失數據處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

機器學習模型評估與超參數調優詳解

一、用管道簡化工作流

二、使用k折交叉驗證評估模型性能

三、 使用學習和驗證曲線調試算法

四、通過網格搜索進行超參數調優

三、使用學習和驗證曲線調試算法