機器學習實戰(四)模型驗證和選擇

模型選擇和評估主要是在sklearn.model_selection這個模塊裏面.這裏只會列出概述和常見函數的用法,更加詳細的可以到sklearn.model_selection: Model Selection 來看

一.概覽

Splitter Classes

model_selection.KFold([n_splits, shuffle, …]) K-Folds cross-validator
model_selection.GroupKFold([n_splits]) K-fold iterator variant with non-overlapping groups.
model_selection.StratifiedKFold([n_splits, …]) Stratified K-Folds cross-validator
model_selection.LeaveOneGroupOut() Leave One Group Out cross-validator
model_selection.LeavePGroupsOut(n_groups) Leave P Group(s) Out cross-validator
model_selection.LeaveOneOut() Leave-One-Out cross-validator
model_selection.LeavePOut(p) Leave-P-Out cross-validator
model_selection.ShuffleSplit([n_splits, …]) Random permutation cross-validator
model_selection.GroupShuffleSplit([…]) Shuffle-Group(s)-Out cross-validation iterator
model_selection.StratifiedShuffleSplit([…]) Stratified ShuffleSplit cross-validator
model_selection.PredefinedSplit(test_fold) Predefined split cross-validator
model_selection.TimeSeriesSplit([n_splits]) Time Series cross-validator

分割函數

model_selection.train_test_split(*arrays, …)

把數組或者矩陣隨機劃分爲子訓練集和子測試集.

model_selection.check_cv([cv, y, classifier]) Input checker utility for building a cross-validator

超參數優化器(Hyper-parameter optimizers)

model_selection.GridSearchCV(estimator, …) Exhaustive search over specified parameter values for an estimator.
model_selection.RandomizedSearchCV(…[, …]) Randomized search on hyper parameters.
model_selection.ParameterGrid(param_grid) Grid of parameters with a discrete number of values for each.
model_selection.ParameterSampler(…[, …]) Generator on parameters sampled from given distributions.
model_selection.fit_grid_point(X, y, …[, …]) Run fit on one set of parameters.

Model validation

model_selection.cross_val_score(estimator, X) :通過交叉驗證生成模型得分
model_selection.cross_val_predict(estimator, X) Generate cross-validated estimates for each input data point
model_selection.permutation_test_score(…) Evaluate the significance of a cross-validated score with permutations
model_selection.learning_curve(estimator, X, y) Learning curve.
model_selection.validation_curve(estimator, …) Validation curve.

一.分割函數

函數原型:
sklearn.model_selection.train_test_split(*arrays, **options)

作用:
把數組或者矩陣隨機劃分爲子訓練集和子測試集.返回的是一個列表,列表的長度是arrays這個長度的兩倍(因爲要分別劃分出一個訓練集和測試集,自然增長了兩倍).要是輸入時稀疏(sparse)的,那麼輸出就會是scipy.sparse.csr_matrix類型,不然輸出類型和輸入的類型是一樣的.

參數:
*arrays :可以索引的序列,允許的輸入可以使lists,ndarray,scipy-sparse matrices或者是pandas的dataframe
test_size : float, int, or None類型 (默認是None),

如果是float類型, 應該介於0.0和1.0之間,表示數據集劃分到測試集中的比例
如果是int類型,表示測試集樣本的數量.
要是爲None, 就自動根據train_size的值來進行補全,要是train_size也是None,那麼test_size就被設置爲0.25

train_size : float, int, or None類型 (默認是None),

如果是float類型, 應該介於0.0和1.0之間,表示數據集劃分到訓練集中的比例 >>如果是int類型,表示訓練集樣本的數量.
要是爲None, 就自動根據test_size的值來進行補全

random_state : int or RandomState 僞隨機數生成器,用來進行隨機採樣.
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.

例1:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

value=[[1,1,1,1],[2,2,2,2],[3,3,3,3],[4,4,4,4],[5,5,5,5],[6,6,6,6],[7,7,7,7],[8,8,8,8]]
dataset=np.array(value)

print(dataset)

X_train,X_test=train_test_split(dataset,test_size=0.2)
print(X_train)
print(X_test)

這裏寫圖片描述

例2:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

boston=load_boston()
dataSet=boston.data
labels=boston.target

print(dataSet.shape)
print(labels.shape)

splited=train_test_split(dataSet,labels,test_size=0.3)
print("elements in splited:\n",len(splited))
print("\n")

print("dataSet split into:")
print(splited[0].shape)
print(splited[1].shape)

print("labels split into:")
print(splited[2].shape)
print(splited[3].shape)

結果:
這裏寫圖片描述
通過結果可以很清楚的看到”分割”之後的形狀特徵.

當然,也可以直接傳入一個dataframe,非常方便.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
frame=pd.DataFrame(
    data={"X":[1,2,3,4,5,6],"y":[0,1,0,1,1,1]}
)
print("frame\n",frame)

X_train,X_test=train_test_split(frame,test_size=0.2)
print("X_train:\n",X_train)
print("X_test:\n",X_test)

結果:
這裏寫圖片描述

二.模型評分

Ⅰ.sklearn.model_selection.cross_val_score

sklearn.model_selection.cross_val_score(estimator,X,y=None,groups=None,scoring=None,cv=None,n_jobs=1,verbose=0,fit_params=None, pre_dispatch=’2*n_jobs’)

通過交叉驗證來評估分數.返回交叉驗證評估的分數.返回值是array類型,形狀爲(len(list(cv)),)
參數:
estimator : 實現了”fit”的”估計”對象,用來擬合數據.其實就是相應的分類器或者是迴歸器對象.
X : array,待fit的數據.
y : array-like,可選, 默認爲: None 其實就是集合相對應的標籤.,
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into train/test set.
scoring : 字符串或者可調用的對象.可選,默認爲None.
cv : 整形,交叉驗證生成器,或者是一個可以迭代的類型. 可選.這個參數決定了交叉驗證的分裂策略.可能的輸入方式有:

None:使用默認的3折交叉驗證.
某個整數: 指明瞭多少折交叉驗證.
用來作爲交叉驗證生成器的某個對象.

n_jobs : 整形,可選.表示用來計算的CPU的數量.當設爲-1的時候,表示使用所有的CPU.
verbose : integer, optional
The verbosity level.
fit_params : dict, optional
Parameters to pass to the fit method of the estimator.
pre_dispatch : int, or string, optional
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
An int, giving the exact number of total jobs that are spawned
A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

這個函數是一個很常見的給模型選擇的函數.這裏通過自帶的boston房價數據集和Rigde迴歸模型來簡單的舉一個使用這個函數的例子.

例2:

import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

X=load_boston().data
#print(X.shape)

y=load_boston().target
#print(y.shape)

ridge=Ridge(alpha=1.0)
test_score=-1*cross_val_score(estimator=ridge,X=X,y=y,scoring='neg_mean_squared_error',cv=10)
print(test_score)

這裏選擇的是alpha=1.0的嶺迴歸算法.採用10折交叉驗證計算損失.所以,將返回一個10維的數組,每個維度表示原數據集其中的某一份做驗證集時的損失.
結果:
這裏寫圖片描述

在實際使用中,我們都是把這些損失值的平均值作爲最後在這整個數據集上面的損失.
這裏再舉一個例子,看看嶺迴歸的參數選擇對於結果的影響.

例二:

import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt


X=load_boston().data
#print(X.shape)

y=load_boston().target
#print(y.shape)

alphas=np.logspace(start=-3,stop=2,num=50)
score=[]
for alpha in alphas:
    ridge=Ridge(alpha=alpha)
    test_score=np.sqrt(-1*cross_val_score(estimator=ridge,X=X,y=y,scoring='neg_mean_squared_error',cv=10))
    score.append(np.mean(test_score))
    print(test_score)

print(score)
plt.plot(alphas,score)
plt.title("Alpha vs CV Error")
plt.show()

這裏寫圖片描述

更加深刻一點,可以加入random forest來和嶺迴歸對比看一下.

import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt


X=load_boston().data
#print(X.shape)

y=load_boston().target
#print(y.shape)

#ridge regression
alphas=np.logspace(start=-3,stop=2,num=50)
score_Ridge=[]
for alpha in alphas:
    ridge=Ridge(alpha=alpha)
    test_score=np.sqrt(-1*cross_val_score(estimator=ridge,X=X,y=y,scoring='neg_mean_squared_error',cv=10))
    score_Ridge.append(np.mean(test_score))


#random forest
max_features=[0.1,0.3,0.5,0.7,0.9,0.99]
score_RF=[]
for max_feature in max_features:
    RF=RandomForestRegressor(n_estimators=200,max_features=max_feature)
    test_score=np.sqrt(-1*cross_val_score(estimator=RF,X=X,y=y,scoring='neg_mean_squared_error',cv=10))
    score_RF.append(np.mean(test_score))

ax1=plt.subplot(2,1,1)
ax2=plt.subplot(2,1,2)

ax1.plot(alphas,score_Ridge,label="Alpha vs CV Error")
ax1.set_xlabel("Alpha")
ax1.set_ylabel("CV Error")

#ax1.title("")


ax2.plot(max_features,score_RF,label="Max_Features vs CV Error")
ax2.set_xlabel("Max_Features")
ax2.set_ylabel("CV Error")

plt.show()

結果:
這裏寫圖片描述

發佈了99 篇原創文章 · 獲贊 690 · 訪問量 112萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章