1.Stacking是什麼?
Stacking簡單理解就是講幾個簡單的模型,一般採用將它們進行K折交叉驗證輸出預測結果,然後將每個模型輸出的預測結果合併爲新的特徵,並使用新的模型加以訓練。
Stacking模型本質上是一種分層的結構,這裏簡單起見,只分析二級Stacking.假設我們有3個基模型M1、M2、M3。
基模型M1,對訓練集train訓練,然後用於預測train和test的標籤列,分別是P1,T1
2.Stacking的好處在哪裏?
做大數據的比賽的一般是是使用單一模型進行預測,或者是多個模型進行比較,選出最合適的模型,我們所做的交叉驗證主要是多個模型的加權平均。我們使用單個模型進行交叉驗證,一般是使用K-fold交叉驗證,來降低模型的過擬合風險,提高模型的準確度。
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
import numpy as np
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
import pandas as pd
# 導入數據集切割訓練與測試數據
data = load_digits()
data_D = preprocessing.StandardScaler().fit_transform(data.data)
data_L = data.target
data_train, data_test, label_train, label_test = train_test_split(data_D,data_L,random_state=1,test_size=0.7)
def SelectModel(modelname):
if modelname == "SVM":
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=16, gamma=0.125,probability=True)
elif modelname == "GBDT":
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
elif modelname == "RF":
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
elif modelname == "XGBOOST":
import xgboost as xgb
model = xgb()
elif modelname == "KNN":
from sklearn.neighbors import KNeighborsClassifier as knn
model = knn()
else:
pass
return model
def get_oof(clf,n_folds,X_train,y_train,X_test):
ntrain = X_train.shape[0]
ntest = X_test.shape[0]
classnum = len(np.unique(y_train))
kf = KFold(n_splits=n_folds,random_state=1)
oof_train = np.zeros((ntrain,classnum))
oof_test = np.zeros((ntest,classnum))
for i,(train_index, test_index) in enumerate(kf.split(X_train)):
kf_X_train = X_train[train_index] # 數據
kf_y_train = y_train[train_index] # 標籤
kf_X_test = X_train[test_index] # k-fold的驗證集
clf.fit(kf_X_train, kf_y_train)
oof_train[test_index] = clf.predict_proba(kf_X_test)
oof_test += clf.predict_proba(X_test)
oof_test = oof_test/float(n_folds)
return oof_train, oof_test
# 單純使用一個分類器的時候
clf_second = RandomForestClassifier()
clf_second.fit(data_train, label_train)
pred = clf_second.predict(data_test)
accuracy = metrics.accuracy_score(label_test, pred)*100
print accuracy
# 91.0969793323
# 使用stacking方法的時候
# 第一級,重構特徵當做第二級的訓練集
modelist = ['SVM','GBDT','RF','KNN']
newfeature_list = []
newtestdata_list = []
for modelname in modelist:
clf_first = SelectModel(modelname)
oof_train_ ,oof_test_= get_oof(clf=clf_first,n_folds=10,X_train=data_train,y_train=label_train,X_test=data_test)
newfeature_list.append(oof_train_)
newtestdata_list.append(oof_test_)
# 特徵組合
newfeature = reduce(lambda x,y:np.concatenate((x,y),axis=1),newfeature_list)
newtestdata = reduce(lambda x,y:np.concatenate((x,y),axis=1),newtestdata_list)
# 第二級,使用上一級輸出的當做訓練集
clf_second1 = RandomForestClassifier()
clf_second1.fit(newfeature, label_train)
pred = clf_second1.predict(newtestdata)
accuracy = metrics.accuracy_score(label_test, pred)*100
print accuracy
# 96.4228934817
- 這裏只是使用了兩層的stacking,完成了一個基本的stacking操作,也可以同理構建三層,四層等等
- 對於第二級的輸入來說,特徵進行了變化(有一級分類器構成的判決作爲新特徵),所以相應的測試集也需要進行同樣的轉換,畢竟分類器學習的訓練集已經不一樣了,學習的內容肯定是無法適用於舊的測試集的,要清楚的是,當初我們是對整個Data集合隨機分測試集和訓練集的!
- 適用k-fold的方法,實質上使用了cv的思想,所以數據並沒有泄露(沒有用到測試集,用的是訓練集中的hold-set),所以這個方法也叫做out-of-folds