Stacking與Blending以及相關的heamy庫的使用收集

參考文獻

1.heamy官方文檔
2.heamy的github倉庫
3.heamy庫中核心模塊的介紹博客
4.heamy庫使用示例


一、關於Stacking和Blending的比較

核心區別:前者基於交叉驗證,後者基於留出法

很多人stack和blend是混着叫的,所以不必糾結這個名字,不要着相。只要知道:
【1】stack是用cv交叉驗證來得出元模型的特徵(一個基模型產出一個元特徵作爲二級模型的輸入);
【2】blend是用留出法,比如百分之80作訓練,另外百分之20的預測值作爲元模型的標籤;
(而stack是用全部的訓練集預測來產出一個基模型對應的標籤,二級模型只用那百分之20的預測值,這樣可以把堆疊用的數據集和二級模型泛化用的數據集分開,而stacking就沒有分開,所以stakcing有數據泄露,存在過擬合的風險)。


二、關於heamy庫的使用介紹

1.stacking實例

from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
from heamy.pipeline import ModelsPipeline
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

加載數據集
from sklearn.datasets import load_boston
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.1, random_state=111)

創建數據集
dataset = Dataset(X_train,y_train,X_test)

創建RF模型和LR模型
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression, parameters={'normalize': True},name='lr')

Stack兩個模型
# Returns new dataset with out-of-fold predictions
pipeline = ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.stack(k=10,seed=111)

第二層使用lr模型stack
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict()

使用10折交叉驗證結果
results10 = stacker.validate(k=10,scorer=mean_absolute_error)

2.blending實例

from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
from heamy.pipeline import ModelsPipeline
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

加載數據集
from sklearn.datasets import load_boston
data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.1, random_state=111)

創建數據集
dataset = Dataset(X_train,y_train,X_test)

創建RF模型和LR模型
model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 50},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression, parameters={'normalize': True},name='lr')

Blending兩個模型
# Returns new dataset with out-of-fold predictions
pipeline = ModelsPipeline(model_rf,model_lr)
stack_ds = pipeline.blend(proportion=0.2,seed=111)

第二層使用lr模型stack
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression)
results = stacker.predict()

使用10折交叉驗證結果
results10 = stacker.validate(k=10,scorer=mean_absolute_error)

3.權重加權取平均

from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
from heamy.pipeline import ModelsPipeline
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsRegressor

data = load_boston()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.1, random_state=111)

創建數據集
dataset = Dataset(X_train,y_train,X_test)

model_rf = Regressor(dataset=dataset, estimator=RandomForestRegressor, parameters={'n_estimators': 151},name='rf')
model_lr = Regressor(dataset=dataset, estimator=LinearRegression, parameters={'normalize': True},name='lr')
model_knn = Regressor(dataset=dataset, estimator=KNeighborsRegressor, parameters={'n_neighbors': 15},name='knn')

pipeline = ModelsPipeline(model_rf,model_lr,model_knn)

weights = pipeline.find_weights(mean_absolute_error)
result = pipeline.weight(weights)

4.簡單取平均或自定義

取平均
# get predictions for test 
result = pipeline.mean().execute()
# or Validate 
_ = pipeline.mean().validate(mean_absolute_error,10)

自定義
result = pipeline.apply(lambda x: np.max(x,axis=0)).execute()

三、關於heamy庫的實現細節

具體實現細節參見該篇博客,總的來說很清晰直觀

estimator.py中的方法(注意它們返回的都是數據集)
pipeline.py中的方法(注意它們返回的都是數據集)
上面的兩個py的方法得出二級模型的輸入,這些基模型的預測值的組合方法:一般的,blending和stacking都是用LR,其他的用加權平均、取平均、取最大值。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章