模型集成方法： Stacked generation

原創

范涛

2020-07-04 22:34

Stacked generation分爲兩個階段

1. Level-0 generalizers

Level-0 generalizers階段生成Level-1 generalizers階段的輸入數據。

我們有K個簡單的分類模型，然後如何ensemble這些模型的結果，等價於這些模型的權重是多少？一種就是根據把訓練集分割一定比率來訓練這K個簡單模型，用這個K的模型預測剩下部分的訓練集。把K個預測結果作爲輸入變量，目標變量是這部分訓練集合的真實值，做一個線性迴歸模型，就可以得到K個模型的權重。但是這裏有個問題，就是線性迴歸一個要求是特徵之間最好不要線性相關。但前面一步預測結果之間必然線性相關。

所以，做點改進。利用cross-validation, 每級的訓練樣本的預測值，用剩下樣本的模型來預測。這樣就得到整體訓練樣本的預測值，這裏最好用分類概率表示。這個時候，每個樣本可以看成K維向量。構建出Level-1 generalizers的輸入訓練樣本，原來的測試樣本就用K個分類器在原有的全部訓練數據的基礎上預測得出的結果，K維向量，作爲Level-1 generalizers的輸入測試樣本。

2. Level-1 generalizers

採用迴歸模型訓練Level-1 generalizers輸入訓練樣本，有的論文提到過，需要用MLR迴歸模型，即參數是非負的。利用這個階段訓練好的模型去預測Level-1 generalizers的輸入測試樣本，得到最終結果。

附上簡單幾個分類算法的預測概率表達方式：

參考代碼如下（其中部分代碼參考kaggle貢獻者）：

'''
Created on 2014/02/28

@author: dylanfan

'''

import scipy as sp
import numpy as np

from sklearn import cross_validation, linear_model

from feature_generator import get_dataset
import com_stat

class Stacked_Generalization(object):
    
    """
    Implement stacking to combine several models.
    The base (stage 0) models can be either combined through
    simple averaging (fastest), or combined using a stage 1 generalizer
    (requires computing CV predictions on the train set).

    See http://ijcai.org/Past%20Proceedings/IJCAI-97-VOL2/PDF/011.pdf:
    "Stacked generalization: when does it work?", Ting and Witten, 1997

    For speed and convenience, both fitting and prediction are done
    in the same method fit_predict; this is done in order to enable
    one to compute metrics on the predictions after training each model without
    having to wait for all the models to be trained.

    Options:
    ------------------------------
    - models: a list of (model, dataset) tuples that represent stage 0 models
    - generalizer: an Estimator object. Must implement fit and predict
  
    """
    
    def __init__(self, models, generalizer=None,
                 stack=False):
        
        self.models = models
      
        self.stack = stack
       
#         self.generalizer = linear_model.RidgeCV(
#             alphas=np.linspace(0, 200), cv=100)
        
        self.generalizer = linear_model.LogisticRegression(fit_intercept = False)
        
        
        

    def _combine_preds(self, X_train, X_cv, y,
                       stack=False):
        
        mean_preds = np.mean(X_cv, axis=1)
        stack_preds = None
     

        if stack:
            self.generalizer.fit(X_train, y)
            stack_preds = self.generalizer.predict(X_cv)
            
        return mean_preds, stack_preds

    
    def _get_model_preds(self, model, X_train, X_predict, y_train, is_cross_preds = 1):
       
        """
        Return the model predictions on the prediction set,
       
        """
        if not is_cross_preds:
            sample_weight = com_stat.stat_sample_weight(y_train)
            
            try:
                model.fit(X_train, y_train,sample_weight)
                
            except :
                print 'sample weight parameter is not legal...'
                model.fit(X_train, y_train)
                
            model_preds = model.predict(X_predict)
            
        else:
            kfold = cross_validation.StratifiedKFold(y_train, 10)
            stack_preds = []
           
          
            for stage0, stack in kfold:
                        
                sample_weight = com_stat.stat_sample_weight(y_train[stage0])
                try:
                    model.fit(X_train[stage0], y_train[stage0],sample_weight)
                    
                except :
                    print 'sample weight parameter is not legal...'
                    model.fit(X_train[stage0], y_train[stage0])
                    
                test_y_regressor_preds = model.predict(X_predict)
                stack_preds.append(test_y_regressor_preds)
    
            model_preds = np.median(np.array(stack_preds).T,axis=1)
            
            
        return model_preds
    

    def _get_model_cv_preds(self, model, X_train, y_train):
      
      
        kfold = cross_validation.StratifiedKFold(y_train, 4)
        stack_preds = []
        indexes_cv = []
        print 'cv stage preds...'
        for stage0, stack in kfold:
                    
            sample_weight = com_stat.stat_sample_weight(y_train[stage0])
            try:
                model.fit(X_train[stage0], y_train[stage0],sample_weight)
                
            except :
                print 'sample weight parameter is not legal...'
                model.fit(X_train[stage0], y_train[stage0])
            
            stack_preds.extend(list(model.predict(
                X_train[stack])))
            indexes_cv.extend(list(stack))
        print 'end stage..'    
        stack_preds = np.array(stack_preds)[sp.argsort(indexes_cv)]           

        return stack_preds
    

    def fit_predict(self, y, train=None, predict=None):
       
        y_train = y[train] if train is not None else y
        if train is not None and predict is None:
            predict = [i for i in range(len(y)) if i not in train]

        stage0_train = []
        stage0_predict = []
        for model, feature_set in self.models:
            X_train, X_predict = get_dataset(feature_set, train, predict)
           
            model_preds = self._get_model_preds(
                model, X_train, X_predict, y_train)
            stage0_predict.append(model_preds)
            
        

            # if stacking, compute cross-validated predictions on the train set
            if self.stack:
                model_cv_preds = self._get_model_cv_preds(
                    model, X_train, y_train)
                stage0_train.append(model_cv_preds)
                

          

        mean_preds, stack_preds = self._combine_preds(
            np.array(stage0_train).T, np.array(stage0_predict).T,
            y_train, stack=self.stack)

        if self.stack:
            selected_preds = stack_preds 
        else:
            selected_preds = mean_preds

        return selected_preds

參考文獻：

1： Stacke dGeneralization :when does it work ?

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

模型集成方法： Stacked generation

Kaggle ： Display Advertising Challenge( ctr 預估 )

餘額支付風控 -- 風控評分模型篇

模型集成方法： Stacked generation

Loan default predictor（貸款違約預測）

FaceBook: Text Tag Recommendation

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結