lightGBM使用

1.categorical_feature(類別特徵)使用
lightGBM比XGBoost的1個改進之處在於對類別特徵的處理, 不再需要將類別特徵轉爲one-hot形式, 具體可參考這裏.

在使用python API時(參考官方文檔)
1.1可以使用pd.DataFrame存放特徵X, 每一列表示1個特徵, 將類別特徵設置爲X[cat_cols].astype('category'). 這樣模型在fit時會自動識別類別特徵.
1.2在模型的fit方法中傳入參數categorical_feature, 指明哪些列是類別特徵.
1.3類別特徵的值必須是從0開始的連續整數, 比如0,1,2,..., 不能是負數.

下面是官方文檔對fit方法中categorical_feature參數的說明:
categorical_feature (list of strings or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.

之前在使用sklearn的GridSearchCV時就發現一個問題, x_train和x_test中的cat_cols列已經設置好了是category類型, 但是concat之後類型又變成了int:

x_search = pd.concat([x_train, x_test], 0)
y_search = np.concatenate([train_y, test_y],0)
gsearch.fit(x_search, y_search) # 搜索最優參數, gsearch是sklearn的GridSearchCV實例

所以, 需要在concat之後重新設置一下類型

x_search = pd.concat([x_train, x_test], 0)
x_search[cat_cols] = x_search[cat_cols].astype('category')
y_search = np.concatenate([train_y, test_y],0)
gsearch.fit(x_search, y_search) # 搜索最優參數

2.init_score使用
init_score是estimator的初始得分, 在regression任務中用init_score可以幫助模型更快收斂.
使用時只需要在fit方法中設置init_score參數即可, 最後predict時, 需要加上設置的這個init_score


model.fit(
#        pd.concat([x_train,x_val],0), 
#        np.concatenate([train_y, val_y],0),
        x_train, 
        train_y,
        init_score=y_train_base_avg1,
        
        eval_metric=['mape'], 
        eval_set=(x_val, val_y),
        early_stopping_rounds=20,
        eval_init_score=[y_val_base_avg1],
        
        verbose=True
        )
...
y_train_pre = model.predict(x_train) + y_train_base_avg1 # 加上init_score

上面是regression的情況, 那麼對於classification呢?以及配合GridSearchCV時怎麼用(因爲predict時必須手動加上init_score)?
參考這裏這裏(init_score is the raw score, before any transformation.).




▲▲▲使用ParameterGrid代替GridSearchCV, 中間可加入更多的自定義操作
1.可以使用early_stopping, 而GridSearchCV做不到
2.可以支持lightGBM的init_score

...
from sklearn.model_selection import ParameterGrid
...
parameters = {
        'objective': ['regression', 'regression_l1'],
        'max_depth': [2,3,4,5],
        'num_leaves': [20,25,30,35],
        'n_estimators': [20,25,30,35,40,50],
        'min_child_samples': [15,20,25,30],
#        'subsample_freq': [0,2,5],
#        'subsample': [0.7,0.8,0.9,1],
#        'colsample_bytree': [0.8,0.9,1]
        }
...default_params = model.get_params() # 得到前面預定義的參數
best_score = np.inf
best_params = None
best_idx = 0
param_grid = list(ParameterGrid(parameters)) # 生成所有參數組合的list
for idx,param_ in enumerate(param_grid):
    param = default_params.copy()
    param.update(param_)
    model = LGBMRegressor(**param)
    
    model.fit(
            x_train, 
            train_y,
            init_score=y_train_base_avg1,
            
            eval_metric=['mape'], 
            eval_set=(x_val, val_y),
            early_stopping_rounds=20,
            eval_init_score=[y_val_base_avg1],
            
            verbose=False
            )
    score_ = model.best_score_['valid_0']['mape'] # 當前模型在val set上的最好得分
    print('for %d/%d, score: %.6f, best idx in %d/%d'%(idx+1, len(param_grid), score_, best_idx, len(param_grid)))
    if score_<best_score:
        best_params = param
        best_score = score_
        best_idx = idx+1
        print('find best score: {}, \nbest params: {}'.format(best_score, best_params))
print('\nbest score: {}, \nbest params: {}\n'.format(best_score, best_params))
#raise ValueError

model = LGBMRegressor(**best_params)
model.fit(
        x_train, 
        train_y,
        init_score=y_train_base_avg1,
        
        eval_metric=['mape'], 
        eval_set=(x_val, val_y),
        early_stopping_rounds=20,
        eval_init_score=[y_val_base_avg1],
        
        verbose=True
        )




 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章