XGB快速上手_初次參賽_津南數字製造算法

一、背景—工藝過程—字段意義

二、特徵工程做法

三、baseline

四、模型stack—用到了 BayesianRidge迴歸

五、比賽入門與跟隨1
比賽入門與跟隨2
數據挖掘進階:kaggle競賽top代碼分享

六、做Kaggle比賽,Jdata,天池的經驗,看完我這篇就夠了

初步改進點有

1、刪除收率異常

2、數據集的時間也有異常值

A11-A12=1:00 (有時候會有異常)

3、轉換爲時間區間

4、將收率進行分箱,然後構造每個特徵中的類別對應不同收率的均值

待優化

1、多模型結果

2、mean 特徵

3、對時間部分 優化

通用模板-主流機器學習模型模板代碼+經驗分享[xgb, lgb, Keras, LR]

-----lgb默認參數-----

param = {'num_leaves': 30,
         'min_data_in_leaf': 30, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.01,
         "min_child_samples": 30,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'mse',
         "lambda_l1": 0.1,
         "verbosity": -1}


# 6折交叉驗證
#[2095]	training's l2: 0.000183012	valid_1's l2: 0.000238902
#CV score: 0.00031

#	training's l2: 0.000179558	valid_1's l2: 0.000251254
#CV score: 0.00031 

folds = KFold(n_splits=10, shuffle=True, random_state=2018)

oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))
    trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
    val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])

    num_round = 10000
    clf = lgb.train(param, 
                    trn_data, 
                    num_round, 
                    valid_sets = [trn_data, val_data], 
                    verbose_eval = 200, 
                    early_stopping_rounds = 200)
    oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
    
    predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)))





xgb

xgb_params = {'eta': 0.005, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, 
          'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}

folds = KFold(n_splits=15, shuffle=True, random_state=2018)
oof_xgb = np.zeros(len(train))
predictions_xgb = np.zeros(len(test))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))
    trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
    val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])

    watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
    clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params)
    oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
    predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
    
print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, target)))

將lgb和xgb的結果進行stacking

train_stack = np.vstack([oof_lgb,oof_xgb]).transpose()
test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()

folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
oof_stack = np.zeros(train_stack.shape[0])
predictions = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack,target)):
    print("fold {}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values
    
    clf_3 = BayesianRidge()
    clf_3.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf_3.predict(val_data)
    predictions += clf_3.predict(test_stack) / 10
    
mean_squared_error(target.values, oof_stack)

data[‘B11’]=1.0
提交結果線下 0.00011862717410224899  線上

data[‘B11’]  取根據b10 加一個小時 pad
提交結果線下線下 0.00011745711197741756

xgb調參過程

調參對於模型準確率的提高有一定的幫助,但這是有限的。最重要的還是要通過數據清洗,特徵選擇,特徵融合,模型融合等手段來進行改進!

參數解釋
調參指導_網格搜索

調參指導2

https://blog.csdn.net/sb19931201/article/details/52577592

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章