五、比賽入門與跟隨1
比賽入門與跟隨2
數據挖掘進階:kaggle競賽top代碼分享
六、做Kaggle比賽,Jdata,天池的經驗,看完我這篇就夠了。
初步改進點有
1、刪除收率異常
2、數據集的時間也有異常值
A11-A12=1:00 (有時候會有異常)
3、轉換爲時間區間
4、將收率進行分箱,然後構造每個特徵中的類別對應不同收率的均值
待優化
1、多模型結果
2、mean 特徵
3、對時間部分 優化
通用模板-主流機器學習模型模板代碼+經驗分享[xgb, lgb, Keras, LR]
-----lgb默認參數-----
param = {'num_leaves': 30,
'min_data_in_leaf': 30,
'objective':'regression',
'max_depth': -1,
'learning_rate': 0.01,
"min_child_samples": 30,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.9 ,
"bagging_seed": 11,
"metric": 'mse',
"lambda_l1": 0.1,
"verbosity": -1}
# 6折交叉驗證
#[2095] training's l2: 0.000183012 valid_1's l2: 0.000238902
#CV score: 0.00031
# training's l2: 0.000179558 valid_1's l2: 0.000251254
#CV score: 0.00031
folds = KFold(n_splits=10, shuffle=True, random_state=2018)
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
print("fold n°{}".format(fold_+1))
trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
num_round = 10000
clf = lgb.train(param,
trn_data,
num_round,
valid_sets = [trn_data, val_data],
verbose_eval = 200,
early_stopping_rounds = 200)
oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.5f}".format(mean_squared_error(oof_lgb, target)))
xgb
xgb_params = {'eta': 0.005, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}
folds = KFold(n_splits=15, shuffle=True, random_state=2018)
oof_xgb = np.zeros(len(train))
predictions_xgb = np.zeros(len(test))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
print("fold n°{}".format(fold_+1))
trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params)
oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, target)))
將lgb和xgb的結果進行stacking
train_stack = np.vstack([oof_lgb,oof_xgb]).transpose()
test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()
folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
oof_stack = np.zeros(train_stack.shape[0])
predictions = np.zeros(test_stack.shape[0])
for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack,target)):
print("fold {}".format(fold_))
trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values
clf_3 = BayesianRidge()
clf_3.fit(trn_data, trn_y)
oof_stack[val_idx] = clf_3.predict(val_data)
predictions += clf_3.predict(test_stack) / 10
mean_squared_error(target.values, oof_stack)
data[‘B11’]=1.0
提交結果線下 0.00011862717410224899 線上
data[‘B11’] 取根據b10 加一個小時 pad
提交結果線下線下 0.00011745711197741756
xgb調參過程
調參對於模型準確率的提高有一定的幫助,但這是有限的。最重要的還是要通過數據清洗,特徵選擇,特徵融合,模型融合等手段來進行改進!