xgboost是陳天奇大神搞出來的大殺器,我在mac上費老半天勁還沒安裝好,查了各種安裝教程,後來找到一個一句話安裝,另一個大殺器anaconda,真香~
安裝好之後就直接用,xgboost是gbdt的升級版,性能更強大,可以並行。前兩年基本上是霸佔kaggle,碾壓其他算法。
import numpy as np
import random
import sklearn
import pandas as pd
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn import metrics
from sklearn.model_selection import train_test_split
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
column_names = ['uin', 'gender', 'age', 'play_cnt', 'share_cnt', 'influence_pv', 'ds1', 'ds2', 'ds3', 'label']
data = pd.read_csv('lr_feature.csv', usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], names=column_names)
print(data.head(10))
# 分訓練集測試集
X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:6]], data[column_names[9]],
test_size=0.25, random_state=3)
model = XGBClassifier(learning_rate=0.01,
n_estimators=10, # 樹的個數-10棵樹建立xgboost
max_depth=3, # 樹的深度
min_child_weight=1, # 葉子節點最小權重
gamma=0., # 懲罰項中葉子結點個數前的參數
subsample=1, # 所有樣本建立決策樹
colsample_btree=1, # 所有特徵建立決策樹
scale_pos_weight=1, # 解決樣本個數不平衡的問題
random_state=27, # 隨機數
slient=0
)
model.fit(X_train, y_train)
# 預測
y_test, y_pred = y_test, model.predict(X_test)
print("Accuracy: %.4g" % metrics.accuracy_score(y_test, y_pred))
print("F1_score: %.4g" % metrics.f1_score(y_test, y_pred))
print("Recall: %.4g" % metrics.recall_score(y_test, y_pred))
y_train_proba = model.predict_proba(X_train)[:, 1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, y_train_proba))
y_proba = model.predict_proba(X_test)[:, 1]
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, y_proba))
運行結果:
uin gender age play_cnt share_cnt influence_pv ds1 ds2 ds3 label
0 1889812 2 67 2 1 0 0 2 2 0.0
1 1966339 2 69 747 92 194 15 15 30 1.0
2 1982539 2 66 1165 104 40 12 12 24 1.0
3 2131170 3 78 53 146 117 9 3 12 1.0
4 4471700 3 81 2 0 0 1 3 4 0.0
5 4921331 3 79 1634 176 178 15 15 30 1.0
6 5441180 3 68 0 4 0 0 4 4 0.0
7 6144422 2 79 109 23 25 10 14 24 1.0
8 6807020 3 72 418 54 90 11 11 22 1.0
9 7015648 3 76 144 9 15 11 7 18 1.0
Accuracy: 0.9668
F1_score: 0.97
Recall: 0.9693
AUC Score (Train): 0.989206
AUC Score (Test): 0.988982
我們使用特徵比較少,因此樹的深度只設定爲3,數量是10,其他參數基本是用的默認值,當特徵數量比較多的時候,調參會比較重要,選擇一組好的參數效果很可能比花時間精力做特徵工程好。調參的細節可以參考文獻1,2。網上也有不少同學些的博客可以看。
參考資料:
- https://zhuanlan.zhihu.com/p/52501965
- https://zhuanlan.zhihu.com/p/68864414
- https://blog.csdn.net/sinat_20177327/article/details/81090324
- https://blog.csdn.net/han_xiaoyang/article/details/52665396