看過別人使用Xgboost會發現它是由有兩個版本的,分別是xgboost的python版本有原生版本和爲了與sklearn相適應的sklearn接口版本,現在就簡單總結下二者的區別。
這裏放上Xgboost中文文檔,以及XGBoost的Python文檔方便查詢和使用。
1. 分別使用兩個版本對同一個數據集進行測試
1.1 數據集的準備
這裏直接利用Hastie算法,生成2分類數據
from sklearn.model_selection import train_test_split
from pandas import DataFrame
from sklearn import metrics
from sklearn.datasets import make_hastie_10_2
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
import pandas as pd
#準備數據,y本來是[-1:1],xgboost自帶接口邀請標籤是[0:1],把-1的轉成1了。
X, y = make_hastie_10_2(random_state=0)
X = DataFrame(X)
y = DataFrame(y)
y.columns={"label"}
label={-1:0,1:1}
y.label=y.label.map(label)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)#劃分數據集
1.2 用兩個版本設定相同的參數,對數據集進行訓練
#XGBoost自帶接口
params={
'eta': 0.3,
'max_depth':3,
'min_child_weight':1,
'gamma':0.3,
'subsample':0.8,
'colsample_bytree':0.8,
'booster':'gbtree',
'objective': 'binary:logistic',
'nthread':12,
'scale_pos_weight': 1,
'lambda':1,
'seed':27,
'silent':0 ,
'eval_metric': 'auc'
}
d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)
d_test = xgb.DMatrix(X_test)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
#sklearn接口
clf = XGBClassifier(
n_estimators=30,#三十棵樹
learning_rate =0.3,
max_depth=3,
min_child_weight=1,
gamma=0.3,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=12,
scale_pos_weight=1,
reg_lambda=1,
seed=27)
watchlist2 = [(X_train,y_train),(X_test,y_test)]
print("XGBoost_自帶接口進行訓練:")
model_bst = xgb.train(params, d_train, 30, watchlist, early_stopping_rounds=500, verbose_eval=10)
print("XGBoost_sklearn接口進行訓練:")
model_sklearn=clf.fit(X_train, y_train, eval_set=watchlist2,eval_metric='auc',verbose=10, early_stopping_rounds=500)
y_bst= model_bst.predict(d_test)
y_sklearn= clf.predict_proba(X_test)[:,1]
XGBoost_自帶接口進行訓練:
[0] train-auc:0.608992 valid-auc:0.579947
Multiple eval metrics have been passed: ‘valid-auc’ will be used for early stopping.
Will train until valid-auc hasn’t improved in 500 rounds.
[10] train-auc:0.940251 valid-auc:0.920879
[20] train-auc:0.973669 valid-auc:0.959898
[29] train-auc:0.983232 valid-auc:0.970292
XGBoost_sklearn接口進行訓練:
[0] validation_0-auc:0.608992 validation_1-auc:0.579947
Multiple eval metrics have been passed: ‘validation_1-auc’ will be used for early stopping.
Will train until validation_1-auc hasn’t improved in 500 rounds.
[10] validation_0-auc:0.940251 validation_1-auc:0.920879
[20] validation_0-auc:0.973669 validation_1-auc:0.959898
[29] validation_0-auc:0.983232 validation_1-auc:0.970292
1.3 將評估結果打印出來
print("XGBoost_自帶接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_bst))
print("XGBoost_sklearn接口 AUC Score : %f" % metrics.roc_auc_score(y_test, y_sklearn))
# 將概率值轉化爲0和1
y_bst = pd.DataFrame(y_bst).apply(lambda row: 1 if row[0]>=0.5 else 0, axis=1)
y_sklearn = pd.DataFrame(y_sklearn).apply(lambda row: 1 if row[0]>=0.5 else 0, axis=1)
print("XGBoost_自帶接口 AUC Score : %f" % metrics.accuracy_score(y_test, y_bst))
print("XGBoost_sklearn接口 AUC Score : %f" % metrics.accuracy_score(y_test, y_sklearn))
GBoost_自帶接口 AUC Score : 0.970292
XGBoost_sklearn接口 AUC Score : 0.970292
XGBoost_自帶接口 AUC Score : 0.897917
XGBoost_sklearn接口 AUC Score : 0.897917
2. 兩個版本的區別
兩個版本區別 | 原生版 | sklearn接口版 |
---|---|---|
是否需要將數據轉化爲Dmatrix | 是 | 否 |
學習率參數(步長) | eta | learning_rate |
迭代次數(樹的數量) | 在原生xgb中定義在xgb.train()的num_boost_round | n_estimators |
對訓練集進行訓練 | xgb.train( ) | clf.fit( ) |
watchlist的形式 | watchlist = [(d_train, ‘train’), (d_valid, ‘valid’)] | watchlist2 = [(X_train,y_train),(X_test,y_test)] |
L1正則化權重 | lambda | reg_lambda |
L2正則化權重 | alpha | reg_alpha |
參考
https://blog.csdn.net/PIPIXIU/article/details/80463565
https://www.cnblogs.com/chenxiangzhen/p/10894143.html
https://zhuanlan.zhihu.com/p/66832906
https://cloud.tencent.com/developer/article/1455453