簡介及相關鏈接:
在b站上看的一個視頻B站xgboost課程。講了xgboost相關的理論知識後,用kaggle上的Otto Group Product Classification數據集進行分類任務。
1、首先導入模塊
from xgboost import XGBClassifier
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss #-log似然損失
from matplotlib import pyplot
import seaborn as sns
%matplotlib inline
2、讀取數據,用到pandas裏面pd.read_csv()。之前讀取碰到過讀取json數據,用pd.read_json()
path='D:\\data\\kaggle\\otto-group-product-classification-challenge'
train=pd.read_csv(path+'\\train.csv')
test=pd.read_csv(path+'\\test.csv')
3、一般在處理問題前都要看下各種特徵的分佈。該數據集的變量確定,數據特徵單一,且經過脫敏(肉眼看不出來他的實際意義),在特徵工程方面少做些工作,主要精力在參數調優。
先看一下target的分佈,看看各類樣本是否均衡。
sns.countplot(train.target)
pyplot.xlabel('target')
pyplot.ylabel('number of occurrences')
4、分類不是很均衡,各類數據的個數差異很大,所以在交叉檢驗的時候要按各類比例抽取。所以這裏用StratifiedKFold,在每折採樣時各類樣本按比例採樣,確保訓練集,測試集中各類別樣本的比例與原始數據集中相同。
y_train=train['target']
y_train=y_train.map(lambda s:s[6:])
y_train=y_train.map(lambda s:int(s)-1)
train=train.drop(['id','target'],axis=1)#維度1,丟掉id和target列。target作爲label單獨存儲,id不作爲特徵
x_train=np.array(train)
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=3)
5、默認參數,此時學習率爲0.1,比較大,觀察弱分類器數目的大致範圍。(採用默認參數,看模型是過擬合還是欠擬合)
def modelfit(alg,x_train,y_train,useTrainCV=True,cv_folds=None,early_stopping_rounds=50):
if useTrainCV:
xgb_param=alg.get_xgb_params()
xgb_param['num_class']=9
xgtrain=xgb.DMatrix(x_train,label=y_train)
cvresult=xgb.cv(xgb_param,xgtrain,num_boost_round=alg.get_params()['n_estimators'],folds=cv_folds,
metrics='mlogloss',early_stopping_rounds=early_stopping_rounds)
n_estimators=cvresult.shape[0]
alg.set_params(n_estimators=n_estimators)
print(cvresult)
cvresult.to_csv(path+'\\my_preds_4_1.csv',index_label='n_estimators')
#plot
test_means=cvresult['test-mlogloss-mean']#平均
test_stds=cvresult['test-mlogloss-std']#標準差
train_means=cvresult['train-mlogloss-mean']
train_stds=cvresult['train-mlogloss-std']
x_axis=range(0,n_estimators)
pyplot.errorbar(x_axis,test_means,yerr=test_stds,label='test')
pyplot.errorbar(x_axis,train_means,yerr=train_stds,label='train')
pyplot.title('XGBoost n_estimators vs Log Loss')
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig(path+'\\n_estimators.png')
#在數據上訓練模型
alg.fit(x_train,y_train,eval_metric='mlogloss')
#在訓練集上預測
train_predprob=alg.predict_proba(x_train)
logloss=log_loss(y_train,train_predprob)
#print
print("logloss of train:")
print(logloss)
xgbl=XGBClassifier(
learning_rate=0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
objective='multi:softprob',#多分類問題
seed=3)
modelfit(xgbl,x_train,y_train,cv_folds=kfold)
這一步好花費很長時間:在弱分類器在742的時候停止,所以接下來調參,我們將弱分類器個數固定,調整每棵樹的深度和葉子節點權重。
6、調整樹的參數:max_depth&min_child_weight。粗調參數的步長爲2,下一步是在粗調的最佳參數周圍,將步長降爲1,進行精細調整¶
max_depth=range(3,10,2)
min_child_weight=range(1,6,2)
param_test2_1=dict(max_depth=max_depth,min_child_weight=min_child_weight)
param_test2_1
xgb2_1=XGBClassifier(
learning_rate=0.1,
n_estimators=742, #這裏用上面找到的最優弱分類器個數
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
objective='multi:softprob',#多分類問題
seed=3)
qsearch2_1=GridSearchCV(xgb2_1,param_grid=param_test2_1,scoring='neg_log_loss',n_jobs=-1,cv=kfold)
qsearch2_1.fit(x_train,y_train)
qsearch2_1.grid_scores_,qsearch2_1.best_params_,qsearch2_1.best_score_
7、可視化,尋找各個參數和損失函數的關係
#可視化
print("best:{} using {}".format(qsearch2_1.best_score_,qsearch2_1.best_params_))
test_means=qsearch2_1.cv_results_['mean_test_score']
test_stds=qsearch2_1.cv_results_['std_test_score']
train_means=qsearch2_1.cv_results_['mean_train_score']
train_stds=qsearch2_1.cv_results_['std_train_score']
pd.DataFrame(qsearch2_1.cv_results_).to_csv(path+'\\my_preds_maxdepth_min_child_weight')
test_scores=np.array(test_means).reshape(len(max_depth),len(min_child_weight))
train_scores=np.array(train_means).reshape(len(max_depth),len(min_child_weight))
for i,value in enumerate(max_depth):
pyplot.plot(min_child_weight,-test_scores[i],label='test_max_depth')
pyplot.legend()
pyplot.xlabel('max_depth')
pyplot.ylabel('log_loss')
pyplot.savefig('max_depth_vs_min_child_weight_1.png')
8、微調,步長設爲1
max_depth=[6,7,8]
min_child_weight=[4,5,6]
通過以上步驟得到較優參數值max_depth和min_child_weight
調整max_depth=6和min_child_weight=4後,再次調整n_estimators
gamma參數調整
正則參數調整
降低學習率,調整樹的數目
大家看開頭的視頻,講的很詳細!