kaggle:Otto Group Product Classification簡單代碼

簡介及相關鏈接:

在b站上看的一個視頻B站xgboost課程。講了xgboost相關的理論知識後,用kaggle上的Otto Group Product Classification數據集進行分類任務。

1、首先導入模塊

from xgboost import XGBClassifier
import xgboost as xgb

import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import log_loss #-log似然損失

from matplotlib import pyplot
import seaborn as sns
%matplotlib inline

2、讀取數據,用到pandas裏面pd.read_csv()。之前讀取碰到過讀取json數據,用pd.read_json()

path='D:\\data\\kaggle\\otto-group-product-classification-challenge'
train=pd.read_csv(path+'\\train.csv')
test=pd.read_csv(path+'\\test.csv')

3、一般在處理問題前都要看下各種特徵的分佈。該數據集的變量確定,數據特徵單一,且經過脫敏(肉眼看不出來他的實際意義),在特徵工程方面少做些工作,主要精力在參數調優。

先看一下target的分佈,看看各類樣本是否均衡。

sns.countplot(train.target)
pyplot.xlabel('target')
pyplot.ylabel('number of occurrences')

4、分類不是很均衡,各類數據的個數差異很大,所以在交叉檢驗的時候要按各類比例抽取。所以這裏用StratifiedKFold,在每折採樣時各類樣本按比例採樣,確保訓練集,測試集中各類別樣本的比例與原始數據集中相同。

y_train=train['target']
y_train=y_train.map(lambda s:s[6:])
y_train=y_train.map(lambda s:int(s)-1)

train=train.drop(['id','target'],axis=1)#維度1,丟掉id和target列。target作爲label單獨存儲,id不作爲特徵
x_train=np.array(train)

kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=3)

5、默認參數,此時學習率爲0.1,比較大,觀察弱分類器數目的大致範圍。(採用默認參數,看模型是過擬合還是欠擬合)

def modelfit(alg,x_train,y_train,useTrainCV=True,cv_folds=None,early_stopping_rounds=50):
    if useTrainCV:
        xgb_param=alg.get_xgb_params()
        xgb_param['num_class']=9
        
        xgtrain=xgb.DMatrix(x_train,label=y_train)
        cvresult=xgb.cv(xgb_param,xgtrain,num_boost_round=alg.get_params()['n_estimators'],folds=cv_folds,
                       metrics='mlogloss',early_stopping_rounds=early_stopping_rounds)
        
        n_estimators=cvresult.shape[0]
        alg.set_params(n_estimators=n_estimators)
        
        print(cvresult)
        
        cvresult.to_csv(path+'\\my_preds_4_1.csv',index_label='n_estimators')
        
        #plot
        test_means=cvresult['test-mlogloss-mean']#平均
        test_stds=cvresult['test-mlogloss-std']#標準差
        
        train_means=cvresult['train-mlogloss-mean']
        train_stds=cvresult['train-mlogloss-std']
        
        x_axis=range(0,n_estimators)
        pyplot.errorbar(x_axis,test_means,yerr=test_stds,label='test')
        pyplot.errorbar(x_axis,train_means,yerr=train_stds,label='train')
        pyplot.title('XGBoost n_estimators vs Log Loss')
        pyplot.xlabel('n_estimators')
        pyplot.ylabel('Log Loss')
        pyplot.savefig(path+'\\n_estimators.png')
        
    #在數據上訓練模型
    alg.fit(x_train,y_train,eval_metric='mlogloss')
    #在訓練集上預測
    train_predprob=alg.predict_proba(x_train)
    logloss=log_loss(y_train,train_predprob)
    
    #print
    print("logloss of train:")
    print(logloss)


xgbl=XGBClassifier(
    learning_rate=0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.3,
    colsample_bytree=0.8,
    colsample_bylevel=0.7,
    objective='multi:softprob',#多分類問題
    seed=3)

modelfit(xgbl,x_train,y_train,cv_folds=kfold)
        

這一步好花費很長時間:在弱分類器在742的時候停止,所以接下來調參,我們將弱分類器個數固定,調整每棵樹的深度和葉子節點權重。

6、調整樹的參數:max_depth&min_child_weight。粗調參數的步長爲2,下一步是在粗調的最佳參數周圍,將步長降爲1,進行精細調整

max_depth=range(3,10,2)
min_child_weight=range(1,6,2)
param_test2_1=dict(max_depth=max_depth,min_child_weight=min_child_weight)
param_test2_1


 

xgb2_1=XGBClassifier(
    learning_rate=0.1,
    n_estimators=742, #這裏用上面找到的最優弱分類器個數
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.3,
    colsample_bytree=0.8,
    colsample_bylevel=0.7,
    objective='multi:softprob',#多分類問題
    seed=3)

qsearch2_1=GridSearchCV(xgb2_1,param_grid=param_test2_1,scoring='neg_log_loss',n_jobs=-1,cv=kfold)
qsearch2_1.fit(x_train,y_train)
qsearch2_1.grid_scores_,qsearch2_1.best_params_,qsearch2_1.best_score_

7、可視化,尋找各個參數和損失函數的關係

#可視化
print("best:{} using {}".format(qsearch2_1.best_score_,qsearch2_1.best_params_))
test_means=qsearch2_1.cv_results_['mean_test_score']
test_stds=qsearch2_1.cv_results_['std_test_score']
train_means=qsearch2_1.cv_results_['mean_train_score']
train_stds=qsearch2_1.cv_results_['std_train_score']

pd.DataFrame(qsearch2_1.cv_results_).to_csv(path+'\\my_preds_maxdepth_min_child_weight')

test_scores=np.array(test_means).reshape(len(max_depth),len(min_child_weight))
train_scores=np.array(train_means).reshape(len(max_depth),len(min_child_weight))
for i,value in enumerate(max_depth):
    pyplot.plot(min_child_weight,-test_scores[i],label='test_max_depth')
    

pyplot.legend()
pyplot.xlabel('max_depth')
pyplot.ylabel('log_loss')
pyplot.savefig('max_depth_vs_min_child_weight_1.png')

8、微調,步長設爲1

max_depth=[6,7,8]
min_child_weight=[4,5,6]

 

通過以上步驟得到較優參數值max_depth和min_child_weight

調整max_depth=6和min_child_weight=4後,再次調整n_estimators

gamma參數調整

正則參數調整

降低學習率,調整樹的數目

大家看開頭的視頻,講的很詳細!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章