DM12---xgboost學習

基本資料

論文：
https://arxiv.org/abs/1603.02754
原理博客：
《機器學習（四）— 從gbdt到xgboost》
https://www.cnblogs.com/mfryf/p/5946815.html
《GBDT&GBRT與XGBoost》
http://blog.csdn.net/u011826404/article/details/76427732
《XGBoost原理解析》
http://blog.csdn.net/dreamyx/article/details/70194018
《xgboost導讀和實戰》
https://wenku.baidu.com/view/44778c9c312b3169a551a460.html
代碼：
https://github.com/dmlc/xgboost
python-lib下載：
https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
官網：
http://xgboost.readthedocs.io/en/latest/python/python_intro.html
API:
http://xgboost.readthedocs.io/en/latest/python/python_api.html

xgboost的xgboost包

以下從python的實現方面來看看xgboost，python實現的時候，xgboost有如下幾個模塊：
Core Data Structure
—-xgboost.DMatrix
—-xgboost.Booster
Learning API
—-xgboost.train
—-xgboost.cv
Scikit-Learn API[Scikit-Learn Wrapper interface for XGBoost]
—-xgboost.XGBRegressor
—-xgboost.XGBClassifier
Plotting API
—-xgboost.plot_importance
—-xgboost.plot_tree
—-xgboost.to_graphviz

xgboost參數

官方參數介紹看這裏：
http://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

General Parameters（常規參數）
1.booster [default=gbtree]：選擇基分類器，gbtree: tree-based models/gblinear: linear models
2.silent [default=0]:設置成1則沒有運行信息輸出，最好是設置爲0.
3.nthread [default to maximum number of threads available if not set]：線程數另外還有兩個參數不用用戶設置： ● num_pbuffer [set automatically by
xgboost, no need to be set by user]
○ size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last
boosting step. ● num_feature [set automatically by xgboost, no need
to be set by user]
○ feature dimension used in boosting, set to maximum dimension of the feature

Booster Parameters（模型參數）
1.eta [default=0.3]:shrinkage參數，用於更新葉子節點權重時，乘以該係數，避免步長過大。參數值越大，越可能無法收斂。把學習率
eta 設置的小一些，小學習率可以使得後面的學習更加仔細。
2.min_child_weight [default=1]:這個參數默認是 1，是每個葉子裏面 h 的和至少是多少，對正負樣本不均衡時的 0-1 分類而言，假設 h 在 0.01 附近，min_child_weight 爲 1 意味着葉子節點中最少需要包含 100
個樣本。這個參數非常影響結果，控制葉子節點中二階導的和的最小值，該參數值越小，越容易 overfitting。
3.max_depth [default=6]: 每顆樹的最大深度，樹高越深，越容易過擬合。
4.max_leaf_nodes:最大葉結點數，與max_depth作用有點重合。
5.gamma [default=0]：後剪枝時，用於控制是否後剪枝的參數。
6.max_delta_step [default=0]：這個參數在更新步驟中起作用，如果取0表示沒有約束，如果取正值則使得更新步驟更加保守。可以防止做太大的更新步子，使更新更加平緩。

7.subsample [default=1]：樣本隨機採樣，較低的值使得算法更加保守，防止過擬合，但是太小的值也會造成欠擬合。
8.colsample_bytree [default=1]：列採樣，對每棵樹的生成用的特徵進行列採樣.一般設置爲： 0.5-1
9.lambda [default=1]：控制模型複雜度的權重值的L2正則化項參數，參數越大，模型越不容易過擬合。
10.alpha [default=0]:控制模型複雜程度的權重值的 L1 正則項參數，參數值越大，模型越不容易過擬合。
11.scale_pos_weight [default=1]：如果取值大於0的話，在類別樣本不平衡的情況下有助於快速收斂。 Learning Task Parameters（學習任務參數）
1.objective [default=reg:linear]：定義最小化損失函數類型，常用參數： binary:logistic –logistic regression for binary classification, returns predicted
probability (not class) multi:softmax –multiclass classification
using the softmax objective, returns predicted class (not
probabilities) you also need to set an additional num_class (number
of classes) parameter defining the number of unique classes
multi:softprob –same as softmax, but returns predicted probability of
each data point belonging to each class.
2.eval_metric [ default according to objective ]： The metric to be used for validation data. The default values are rmse for regression
and error for classification. Typical values are: rmse – root mean
square error mae – mean absolute error logloss – negative
log-likelihood error – Binary classification error rate (0.5
threshold) merror – Multiclass classification error rate mlogloss –
Multiclass logloss auc: Area under the curve
3.seed [default=0]： The random number seed. 隨機種子，用於產生可復現的結果 Can be used for generating reproducible results and also for parameter
tuning. 注意: python sklearn style參數名會有所變化 eta –> learning_rate lambda
–> reg_lambda alpha –> reg_alpha

實踐一下

# coding=utf-8

import xgboost as xgb
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=0, shuffle=True)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

svc = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
svc_pre = svc.predict(X_test)
# score = clf.score(X_test, y_test)
# print('svm-score:', score)

# xgb矩陣賦值
xgb_train = xgb.DMatrix(X_train, label=y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)

#
params = {
    # 1. General Parameters
    'booster': 'gbtree',  # default=gbtree 選擇基分類器，gbtree: tree-based models/gblinear: linear models
    'silent': 1,  # default=0, 設置成1則沒有運行信息輸出，0則顯示運行信息.
    'nthread': 7,  # [default to maximum number of threads available if not set] cpu 線程數
    # 2. Tree Booster 參數
    'eta': 0.007,  # [default=0.3, alias: learning_rate] 如同學習率
    'gamma': 0.1,  # [default=0, alias: min_split_loss] 用於控制是否後剪枝的參數,越大越保守，一般0.1、0.2這樣子，範圍: [0,∞]。
    'max_depth': 12,  # [default=6]構建樹的深度，越大越容易過擬合
    'min_child_weight': 3,  # [default=1]是每個葉子裏面 h 的和至少是多少，對正負樣本不均衡時的 0-1 分類而言，
    # 假設 h 在 0.01 附近，min_child_weight 爲 1 意味着葉子節點中最少需要包含 100 個樣本。
    # 這個參數非常影響結果，控制葉子節點中二階導的和的最小值，該參數值越小，越容易 overfitting。
    'max_delta_step': 1,  # [default=0]
    'subsample': 0.9,  # [default=1]隨機採樣訓練樣本 ◦range: (0,1]
    'colsample_bytree': 0.9,  # [default=1] 生成樹時進行的列採樣
    # 'colsample_bylevel':0.8,# [default=1]
    'lambda': 1.5,  # [default=1, alias: reg_lambda]控制模型複雜度的權重值的L2正則化項參數，參數越大，模型越不容易過擬合。
    'alpha': 0,  # [default=0, alias: reg_alpha]控制模型複雜程度的權重值的 L1 正則項參數，參數值越大，模型越不容易過擬合
    'tree_method': 'auto',  # [default=’auto’]' 可以選擇{‘auto’, ‘exact’, ‘approx’, ‘hist’, ‘gpu_exact’, ‘gpu_hist’}
    # ◾‘auto’: Use heuristic to choose faster one.◾For small to medium dataset, exact greedy will be used.
    # ◾For very large-dataset, approximate algorithm will be chosen.
    # ◾Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
    #
    # ◾‘exact’: Exact greedy algorithm.
    # ◾‘approx’: Approximate greedy algorithm using sketching and histogram.
    # ◾‘hist’: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bins caching.
    # ◾‘gpu_exact’: GPU implementation of exact algorithm.
    # ◾‘gpu_hist’: GPU implementation of hist algorithm.
    'sketch_eps': 0.03,  # [default=0.03],這個參數對於 approximate greedy algorithm 纔有用 ◦range: (0, 1)
    'scale_pos_weight': 1,  # [default=1]如果取值大於0的話，在類別樣本不平衡的情況下有助於快速收斂。
    'updater': 'grow_colmaker,prune',  # [default=’grow_colmaker,prune’]設置值用逗號分隔開，定義樹的更新
    # The following updater plugins exist:◾‘grow_colmaker’: non-distributed column-based construction of trees.
    # ◾‘distcol’: distributed tree construction with column-based data splitting mode.
    # ◾‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
    # ◾‘grow_local_histmaker’: based on local histogram counting.
    # ◾‘grow_skmaker’: uses the approximate sketching algorithm.
    # ◾‘sync’: synchronizes trees in all distributed nodes.
    # ◾‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
    # ◾‘prune’: prunes the splits where loss < min_split_loss (or gamma).
    #
    # ◦In a distributed setting, the implicit updater sequence value would be adjusted as follows:◾‘grow_histmaker,prune’ when dsplit=’row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
    # ◾‘grow_histmaker,refresh,prune’ when dsplit=’row’ and prob_buffer_row < 1
    # ◾‘distcol’ when dsplit=’col’

    # 3. 學習任務
    'objective': 'multi:softmax',  # 多分類的問題
    'num_class': 3,  # 類別數，與 multisoftmax 並用
    # 這個參數默認是 1，
    'seed': 1000,
    'eval_metric': 'mlogloss'  # [default according to objective]
}
plst = list(params.items())
num_rounds = 5000  # 迭代次數
watchlist = [(xgb_train, 'train'), (xgb_test, 'val')]

# 訓練模型並保存
# early_stopping_rounds 當設置的迭代次數較大時，early_stopping_rounds 可在一定的迭代次數內準確率沒有提升就停止訓練
bst = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=100)
bst.save_model('xgb.model')  # 用於存儲訓練出的模型
print("best best_ntree_limit", bst.best_ntree_limit)
bst_pre = bst.predict(xgb_test, ntree_limit=bst.best_ntree_limit)

xgb_sc = zero_one_loss(y_test, bst_pre)
svm_sc = zero_one_loss(y_test, svc_pre)
rs = np.c_[y_test, bst_pre, svc_pre]
print(rs)
print('xgb_sc:', 1 - xgb_sc, 'svm_sc:', 1 - svm_sc)
xgb.plot_importance(bst)
xgb.plot_tree(bst, num_trees=2)
plt.show()

# 輸出要提交的數據
# np.savetxt('submission.csv', np.c_[range(1, len(X_test) + 1), preds], delimiter=',', header='ImageId,Label',
#            comments='', fmt='%d')