xgboost原理

本文主要結合了兩篇文章，並做了稍微的修改，原文出處：
1、http://blog.csdn.net/a819825294/article/details/51206410
2、https://zhuanlan.zhihu.com/p/28672955

1.xgboost vs gbdt

　　說到xgboost，不得不說gbdt，兩者都是boosting方法（如圖1所示），關於gbdt可以看我這篇文章地址。

圖1

　　如果不考慮工程實現、解決問題上的一些差異，xgboost與gbdt比較大的不同就是目標函數的定義。

　　注：紅色箭頭指向的l即爲損失函數；紅色方框爲正則項，包括L1、L2；紅色圓圈爲常數項。xgboost利用泰勒展開三項，做一個近似，我們可以很清晰地看到，最終的目標函數只依賴於每個數據點的在誤差函數上的一階導數和二階導數。

2.原理

對於上面給出的目標函數，我們可以進一步化簡

（1）定義樹的複雜度

對於f的定義做一下細化，把樹拆分成結構部分q和葉子權重部分w。下圖是一個具體的例子。結構函數q把輸入映射到葉子的索引號上面去，而w給定了每個索引號對應的葉子分數是什麼。

定義這個複雜度包含了一棵樹裏面節點的個數，以及每個樹葉子節點上面輸出分數的L2模平方。當然這不是唯一的一種定義方式，不過這一定義方式學習出的樹效果一般都比較不錯。下圖還給出了複雜度計算的一個例子。

注：方框部分在最終的模型公式中控制這部分的比重,對應模型參數中的lambda ，gamma

在這種新的定義下，我們可以把目標函數進行如下改寫，其中I被定義爲每個葉子上面樣本集合 ,g是一階導數，h是二階導數

這一個目標包含了T個相互獨立的單變量二次函數。我們可以定義

最終公式可以化簡爲

通過對求導等於0，可以得到

然後把最優解代入得到：

（2）打分函數計算示例

Obj代表了當我們指定一個樹的結構的時候，我們在目標上面最多減少多少。我們可以把它叫做結構分數(structure score)

（3）分裂節點

論文中給出了兩種分裂節點的方法

（1）貪心法：

每一次嘗試去對已有的葉子加入一個分割

對於每次擴展，我們還是要枚舉所有可能的分割方案，如何高效地枚舉所有的分割呢？我假設我們要枚舉所有x < a 這樣的條件，對於某個特定的分割a我們要計算a左邊和右邊的導數和。

我們可以發現對於所有的a，我們只要做一遍從左到右的掃描就可以枚舉出所有分割的梯度和GL和GR。然後用上面的公式計算每個分割方案的分數就可以了。

觀察這個目標函數，大家會發現第二個值得注意的事情就是引入分割不一定會使得情況變好，因爲我們有一個引入新葉子的懲罰項。優化這個目標對應了樹的剪枝，當引入的分割帶來的增益小於一個閥值的時候，我們可以剪掉這個分割。大家可以發現，當我們正式地推導目標的時候，像計算分數和剪枝這樣的策略都會自然地出現，而不再是一種因爲heuristic（啓發式）而進行的操作了。

下面是論文中的算法

（2）近似算法：

主要針對數據太大，不能直接進行計算

3.自定義損失函數（指定grad、hess）

（1）損失函數

（2）grad、hess推導

（3）官方例子：https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py

4.python、R對於xgboost的簡單使用

任務：二分類，存在樣本不均衡問題（scale_pos_weight可以一定程度上解讀此問題）

【python】

【R】

5.xgboost中比較重要的參數介紹

5.1 XGBoost的參數主要分爲三種：

General Parameters: 控制總體的功能
Booster Parameters: 控制單個學習器的屬性
Learning Task Parameters: 控制調優的步驟

5.2 XGBoost的參數詳情

（1）General Parameters:

booster [default=gbtree]

選擇每一次迭代中，模型的種類. 有兩個選擇:

gbtree: 基於樹的模型
gblinear: 線性模型

silent [default=0]:

設爲1 則不打印執行信息
I設爲0打印信息

nthread [default to maximum number of threads available if not set]

這個是設置併發執行的信息，設置在幾個核上併發
如果你希望在機器的所有可以用的核上併發執行，則採用默認的參數

（2）Booster Parameters

有2種booster，線性的和樹的，一般樹的booster較爲常用。

eta [default=0.3]

類似於GBM裏面的學習率
通過在每一步中縮小權重來讓模型更加魯棒
一般常用的數值: 0.01-0.2

min_child_weight [default=1]

Defines the minimum sum of weights of all observations required in a child.
Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
這個參數用來控制過擬合
Too high values can lead to under-fitting hence, it should be tuned using CV.
如果數值太大可能會導致欠擬合

max_depth [default=6]

The maximum depth of a tree, same as GBM.
控制子樹中樣本數佔總的樣本數的最低比例
設置樹的最大深度
Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
控制過擬合，如果樹的深度太大會導致過擬合
Should be tuned using CV.
應該使用CV來調節。
Typical values: 3-10

max_leaf_nodes

The maximum number of terminal nodes or leaves in a tree.
葉子節點的最大值
Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
也是爲了通過樹的規模來控制過擬合
If this is defined, GBM will ignore max_depth.
如果葉子樹確定了，對於2叉樹來說高度也就定了，此時以葉子樹確定的高度爲準

gamma [default=0]

A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
如果分裂能夠使loss函數減小的值大於gamma，則這個節點才分裂。gamma設置了這個減小的最低閾值。如果gamma設置爲0，表示只要使得loss函數減少，就分裂
Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
這個值會跟具體的loss函數相關，需要調節

max_delta_step [default=0]

In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
如果參數設置爲0，表示沒有限制。如果設置爲一個正值，會使得更新步更加謹慎。（關於這個參數我還是沒有完全理解透徹。。。）
Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
不是很經常用，但是在邏輯迴歸時候，使用它可以處理類別不平衡問題。

subsample [default=1]

Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
對原數據集進行隨機採樣來構建單個樹。這個參數代表了在構建樹時候對原數據集採樣的百分比。eg：如果設爲0.8表示隨機抽取樣本中80%的個體來構建樹。
Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
相對小點的數值可以防止過擬合，但是過小的數值會導致欠擬合（因爲採樣過小）。
Typical values: 0.5-1
一般取值 0.5 到 1

colsample_bytree [default=1]

Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
創建樹的時候，從所有的列中選取的比例。e.g：如果設爲0.8表示隨機抽取80%的列用來創建樹
Typical values: 0.5-1

colsample_bylevel [default=1]

Denotes the subsample ratio of columns for each split, in each level.
I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.

lambda [default=1]

L2 regularization term on weights (analogous to Ridge regression)
L2正則項，類似於Ridge Regression
This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
可以用來考慮降低過擬合，L2本身可以防止過分看重某個特定的特徵。儘量考慮儘量多的特徵納入模型。

alpha [default=0]

L1 regularization term on weight (analogous to Lasso regression)
L1正則。類似於lasso
Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
L1正則有助於產生稀疏的數據，這樣有助於提升計算的速度

scale_pos_weight [default=1]

A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

（3）Learning Task Parameters

These parameters are used to define the optimization objective the metric to be calculated at each step.

objective [default=reg:linear]

This defines the loss function to be minimized. Mostly used values are:

binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)

you also need to set an additional num_class (number of classes) parameter defining the number of unique classes

multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.

eval_metric [ default according to objective ]

The metric to be used for validation data.
The default values are rmse for regression and error for classification.
對於迴歸問題默認採用rmse，對於分類問題一般採用error
Typical values are:

rmse – root mean square error
mae – mean absolute error
logloss – negative log-likelihood
error – Binary classification error rate (0.5 threshold)
merror – Multiclass classification error rate
mlogloss – Multiclass logloss
auc: Area under the curve

seed [default=0]

The random number seed.
Can be used for generating reproducible results and also for parameter tuning.
爲了產生能過重現的結果。因爲如果不設置這個種子，每次產生的結果都會不同。

6.Xgboost調參

由於xgboost的參數過多，這裏介紹三種思路

（1）GridSearch

（2）Hyperopt

（3）老外寫的一篇文章，操作性比較強，推薦學習一下。地址

7.Tip

（1）含有缺失進行訓練
dtrain = xgb.DMatrix(x_train, y_train, missing=np.nan）

8.參考文獻

（1）xgboost導讀和實戰
（2）xgboost
（3）自定義目標函數
（4）機器學習算法中GBDT和XGBOOST的區別有哪些？
（5）https://www.kaggle.com/anokas/sparse-xgboost-starter-2-26857/code/code
（6）XGBoost: Reliable Large-scale Tree Boosting System
（7）XGBoost: A Scalable Tree Boosting System

夕陽下江堤上的男孩

發佈了37 篇原創文章 · 獲贊 88 · 訪問量 10萬+

私信關注

1.xgboost vs gbdt

2.原理

3.自定義損失函數（指定grad、hess）

4.python、R對於xgboost的簡單使用

5.xgboost中比較重要的參數介紹

6.Xgboost調參

7.Tip

8.參考文獻

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

Dokcer部署Kafka集羣

【Linux命令學習】lsof查看打開的文件

奇異值分解（SVD）理論與python實現

選擇排序

快速排序

python——數組的創建、存取（切片、整數、布爾）、去重、堆疊、連接

python——繪圖、插值、優化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結