歡迎訪問我的技術博客：SnailDove ，本文內容有大量公式，csdn對數學公式支持不好，請訪問本文原鏈接：一起入門xgboost

前言

在深度學習火起來之前，集成學習（ensemble learning 包括 boosting: GBDT, XGBoost）是 kaggle 等比賽中的利器，所以集成學習是機器學習必備的知識點，如果提升樹或者GBDT不熟悉，最好先看一下我的另一文：《統計學習方法》第8章提升方法之AdaBoost\BoostingTree\GBDT ，陳天奇的 XGBoost (eXtreme Gradient Boosting) 和微軟的 lightGBM 是 GBDT 算法模型的實現，非常巧妙，是比賽的屠龍之器，算法不僅僅是數學，還涉及系統設計和工程優化。以下引用陳天奇 XGBoost論文的一段話：

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-configured XGBoost by only a small amount [1].

文章目錄

XGBoost

快速瞭解

梯度提升Gradient Boosting (How do we Learn)

XGBoost 系統設計

XGBoost 對 GBDT 實現的巧妙之處

優化的角度

調參

引用

正文

XGBoost

快速瞭解

這部分內容基本上是對陳天奇幻燈片：官網幻燈片

outlook 幻燈片大綱

• 監督學習的主要概念的回顧
• 迴歸樹和集成模型 (What are we Learning)
• 梯度提升 (How do we Learn)
• 總結

Review of key concepts of supervised learning 監督學習的關鍵概念的回顧

概念

符號	含義
$R^d$	特徵維度爲d的數據集
$x_i∈R^d$	第i個樣本
$w_j$	第j個特徵的權重
$\hat{y}_i$	$x_i$ 的預測值
$y_i$	第i個訓練集的對應的標籤
$\Theta$	特徵權重的集合，$\Theta={w_j

模型

基本上相關的所有模型都是在下面這個線性式子上發展起來的
$\hat y_i = \sum_{j = 0}^{d} w_j x_{ij}$
上式中 $x_0=1$ ，就是引入了一個偏差量，或者說加入了一個常數項。由該式子可以得到一些模型：

線性模型，最後的得分就是 $\hat{y}_i$ 。
logistic模型，最後的得分是sigmoid函數 $\frac{1}{1+e^{−\hat{y}_i}}$ 。然後設置閥值，轉爲正負實例。
其餘的大部分也是基於 $\hat{y}_i$ 做了一些運算得到最後的分數

參數

參數就是 $\Theta$ ，這也正是我們所需要通過訓練得出的。

訓練時的目標函數

訓練時通用的目標函數如下：
$Obj(\Theta)=L(\Theta)+Ω(\Theta)$
在上式中 $L(\Theta)$ 代表的是訓練誤差，表示該模型對於訓練集的匹配程度。 $Ω(\Theta)$ 代表的是正則項，表明的是模型的複雜度。

訓練誤差可以用 $L = \sum_{i = 1}^n l(y_i, \hat y_i)$ 來表示，一般有方差和logistic誤差。

方差: $l(y_i,\hat y_i) = (y_i - \hat y_i)^2$
logstic誤差: $l(y_i, \hat y_i) = y_i ln(1 + e^{- \hat y_i}) + (1 - y_i)ln(1 + e^{\hat y_i})$

正則項按照Andrew NG的話來說，就是避免過擬合的。爲什麼能起到這個作用呢？正是因爲它反應的是模型複雜度。模型複雜度，也就是我們的假設的複雜度，按照奧卡姆剃刀的原則，假設越簡單越好。所以我們需要這一項來控制。

L2 範數: $Ω(w)=λ||w||_2$
L1 範數(lasso): $Ω(w)=λ||w||_1$

常見的優化函數有有嶺迴歸，logstic迴歸和Lasso，具體的式子如下

嶺迴歸，這是最常見的一種，由線性模型，方差和L2範數構成。具體式子爲 $\sum\limits^n_{i=1}(y_i−w^Tx_i)2+λ||w||_2$
logstic迴歸，這也是常見的一種，主要是用於二分類問題，比如愛還是不愛之類的。由線性模型，logistic 誤差和L2範數構成。具體式子爲 $\sum\limits^n_{i=1} [y_iln(1+e^{−w^Tx_i})+(1−y_i)ln(1+e^{w^Tx_i})]+λ||w||_2$
lasso比較少見，它是由線性模型，方差和L1範數構成的。具體式子爲 $\sum\limits_{i = 1}^n (y_i - w^T x_i)^2 + \lambda \vert \vert w \vert \vert _1$

我們的目標的就是讓 $Obj(\Theta)$ 最小。那麼由上述分析可見，這時必須讓 $L(\Theta$ ) 和 $Ω(\Theta)$ 都比較小。而我們訓練模型的時候，要在 bias 和 variance 中間找平衡點。bias 由 $L(\Theta)$ 控制，variance 由 $Ω(\Theta)$ 控制。欠擬合，那麼 $L(\Theta)$ 和 $Ω(\Theta)$ 都會比較大，過擬合的話 $Ω(\Theta)$ 會比較大，因爲模型的擴展性不強，或者說穩定性不好。

迴歸樹和集成模型 (What are we Learning)

Regression Tree (CART)

迴歸樹，也叫做分類與迴歸樹，我認爲就是一個葉子節點具有權重的二叉決策樹。它具有以下兩點特徵

決策規則與決策樹的一樣。
每個葉子節點上都包含了一個權重，也有人叫做分數。

下圖就是一個迴歸樹的示例：

迴歸樹的集成模型

迴歸

小男孩落在第一棵樹的最左葉子和第二棵樹的最左葉子，所以它的得分就是這兩片葉子的權重之和，其餘也同理。

樹有以下四個優點：

使用範圍廣，像GBM，隨機森林等。(PS:據陳天奇大神的統計，至少有超過半數的競賽優勝者的解決方案都是用迴歸樹的變種)
對於輸入範圍不敏感。所以並不需要對輸入歸一化
能學習特徵之間更高級別的相互關係
很容易對其擴展

模型和參數

假設我們有 $K$ 棵樹，那麼
$KaTeX parse error: Expected 'EOF', got '\cal' at position 49: …i),\ \ f_k \in \̲c̲a̲l̲ ̲F$
上式中 $KaTeX parse error: Expected 'EOF', got '\cal' at position 1: \̲c̲a̲l̲ ̲F$ 表示的是迴歸森林中的所有函數空間。 $f_k(x_i)$ 表示的就是第 $i$ 個樣本在第 $k$ 棵樹中落在的葉子的權重。那麼現在我們需要求的參數就是每棵樹的結構和每片葉子的權重，或者簡單的來說就是求 $f_k$ 。那麼爲了和上一節所說的通用結構統一，可以設
$\Theta = \lbrace f_1,f_2,f_3, \cdots ,f_k \rbrace$

在單一變量上學習一棵樹

定義一個目標對象，優化它。
例如：
- 考慮這樣一個問題：在輸入只有時間（t）的迴歸樹
- 我想預測在時間是t的時候，我是否喜歡浪漫風格的音樂？

可見分段函數的分割點就是迴歸樹的非葉子節點，分段函數每一段的高度就是迴歸樹葉子的權重。那麼就可以直觀地看到欠擬合和過擬合曲線所對應的迴歸樹的結構。根據我們上一節的討論， $Ω(f)$ 表示模型複雜度，那麼在這裏就對應着分段函數的瑣碎程度。 $L(f)$ 表示的就是函數曲線和訓練集的匹配程度。

學習階躍函數

第二幅圖：太多的分割點， $\Omega(f)$ 即模型複雜度很高；第三幅圖：錯誤的分割點， $L(f)$ 即損失函數很高。第四幅圖：在模型複雜度和損失函數之間取得很好的平衡。

綜上所述

模型：假設我們有k棵樹，那麼模型的表達式 $KaTeX parse error: Expected 'EOF', got '\cal' at position 51: …k(x_i), f_k\in \̲c̲a̲l̲{F}$

目標函數： $Obj =\underbrace{\sum_{i=1}^{n}l(y_i, \hat{y_i})}_{訓練誤差} +\underbrace{\sum_{k=1}^{K}\Omega(f_k)}_{樹的複雜度}$

定義樹的複雜度幾種方式

樹的節點數或深度
樹葉子節點的L2範式
…（後面會介紹有更多的細節）

目標函數 vs 啓發式

當你討論決策樹，它通常是啓發式的

按信息增益
對樹剪枝
最大深度
對葉子節點進行平滑

大多數啓發式可以很好地映射到目標函數

信息增益 -> 訓練誤差
剪枝 -> 按照樹節點的數目定義的正則化項
最大深度 -> 限制函數空間
對葉子值進行平滑操作 -> 葉子權重的L2正則化項

迴歸樹不僅僅用於迴歸

迴歸樹的集成模型定義了你如何創建預測的分數，它能夠用於
- 分類，迴歸，排序 …
- …
迴歸樹的功能取決於你怎麼定義目標函數
目前爲止我們已經學習過
- 使用方差損失（Square Loss） $l(y_i, \hat{y_i})=(y_i-\hat{y}_i)$ ，這樣就產生了普通的梯度提升機（common gradient boosted machine）
- 使用邏輯損失（Logistic loss） $l(y, \hat{y}_i)=y_i\ln(1+e^{-\hat{y}_i}) + (1-y_i)\ln(1+e^{\hat{y}_i})$ ，這樣就產生了邏輯梯度提升（LogitBoost）。

梯度提升Gradient Boosting (How do we Learn)

那怎麼學習？

目標對象： $KaTeX parse error: Expected 'EOF', got '\cal' at position 62: …(f_k), f_k \in \̲c̲a̲l̲{F}$
我們不能用像SGD（隨機梯度下降）這樣的方法去找到 f，因爲他們是樹而不是僅僅是數值向量。
解決方案：加法訓練 Additive Training（提升方法boosting）
- 從常量方法開始，每一次（輪）添加一個新的方法

這個算法的思想很簡單，一棵樹一棵樹地往上加，一直到 $K$ 棵樹停止。過程可以用下式表達：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \hat y_i^{(0)}…$

加法訓練

我們如何決定什麼樣的 $f$ 加到模型中？
- 優化目標
在 $t$ 輪的預測是：$\hat y_i^{(t)} = \hat y_i^{(t - 1)} + f_t(x_i) $ 加號右邊這一項就是我們在 t 輪需要決定的東西

$KaTeX parse error: No such environment: align at position 10: \begin{̲a̲l̲i̲g̲n̲}̲ Obj^{(t)} &= \…$
考慮平方誤差
$KaTeX parse error: No such environment: align at position 10: \begin{̲a̲l̲i̲g̲n̲}̲ Obj^{(t)} &=…$
$(\hat{y}^{(t-1)}_i-y_i)$ 稱爲殘差。

損失函數的泰勒展開

可由泰勒公式得到下式
$f(x + \Delta x) \approx f(x) +f^{\prime}(x) \Delta x + \frac 1 2 f^{\prime \prime}(x) \Delta x^2$
那麼現在可以把 $y^{(t)}_i$ 看成上式中的 $f(x+Δx)$ ， $y^{(t−1)}_i$ 就是 $f(x)$ ， $f_t(x_i)$ 爲 $Δx$ 。然後設 $g_i$ 代表 $f′(x)$ ，也就是 $g_i = {\partial}_{\hat y^{(t - 1)}} \ l(y_i, \hat y^{(t - 1)})$ 用 $h_i$ 代表 $f′′(x)$ ，於是 $h_i = {\partial}_{\hat y^{(t - 1)}}^2 \ l(y_i, \hat y^{(t - 1)})$ 於是現在目標函數就爲下式:
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ Obj^{(t)} &\ap…$
可以用平方誤差的例子進行泰勒展開看看結果是否一致，很明顯，上式中後面那項 $[\sum_{i = 1}^n l(y_i, \hat y_i^{(t - 1)}) + constant]$ 對於該目標函數我們求最優值點的時候並無影響，所以，現在有了新的優化目標
$Obj^{(t)} \approx \sum_{i = 1}^n [g_i f_t(x_i) + \frac 1 2 h_i f_t^2 (x_i)] + \Omega (f_t)$

這麼苦逼圖啥？

改進樹的定義 Refine the definition of tree

上一節討論了 $f_t(x)$ 的物理意義，現在我們對其進行數學公式化。設 $w∈R^T$ ， $w$ 爲樹葉的權重序列， $q:R^d \rightarrow \lbrace 1,2, \cdots ,T \rbrace$ ，q爲樹的結構。那麼 q(x) 表示的就是樣本 x 所落在樹葉的位置。可以用下圖形象地表示

現在對訓練誤差部分的定義已經完成。那麼對模型的複雜度應該怎麼定義呢？

定義樹的複雜度 Define Complexity of a Tree

樹的深度？最小葉子權重？葉子個數？葉子權重的平滑程度？等等有許多選項都可以描述該模型的複雜度。爲了方便，現在用葉子的個數和葉子權重的平滑程度來描述模型的複雜度。可以得到下式：
$\Omega(f_t) = \gamma T + \frac 1 2 \lambda \sum_{j = 1}^T w_j^2$
說明：上式中前一項用葉子的個數乘以一個收縮係數，後一項用L2範數來表示葉子權重的平滑程度。

下圖就是計算複雜度的一個示例：

修改目標函數 Revisit the Objectives

最後再增加一個定義，用 $I_j$ 來表示第 $j$ 個葉子裏的樣本集合。也就是上圖中，第 $j$ 個圈，就用 $I_j$ 來表示。
$I_j = \lbrace i|q(x_i) = j \rbrace$
好了，最後把優化函數重新按照每個葉子組合,並捨棄常數項：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ Obj^{(t)} &\ap…$

這是 $T$ 個獨立的二次函數的和。

結構分 The Structure Score

初中時所學的二次函數的最小值可以推廣到矩陣函數裏
$\mathop{\min_x}\{Gx+ \frac 1 2 Hx^2\} = - \frac 1 2 \frac {G^2} H, \quad H \gt 0 \\ \mathop{\arg\min_x}\{Gx+\frac{1}{2}Hx^2\} = -\frac{G}{H}，H \ge 0$
設 $G_j = \sum_{i \in I_j } g_i,\ H_j = \sum_{i \in I_j}h_i$ ，那麼
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ Obj^{(t)} &= \…$
因此，若假設我們的樹的結構已經固定，就是 $q(x)$ 已經固定，那麼
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ W_j^* &= - \fr…$
例子

用於單棵樹的搜索算法 Searching Algorithm for Single Tree

現在只要知道樹的結構，就能得到一個該結構下的最好分數。可是樹的結構應該怎麼確定呢？

枚舉可能的樹結構 q
使用分數公式來計算 q 的結構分：

$Obj = -\frac{1}{2} \sum\limits_{j=1}^{T}\frac{G_j^2}{H_j+\lambda} + \gamma T$
找到最好的樹結構，然後使用優化的葉子權重：

$w^*_j=-\frac{G_j}{H_j+\lambda}$
但是這可能有無限多個可能的樹結構

樹的貪婪學習 Greedy Learning of the Tree

從深度爲 0 的樹開始
對樹的每個葉子節點，試着添加一個分裂點。添加這個分裂點後目標函數的值變化
$KaTeX parse error: No such environment: align at position 10: \begin{̲a̲l̲i̲g̲n̲}̲ Obj_{spl…$
剩下的問題：我們如何找到最好的分裂點？

最好分裂點的查找 Efficient Finding of the Best Split

當分裂規則是 $x_j<a$ 時，樹的增益是 ? 假設 $x_j$ 是年齡
我們所需要就是上圖的兩邊 $g$ 和 $h$ 的和，然後計算
$Gain = \frac{G_L^2}{H_L+\lambda} + \frac{G_L^2}{H_L+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda} - \gamma$
在一個特徵上，從左至右對已經排序的實例進行線性掃描能夠決定哪個是最好的分裂點。

分裂點查找算法 An Algorithm for Split Finding

對於每個節點，枚舉所有的特徵
- 對於每個特徵，根據特徵值對實例（樣本）進行排序
- 在這個特徵上，使用線性掃描決定哪個是最好的分裂點
- 在所有特徵上採用最好分裂點的方案
深度爲 $K$ 的生長樹的時間複雜度
- $O(K\ d\ n\log n)$ ：每一層需要 $O(n\ \log n)$ 時間去排序，且需要在 $d$ 個特徵上排序，我們需要在 $K$ 層進行這些排序。（補充： $O(n)$ 時間計算當前特徵的最佳分裂點，即最後實際上 $O(d\ K\ (n\log n +n)$ ）
- 這些可以進一步優化（例如：使用近似算法和緩存已經排序的特徵）
- 能夠拓展到非常大的數據集

類變量（categorical variables）？

有一些樹處理分開處理類變量和連續值的變量
- xgboost可以簡單地使用之前推導的分數公式去計算基於類變量的分裂分數
實際上，沒有必要分開處理類變量
- 我們可以使用獨熱編碼（one-hot encoding）將類變量編碼成數值向量。分配一個維度爲類數量的向量。
  $KaTeX parse error: Expected 'EOF', got '\cases' at position 8: z_j=\̲c̲a̲s̲e̲s̲{1,\quad &\text…$
- 如果有很多類變量，這個數值向量將是稀疏的，xgboost學習算法被設計成偏愛處理稀疏數據。
補充：對某個節點的分割時，是需要按某特徵的值排序，那麼對於無序的類別變量，就需要進行one-hot化。否則，舉個例子：假設某特徵有1，2，3三種變量，進行比較時，就會只比較左子樹爲1, 2或者右子樹爲2, 3，或者不分割，哪個更好，但是左子樹爲 1,3 的分割的這種情況就會忘記考慮。因爲 $Gain$ 於特徵的值範圍是無關的，它採用的是已經生成的樹的結構與權重來計算的。所以不需要對特徵進行歸一化處理。

剪枝和正則化 Pruning and Regularization

回憶一下增益公式：
- $Gain=\underbrace{\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}}_{訓練損失的減少量} - \underbrace{\gamma}_{正則項}$
- 當訓練損失減少量小於正則項的時候，分裂後的增益就變成負的。
- 在樹的簡化度（simplicity）和預測性能（predictiveness）的權衡（trade-off）
提早終止（Pre-stopping）
- 如果最好的分裂產生的增益計算出來是負的，那麼停止分裂。
- 但是（當前的）一個分裂可能對未來的分裂有益。
後剪枝（Post-Prunning）
- 生長一棵樹到最大深度，再遞歸地剪枝所有具有負增益的葉子分裂節點。

回顧提升樹算法 Recap: Boosted Tree Algorithm

每一輪添加一棵樹
每一輪開始的時候，計算 $g_i=\partial_{\hat{y}_i^{(t-1)}}l(y_i,\hat{y}^{(t-1)}), h_i=\partial_{\hat{y}^{(t-1)}}l(y_i, \hat{y}^{(t-1)})$
使用統計學知識（統計所有分裂點信息：一節梯度和二階梯度），用貪婪的方式生長一棵樹 $f_t(x)$ ：
$Obj = -\frac{1}{2}\sum\limits_{j=1}^{T}\frac{G_j^2}{H_j+\lambda} + \gamma T$
添加 $f_t(x)$ 到模型 $\hat{y}_i^{(t)}=\hat{y}_i^{(t-1)} + f_t(x_i)$
- 通常，我們這麼做令 $\hat{y}_i^{(t)}=\hat{y}_i^{(t-1)} + \epsilon f_t(x_i)$
- $\epsilon$ 稱爲步伐大小（step-size）或者收縮（shrinkage），通常設置爲大約 0.1
- 這意味着在每一步我們做完全優化，是爲了給未來的輪次保留機會（去進一步優化），這樣做有助於防止過擬合。

--------------------------------------------------------------幻燈片內容結束----------------------------------------------------------------------

XGBoost 系統設計

這部分內容主要來自陳天奇的論文 XGBoost: A Scalable Tree Boosting System

縮小和列抽樣 shrinkage and column subsampling

隨機森林中的用法和目的一樣，用來防止過擬合，主要參考論文2.3節

這個xgboost與現代的gbdt一樣，都有shrinkage參數（最原始的gbdt沒有這個參數）類似於梯度下降算法中的學習速率，在每一步tree boosting之後增加了一個參數 $\eta$ （被加入樹的權重），通過這種方式來減小每棵樹的影響力，給後面的樹提供空間去優化模型。
column subsampling 列（特徵）抽樣，這個經常用在隨機森林，不過據XGBoost的使用者反饋，列抽樣防止過擬合的效果比傳統的行抽樣還好（xgboost也提供行抽樣的參數供用戶使用），並且有利於後面提到的並行化處理算法。

查找分裂點的近似算法 Approximate Algorithm

主要參考論文3.2節

當數據量十分龐大，以致於不能全部放入內存時，精確的貪婪算法就不可能很有效率，通樣的問題也出現在分佈式的數據集中，爲了高效的梯度提升算法，在這兩種背景下，近似的算法被提出使用，算法的僞代碼如下圖所示

概括一下：枚舉所有特徵，根據特徵，比如是第 $k$ 個特徵的分佈的分位數來決定出 $l$ 個候選切分點 $S_k = \{s_{k1},s_{k2},\cdots s_{kl}\}$ ，然後根據這些候選切分點把相應的樣本映射到對應的桶中，對每個桶的 $G,H$ 進行累加。最後在候選切分點集合上貪心查找，和Exact Greedy Algorithm類似。

特徵分佈的分位數的理解

此圖來自知乎weapon大神的《 GBDT算法原理與系統設計簡介》

論文給出近似算法的2種變體，主要是根據候選點的來源不同區分：

在建樹之前預先將數據進行全局（global）分桶，需要設置更小的 ϵ ，產生更多的桶，特徵分裂查找基於候選點多，計算較慢，但只需在全局執行一次，全局分桶多次使用。
每次分裂重新局部（local）分桶，可以設置較大的 ϵ ，產生更少的桶，每次特徵分裂查找基於較少的候選點，計算速度快，但是需要每次節點分裂後重新執行，論文中說該方案更適合樹深的場景。

論文給出Higgs案例下，方案1全局分桶設置 ϵ=0.05 與精確算法效果差不多，方案2局部分桶設置 ϵ=0.3 與精確算法僅稍差點，方案1全局分桶設置 ϵ=0.3 則效果極差，如下圖：

由此可見，局部選擇的近似算法的確比全局選擇的近似算法優秀的多，所得出的結果和貪婪算法幾乎不相上下。

最後很重的是：使用哪種方案，xgboost用戶可以自由選擇。

Notably, it is also possible to directly construct approximate histograms of gradient statistics. Our system efficiently supports exact greedy for the single machine setting, as well as approximate algorithm with both local and global proposal methods for all settings. Users can freely choose between the methods according to their needs.

這裏直方圖算法，常用於GPU的內存優化算法，leetcode上也有人總結出來：LeetCode Largest Rectangle in Histogram O(n) 解法詳析， Maximal Rectangle

帶權的分位方案 Weighted Quantile Sketch

主要參考論文3.3節

在近似的分裂點查找算法中，一個步驟就是提出候選分裂點，通常情況下，一個特徵的分位數使候選分裂點均勻地分佈在數據集上，就像前文舉的關於特徵分位數的例子。

考慮 $KaTeX parse error: Expected 'EOF', got '\cal' at position 1: \̲c̲a̲l̲{D}_k = \lbrace…$ 代表每個樣本的第 $k$ 個特徵和其對應的二階梯度所組成的集合。那麼我們現在就能用分位數來定義下面的這個排序函數 $r_k:\Bbb R \rightarrow [0,1]$
$KaTeX parse error: Expected '}', got '\cal' at position 36: …sum_{(x,h) \in \̲c̲a̲l̲{D}_k}h} \sum_{…$
上式表示的就是該特徵的值小於 $z$ 的樣本所佔總樣本的比例。於是我們就能用下面這個不等式來尋找分裂候選點 $\lbrace s_{k1},s_{k2},s_{k3}, \cdots, s_{kl} \rbrace$
$\|r_k(s_{k,j}) - r_k(s_{k, j+1})\| \lt \epsilon,\ s_{k1}=\underset{i}{min}\ x_{ik},s_{kl}=\underset{i}{max}\ x_{ik}$
上式中 $\epsilon$ 的作用：控制讓相鄰兩個候選分裂點相差不超過某個值 $\epsilon$ ，那麼 $1/\epsilon$ 的整數值就代表幾分位，舉例 $\epsilon=1/3$ ，那麼就是三分位，即有 $3-1$ 個候選分裂點。數學上，從最小值開始，每次增加 $KaTeX parse error: Got function '\max' as argument to '\underset' at position 16: ϵ∗(\underset{i}\̲m̲a̲x̲ ̲x_{ik}−\underse…$ 作爲分裂候選點。然後在這些分裂候選點中選擇一個最大分數作爲最後的分裂點，而且每個數據點的權重是 $h_i$ ，原因如下：
$KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ Obj^{(t)} &\ap…$
說明：這部分論文原文推導有些錯誤，國外問答網站 stack exchange 給出很明確的答覆，上式可以視爲標籤爲 $-\frac{g_i}{h_i}$ 且權重爲 $h_i$ 的平方誤差，此時視 $\frac{g_i^2}{2h_i}$ 常數（因爲是來自上一輪的梯度和二階梯度）。

現在應該明白 Weighted Quantile Sketch 帶權的分位方案的由來，下面舉個例子：

即要切分爲3個，總和爲1.8，因此第1個在0.6處，第2個在1.2處。此圖來自知乎weapon大神的《 GBDT算法原理與系統設計簡介》

注意稀疏問題的分裂點查找 Sparsity-aware Split Finding

主要參考論文3.4節

對於數據缺失數據、one-hot編碼等造成的特徵稀疏現象，作者在論文中提出可以處理稀疏特徵的分裂算法，主要是對稀疏特徵值缺失的樣本學習出默認節點分裂方向：

默認miss value進右子樹，對non-missing value的樣本在左子樹的統計值 $G_L$ 與 $H_L$ ，右子樹爲 $G-G_L$ 與 $H−H_L$ ，其中包含miss的樣本。
默認miss value進左子樹，對non-missing value的樣本在右子樹的統計值 $G_R$ 與 $H_R$ ，左子樹爲 $G-G_R$ 與 $H−H_R$ ，其中包含miss的樣本。

這樣最後求出增益最大的特徵值以及miss value的分裂方向。作者在論文中提出基於稀疏分裂算法：

使用了該方法，相當於比傳統方法多遍歷了一次，但是它只在非缺失值的樣本上進行迭代，因此其複雜度與非缺失值的樣本成線性關係。在 Allstate-10k 數據集上，比傳統方法快了50倍：

旨在並行學習的列塊結構 Column Block for Parallel Learning

主要參考論文4.1節

**CSR vs CSC **

稀疏矩陣的壓縮存儲形式，比較常見的其中兩種：壓縮的稀疏行（Compressed Sparse Row）和壓縮的稀疏列（Compressed Sparse Row）

CSR包含非0數據塊values，行偏移offsets，列下標indices。offsets數組大小爲（總行數目+1），CSR是對稠密矩陣的壓縮，實際上直接訪問稠密矩陣元素 (i,j) 並不高效，畢竟損失部分信息，訪問過程如下：

    1. 根據行i 得到偏移區間開始位置 `offsets[i]`與區間結束位置 `offsets[i+1]-1`，得到ii行數據塊 `values[offsets[i]..(offsets[i+1]-1)]`， 與非0的列下表`indices[offsets[i]..(offsets[i+1]-1)]`
    2. 在列下標數據塊中二分查找j，找不到則返回0，否則找到下標值k，返回values[offsets[i]+k]

從訪問單個元素來說，從 $O(1)$ 時間複雜度升到 $O(\log N)$ , N 爲該行非稀疏數據項個數。但是如果要遍歷訪問整行非0數據，則無需訪問indices數組，時間複雜度反而更低，因爲少了大量的稀疏爲0的數據訪問。

CSC與CSR變量結構上並無差別，只是變量意義不同，其中values仍然爲非0數據塊，offsets爲列偏移，即特徵id對應數組，indices爲行下標，對應樣本id數組，XBGoost使用CSC主要用於對特徵的全局預排序。預先將CSR數據轉化爲無序的CSC數據，遍歷每個特徵，並對每個特徵 i 進行排序：sort(&values[offsets[i]], &values[offsets[i+1]-1])。全局特徵排序後，後期節點分裂可以複用全局排序信息，而不需要重新排序。

矩陣的存儲形式，參考此文：稀疏矩陣存儲格式總結+存儲效率對比:COO,CSR,DIA,ELL,HYB

採取這種存儲結構的好處

未完待續。。。。。

關注緩存的存取 Cache-aware Access

使用Block結構的一個缺點是取梯度的時候，是通過索引來獲取的，而這些梯度的獲取順序是按照特徵的大小順序的。這將導致非連續的內存訪問，可能使得CPU cache緩存命中率低，從而影響算法效率。

因此，對於exact greedy算法中, 使用緩存預取。具體來說，對每個線程分配一個連續的buffer，讀取梯度信息並存入Buffer中（這樣就實現了非連續到連續的轉化），然後再統計梯度信息。該方式在訓練樣本數大的時候特別有用，見下圖：

在 approximate 算法中，對Block的大小進行了合理的設置。定義Block的大小爲Block中最多的樣本數。設置合適的大小是很重要的，設置過大則容易導致命中率低，過小則容易導致並行化效率不高。經過實驗，發現 $2^{16}$ 比較好，那麼上文提到CSC存儲結構的 indices 數組（存儲的行下表）的元素佔用的字節數就是 16/8 = 2 。

核外塊的計算 Blocks for Out-of-core Computation

XGBoost中提出Out-of-core Computation優化，解決了在硬盤上讀取數據耗時過長，吞吐量不足

多線程對數據分塊壓縮 Block Compression，再將數據傳輸到內存，最後再用獨立的線程解壓縮，核心思想：將磁盤的讀取消耗轉換爲解壓縮所消耗的計算資源。
分佈式數據庫系統的常見設計：Block Sharding將數據分片到多塊硬盤上，每塊硬盤分配一個預取線程，將數據fetche到in-memory buffer中。訓練線程交替讀取多塊緩存的同時，計算任務也在運轉，提升了硬盤總體的吞吐量。

注：這部分內容屬於外存算法External_memory_algorithm

XGBoost 對 GBDT 實現的巧妙之處

這部分內容主要參考了知乎上的一個問答機器學習算法中 GBDT 和 XGBOOST 的區別有哪些？ - 知乎根據他們的總結和我自己對論文的理解和補充。

傳統GBDT以CART作爲基分類器，xgboost支持多種基礎分類器。比如，線性分類器，這個時候xgboost相當於帶 L1 和 L2正則化項的邏輯斯蒂迴歸（分類問題）或者線性迴歸（迴歸問題）。

可以通過booster [default=gbtree] 設置參數，詳細參照官網
- gbtree: tree-based models
- gblinear: linear models
- DART: Dropouts meet Multiple Additive Regression Trees dropout 在深度學習裏面也經常使用，需要注意的是無論深度學習還是機器學習：使用droput訓練出來的模型，預測的時候要使dropout失效。
傳統GBDT在優化時只用到一階導數信息，xgboost則對損失函數函數進行了二階泰勒展開，同時用到了一階和二階導數，這樣相對會精確地代表損失函數的值。順便提一下，xgboost工具支持自定義代價函數，只要函數可一階和二階求導，詳細參照官網API。
xgboost在代價函數里加入了正則項，用於控制模型的複雜度。正則項裏包含了樹的葉子節點個數、每個葉子節點上輸出的score的L2模的平方和，防止過擬合，這也是xgboost優於傳統GBDT的一個特性。正則化的兩個部分，都是爲了防止過擬合，剪枝是都有的，葉子結點輸出L2平滑是新增的。
Built-in Cross-Validation 內置交叉驗證

XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
This is unlike GBM where we have to run a grid-search and only a limited values can be tested.

continue on Existing Model 可以保存模型下次接着訓練

User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
GBM implementation of sklearn also has this feature so they are even on this point.

High Flexibility 可定製損失函數，只要這個損失函數2階可導

XGBoost allow users to define custom optimization objectives and evaluation criteria.
This adds a whole new dimension to the model and there is no limit to what we can do.

xgboost工具支持並行。注意xgboost不同於隨機森林中的並行粒度是：tree，xgboost與其他提升方法（比如GBDT）一樣，也是一次迭代完才能進行下一次迭代的（第t次迭代的代價函數裏包含了前面t-1次迭代的預測值）。xgboost的並行是在特徵粒度上的。我們知道，決策樹的學習最耗時的一個步驟就是對特徵的值進行排序（因爲要確定最佳分割點），xgboost在訓練之前，預先對數據進行了排序，然後保存爲block結構，後面的迭代中重複地使用這個結構，大大減小計算量。這個block結構也使得並行成爲了可能，在進行節點的分裂時，需要計算每個特徵的增益，最終選增益最大的那個特徵去做分裂，那麼各個特徵的增益計算就可以開多線程進行。

總體來說，這部分內容需要學習很多，特別是涉及到分佈式地併發優化和資源調度算法，這就不僅僅是數學模型的問題了，還涉及到系統設計，程序運行性能的優化，本人實在是才疏學淺，這部分內容理解尚淺，進一步學習還需要其他論文和看XGBoost源碼，有些優化的地方也不是作者首創，表示從附錄的論文中得以學習集成到XGBoost中，真的是集萬千之大作，作者不愧是上海交大ACM班出身。大神的訪談：https://cosx.org/2015/06/interview-of-tianqi/

優化的角度

馬琳同學的回答非常棒，真是讓我感受到了：橫看成嶺側成峯

高可用的xgboost

由於xgboost發展平穩成熟，現在已經非常易用，下圖來自官網

hello world

來自官網，其他複雜的demo，參看github的demo目錄

Python

import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)

在jupter notebook中運行結果

import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')

[18:22:42] 6513x127 matrix with 143286 entries loaded from demo/data/agaricus.txt.train
[18:22:42] 1611x127 matrix with 35442 entries loaded from demo/data/agaricus.txt.test

# specify parameters via map
param = {'max_depth':3, 'eta':1, 'silent': 0, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)

[18:22:42] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[18:22:42] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3

# make prediction
preds = bst.predict(dtest)
print(preds)
print(bst.eval(dtest))

[0.10828121 0.85500014 0.10828121 ... 0.95467216 0.04156424 0.95467216]
[0]	eval-error:0.000000

param = {'booster': 'dart',
         'max_depth': 4, 
         'eta': 0.001,
         'objective': 'binary:logistic', 
         'silent': 0,
         'sample_type': 'uniform',
         'normalize_type': 'tree',
         'rate_drop': 0.5,
         'skip_drop': 0.0}
#Command Line Parameters: 提升的輪次數
num_round = 2
bst = xgb.train(param, dtrain, num_round)

[18:22:42] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 0 pruned nodes, max_depth=4
[18:22:42] C:\Users\Administrator\Desktop\xgboost\src\gbm\gbtree.cc:494: drop 0 trees, weight = 1
[18:22:42] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 0 pruned nodes, max_depth=4
[18:22:42] C:\Users\Administrator\Desktop\xgboost\src\gbm\gbtree.cc:494: drop 1 trees, weight = 0.999001

# make prediction
preds = bst.predict(dtest, ntree_limit=num_round)
print(preds)
print(bst.eval(dtest))

[0.4990105 0.5009742 0.4990105 ... 0.5009742 0.4990054 0.5009742]
[0]	eval-error:0.007449

參數詳解

官網，看懂參數的前提是把前文數學公式和理論看懂，這部分內容主要是對官網的翻譯。

運行XGBoost之前，我們必須設置3種類型的參數：通用參數（general parameters），提升器參數（booster paramter），任務參數（task parameter）。

通用參數：與我們所使用的提升器（通常是樹型提升器或者線性提升器）的提升算法相關。
提升器參數：取決於你所選擇的哪種提升器
學習任務的參數：這些參數決定了學習的方案（learning scenario）。例如：在排名任務場景下，迴歸任務可能使用不同的參數。
命令行參數：與 XGBoost 的命令行接口（CLI）版本的行爲相關。

Note

Parameters in R package

In R-package, you can use . (dot) to replace underscore(與underline同義) in the parameters, for example, you can use max.depth to indicate max_depth. The underscore parameters are also valid in R.

通用參數 general parameters

booster [default=gbtree] 設定基礎提升器的參數

Which booster to use. Can be gbtree, gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
silent [default=0]: 設置成1則沒有運行信息的輸出，最好是設置爲0.
nthread [default to maximum number of threads available if not set]：線程數
disable_default_eval_metric [default=0]
Flag to disable default metric. Set to >0 to disable. ，使默認的模型評估器失效的標識
num_pbuffer [set automatically by XGBoost, no need to be set by user]
Size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by XGBoost, no need to be set by user]
Feature dimension used in boosting, set to maximum dimension of the feature

提升器參數 Booster parameters

樹提升器參數 Parameters for Tree Booster

eta [default=0.3], range $[0, 1]$

shrinkage參數，用於更新葉子節點權重時，乘以該係數，避免步長過大。參數值越大，越可能無法收斂。把學習率 eta 設置的小一些，小學習率可以使得後面的學習更加仔細。
gamma [default=0 alias: min_split_loss] , range $[0, \infty]$

功能與min_split_loss 一樣，（alias是“別名，又名”的意思，聯想linux命令：alias就非常容易理解，即給相應的命令起了新的名字，引用是同一段功能的同一個程序是一樣）後剪枝時，用於控制是否後剪枝的參數。
max_depth [default=6], range $[0, \infty]$

每顆樹的最大深度，樹高越深，越容易過擬合。
min_child_weight [default=1], range: $[0, \infty]$

這個參數默認是 1，是每個葉子裏面loss函數二階導（ hessian）的和至少是多少，對正負樣本不均衡時的 0-1 分類而言，假設 h 在 0.01 附近，min_child_weight 爲 1 意味着葉子節點中最少需要包含 100 個樣本。這個參數非常影響結果，控制葉子節點中二階導的和的最小值，該參數值越小，越容易 overfitting。
max_delta_step [default=0] , range: $[0, \infty]$

這個參數在更新步驟中起作用，如果取0表示沒有約束，如果取正值則使得更新步驟更加保守。可以防止做太大的更新步子，使更新更加平緩。
subsample [default=1], range: $[0, 1]$

訓練實例的抽樣率，較低的值使得算法更加保守，防止過擬合，但是太小的值也會造成欠擬合。如果設置0.5，那就意味着隨機樹的生長之前，隨機抽取訓練數據的50%做樣本。
colsample_bytree [default=1], range: $[0, 1]$

在構建每棵樹的時候，特徵（這裏說是列，因爲樣本是按行存儲的，那麼列就是相應的特徵）的採樣率，用的特徵進行列採樣.
colsample_bytree 表示的是每次分割節點時，抽取特徵的比例。
lambda [default=1, alias: reg_lambda]

作用於權重值的 L2 正則化項參數，參數越大，模型越不容易過擬合。
alpha [default=0, alias: reg_alpha]

作用於權重值的 L1 正則項參數，參數值越大，模型越不容易過擬合。
tree_method string [default=auto]
- 用來設定樹的構建算法，欲知詳情請看陳天奇論文中的引用資料： reference paper.
  
  The tree construction algorithm used in XGBoost. See description in the reference paper.
- 分佈式和外存版本僅僅支持 tree_method=approx
  
  Distributed and external memory version only support tree_method=approx.
- 選項：auto, exact, approx, hist, gpu_exact, gpu_hist, auto
  
  Choices: auto,exact,approx,hist,gpu_exact,gpu_hist,auto
  - auto: Use heuristic to choose the fastest method. 啓發式地選擇快速算法
    - For small to medium dataset, exact greedy (exact) will be used. 中小數據量採用精確的貪婪搜索算法（指代前文說的樹的生長過程中，節點分裂算法，所以很好理解）
    - For very large dataset, approximate algorithm (approx) will be chosen. 非常大的數據集，近似算法將被選用。
    - Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice. 因爲舊的行爲總是使用精確的貪婪算法，所以在近似算法被選用的時候，用戶會收到一個通知消息，告訴用戶近似算法被選用。
    - exact: Exact greedy algorithm. 精確地貪婪算法
    - approx: Approximate greedy algorithm using quantile sketch and gradient histogram. 近似算法採用分位方案和梯度直方圖方案。
    - hist: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bins caching. 優化過的近似貪婪算法的快速算法，這個快速算法採用一些性能改善（的策略），例如桶的緩存（這裏桶指的是直方圖算法中所用的特徵數據劃分成不同的桶，欲知詳情，查看陳天奇論文以及論文的引用資料）
    - gpu_exact: GPU implementation of exact algorithm.
    - gpu_hist: GPU implementation of hist algorithm.

sketch_eps [default=0.03], range: (0, 1) 全稱：sketch epsilon 即分位算法中的 $\epsilon$ 參數

- Only used for `tree_method=approx`. 僅僅用於近似算法
- This roughly translates into `O(1 / sketch_eps)` number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. 大致理解爲桶數的倒數值。與直接給出桶數相比，這個與帶權分位草案（Weighted Quantitle Sketch）能夠保證理論上一致，<font color="blue">此部分內容詳詢陳天奇論文3.3節</font>
- **Usually user does not have to tune this**. But consider setting to a lower number for more accurate enumeration of split candidates. 通常情況下，不需要用戶調試這個參數，但是考慮到設置一個更低的值能夠枚舉更精確的分割候選點。

scale_pos_weight [default=1] 正標籤的權重縮放值
- Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances). 控制樣本正負標籤的平衡，對於標籤不平衡的樣本有用，一個經典的值是：訓練樣本中具有負標籤的實例數量/訓練樣本中正標籤的實例數量。（舉例：-1:2000個 +1:8000個，那麼訓練過程中每個正標籤實例權重只有負標籤實例的25%）
  
  See Parameters Tuning for more discussion. Also, see Higgs Kaggle competition demo for examples: R, py1, py2, py3.
updater [default=grow_colmaker,prune] 逗號分割的字符串定義樹的生成器和剪枝，注意這些生成器已經模塊化，只要指定名字即可。
- A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, it could be also set explicitly by a user. The following updater plugins exist:
  - grow_colmaker: non-distributed column-based construction of trees. 單機版本下的基於列數據生長樹，這裏distributed tree 是xgboost有兩種策略：單機版non-distributed和distributed分佈式版本，比如單機版用的是精確貪婪的方式尋找分割數據點，分佈式版本在採用的是近似直方圖算法）
  - distcol: distributed tree construction with column-based data splitting mode. 用基於列數據的分割模式來構建一個樹（即：生長一棵樹），且樹是按照分佈式版本的算法構建的。
  - grow_histmaker: distributed tree construction with row-based data splitting based on global proposal of histogram counting. 基於全局數據的直方圖統計信息，並按照行分割的方式地進行樹的生長。
  - grow_local_histmaker: based on local histogram counting. 基於局部數據（當前節點，非整棵樹）的直方圖統計
  - grow_skmaker: uses the approximate sketching algorithm. 使用近似草案算法。
  - sync: synchronizes trees in all distributed nodes. 在分佈式地所有節點中同步樹（的信息）
  - refresh: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed. 刷新樹的統計信息或者基於當前數據的葉子節點的值，注意：沒有進行數據行的隨機子抽樣。
  - prune: prunes the splits where loss < min_split_loss (or $\gamma$ ). 在當前節點小於被定義的最小分割損失時，那麼進行剪枝。
- In a distributed setting, the implicit updater sequence value would be adjusted to grow_histmaker,prune.在分佈式環境下，這個參數值被顯示地調整爲grow_histmaker,prune
refresh_leaf [default=1]
- This is a parameter of the refresh updater plugin. When this flag is 1, tree leafs as well as tree nodes’ stats are updated. When it is 0, only node stats are updated. 用來標記是否刷新葉子節點信息的標識。當這個標誌位爲0時，只有節點的統計信息被更新。
process_type [default=default]
- A type of boosting process to run.
- Choices:default,update
  - default: The normal boosting process which creates new trees.
  - update: Starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updater plugins is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updater plugins could be meaningfully used with this process type: refresh, prune. With process_type=update, one cannot use updater plugins that create new trees.
grow_policy [default=depthwise] 樹的生長策略，基於深度或者基於最高損失變化
- Controls a way new nodes are added to the tree.
- Currently supported only if tree_method is set to hist.
- Choices:depthwise, lossguide
  - depthwise: split at nodes closest to the root. 按照離根節點最近的節點進行分裂
  - lossguide: split at nodes with highest loss change.
max_leaves [default=0] 葉子節點的最大數目，只有當參數``grow_policy=lossguide`才相關（起作用）
- Maximum number of nodes to be added. Only relevant when grow_policy=lossguide is set.
max_bin, [default=256] 桶的最大數目
- Only used if tree_method is set to hist.只有參數 tree_method=hist 時，這個參數才被使用。
- Maximum number of discrete bins to bucket continuous features. 用來控制將連續特徵離散化爲多個直方圖的直方圖數目。
- Increasing this number improves the optimality of splits at the cost of higher computation time. 增加此值提高了拆分的最優性, 但是是以更多的計算時間爲代價的。
predictor , [default=cpu_predictor] 設定預測器算法的參數
- The type of predictor algorithm to use. Provides the same results but allows the use of GPU or CPU.
  - cpu_predictor: Multicore CPU prediction algorithm. 多核cpu預測器算法
  - gpu_predictor: Prediction using GPU. Default when tree_method is gpu_exact or gpu_hist. GPU預測器算法，當參數 tree_method = gpu_exact or gpu_hist 時，預測器算法默認採用 gpu_predictor 。

Additional parameters for Dart Booster (`booster=dart`)

Note 在測試集上預測的時候，必須通過參數 ntree_limits 要關閉掉dropout功能

Using predict() with DART booster

If the booster object is DART type, predict() will perform dropouts, i.e. only some of the trees will be evaluated. This will produce incorrect results if data is not the training data. To obtain correct results on test sets, set ntree_limit to a nonzero value, e.g.
preds = bst.predict(dtest, ntree_limit=num_round)

sample_type [default=uniform] 設定抽樣算法的類型
- Type of sampling algorithm.
  - uniform: dropped trees are selected uniformly. 所有的樹被統一處理，指的是權重一樣，同樣的機率被選爲輟學樹（被選爲輟學的樹，即不參與訓練的學習過程）
  - weighted: dropped trees are selected in proportion to weight. 選擇輟學樹的時候是正比於權重。
normalize_type [default=tree] 歸一化（又名：標準化）算法的的類型，這個地方是與深度學習中的dropout不太一樣。
- Type of normalization algorithm.
  - tree: new trees have the same weight of each of dropped trees. 新樹擁有跟每一顆輟學樹一樣的權重
    - Weight of new trees are 1 / (k + learning_rate).
    - Dropped trees are scaled by a factor of k / (k + learning_rate).
  - forest: new trees have the same weight of sum of dropped trees (forest).新樹的權重等於所有輟學樹的權重總和
    - Weight of new trees are 1 / (1 + learning_rate).
    - Dropped trees are scaled by a factor of 1 / (1 + learning_rate).
rate_drop [default=0.0], range: [0.0, 1.0] 輟學率，與深度學習中的一樣意思
- Dropout rate (a fraction of previous trees to drop during the dropout).
one_drop [default=0] 設置是否在選擇輟學的過程中，至少一棵樹被選爲輟學樹。
- When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).
skip_drop [default=0.0], range: [0.0, 1.0] 在提升迭代的過程中，跳過輟學過程的概率，即不執行dropout功能的概率
- Probability of skipping the dropout procedure during a boosting iteration.
  - If a dropout is skipped, new trees are added in the same manner as gbtree.
  - Note that non-zero skip_drop has higher priority than rate_drop or one_drop. 注意到非0值得skip_drop參數比rate_drop和one_drop參數擁有更高的優先級。

學習任務的參數 Learning Task Parameters

Specify the learning task and the corresponding learning objective. The objective options are below:

objective[default=reg:linear] 這個參數定義需要被最小化的損失函數
- reg:linear: linear regression
- reg:logistic: logistic regression
- binary:logistic: logistic regression for binary classification, output probability
- binary:logitraw: logistic regression for binary classification, output score before logistic transformation
- binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
- gpu:reg:linear, gpu:reg:logistic, gpu:binary:logistic, gpu:binary:logitraw: versions of the corresponding objective functions evaluated on the GPU; note that like the GPU histogram algorithm, they can only be used when the entire training session uses the same dataset
- count:poisson
  –poisson regression for count data, output mean of poisson distribution
  - max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
- survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR). 比例風險迴歸模型(proportional hazards model，簡稱Cox模型)” 這塊不太懂
- multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
- multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
- rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
- rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
- rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
- reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
- reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
base_score [default=0.5]
- The initial prediction score of all instances, global bias
- For sufficient number of iterations, changing this value will not have too much effect.
eval_metric [default according to objective] 對於有效數據的度量方法
- Evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and error for classification, mean average precision for ranking)
- User can add multiple evaluation metrics. Python users: remember to pass the metrics in as list of parameters pairs instead of map, so that latter eval_metric won’t override previous one
- The choices are listed below:
  - rmse: root mean square error 均方根誤差
  - mae: mean absolute error 平均絕對誤差
  - logloss: negative log-likelihood 負對數似然函數值
  - error: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. 二分類錯誤率(閾值爲0.5)
  - error@t: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through ‘t’指定2分類誤差率的閾值t
  - merror: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases). 多分類錯誤率
  - mlogloss: Multiclass logloss. 多分類的負對數似然函數值
  - auc: Area under the curve 曲線下面積
  - aucpr: Area under the PR curve 準確率和召回率曲線下的面積
  - ndcg: Normalized Discounted Cumulative Gain
  - map: Mean Average Precision 主集合的平均準確率(MAP)是每個主題的平均準確率的平均值
  - ndcg@n, map@n: ‘n’ can be assigned as an integer to cut off the top positions in the lists for evaluation.
  - ndcg-, map-, ndcg@n-, map@n-: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
  - poisson-nloglik: negative log-likelihood for Poisson regression
  - gamma-nloglik: negative log-likelihood for gamma regression
  - cox-nloglik: negative partial log-likelihood for Cox proportional hazards regression
  - gamma-deviance: residual deviance for gamma regression
  - tweedie-nloglik: negative log-likelihood for Tweedie regression (at a specified value of the tweedie_variance_power parameter)
seed [default=0] 隨機數的種子
- Random number seed. 設置它可以復現隨機數據的結果，也可以用於調整參數

命令行參數 Command Line Parameters

The following parameters are only used in the console version of XGBoost

num_round
- The number of rounds for boosting
data
- The path of training data
test:data
- The path of test data to do prediction
save_period [default=0]
- The period to save the model. Setting save_period=10 means that for every 10 rounds XGBoost will save the model. Setting it to 0 means not saving any model during the training.
task [default=train] options:train,pred,eval,dump
- train: training using data
- pred: making prediction for test:data
- eval: for evaluating statistics specified by eval[name]=filename
- dump: for dump the learned model into text format
model_in [default=NULL]
- Path to input model, needed for test, eval, dump tasks. If it is specified in training, XGBoost will continue training from the input model.
model_out [default=NULL]
- Path to output model after training finishes. If not specified, XGBoost will output files with such names as 0003.model where 0003 is number of boosting rounds.
model_dir [default=models/]
- The output directory of the saved models during training
fmap
- Feature map, used for dumping model
dump_format [default=text] options:text, json
- Format of model dump file
name_dump [default=dump.txt]
- Name of model dump file
name_pred [default=pred.txt]
- Name of prediction file, used in pred mode
pred_margin [default=0]
- Predict margin instead of transformed probability
  XGBoost GPU Support
  XGBoost Python Package

調參

調參主要參考 Complete Guide to Parameter Tuning in XGBoost (with codes in Python) ，有空再詳細說明。

引用

陳天奇的論文 XGBoost: A Scalable Tree Boosting System
陳天奇的演講視頻 XGBoost A Scalable Tree Boosting System June 02, 2016 演講幻燈片和官網幻燈片
XGBoost 官網
XGBoost的貢獻者之一的演講
機器學習算法中 GBDT 和 XGBOOST 的區別有哪些？ - 知乎

入門XGBoost

前言