GBDT原理與Sklearn源碼分析-分類篇

摘要：

繼上一篇文章，介紹完迴歸任務下的GBDT後，這篇文章將介紹在分類任務下的GBDT，大家將可以看到，對於迴歸和分類，其實GBDT過程簡直就是一模一樣的。如果說最大的不同的話，那就是在於由於loss function不同而引起的初始化不同、葉子節點取值不同。

正文：

GB的一些基本原理都已經在上文中介紹了，下面直接進入正題。
下面是分類任務的GBDT算法過程，其中選用的loss function是logloss。
$L (y_{i}, F_{m} (x_{i})) = - {y_{i} l o g p_{i} + (1 - y_{i}) l o g (1 - p_{i})}$ 。
其中 $p_{i} = \frac{1}{1 + e^{(- F_{m} (x_{i}))}}$

這裏簡單推導一下logloss通常化簡後的式子：
$L (y_{i}, F_{m} (x_{i})) = - {y_{i} l o g p_{i} + (1 - y_{i}) l o g (1 - p_{i})}$
（先不帶入負號）
帶入 $p_{i}$ => $y_{i} l o g (\frac{1}{1 + e^{(- F_{m} (x_{i}))}}) + (1 - y_{i}) l o g (\frac{e^{(- F_{m} (x_{i}))}}{1 + e^{(- F_{m} (x_{i}))}})$
=> $- y_{i} l o g (1 + e^{(- F_{m} (x_{i}))}) + (1 - y_{i}) {l o g (e^{(- F_{m} (x_{i}))}) - l o g (1 + e^{(- F_{m} (x_{i}))})}$
=> $- y_{i} l o g (1 + e^{(- F_{m} (x_{i}))}) + l o g (e^{(- F_{m} (x_{i}))}) - l o g (1 + e^{(- F_{m} (x_{i}))}) - y_{i} l o g (e^{(- F_{m} (x_{i}))}) + y_{i} l o g (1 + e^{(- F_{m} (x_{i}))})$
=> $y_{i} F_{m} (x_{i}) - l o g (1 + e^{F_{m} (x_{i})})$
最後加上負號可以得：
$L (y_{i}, F_{m} (x_{i})) = - {y_{i} l o g p_{i} + (1 - y_{i}) l o g (1 - p_{i})} = - {y_{i} F_{m} (x_{i}) - l o g (1 + e^{F_{m} (x_{i})})}$

A l g o r i t h m 3 : B i n o m i a D e v i a n c e_T r e e B o o s t____________________________________F_{0} (x) = 0.5 * l o g (\frac{\sum_{i = 1}^{N} y_{i}}{\sum_{i = 1}^{N} (1 - y_{i})}) F r o m = 1 t o M d o : \tilde{y_{i}} = - {[\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) = F_{m - 1} (x)} = y_{i} - \frac{1}{1 + e^{(- F_{m - 1} (x_{i}))}} {R_{j m}}_{1}^{J} = J - t e r m i n a l n o d e t r e e ({{\tilde{y}}_{i}, x_{i}}_{1}^{N}) γ_{j m} = \frac{\sum_{x_{i} \in R_{j m}} {\tilde{y}}_{i}}{\sum_{x_{i} \in R_{j m}} (y_{i} - {\tilde{y}}_{i}) * (1 - y_{i} + {\tilde{y}}_{i})} F_{m} (x) = F_{m - 1} (x) + \sum_{j = 1}^{J} γ_{j m} I (x \in R_{j m})

算法3就是GBDT用於分類任務時，loss funcion選用logloss的算法流程。
可以看到，和迴歸任務是一樣的，並沒有什麼特殊的處理環節。
（其實在sklearn源碼裏面，雖然迴歸任務的模型定義是GradientBoostingRegressor()而分類任務是GradientBoostingClassifier()，但是這兩者區分開來是爲了方便用戶使用，最終兩者都是共同繼承BaseGradientBoosting()，算法3這些流程都是在BaseGradientBoosting()完成的，GradientBoostingRegressor()、GradientBoostingClassifier()只是完成一些學習器參數配置的任務）

實踐

下面同樣以一個簡單的數據集來大致的介紹一下GBDT的過程。

$x_{i}$	1	2	3	4	5	6	7	8	9	10
$y_{i}$	0	0	0	1	1	0	0	0	1	1

參數配置：
1. 以logloss爲損失函數
2. 以MSE爲分裂準則
3. 樹的深度爲1
4. 學習率爲0.1

算法3的第一步，初始化。
$F_{0} (x) = l o g (\frac{\sum_{i = 1}^{N} y_{i}}{\sum_{i = 1}^{N} (1 - y_{i})}) = l o g (\frac{4}{6}) = - 0.4054$

擬合第一顆樹( $m = 1$ )
計算負梯度值：
$\tilde{y_{i}} = - {[\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) = F_{m - 1} (x)} = y_{i} - \frac{1}{1 + e^{(- F_{m - 1} (x_{i}))}} = y_{i} - \frac{1}{1 + e^{(- F_{0} (x_{i}))}}$

比如計算第一個樣本( $i = 1$ )有：
$\tilde{y_{1}} = 0 - \frac{1}{1 + e^{(0.4054)}} = - 0.400$
同樣地，其他計算後如下表：

$x_{i}$	1	2	3	4	5	6	7	8	9	10
${\tilde{y}}_{i}$	-0.4	-0.4	-0.4	0.6	0.6	-0.4	-0.4	-0.4	0.6	0.6

接着，我們需要以 ${\tilde{y}}_{i}$ 爲目標，擬合一顆樹。
擬合樹的過程上篇文章已經詳細介紹了，這裏就不再累述了。擬合完後結果如下：

可以得出建好樹之後葉子節點的區域：
$R_{11}$ 爲 $x_{i} <= 8$ ， $R_{21}$ 爲 $x_{i} > 8$
下面計算可以葉子節點的值 $γ_{j m}$
由公式： $γ_{j m} = \frac{\sum_{x_{i} \in R_{j m}} {\tilde{y}}_{i}}{\sum_{x_{i} \in R_{j m}} (y_{i} - {\tilde{y}}_{i}) * (1 - y_{i} + {\tilde{y}}_{i})}$
對於區域 $R_{11}$ 有如下：
$\sum_{x_{i} \in R_{11}} {\tilde{y}}_{i} = ({\tilde{y}}_{1} + {\tilde{y}}_{2} + {\tilde{y}}_{3} + {\tilde{y}}_{4} + {\tilde{y}}_{5} + {\tilde{y}}_{6} + {\tilde{y}}_{7} + {\tilde{y}}_{8}) = - 1.2$
$\sum_{x_{i} \in R_{11}} (y_{i} - {\tilde{y}}_{i}) * (1 - y_{i} + {\tilde{y}}_{i}) = (y_{1} - {\tilde{y}}_{1}) * (1 - y_{1} + {\tilde{y}}_{1}) + (y_{2} - {\tilde{y}}_{2}) * (1 - y_{2} + {\tilde{y}}_{2}) + (y_{3} - {\tilde{y}}_{3}) * (1 - y_{3} + {\tilde{y}}_{3}) + (y_{4} - {\tilde{y}}_{4}) * (1 - y_{4} + {\tilde{y}}_{4}) + (y_{5} - {\tilde{y}}_{5}) * (1 - y_{5} + {\tilde{y}}_{5}) + (y_{6} - {\tilde{y}}_{6}) * (1 - y_{6} + {\tilde{y}}_{6}) + (y_{7} - {\tilde{y}}_{7}) * (1 - y_{7} + {\tilde{y}}_{7}) + (y_{8} - {\tilde{y}}_{8}) * (1 - y_{8} + {\tilde{y}}_{8}) = 1.92$

對於區域 $R_{21}$ 有如下：
$\sum_{x_{i} \in R_{21}} {\tilde{y}}_{i} = ({\tilde{y}}_{9} + {\tilde{y}}_{10}) = 1.2$
$\sum_{x_{i} \in R_{21}} (y_{i} - {\tilde{y}}_{i}) * (1 - y_{i} + {\tilde{y}}_{i}) = (y_{9} - {\tilde{y}}_{9}) * (1 - y_{9} + {\tilde{y}}_{9}) + (y_{10} - {\tilde{y}}_{10}) * (1 - y_{10} + {\tilde{y}}_{10}) = 0.48$

故最後可以得到兩個葉子節點的值：
$γ_{11} = \frac{- 1.2}{1.92} = - 0.625$ 、 $γ_{21} = \frac{1.2}{0.480} = 2.5$

最後通過 $F_{m} (x) = F_{m - 1} (x) + \sum_{j = 1}^{J} γ_{j m} I (x \in R_{j m})$ 更新 $F_{1} (x)$ ，需要注意的是，這裏同樣也用shrinkage，即乘一個學習率 $η$ ，具體表現爲：
$F_{m} (x) = F_{m - 1} (x) + η * \sum_{j = 1}^{J} γ_{j m} I (x \in R_{j m})$ 。

以計算 $x_{1}$ 爲例：
$F_{1} (x_{1}) = F_{0} (x_{1}) + 0.1 * (- 0.625) = - 0.4054 - 0.0625 = - 0.4679$
其他計算完畢後如下表供參考：

$x_{i}$	1	2	3	4	5	6	7	8	9	10
$F_{1} (x_{i})$	-0.46796511	-0.46796511	-0.46796511	-0.46796511	-0.46796511	-0.46796511	-0.46796511	-0.46796511	-0.15546511	-0.15546511

至此，第一顆樹已經訓練完成。可以再次看到其訓練過程和迴歸基本沒有區別。

下面簡單提一下擬合第二顆樹( $m = 2)$

計算負梯度值：
比如對於 $x_{1}$ 有：
=> ${\tilde{y}}_{1} = y_{1} - \frac{1}{1 + e^{(- F_{1} (x_{1}))}} = 0 - 0.38509 = - 0.38509$
其他同理，可得下表：

$x_{i}$	1	2	3	4	5	6	7	8	9	10
${\tilde{y}}_{i}$	-0.38509799	-0.38509799	-0.38509799	0.61490201	0.61490201	-0.38509799	-0.38509799	-0.38509799	0.53878818	0.53878818

之後也是以新的 ${\tilde{y}}_{i}$ 爲目標擬合一顆迴歸樹後計算葉子節點的區間和葉子節點的值。

關於預測

當只有2顆樹的時候，其預測過程也是和下面這個圖一樣

相比於迴歸任務，分類任務需把要最後累加的結果 $F_{m} (x)$ 轉成概率。（其實 $F_{m} (x)$ 可以理解成一個得分）。具體來說：
對於採用logloss作爲損失函數的情況下， $p_{i} = \frac{1}{1 + e^{(- F_{m} (x_{i}))}}$ 。
對於採用指數損失作爲損失函數的情況下， $p_{i} = \frac{1}{1 + e^{(- 2 F_{m} (x_{i}))}}$ 。
當然這裏的 $p_{i}$ 指的是正樣本的概率。

這裏再詳細一點，比如對於上面例子，當我們擬合完第二顆樹後，計算 $F_{2} (x)$ 可有有下表：

$x_{i}$	1	2	3	4	5	6	7	8	9	10
$F_{2} (x_{i})$	-0.52501722	-0.52501722	-0.52501722	-0.52501722	-0.52501722	-0.52501722	-0.52501722	-0.52501722	0.06135501	0.06135501

此時計算相應的概率值有：
$F_{2} (x)$ 可有有下表：

$x_{i}$	1	2	3	4	5	6	7	8	9	10
$p_{i}$	0.37167979	0.37167979	0.37167979	0.37167979	0.37167979	0.37167979	0.37167979	0.37167979	0.51533394	0.51533394

(表中的概率爲正樣本的概率，即 $y_{i} = 1$ 的概率）

Sklearn源碼簡單分析

寫在前面：Sklearn源碼分析後面有時間有添加一些內容，下面先簡單瞭解GDBT分類的核心代碼。

當loss function選用logloss時，對應的是sklearn裏面的loss=’deviance’。
計算負梯度、初始化、更新葉子節點、轉成概率都在一個名叫BinomialDeviance()的類中。

class BinomialDeviance(ClassificationLossFunction):
    """Binomial deviance loss function for binary classification.

    Binary classification is a special case; here, we only need to
    fit one tree instead of ``n_classes`` trees.
    """
    def __init__(self, n_classes):
        if n_classes != 2:
            raise ValueError("{0:s} requires 2 classes.".format(
                self.__class__.__name__))
        # we only need to fit one tree for binary clf.
        super(BinomialDeviance, self).__init__(1)

    def init_estimator(self):
        return LogOddsEstimator()

    def __call__(self, y, pred, sample_weight=None):
        """Compute the deviance (= 2 * negative log-likelihood). """
        # logaddexp(0, v) == log(1.0 + exp(v))
        pred = pred.ravel()
        if sample_weight is None:
            return -2.0 * np.mean((y * pred) - np.logaddexp(0.0, pred))
        else:
            return (-2.0 / sample_weight.sum() *
                    np.sum(sample_weight * ((y * pred) - np.logaddexp(0.0, pred))))

    def negative_gradient(self, y, pred, **kargs):
        """Compute the residual (= negative gradient). """
        return y - expit(pred.ravel())

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, pred, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        terminal_region = np.where(terminal_regions == leaf)[0]
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        sample_weight = sample_weight.take(terminal_region, axis=0)

        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
        # prevents overflow and division by zero
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator

    def _score_to_proba(self, score):
        proba = np.ones((score.shape[0], 2), dtype=np.float64)
        proba[:, 1] = expit(score.ravel())
        proba[:, 0] -= proba[:, 1]
        return proba

    def _score_to_decision(self, score):
        proba = self._score_to_proba(score)
        return np.argmax(proba, axis=1)

下面這是用於計算負梯度值。注意的函數expit就是 $\frac{1}{1 + e^{- x}}$
代碼中的y_pred或者pred表達的就是 $F_{m - 1} (x)$

    def negative_gradient(self, y, pred, **kargs):
        """Compute the residual (= negative gradient). """
        return y - expit(pred.ravel())

更新葉子節點，關鍵在於計算numerator和denominator。
另外代碼裏的residual代表的是負梯度值。

    def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
                                residual, pred, sample_weight):
        """Make a single Newton-Raphson step.

        our node estimate is given by:

            sum(w * (y - prob)) / sum(w * prob * (1 - prob))

        we take advantage that: y - prob = residual
        """
        terminal_region = np.where(terminal_regions == leaf)[0]
        residual = residual.take(terminal_region, axis=0)
        y = y.take(terminal_region, axis=0)
        sample_weight = sample_weight.take(terminal_region, axis=0)

        numerator = np.sum(sample_weight * residual)
        denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
        # prevents overflow and division by zero
        if abs(denominator) < 1e-150:
            tree.value[leaf, 0, 0] = 0.0
        else:
            tree.value[leaf, 0, 0] = numerator / denominator

初始化的類：

class LogOddsEstimator(object):
    """An estimator predicting the log odds ratio."""
    scale = 1.0

    def fit(self, X, y, sample_weight=None):
        # pre-cond: pos, neg are encoded as 1, 0
        if sample_weight is None:
            pos = np.sum(y)
            neg = y.shape[0] - pos
        else:
            pos = np.sum(sample_weight * y)
            neg = np.sum(sample_weight * (1 - y))

        if neg == 0 or pos == 0:
            raise ValueError('y contains non binary labels.')
        self.prior = self.scale * np.log(pos / neg)

    def predict(self, X):
        check_is_fitted(self, 'prior')

        y = np.empty((X.shape[0], 1), dtype=np.float64)
        y.fill(self.prior)
        return y

其中，下面這個用於初始化，可以看到有一個因子self.scale，這是由於在Sklearn裏提供兩種loss function用於分類，一種是logloss，一種是指數損失，兩者的初始化僅僅只是在係數上不同，前者是1.0,後者是0.5。

    def fit(self, X, y, sample_weight=None):
        # pre-cond: pos, neg are encoded as 1, 0
        if sample_weight is None:
            pos = np.sum(y)
            neg = y.shape[0] - pos
        else:
            pos = np.sum(sample_weight * y)
            neg = np.sum(sample_weight * (1 - y))

        if neg == 0 or pos == 0:
            raise ValueError('y contains non binary labels.')
        self.prior = self.scale * np.log(pos / neg)

最後是轉化成概率，這裏有個細節，就是正樣本的概率是放在第2列（從1數起）。

    def _score_to_proba(self, score):
        proba = np.ones((score.shape[0], 2), dtype=np.float64)
        proba[:, 1] = expit(score.ravel())
        proba[:, 0] -= proba[:, 1]
        return proba

總結

至此，GBDT用於迴歸和分類的兩種情況都已經說明完畢，欠缺的可能是源碼部分說的不夠深入，由於最近時間的關係沒辦法做到太深入，所以後面找時間會把代碼再深入的分析後補充在這。

對於多分類問題也需要單獨討論詳細請看文章。

參考資料

http://docplayer.net/21448572-Generalized-boosted-models-a-guide-to-the-gbm-package.html（各種loss function的推導結果）
http://xueshu.baidu.com/s?wd=paperuri%3A%28ab7165108163edc94b30781e51819e0c%29&filter=sc_long_sign&sc_ks_para=q%3DGreedy%20function%20approximation%3A%20A%20gradient%20boosting%20machine.&sc_us=13783016239361402484&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8 （本文主要參考的超級著名論文 greedy function approximation: a gradient boosting machine）

GBDT原理與Sklearn源碼分析-分類篇

摘要：

正文：

實踐

關於預測

Sklearn源碼簡單分析

總結

參考資料

PyQt5學習入門-4-文件讀寫框

pandas-matplotlib 畫圖初級教程。

CNN學習筆記

AUC的計算方法

天池-智慧交通預測大賽-亞軍-分享

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結