Scikit-learn GBDT算法庫總結與實踐

上篇我們對傳統的GBDT算法原理進行了總結，相信大家對GBDT的算法原理有了一定的瞭解。本篇我們就探討Scikit-learn中GBDT算法庫的使用。
本篇我們先對Scikit-learn中GBDT算法庫進行概述；再分別介紹Boosting框架的常用參數和基學習器CART迴歸樹的常用參數；最後利用一個示例數據展示下GBDT的學習過程，以更直觀的方式理解GBDT的原理。

1）Scikit-learn GBDT算法庫類概述

在梯度提升樹（GBDT）算法原理詳細總結中我們提到，GBDT即可以處理分類任務又可以處理迴歸任務。在Scikit-learn中，GBDT分類算法對應GradientBoostingClassifier，迴歸算法對應GradientBoostingRegressor。兩者參數類型基本相同，只有損失函數loss不同，具體官方API如下。不管GradientBoostingClassifier還是GradientBoostingRegressor，它們的參數都可以分爲兩個類型：第一類是Boosting框架參數，第二類是CART迴歸樹的重要參數。下面我們也按照這兩部分對GBDT算法庫進行探討，對於GradientBoostingClassifier，GradientBoostingRegressor不同的參數，會進行說明。

GradientBoostingClassifier 官方API:
class sklearn.ensemble.GradientBoostingClassifier(loss=‘deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=‘friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=‘deprecated’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)[source]¶

GradientBoostingRegressor官方API:
class sklearn.ensemble.GradientBoostingRegressor(loss=‘ls’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=‘friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort=‘deprecated’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)[source]

2）Boosting框架參數

我們先對Boosting框架參數進行解讀。

n_estimators，CART迴歸樹的數量，默認100
n_estimators，可以理解成CART迴歸樹的數量，也可以理解爲Boosting最大迭代次數。一般情況，n_estimators越大，模型效果越好，但模型更易過擬合，需要結合learning_rate，subsample以及cart迴歸樹剪枝一起權衡。
learning_rate，權重縮減係數 $\nu$ ，也稱作步長，默認0.1
用來控制模型過擬合的參數。在沒有權重縮減時，模型表達式爲：
$f_m (x)=f_{m−1} (x)+g _m (x)$
爲了防止過擬合，sklearn加入了權重衰減係數 $\nu$ , $\nu$ ∈(0,1]，模型表達式變成：
$f_m (x)=f_{m−1} (x)+\nu g _m (x)$
對於同樣的訓練集擬合效果，較小的 $\nu$ 意味着我們需要更多的n_estimators，模型訓練的時間也越多。一般情況，learning_rate和n_estimators一起來決定算法的擬合效果。所以這兩個參數n_estimators和learning_rate需要一起調參。
subsample，子採樣比例，默認爲1
用來控制模型過擬合的參數。subsample，是指採樣多少比例的訓練樣本去訓練基學習器，這裏的採樣方式是不放回的採樣。當subsample=1，使用全部樣本，等於沒有使用子採樣。當subsample＜1，則只有一部分樣本會去做GBDT的決策樹擬合。subsample＜1可以減少方差，防止模型過擬合，但是會增加樣本擬合的偏差，因此可以增加n_estimators。因此，一般情況下，取值不能太低。
loss，損失函數，GradientBoostingClassifier，GradientBoostingRegressor的損失函數不同。
- GradientBoostingClassifier的損失函數，默認deviance
  對於分類問題，GradientBoostingClassifier的損失函數有對數似然損失函數"deviance"和指數損失函數"exponential"兩者輸入選擇。使用對數似然損失函數"deviance"，即爲梯度提升樹（GBDT）算法原理詳細總結中介紹的分類情況GBDT算法過程。使用指數損失函數"exponential"，GBDT退化成Adaboost算法。一般情況，保持默認"deviance"。
- GradientBoostingRegressor的損失函數，默認ls
  對於迴歸問題，GradientBoostingRegressor的損失函數有均方誤差"ls"，絕對損失"lad"，huber損失"huber"，分位數損失"quantile"。默認均方誤差"ls"選擇。
  均方誤差損失函數：
  $L(y,f(x))=(y-f(x))^2$
  絕對值損失函數：
  $L(y,f(x))=|y-f(x)|$
  huber損失，它是均方差和絕對損失的折中，對於遠離中心的異常點，採用絕對損失，對於中心附近的點採用均方差。這個界限一般用分位數點度量。損失函數如下：
  $\begin{cases} \frac{1}{2}(y-f(x))^2 \qquad |y-f(x)|\leq \theta \\ \theta(|y-f(x)|- \frac{\theta }{2}) \qquad |y-f(x)|> \theta\\ \end{cases}$
  分位數損失，它是分位數迴歸的損失函數：
  $L(y,f(x))=\sum _{y \geq f(x) }\theta|y-f(x)|+\sum _{y< f(x) }(1- \theta)|y-f(x)|$
  
  huber損失和分位數損失，具有更強的魯棒性，可減小異常值的影響。一般情況，數據的噪音不多，默認均方差"ls"比較好。如果噪音比較多，則推薦用抗噪音的損失函數"huber"。而如果我們需要對訓練集進行分段預測的時候，則採用“quantile”。
  ccp_alphanon，
init，初始化的基學習器，默認爲None
該參數表示初始的學習器。默認爲None時，使用訓練集樣本來做樣本集的初始化分類迴歸預測；輸入爲0時，則初始的原始預測值爲0；如果我們對數據有先驗知識或者對數據已經做過一些擬合，可以指定基學習器。一般情況下，保持默認None。
alpha，分位點值，默認0.9
GradientBoostingRegressor獨有參數。對應迴歸中的Huber損失"huber"函數和分位數損失“quantile”函數中的 $\theta$ ，當使用"huber"和“quantile”做爲損失函數和時，需要指定alpha的值。一般情況，保持該參數爲默認設置。如果數據噪音點較多，可以適當降低這個分位數的值。

3）CART迴歸樹參數

由於GBDT使用了CART迴歸決策樹，因此它的參數和DecisionTreeRegressor基本一樣。下面我們選取部分剪枝參數進行解讀，更詳細的參數解讀請參考：Scikit-learn決策樹算法庫總結與簡單實踐。

max_features，特徵劃分時考慮的最大特徵數，默認爲None
該參數用來限制樹過擬合的剪枝參數。max_features限制分枝時考慮的特徵個數，超過限制個數的特徵都會被捨棄。輸入爲“None”時，使用所有的特徵進行特徵選擇。否則設定指定數量的特徵，如平方根個特徵（sqrt），對數個特徵（log2）。設定指定數量的特徵可能會導致模型學習不足。
max_depth，最大樹的深度，默認爲3
該參數用來限制樹過擬合的剪枝參數，超過指定深度的樹枝全部被剪掉。一般情況下，數據量較少或者特徵數量較少的時候，保持默認設置。如果樣本量多，特徵也多的情況下，可以增大該參數的值。
min_samples_leaf，葉節點最小樣本個數，默認1
該參數用來限制樹過擬合的剪枝參數。如果葉節點樣本數目小於該參數的值，葉節點將會被剪枝。實際使用時，小樣本保持默認1設置，大樣本（10萬）時，需要設置該參數，可以參考從5開始調整。
min_samples_split，節點劃分所需的最小樣本個數，默認2
該參數用來限制樹過擬合的剪枝參數。如果節點樣本數小於該參數，節點將會不會再被劃分。實際使用時，小樣本保持默認設置，大樣本（10萬）時，可以參考從5開始調整。
max_leaf_nodes，最大葉子節點數，默認爲None
該參數用於限制最大葉子節點數，可以防止過擬合，默認是"None”，即不限制最大的葉子節點數。如果加了限制，算法會建立在最大葉子節點數內最優的決策樹。如果特徵不多，可以不考慮這個值，但是如果特徵分成多的話，可以加以限制。
ccp_alpha，最小剪枝係數，默認爲0
該參數用來限制樹過擬合的剪枝參數。對應決策樹原理中的 $\alpha$ ，模型將會選擇小於輸入值最大 $\alpha$ 。ccp_alpha=0時，決策樹不剪枝；ccp_alpha越大，越多的節點被剪枝。Scikit learn 0.22版本新增參數。
criterion，不確定性的計算方式，默認“friedman_mse”
該參數表示選擇最優分裂特徵的標準。可以輸入"friedman_mse",“mse"或者"mae”。friedman_mse爲friedman改進的均方誤差，mse是一般的均方差，mae是絕對值差。一般情況，保持默認設置friedman_mse。

4）實例

與 Adaboost 一樣，GBDT也是通過向集成中逐步增加分類器運行的，每一個分類器都修正之前的分類結果。然而，它並不像 Adaboost 那樣每一次迭代都更改實例的權重，而是使用新的分類器去擬合前面分類器預測的殘差。
下面我們生成一個含有噪聲的二次訓練集，通過一個使用決策樹當做基分類器的迴歸例子，學習GBDT分類的過程，再學習GBDT幾個重要參數對模型的影響，最後介紹下GBDT的早停技術。
首先，導入需要使用的庫。

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import matplotlib as mpl
import matplotlib.pyplot as plt

生成含有噪聲的二次數據集。

np.random.seed(42)
X = np.random.rand(100,1)-0.5
y = 3*X[:,0]**2 + 0.05*np.random.randn(100)

plt.figure(figsize=(10,8), facecolor='white')
plt.plot(X,y,'b.')
plt.xlabel('X',fontsize = 12)
plt.ylabel('y',fontsize = 12)
plt.show()

利用決策樹訓練三個模型，後一個模型在前一個模型的殘差基礎上進行訓練，代碼如下。

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

可視化決策樹的集成過程。

def plot_predictions(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center", fontsize=16)
    plt.axis(axes)

plt.figure(figsize=(11,11))

plt.subplot(321)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h_1(x_1)$", style="g-", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Residuals and tree predictions", fontsize=16)

plt.subplot(322)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$", fontsize=16, rotation=0)
plt.title("Ensemble predictions", fontsize=16)

plt.subplot(323)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_2(x_1)$", style="g-", data_style="k+", data_label="Residuals")
plt.ylabel("$y - h_1(x_1)$", fontsize=16)

plt.subplot(324)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1)$")
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.subplot(325)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5], label="$h_3(x_1)$", style="g-", data_style="k+")
plt.ylabel("$y - h_1(x_1) - h_2(x_1)$", fontsize=16)
plt.xlabel("$x_1$", fontsize=16)

plt.subplot(326)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$", fontsize=16)
plt.ylabel("$y$", fontsize=16, rotation=0)

plt.show()

第一行中，只包含一顆樹，所以它與第一顆樹的預測相同。第二行中，一顆新的樹在第一顆樹的殘差上進行擬合。在右邊可以看出集成的預測等於前兩個樹預測的和。同樣的，第三顆樹在第二顆樹的殘差上進行訓練。集成的擬合效果變的越來越好。
當 $x=0.4$ 時，打印出集成效果和真實值相差不多。

X_new = np.array([[0.4]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_true = 3*X_new**2
print('y_true:',y_true,'\n y_pred',y_pred)

下面我們演示GradientBoostingRegressor相關參數對模型的影響。

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

gbrt1 = GradientBoostingRegressor(max_depth=2, n_estimators=30, learning_rate=1.0, random_state=42)
gbrt1.fit(X, y)

gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

gbrt_slow1 = GradientBoostingRegressor(max_depth=2, n_estimators=30, learning_rate=0.1, random_state=42)
gbrt_slow1.fit(X, y)

plt.figure(figsize=(11,8))

plt.subplot(221)
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt.learning_rate, gbrt.n_estimators), fontsize=14)

plt.subplot(222)
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.learning_rate, gbrt_slow.n_estimators), fontsize=14)

plt.subplot(223)
plot_predictions([gbrt1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt1.learning_rate, gbrt1.n_estimators), fontsize=14)

plt.subplot(224)
plot_predictions([gbrt_slow1], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow1.learning_rate, gbrt_slow1.n_estimators), fontsize=14)

plt.show()

可以看出小的 $learning\_rate$ , 需要更多的 $n\_estimators$ 。
再來看看參數 $sub\_sample$ 和 $n\_estimators$ 對模型的影響。

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=10,subsample=0.2, random_state=42)
gbrt.fit(X, y)

gbrt1 = GradientBoostingRegressor(max_depth=2, n_estimators=100,subsample=0.2, random_state=42)
gbrt1.fit(X, y)

gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=10,subsample=0.9, random_state=42)
gbrt_slow.fit(X, y)

gbrt_slow1 = GradientBoostingRegressor(max_depth=2, n_estimators=100,subsample=0.9, random_state=42)
gbrt_slow1.fit(X, y)

plt.figure(figsize=(11,8))

plt.subplot(221)
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt.subsample, gbrt.n_estimators), fontsize=14)

plt.subplot(222)
plot_predictions([gbrt_slow], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.subsample, gbrt_slow.n_estimators), fontsize=14)

plt.subplot(223)
plot_predictions([gbrt1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble predictions")
plt.title("learning_rate={}, n_estimators={}".format(gbrt1.subsample, gbrt1.n_estimators), fontsize=14)

plt.subplot(224)
plot_predictions([gbrt_slow1], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow1.subsample, gbrt_slow1.n_estimators), fontsize=14)

plt.show()

可以看出 $sub\_sample$ 對模型的影響效果沒有 $learning\_rate$ 大。小的 $sub\_sample$ ，模型會更平滑些。
下面介紹下GBDT的早停技術。

gbrt = GradientBoostingRegressor(max_depth=2,learning_rate= 0.1, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

bst_n_estimators = gbrt.n_estimators
print(gbrt.n_estimators)
print("Minimum validation MSE:", min_val_error)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)
plt.figure(figsize=(14, 5))

plt.subplot(122)
plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title("Best model (%d trees)" % bst_n_estimators, fontsize=14)

plt.show()