GBDT分類實戰完全總結(二)

第二部分:sklearn分類實例

實例一:Feature transformations with ensembles of trees使用集成樹的特徵轉換

import numpy as np
np.random.seed(10)
# seed( ) 用於指定隨機數生成時所用算法開始的整數值。
# 1.如果使用相同的seed( )值,則每次生成的隨即數都相同;
# 2.如果不設置這個值,則系統根據時間來自己選擇這個值,此時每次生成的隨機數因時間差異而不同。
# 3.設置的seed()值僅一次有效

import matplotlib.pyplot as plt
import time
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

startTime = time.time()
print('Step 1.Prepareing data...')
n_estimator = 10         # 迭代次數
X, y = make_classification(n_samples=80000)   # 樣本生成,這裏生成了80000個樣本
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)   # 將80000個樣本分成一半是訓練集,一半是測試集
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)
print('Step 2.RT+LR...')
# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
	random_state=0)
# 邏輯迴歸
rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)   # 構造函數 RT+LR
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)
print('Step 3.RF+LR...')
# Supervised transformation based on random forests
# 隨機森林
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]    # RF+LR
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

print('Step 4.GBT+LR...')
# 梯度提升決策樹分類
grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()    # 數據預處理
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)    # 訓練GRD模型
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)   # 畫ROC曲線,sklearn裏面有相應的函數,橫軸假正例率,縱軸真正例率;
# ROC曲線反映了分類器對正例的覆蓋能力和對負例的覆蓋能力之間的權衡。

print('Step 5.GBT...')
# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]    # 單純用提升樹時的預測概率
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)  # 得到繪製ROC曲線需要的fpr 和 tpr
print('Step 6.RF...')
# 隨機森林預測
# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)
print('Step 7.Ploting...')
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
print('Step 8.局部放大圖...')
plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()
print('---Training Completed.Took %f s.' % (time.time() - startTime))

輸出結果如下:


圖中輸出的是這五種方法的ROC曲線對比圖,ROC曲線反映了分類器對正例的覆蓋能力和對負例的覆蓋能力之間的權衡,從圖中可以看出GBT+LR有着最高的真正例率曲線值,依次往上,真正例率值依次降低。

按照ROC曲線的性質,如果一個學習器的ROC曲線被另一個學習器的ROC曲線完全“包住”,那麼可以斷言後者的性能優於前者,如果兩者有相交,則需要對比各自的覆蓋面積。

若按照是否包裹來論,GBT+LR基本上可以包住其他四個的曲線,但是在最前面有所交叉,因此不能完全斷言所有的關係;若按照覆蓋面積來論,GBT+LR的覆蓋面積最大,之後依次是GBT,RF+LR,RF,RT+LR,那麼對應的性能從優到劣依次爲:GBT+LR>GBT>RF+LR>RF>RT+LR。

實例二:Gradient Boosting Out-of-Bag estimates
名詞:out of bag(the improvement in loss based on the examples not included in the bootstrap sample對損失函數的改進是基於非自助法抽樣的樣本)
OOB估計是一種有用且具有啓發性的估計最優迭代次數的方法,它等價於交叉驗證法,但具有實時(on-the -fly)計算不需要反覆擬合模型。OOB只適用於隨機梯度提升(即子樣本個數<1,也就是說沒有子樣本,應該說它是每個基訓練器樹屬性分割最優的參數結果的訓練器)。OOB分類器是對真實的測試集的悲觀估計,但仍然保留了小樣本(未被bootstrap選中的)決策樹的比較好的近似估計。

# Author: Peter Prettenhofer <[email protected]>
#
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split


# Generate data (adapted from G. Ridgeway's gbm example)
n_samples = 1000
random_state = np.random.RandomState(13)
x1 = random_state.uniform(size=n_samples)   # 產生1行1000列的assarray,其中的每個元素都是[0,1]區間的均勻分佈的隨機數
x2 = random_state.uniform(size=n_samples)
x3 = random_state.randint(0, 4, size=n_samples)  # 產生1行1000列的assarray,其中的每個元素都是[0,4]區間的均勻分佈的隨機數

p = 1 / (1.0 + np.exp(-(np.sin(3 * x1) - 4 * x2 + x3)))
y = random_state.binomial(1, p, size=n_samples)   # 二項分佈,1次實驗p的概率成果,重複1000次

X = np.c_[x1, x2, x3]  # 按行連接兩個矩陣,就是把兩矩陣左右相加,要求行數相等

X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                    random_state=9)

# Fit classifier with out-of-bag estimates
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)

clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Accuracy: {:.4f}".format(acc))  # 損失函數誤差值y

n_estimators = params['n_estimators']
x = np.arange(n_estimators) + 1


def heldout_score(clf, X_test, y_test):
    """compute deviance scores on ``X_test`` and ``y_test``. """
    score = np.zeros((n_estimators,), dtype=np.float64)
    for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
        score[i] = clf.loss_(y_test, y_pred)
    return score  # 每一個基分類器的損失函數值組成的數組


def cv_estimate(n_splits=3):
    cv = KFold(n_splits=n_splits)
    cv_clf = ensemble.GradientBoostingClassifier(**params)
    val_scores = np.zeros((n_estimators,), dtype=np.float64)
    for train, test in cv.split(X_train, y_train):
        cv_clf.fit(X_train[train], y_train[train])
        val_scores += heldout_score(cv_clf, X_train[test], y_train[test])
    val_scores /= n_splits
    return val_scores  # 求損失函數平均分


# Estimate best n_estimator using cross-validation
cv_score = cv_estimate(3)

# Compute best n_estimator for test data
test_score = heldout_score(clf, X_test, y_test)  # 測試集的評分

# negative cumulative sum of oob improvements
cumsum = -np.cumsum(clf.oob_improvement_)  # 一次都沒抽到的訓練集(最悲觀情況的樣本)的損失誤差值的累加和

# min loss according to OOB
oob_best_iter = x[np.argmin(cumsum)]

# min loss according to test (normalize such that first loss is 0)
test_score -= test_score[0]  # 對損失函數值標準化,即令第一個基分類器損失值標準化0
test_best_iter = x[np.argmin(test_score)]

# min loss according to cv (normalize such that first loss is 0)
cv_score -= cv_score[0]
cv_best_iter = x[np.argmin(cv_score)]

# color brew for the three curves
oob_color = list(map(lambda x: x / 256.0, (190, 174, 212)))
test_color = list(map(lambda x: x / 256.0, (127, 201, 127)))
cv_color = list(map(lambda x: x / 256.0, (253, 192, 134)))

# plot curves and vertical lines for best iterations
plt.plot(x, cumsum, label='OOB loss', color=oob_color)
plt.plot(x, test_score, label='Test loss', color=test_color)
plt.plot(x, cv_score, label='CV loss', color=cv_color)
plt.axvline(x=oob_best_iter, color=oob_color)  # 畫一條垂直線,x爲位置,color爲線的顏色
plt.axvline(x=test_best_iter, color=test_color)
plt.axvline(x=cv_best_iter, color=cv_color)

# add three vertical lines to xticks
xticks = plt.xticks()  # plt.xticks()返回x軸的刻度值、標籤名
xticks_pos = np.array(xticks[0].tolist() +
                      [oob_best_iter, cv_best_iter, test_best_iter])  # 添加x軸刻度值
xticks_label = np.array(list(map(lambda t: int(t), xticks[0])) +
                        ['OOB', 'CV', 'Test'])  # 添加x軸標籤名
ind = np.argsort(xticks_pos)  # 對刻度值從小到大排序,返回排序後的索引值
xticks_pos = xticks_pos[ind]  # 從小到大的刻度值
xticks_label = xticks_label[ind]  # 從小到大的標籤名
plt.xticks(xticks_pos, xticks_label)  # 設置刻度值、標籤名

plt.legend(loc='upper right')  # 設置圖例位置
plt.ylabel('normalized loss')
plt.xlabel('number of iterations')

plt.show()

運行結果如下:


橫軸爲迭代次數,縱軸爲歸一化損失值,圖例分別是:使用交叉驗證來估計損失,測試集的損失,OOB損失。

圖中共有六條線,其中的三條直線是三種估計方法所取得的歸一化損失的最小值,用各自的顏色劃一直線表示;三條曲線表示這三種估計方法對測試集的損失值隨着迭代次數的增加,基本趨勢是先減少後增大,這表明:迭代次數並非越大越好。

實例三、Gradient Boosting regularization-----梯度提升正則化
Illustration of the effect of different regularization strategies for Gradient Boosting. The example is taken from Hastie et al 2009 [1].
The loss function used is binomial deviance. Regularization via shrinkage (learning_rate < 1.0) improves performance considerably. In combination with shrinkage, stochastic gradient boosting (subsample < 1.0) can produce more accurate models by reducing the variance via bagging. Subsampling without shrinkage usually does poorly. Another strategy to reduce the variance is by subsampling the features analogous to the random splits in Random Forests (via the max_features parameter).
[1] T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, Springer, 2009.
正則化regularization
  • Shrinkage:即學習率
v就是學習率。 一般情況下,越小的學習率,可以越好的逼近預測值,不容易產生過擬合,迭代次數會增加,經驗上一般選取0.1左右。
  • 使用縮減訓練集
Friedman提出在每次迭代時對base learner從原始訓練集中隨機抽取一部分(a subsample of the training set drawn at random without replacement)作爲本次base learner去擬合的樣本集可以提高算法最後的準確率。


# Author: Peter Prettenhofer <[email protected]>
#
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn import datasets


X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
X = X.astype(np.float32)

# map labels from {-1, 1} to {0, 1}
labels, y = np.unique(y, return_inverse=True)

X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

original_params = {'n_estimators': 1000, 'max_leaf_nodes': 4, 'max_depth': None, 'random_state': 2,
                   'min_samples_split': 5}

plt.figure()

for label, color, setting in [('No shrinkage', 'orange',
                               {'learning_rate': 1.0, 'subsample': 1.0}),
                              ('learning_rate=0.1', 'turquoise',
                               {'learning_rate': 0.1, 'subsample': 1.0}),
                              ('subsample=0.5', 'blue',
                               {'learning_rate': 1.0, 'subsample': 0.5}),
                              ('learning_rate=0.1, subsample=0.5', 'gray',
                               {'learning_rate': 0.1, 'subsample': 0.5}),
                              ('learning_rate=0.1, max_features=2', 'magenta',
                               {'learning_rate': 0.1, 'max_features': 2})]:
    params = dict(original_params)
    params.update(setting)

    clf = ensemble.GradientBoostingClassifier(**params)
    clf.fit(X_train, y_train)

    # compute test set deviance
    test_deviance = np.zeros((params['n_estimators'],), dtype=np.float64)

    for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
        # clf.loss_ assumes that y_test[i] in {0, 1}
        test_deviance[i] = clf.loss_(y_test, y_pred)

    plt.plot((np.arange(test_deviance.shape[0]) + 1)[::5], test_deviance[::5],
            '-', color=color, label=label)

plt.legend(loc='upper left')
plt.xlabel('Boosting Iterations')
plt.ylabel('Test Set Deviance')

plt.show()

結果如下所示:


其對應的訓練時間爲:


橫座標是Boosting迭代次數,縱座標是測試集偏差,從圖中可以看出:

(1)當不採用收縮學習時,黃色曲線下降最快,表明可以在較少的迭代次數達到較好的偏差效果,但是隨着迭代次數增加,其偏差反而會增大一部分,繼而穩定;

(2)其次是藍色曲線,子採樣爲0.5(如果小於1.0,則會導致隨機梯度增強),同樣是下降最快,但是與黃色曲線相比,藍色線最後穩定在一個值附近,這個值比黃色線要大。

(3)其餘三條曲線趨勢類似,其中最好的是灰色曲線,即learning_rate=0.1, subsample=0.5時,下降趨勢在三條曲線中是最快的,並且其測出的偏差一直處於較小的值。

(4)五種情況下,訓練時間的情況是:青色>灰色>藍色>黃色>玫紅色;

因此得到結論:如果想在較短的時間和較少的迭代次數情況下得到較好的性能,選擇黃色No shrinkage,如果不考慮時間和迭代次數,只需要性能最好,選擇灰色曲線代表的方法,即learning_rate=0.1, subsample=0.5。

That's all............

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章