Spark2.0機器學習系列之5:GBDT(梯度提升決策樹)、GBDT與隨機森林差異、參數調試及Scikit代碼分析

  GBDT(梯度提升決策樹)作爲Spark MLlib實現又一個決策樹組合算法(另一個是隨機森林),其基本原理也離不開決策樹,常常也和隨機森林來進行比較。
  關於決策樹和隨機森林,我也寫了兩篇介紹博客,可以作爲參考:
  隨機森林介紹、關鍵參數分析及Spark2.0中實現
http://blog.csdn.net/qq_34531825/article/details/52352737
  決策樹的幾種類型差異及Spark 2.0-MLlib、Scikit代碼分析
http://blog.csdn.net/qq_34531825/article/details/52330942

概念梳理

GBDT的別稱

  GBDT((Gradient Boost Decision Tree),梯度提升決策樹。
    GBDT這個算法還有一些其他的名字,比如說MART(Multiple Additive Regression Tree),GBRT(Gradient Boost Regression Tree),Tree Net等,其實它們都是一個東西(參考自wikipedia – Gradient Boosting),發明者是Friedman。
研究GBDT一定要看看Friedman的paper《Greedy Function Approximation: A Gradient Boosting Machine》,裏面論述和公式推導更爲系統。

什麼是梯度提升算法?

  GB(Gradient Boosting)梯度提升算法
  GB其實是一個算法框架,即可以將已有的分類或迴歸算法放入其中,得到一個性能很強大的算法。
  GB這個框架中可以放入很多不同的算法。
  GB總共需要進行M次迭代,每次迭代產生一個模型,我們需要讓每次迭代生成的模型對訓練集的損失函數最小,而如何讓損失函數越來越小呢?我們採用梯度下降的方法,在每次迭代時通過向損失函數的負梯度方向移動來使得損失函數越來越小,這樣我們就可以得到越來越精確的模型。
  這裏寫圖片描述

梯度下降算法在機器學習中會經常遇到,這裏給一幅圖片就好理解了:
這裏寫圖片描述
圖片說明:將參數θ按照梯度下降的方向進行調整,就會使得代價函數J(θ)往更低的方向進行變化,如圖所示,算法的結束將是在θ下降到無法繼續下降爲止。黑線就是代價(錯誤)下降的軌跡,始終是按照梯度方向下降的,也是下降最快的方向。
圖片來源:
http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_gradient_descent.html
更詳細的內容可以參考原博客。

原始Boosting算法與Gradient Boosting的區別

  同樣都是提升算法,原始Boosting算法與Gradient Boosting是有很本質區別的。
  原始的Boost算法是在算法開始的時候,爲每一個樣本賦上一個權重值,初始的時候,大家都是一樣重要的。在每一步訓練中得到的模型,會使得數據點的估計有對有錯,我們就在每一步結束後,增加分錯的點的權重,減少分對的點的權重,這樣使得某些點如果老是被分錯,那麼就會被“嚴重關注”,也就被賦上一個很高的權重。然後等進行了N次迭代(由用戶指定),將會得到N個簡單的分類器(basic learner),然後我們將它們組合起來(比如說可以對它們進行加權、或者讓它們進行投票等),得到一個最終的模型。

而Gradient Boost與傳統的Boost的區別是,每一次的計算是爲了減少上一次的殘差(residual),而爲了消除殘差,我們可以在殘差減少的梯度(Gradient)方向上建立一個新的模型。所以說,在Gradient Boost中,每個新的模型的簡歷是爲了使得之前模型的殘差往梯度方向減少,與傳統Boost對正確、錯誤的樣本進行加權有着很大的區別。
  在GB算法框架中放入決策樹,就是GBDT了。
  
  

GBDT的兩個版本

參考文章:http://blog.csdn.net/kunlong0909/article/details/17587101

(1)殘差版本
  殘差其實就是真實值和預測值之間的差值,在學習的過程中,首先學習一顆迴歸樹,然後將“真實值-預測值”得到殘差,再把殘差作爲一個學習目標,學習下一棵迴歸樹,依次類推,直到殘差小於某個接近0的閥值或迴歸樹數目達到某一閥值。其核心思想是每輪通過擬合殘差來降低損失函數。
  總的來說,第一棵樹是正常的,之後所有的樹的決策全是由殘差來決定。
首先給出一個簡單的例子:
如果不明白圖片是什麼意思,請參考:
http://blog.csdn.net/w28971023/article/details/8240756
這裏寫圖片描述
  可以看到第二棵數的輸入是對第一棵樹預測結果與實際結果的殘差。因此很容易發現GBDT算法有這樣一些重要的特性,會對後面Spark實際編程時參數設置(調試)有一些指導作用(後面還會詳細說)。
  GBDT是通過迭代不斷使誤差減小的過程,後一棵樹對前一棵樹的殘差進行預測,這和隨機森林平行的用多棵樹同時預測完全不一樣。因此對樹結構(如MaxDepth),運算時間,預測結果,泛化能力都和隨機森林不一樣。(Spark coding時再詳細對比分析)

  算法:
  這裏寫圖片描述
(2)梯度版本
  與殘差版本把GBDT說成一個殘差迭代樹,認爲每一棵迴歸樹都在學習前N-1棵樹的殘差不同,Gradient版本把GBDT說成一個梯度迭代樹,使用梯度下降法求解,認爲每一棵迴歸樹在學習前N-1棵樹的梯度下降值。總的來說兩者相同之處在於,都是迭代迴歸樹,都是累加每顆樹結果作爲最終結果(Multiple Additive Regression Tree),每棵樹都在學習前N-1棵樹尚存的不足,從總體流程和輸入輸出上兩者是沒有區別的;
  兩者的不同主要在於每步迭代時,是否使用Gradient作爲求解方法。前者不用Gradient而是用殘差—-殘差是全局最優值,Gradient是局部最優方向*步長,即前者每一步都在試圖讓結果變成最好,後者則每步試圖讓結果更好一點。
  兩者優缺點。看起來前者更科學一點–有絕對最優方向不學,爲什麼捨近求遠去估計一個局部最優方向呢?原因在於靈活性。前者最大問題是,由於它依賴殘差,cost function一般固定爲反映殘差的均方差,因此很難處理純迴歸問題之外的問題。而後者求解方法爲梯度下降,只要可求導的cost function都可以使用。
  算法如下:
  可參考http://blog.csdn.net/starzhou/article/details/51648219
  其實這些算法都來自Friedman的論文,想要深度研究該算法的原理,最好閱讀原文自己推導一遍。
  這裏寫圖片描述
 
 

前向分步算法(forward stagewise algorithm)

  可以看出GBDT是一種前向分步算法。
  更普遍的,前向分步算法有兩種形式,前一種是更新模型,是一種是加法模型:
這裏寫圖片描述
  通俗理解就是:向前一步一步的走,逐漸逼近想要的結果。當然走的快慢,也是可以再增加一個控制參數,一個叫學習率的參數來控制(見下面正則化部分)。
  

正則化(學習率)

Shrinkage
proposed a simple regularization strategy that scales the contribution of each weak learner by a factor \nu:
這裏寫圖片描述
The parameter \nu: is also called the learning rate because it scales the step length the the gradient descent procedure; it can be set via the learning_rate parameter.

  學習率和正則化怎麼在一起了?通俗理解就是:每次走很小的一步逐漸逼近的效果,要比每次邁一大步很快逼近結果的方式更容易避免過擬合。

Spark2.0中GBDT

GBDT的優點

  GBDT和隨機森林一樣,都具備決策樹的一些優點:
  (1)可以處理類別特徵和連續特徵;
  (2)不需要對數據進行標準化預處理;
  (3)可以分析特徵之間的相互影響
  值得注意的是,Spark中的GBDT目前還不能處理多分類問題,僅可以用於二分類和迴歸問題。(Spark隨機森林可以處理多分類問題) 

  Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. Like decision trees, GBTs handle categorical features, do not require feature scaling, and are able to capture non-linearities and feature interactions.

  spark.mllib supports GBTs for binary classification and for regression, using both continuous and categorical features. spark.mllib implements GBTs using the existing decision tree implementation. Please see the decision tree guide for more information on trees.

  Note: GBTs do not yet support multiclass classification. For multiclass problems, please use decision trees or Random Forests.

GBDT與 隨機森林應用時的對比

  GBDT和隨機森林雖然都是決策樹的組合算法,但是兩者的訓練過程還是很不相同的。
  GBDT訓練是每次一棵,一棵接着一棵(串行),因此與隨機森林並行計算多棵樹相比起來,會需要更長的訓練時間
  在GBDT中,相對於隨機森林而言(隨機森林中的樹可以不做很多的剪枝),一般會選擇更淺(depth更小)的樹,這樣運算時間會減少。
  隨機森林更不容易過擬合,而且森林中包含越多的樹似乎越不會出現過擬合。用統計學的語言來講,就是說越多的樹包含進來,會降低預測結果的方差(多次預測結果會更加穩定)。但是GBDT則恰好相反,包含預測的樹(即迭代的次數越多),反而會更傾向於過擬合,用統計學的語言來將,就是GBDT迭代次數的增加減少的是偏差(預測結果和訓練數據label之間的差異)。(偏差和方差這兩個概念是不同的概念,見後面的圖)
  隨機森林參數相對更容易調試一些,這是由於隨着所包含的決策樹的個數增加,其預測效果一般是單調的向好的方向變。而GBDT則不同,一開始預測表現會隨着樹的數目增大而變好,但是到一定程度之後,反而會隨着樹的數目增加而變差。
  總而言之,這兩種算法都還是非常有效的算法,如何選擇取決於實際的數據。

   Gradient-Boosted Trees vs. Random Forests
  Both Gradient-Boosted Trees (GBTs) and Random Forests are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:

  GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests can train multiple trees in parallel.

  On the other hand, it is often reasonable to use smaller (shallower) trees with GBTs than with Random Forests, and training smaller trees takes less time.

  Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting. (In statistical language, Random Forests reduce variance by using more trees, whereas GBTs reduce bias by using more trees.)

  Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).

  In short, both algorithms can be effective, and the choice should be based on the particular dataset.

偏差和方差的區別:
  偏差:描述的是預測值(估計值)的期望與真實值之間的差距。偏差越大,越偏離真實數據,如下圖第二行所示。
  方差:描述的是預測值的變化範圍,離散程度,也就是離其期望值的距離。方差越大,數據的分佈越分散,如下圖右列所示。
這裏寫圖片描述

關鍵參數

  有三個關鍵參數需要仔細分析:loss,numIterations,learningRate。可以通過下面的方式設置

//定義GBTClassifier,注意在Spark中輸出(預測列)都有默認的設置,可以不自己設置
GBTClassifier gbtClassifier=new GBTClassifier()
                            .setLabelCol("indexedLabel")//輸入label
                            .setFeaturesCol("indexedFeatures")//輸入features vector
                            .setMaxIter(MaxIter)//最大迭代次數
                            .setImpurity("entropy")//or "gini"
                            .setMaxDepth(3)//決策樹的深度
                            .setStepSize(0.3)//範圍是(0, 1]
                            .setSeed(1234); //可以設一個隨機數種子點

loss(損失函數的類型)

  Spark中已經實現的損失函數類型有以下三種,注意每一種都只適合一類問題,要麼是迴歸,要麼是分類。
  分類只可選擇 Log Loss,迴歸問題可選擇平方誤差和絕對值誤差。分別又稱爲L2損失和L1損失。絕對值誤差(L1損失)在處理帶有離羣值的數據時比L2損失更加具有魯棒性。
這裏寫圖片描述

numIterations(迭代次數)

  GBDT迭代次數,每一次迭代將產生一棵樹,因此numIterations也是算法中所包含的樹的數目。增加numIterations會提高訓練集數據預測準確率(注意是訓練集數據上的準確率哦)。但是相應的會增加訓練的時間。如何選擇合適的參數防止過擬合,一定需要做驗證。將數據分爲兩份,一份是訓練集,一份是驗證集。
  隨着迭代次數的增加,一開始在驗證集上預測誤差會減小,迭代次數增大到一定程度後誤差反而會增加,那麼通過準確度vs.迭代次數曲線可以選擇最合適的numIterations。

learningRate(學習率)

  這個參數一般不需要調試,如果發現算法面對某個數據集,變現得極其不穩定,那麼就要減小學習率再試一下,一般會有改善(穩定性變好)。小的學習率(步長)肯定會增加訓練的時間。

  (1) loss: See the section above for information on losses and their applicability to tasks (classification vs. regression). Different losses can give significantly different results, depending on the dataset.
  (2) numIterations: This sets the number of trees in the ensemble. Each iteration produces one tree. Increasing this number makes the model more expressive, improving training data accuracy. However, test-time accuracy may suffer if this is too large.
Gradient boosting can overfit when trained with more trees. In order to prevent overfitting, it is useful to validate while training. The method runWithValidation has been provided to make use of this option. It takes a pair of RDD’s as arguments, the first one being the training dataset and the second being the validation dataset.
  The training is stopped when the improvement in the validation error is not more than a certain tolerance (supplied by the validationTol argument in BoostingStrategy). In practice, the validation error decreases initially and later increases. There might be cases in which the validation error does not change monotonically, and the user is advised to set a large enough negative tolerance and examine the validation curve using evaluateEachIteration (which gives the error or loss per iteration) to tune the number of iterations.
  (3) learningRate: This parameter should not need to be tuned. If the algorithm behavior seems unstable, decreasing this value may improve stability.

Validation while training
  Gradient boosting can overfit when trained with more trees. In order to prevent overfitting, it is useful to validate while training. The method runWithValidation has been provided to make use of this option. It takes a pair of RDD’s as arguments, the first one being the training dataset and the second being the validation dataset.
  The training is stopped when the improvement in the validation error is not more than a certain tolerance (supplied by the validationTol argument in BoostingStrategy). In practice, the validation error decreases initially and later increases. There might be cases in which the validation error does not change monotonically, and the user is advised to set a large enough negative tolerance and examine the validation curve using evaluateEachIteration (which gives the error or loss per iteration) to tune the number of iterations.

完整代碼

基於Spark2.0 DataFrame、pipeline代碼需要一些預處理流程,可以參考我另一篇文章,有詳細的說明:
決策樹的幾種類型差異及Spark 2.0-MLlib、Scikit代碼分析
http://blog.csdn.net/qq_34531825/article/details/52330942

//Spark 2.0 GBDT完整代碼
package my.spark.ml.practice.classification;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.GBTClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.feature.IndexToString;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.StringIndexerModel;
import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class myGDBT {

    public static void main(String[] args) {
        SparkSession spark=SparkSession
                .builder()
                .appName("CoFilter")
                .master("local[4]")
                .config("spark.sql.warehouse.dir",
                        "file///:G:/Projects/Java/Spark/spark-warehouse" )
                .getOrCreate();         

        String path="C:/Users/user/Desktop/ml_dataset/classify/horseColicTraining2libsvm.txt";
        String path2="C:/Users/user/Desktop/ml_dataset/classify/horseColicTest2libsvm.txt";
        //屏蔽日誌
        Logger.getLogger("org.apache.spark").setLevel(Level.ERROR);//WARN
        Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF);   

        Dataset<Row> training=spark.read().format("libsvm").load(path);
        Dataset<Row> test=spark.read().format("libsvm").load(path2);        

        StringIndexerModel indexerModel=new StringIndexer()
                .setInputCol("label")
                .setOutputCol("indexedLabel")
                .fit(training);
        VectorIndexerModel vectorIndexerModel=new VectorIndexer()
                .setInputCol("features")
                .setOutputCol("indexedFeatures")
                .fit(training);
        IndexToString converter=new IndexToString()
                .setInputCol("prediction")
                .setOutputCol("convertedPrediction")
                .setLabels(indexerModel.labels());
        //調試參數MaxIter,learningRate,maxDepth,也對兩種不純度進行了測試                
       for (int MaxIter = 30; MaxIter < 40; MaxIter+=10)
          for (int maxDepth = 2; maxDepth < 3; maxDepth+=1)
              for (int impurityType = 1; impurityType <2; impurityType+=1)
                 for (int setpSize = 1; setpSize< 10; setpSize+=1) {    
                    long begin = System.currentTimeMillis();//訓練開始時間
                    String impurityType_=null;//不純度類型選擇
                    if (impurityType==1) {
                        impurityType_="gini";
                    }
                    else  {
                        impurityType_="entropy";
                    }
                    double setpSize_=0.1*setpSize;
                    GBTClassifier gbtClassifier=new GBTClassifier()
                            .setLabelCol("indexedLabel")
                            .setFeaturesCol("indexedFeatures")
                            .setMaxIter(MaxIter)
                            .setImpurity(impurityType_)//.setImpurity("entropy")
                            .setMaxDepth(maxDepth)
                            .setStepSize(setpSize_)//範圍是(0, 1]
                            .setSeed(1234);                     

                    PipelineModel pipeline=new Pipeline().setStages
                            (new PipelineStage[]
                                    {indexerModel,vectorIndexerModel,gbtClassifier,converter})
                            .fit(training);     
                    long end=System.currentTimeMillis();        

                    //一定要在測試數據集上做驗證
                    Dataset<Row> predictDataFrame=pipeline.transform(test);     

                    double accuracy=new MulticlassClassificationEvaluator()
                            .setLabelCol("indexedLabel")
                            .setPredictionCol("prediction")
                            .setMetricName("accuracy").evaluate(predictDataFrame);          
                    String str_accuracy=String.format(" accuracy = %.4f ", accuracy);
                    String str_time=String.format(" trainig time = %d ", (end-begin));
                    String str_maxIter=String.format(" maxIter = %d ", MaxIter);
                    String str_maxDepth=String.format(" maxDepth = %d ", maxDepth);
                    String str_stepSize=String.format(" setpSize = %.2f ", setpSize_);
                    String str_impurityType_=" impurityType = "+impurityType_;
                    System.out.println(str_maxIter+str_maxDepth+str_impurityType_+
                            str_stepSize+str_accuracy+str_time);

                }//Params Cycle         
    }   
}

/*下面的參數分析只是針對這個小數據集,實際不同數據會有很大差別,僅僅是一種非常的簡單的測試而已*/
/**迭代次數影響:隨着次數的增加,開始在測試上準確度會提高,訓練時間呈線性增長。
maxIter = 1  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7313  trainig time = 1753 
 maxIter = 11  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7463  trainig time = 2820 
 maxIter = 21  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7612  trainig time = 5043 
 maxIter = 31  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 7217 
 maxIter = 41  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 9932 
 maxIter = 51  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 12337 
 maxIter = 61  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 15091 
 */
/**

隨maxDepth=2時,預測準確度最高,然後開始下降,確實說明:GDBT中的決策樹要設置淺一些
訓練時間隨maxDepth增加而增加,但不是線性增加,:
這裏寫圖片描述
這裏寫圖片描述

/**兩種不純的比較:這個數據和參數,沒有差別
maxIter = 30 maxDepth = 2 impurityType = gini setpSize = 0.10 accuracy = 0.7910 trainig time = 10522
maxIter = 30 maxDepth = 2 impurityType = entropy setpSize = 0.10 accuracy = 0.7910 trainig time = 8824
*/

學習率(步長):學習率也會影響預測準確率,設置太大精度會降低。
這裏寫圖片描述

Scikit中繼續學習GBDT

  機器學習庫Scikit-learn中一般有更豐富的文檔和實例,接着再深入學學吧。
  他叫做:Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT)。其實是一個東西,GBDT中的樹一般就是迴歸樹(不是分類樹)。這個算法在搜索排序中用的很多。

   Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

在Scikit中實現起來就更簡單了:

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

#加載一個Demo數據集
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

#定義參數,訓練分類器
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X_train, y_train)

#測試集上評估
clf.score(X_test, y_test)
Out[7]: 0.91300000000000003

n_estimators弱分類器的個數,實際上就是Spark 2.0中的最大迭代次數maxIter(即決策樹的個數,這裏的弱分類就是決策樹啊)。
learning_rate應該對應的就是Spark2.0中的stepSize。
值得注意的是n_estimators和learning_rate是相互影響的,小一點的學習率需要更多的弱分類器,這樣才能維持一個恆定的訓練誤差。
[HTF2009]實驗表明設置一個小一點的學習,小一些的學習率在測試數據集上會有更高的預測準確率。
[R2007] 也建議將學習率設置爲選擇一個小的恆定值(比如小於等於0.1),並選擇一個n_estimators作爲訓練的早期停止條件。

[HTF2009] Hastie, R. Tibshirani and J. Friedman, “Elements of
Statistical Learning Ed. 2”, Springer, 2009.
[R2007] Ridgeway,Generalized Boosted Models: A guide to the gbm package”, 2007
還沒有時間看這兩個文獻,希望有時間再學習學習。
The parameter learning_rate strongly interacts with the parameter n_estimators, the number of weak learners to fit. Smaller values of learning_rate require larger numbers of weak learners to maintain a constant training error. Empirical evidence suggests that small values of learning_rate favor better test error. [HTF2009] recommend to set the learning rate to a small constant (e.g. learning_rate <= 0.1) and choose n_estimators by early stopping. For a more detailed discussion of the interaction between learning_rate and n_estimators see [R2007].

可以用類似的循環很方便各種完成測試

#GDBT python參數測試代碼
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]

y_train, y_test = y[:2000], y[2000:]

'''
n_estimators_ =[10,100,300,500,1000]
learning_rate_=[0.05,0.10,0.2,0.5,1.0]
for i in range(5):
    for j in range(5): 
        clf = GradientBoostingClassifier(n_estimators=n_estimators_[i],\
        learning_rate=learning_rate_[j],\
        max_depth=1,random_state=0).fit(X_train, y_train)

        print ("n_estimators = "+str(n_estimators_[i])\
        +"  learning_rate = "+str(learning_rate_[j])+ \
        "  score = "+str(clf.score(X_test, y_test)))
'''
n_estimators_ =[10,100,300,500,1000,2000,5000]
learning_rate_=[0.05]
for i in range(7):
    for j in range(1): 
        clf = GradientBoostingClassifier(n_estimators=n_estimators_[i],\
        learning_rate=learning_rate_[j],\
        max_depth=1,random_state=0).fit(X_train, y_train)

        print ("n_estimators = "+str(n_estimators_[i])\
        +"  learning_rate = "+str(learning_rate_[j])+ \
        "  score = "+str(clf.score(X_test, y_test)))

設置一個非常小的學習率=0.05,逐步增加弱分類器的數目
可以看出學習率很小時,的確需要很多的弱分類器才能得到較好的結果。但是預測效果一直在變好。

這裏寫圖片描述

學習率很大時,較少的n_estimators 值就可以達到類似的結果。(但是考慮到模型的穩定,還是不建議選一個很大的學習率)

n_estimators = 10 learning_rate = 0.5 score = 0.6889
n_estimators = 100 learning_rate = 0.5 score = 0.8987
n_estimators = 300 learning_rate = 0.5 score = 0.9291
n_estimators = 500 learning_rate = 0.5 score = 0.9378
n_estimators = 1000 learning_rate = 0.5 score = 0.9444
n_estimators = 2000 learning_rate = 0.5 score = 0.9475
n_estimators = 5000 learning_rate = 0.5 score = 0.9469

超級多的樹會組合什麼結果呢?(即使toy-dataset也訓練漫長)
我們可以看到最終預測準確率會收斂到一個值(大於2000-5000次以後)

n_estimators = 100 learning_rate = 0.1 score = 0.8189
n_estimators = 500 learning_rate = 0.1 score = 0.8975
n_estimators = 1000 learning_rate = 0.1 score = 0.9203
n_estimators = 5000 learning_rate = 0.1 score = 0.9428
n_estimators = 10000 learning_rate = 0.1 score = 0.9463
n_estimators = 20000 learning_rate = 0.1 score = 0.9465
n_estimators = 50000 learning_rate = 0.1 score = 0.9457

參考文獻:

(1)Spark document
http://spark.apache.org/docs/latest/mllib-ensembles.html
(2)機器學習中的數學(1)-迴歸(regression)、梯度下降(gradient descent)
http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_gradient_descent.html
(3)GBDT(MART) 迭代決策樹入門教程 | 簡介
http://blog.csdn.net/w28971023/article/details/8240756
  
  

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章