Scikit-learn隨機森林算法庫總結與調參實踐

上篇我們對隨機森林的算法原理進行了探討，以及算法的優缺點進行了總結。我們知道隨機森林是在bagging框架下，組合多顆隨機特徵生成的CART樹形成隨機森林，是一種非常強大的算法。本篇我們就來探討Scikit-learn中隨機森林庫類的使用。按照以往的套路，我們先對隨機森林庫進行概述，再對常用參數進行解讀，最後我們使用kaggle上面的一個數據對隨機森林的調參進行全面的演示。

1）隨機森林庫類概述

隨機森林算法即可以做分類，又可以做迴歸。在Scikit learn中，隨機森林分類對應是RandomForestClassifer庫類，迴歸則是對應RandomForestRegressor庫類。兩者的具體參數如下：

sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion=‘gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=‘auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]

sklearn.ensemble.RandomForestRegressor(n_estimators=100, criterion=‘mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=‘auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)[[source]]
可以看出，RandomForestClassifier和RandomForestRegressor絕大多數參數都相同，不同之處在於，RandomForestClassifier多了一個類別不平衡的參數：class_weight。

不管是RandomForestClassifier還是RandomForestRegressor，其參數都可以分爲兩部分，第一部分是隨機森林框架參數，如n_estimators，oob_score；第二部分是CART樹參數，如max_depth，criterion等。下面我們就分部分來進行介紹。

2）隨機森林框架參數

n_estimators，樹的數量，默認100（0.22版本改成了100）；
該參數主要用於降低整體模型的方差。當n_estimators增大模型方差降低，整體模型的準確度會有所提升，直到增加到一定的值，不再發生顯著變化。實際應用中，不一定需要選擇最優的n_estimators（n_estimators越大，時間成本越高），可以根據自己電腦的運算能力，選擇適中的值，然後把算力放到調整其他超參數上。
oob_score，是否採用Out of Bag評估方式，默認False；
oob_score是我們在隨機森林原理中提到的Out of Bag評估方式。Out of Bag可以反應了模型的泛化能力，oob_score=True等同於使用交叉驗證評估模型。實際應用中，設置成True。袋外數據評估得分通過oob_score_屬性查看。
bootstrap，是否採用有放回的採樣方式，默認True；
bootstrap，有放回採樣，可以增加訓練集的多樣性，實際應用中，保持默認設置。
max_samples，訓練樹的最大樣本量，默認爲None;
當boostrap=True時，該參數才起作用，表示從訓練集中抽取多少樣本去訓練子模型。新版本0.22新增參數。

3）CART樹參數

剩下的參數則是CART樹的參數，和我們之前探討的決策樹參數含義基本一樣，下面我們看下常用的一些參數，其他的參數可以參考Scikit-learn決策樹算法庫總結與簡單實踐。

criterion，不確定性的計算方式；
分類樹和迴歸樹的損失函數不一樣，不確定性的計算方式也不一樣。RandomForestClassifier默認Gini，也可以輸入entropy。RandomForestRegressor默認爲均方差mse，也可以輸入絕對值差mae。在絕大多數情況下，兩者沒有顯著差別，實際應用中，優先考慮保持默認設置。
max_features，訓練樹的最大特徵數，默認爲auto；
該參數用來限制樹過擬合的剪枝參數。max_features限制分枝時考慮的特徵個數，超過限制個數的特徵都會被捨棄。適當的減少輸入模型的特徵可以增加基學習器的多樣性，當然也可能會存在模型欠擬合的風險。默認爲auto時，表示選擇的特徵數爲 $\sqrt {features}$ 。實際應用中，可以在默認auto的基礎上增大該參數，驗證模型是否欠擬合。
max_depth，樹的最大深度，默認爲None；
該參數用來限制樹過擬合的剪枝參數，超過指定深度的樹枝全部被剪掉。當默認爲None時，樹將自由生長直到達到停止條件。樹越深，模型的偏差越低，方差越高。
min_samples_split，內部節點分裂的最小樣本數，默認爲2；
該參數用來限制樹過擬合的剪枝參數。如果葉節點樣本數目小於該參數的值，葉節點將會被剪枝。min_samples_split越大，被剪枝的越多，樹越簡單，模型偏差越高，方差越低。
min_samples_leaf，葉節點的最小樣本數，默認爲1；
該參數用來限制樹過擬合的剪枝參數。如果葉節點樣本數目小於該參數的值，葉節點將會被剪枝。min_samples_leaf越大，被剪枝的越多，樹越簡單，模型偏差越高，方差越低。
max_leaf_nodes，最大葉節點數，默認爲None；
該參數用來限制樹過擬合的剪枝參數。默認是None，即不限制最大的葉子節點數。如果加了限制，算法會建立在最大葉子節點數內最優的決策樹。max_leaf_nodes越大，樹越複雜，模型偏差越低，方差越高。

4）隨機森林算法庫使用經驗總結

使用grid_search和交叉驗證選擇最優的超參數。
通常，隨機森林參數的調整順序爲：n_estimators，max_features，max_depth，min_samples_split，min_samples_leaf。
max_features可以粗粒度地調整樹的結構，搜索空間可以大一些；min_samples_split，min_samples_leaf可以更細粒度地調整樹的結構，搜索空間可以更細一些。
使用隨機森林的feature_importances_查看特徵重要性。

5）調參實踐

下面我們使用kaggle比賽的Give Me Some Credit數據，使用網格搜索的方式演示隨機森林的調參過程，同時更直觀的理解各超參數對模型的偏差和方差的影響。
代碼和數據已上傳到我的GitHub，大家可以去下載，自己跑一遍。下面我們對數據進行簡單的處理，演示隨機森林的調參過程。
首先導入所需要的Python包

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

import matplotlib.pyplot as plt
import matplotlib as mpl

import warnings
warnings.filterwarnings("ignore")

讀入數據，查看數據的基本信息。

#讀取數據
path = r'.\cs-training.csv'
data = pd.read_csv(path)
data.head()

data.shape

#數據集基本信息
data.info()

查看目標變量的分佈，以及缺失值情況。

#樣本不平衡
data['SeriousDlqin2yrs'].value_counts()

#缺失值所佔比例
data.isnull().sum()/data.shape[0]

正負樣本極其不平衡，我們使用class_weight =‘balanced’增加正樣本的權重。
缺失值比例不高，使用均值和中位數對缺失值進行填充。

#使用均值和中位數進行缺失值填充
data['MonthlyIncome'].fillna(data['MonthlyIncome'].mean(),inplace=True)
data['NumberOfDependents'].fillna(data['NumberOfDependents'].median(),inplace=True)

爲了驗證袋外數據評估效果，我們將數據集劃分訓練集合驗證集。使用默認參數，初始跑一個模型。

x= data.iloc[:,2:]
y = data['SeriousDlqin2yrs']
x.head()

#拆分訓練集，驗證集
x_train,x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 0)
print(x_train.shape)
print(y_test.shape)

base_rf = RandomForestClassifier(oob_score = True, random_state=42,class_weight='balanced')
base_rf.fit(x_train,y_train)

y_train_hat = base_rf.predict(x_train)
y_hat = base_rf.predict(x_test)

print('train accuracy_score:', metrics.accuracy_score(y_train,y_train_hat))
print('test accuracy_score:', metrics.accuracy_score(y_test,y_hat))
print('train auc_sorce:', metrics.roc_auc_score(y_train,y_train_hat))
print('test auc_sorce:', metrics.roc_auc_score(y_test,y_hat))
print('Out of Bag:',base_rf.oob_score_)

從模型結果可以看出，測試集準確率爲0.932，模型的袋外評分爲0.927，兩者相差不多，這也驗證了袋外數據可以用作模型評估。另一方面，訓練集AUC得分爲0.9280，對比測試集AUC值0.5671，模型嚴重過擬合。
下面我們使用網格搜索和交叉驗證對超參數進行調整。首先對隨機森林框架參數n_estimators進行調整，搜索空間爲1-211，間隔爲10。

param = range(1,211,10)
parameters ={'n_estimators':param}
gs_rf = GridSearchCV(estimator=RandomForestClassifier(oob_score = True, random_state=42,class_weight='balanced'), 
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)


print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

#畫圖
mean_list = gs_rf.cv_results_['mean_test_score']
std_list = gs_rf.cv_results_['std_test_score']

plt.figure(figsize = (23,8),facecolor='w')

plt.subplot(1, 3, 1)
plt.plot(param, mean_list, 'ro-', markeredgecolor='k', lw=2)  
plt.ylabel('Accuracy',fontsize=12)
plt.title('Accuracy_n_estimators')
plt.grid(b=True, ls=':', color='#606060')

plt.subplot(1, 3, 2)
plt.plot(param, std_list, 'bo-', markeredgecolor='k', lw=2)  
plt.ylabel('STD',fontsize=12)
plt.title('std_n_estimators')
plt.grid(b=True, ls=':', color='#606060')

左圖縱軸爲測試集的準確率，右圖縱軸爲測試集標準差，橫軸爲超參數的搜索空間。通過上圖可以看出，隨着子模型的增加，模型的標準差減小，模型泛化能力增強，所以整體模型的準確率有所提高。同樣，當子模型的數量增加到120之後，模型的準確率沒有顯著提升。考慮筆記本的計算能力，我們選擇n_estimators =120作爲最終參數。
下面對CART數的參數進行調整，首先調整criterion參數，並對調參結果可視化。

param = ['gini','entropy']
parameters ={'criterion':param}

gs_rf = GridSearchCV(estimator=RandomForestClassifier(n_estimators = 120,oob_score = True, random_state=42,class_weight='balanced'), 
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

從上圖我們可以看出，使用gini指數和entropy，模型效果基本不變。我們在決策樹理論中也提到，gini指數比entropy減少了對數計算，計算速度會快一些，是一個很好的默認值。但現在entropy的效果比gini係數好那麼一點點，原因是在不平衡數據的情況，而entropy趨向於產生略微平衡一些的決策樹模型。最終，選擇criterion=entropy。
下面對max_features進行調整，max_features默認參數是 $\sqrt {features}=\sqrt{11}=3.3$ ，我們選擇搜索空間爲：1到9，間隔爲1，並對調參結果進行可視化。

param = range(1,10,1)
parameters ={'max_features':param}

gs_rf = GridSearchCV(
    estimator=RandomForestClassifier(n_estimators = 120,criterion = 'entropy',oob_score = True,class_weight='balanced', random_state=42), 
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

從上圖可以看出，當max_features=1時，模型欠擬合；當max_features=2時，模型準確率提升，當繼續增大max_features，模型相關性變大，多樣性變小，模型準確率降低。最終選擇max_features=2。
下面我們對max_depth參數進行調整，max_depth可粗粒度的對樹的結構進行調整，我們選擇搜索空間爲1-110，間隔爲10。

param = np.arange(1,110,10)
parameters ={'max_depth':param}

gs_rf = GridSearchCV(estimator=RandomForestClassifier(
    n_estimators = 120,criterion = 'entropy',max_features=2 ,oob_score = True,class_weight='balanced', random_state=42),
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

從上圖可以看出，隨着樹的深度加深，模型由欠擬合到達到過擬合，模型的準確率先增大再減小，下面我們縮小搜索空間的範圍。

param = np.arange(1,22,2)
parameters ={'max_depth':param}

gs_rf = GridSearchCV(estimator=RandomForestClassifier(
    n_estimators = 100,criterion = 'entropy',max_features=2 ,oob_score = True,class_weight='balanced', random_state=42),
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

最終我們選擇max_depth =7。
下面我們對min_samples_split進行調整，min_samples_split可以對樹結構進行細粒度的調整，選擇搜索空間爲2-22，間隔爲2。

param = range(2,22,2)
parameters ={'min_samples_split':param}

gs_rf = GridSearchCV(estimator=RandomForestClassifier(
    n_estimators = 120,criterion = 'entropy',max_features =2 ,max_depth = 7,oob_score = True,class_weight='balanced', random_state=42), 
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

同理對min_samples_leaf進行調整，選擇搜索空間爲2-22，間隔爲2。

param = range(2,22,2)
parameters ={'min_samples_leaf':param}

gs_rf = GridSearchCV(estimator=RandomForestClassifier(n_estimators = 120,criterion = 'entropy',max_features = 2,
     max_depth =7,min_samples_split=18,oob_score = True,class_weight='balanced', random_state=42), 
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

最終我們選擇min_samples_split=18和min_samples_leaf=16。

最後對max_leaf_nodes進行調整，max_leaf_nodes可以粗粒度的對樹結構進行調整，選擇搜索空間爲20-250，間隔爲30。

param = range(20,250,30)
parameters ={'max_leaf_nodes':param}

gs_rf = GridSearchCV(
    estimator=RandomForestClassifier(n_estimators = 120,criterion = 'entropy',max_features =2 ,max_depth =7 ,min_samples_split=18 ,
                                     min_samples_leaf=16 ,oob_score = True,class_weight='balanced', random_state=42), 
                     param_grid=parameters, n_jobs=-1, cv=5, scoring='roc_auc' )
gs_rf.fit(x_train,y_train)

print('best_params:', gs_rf.best_params_, gs_rf.best_score_)

選擇max_leaf_nodes=80。下面我們組合所有的最優參數：

rf = RandomForestClassifier(n_estimators = 120,criterion = 'entropy',max_features =2, max_depth =7,
                    min_samples_split=18, min_samples_leaf=16,max_leaf_nodes=80,oob_score = True,class_weight='balanced', random_state=42)
rf.fit(x_train,y_train)
y_hat = rf.predict(x)

y_train_hat = rf.predict(x_train)
y_hat = rf.predict(x_test)

print('train accuracy_score:', metrics.accuracy_score(y_train,y_train_hat))
print('test accuracy_score:', metrics.accuracy_score(y_test,y_hat))
print('train auc_sorce:', metrics.roc_auc_score(y_train,y_train_hat))
print('test auc_sorce:', metrics.roc_auc_score(y_test,y_hat))
print('Out of Bag:',rf.oob_score_)

從rf模型的評估指標可以看出，模型未出現過擬合，測試集AUC值爲0.778。對比未進行參數調整的模型base_rf效果，模型準確率降低，測試集AUC值提高，模型泛化能力增強。

下面我們利用隨機森林的feature_importances_查看下模型特徵的重要性。

#特徵重要性
important_features = pd.DataFrame({'feature':x_train.columns,'importance':rf.feature_importances_})
important_features.sort_values(by = 'importance',ascending = False,inplace =True)
important_features['cum_importance'] = np.cumsum(important_features['importance'])
# sel_features = important_features[important_features['importance']<0.95].feature
# sel_features
important_features

以上就是我們使用網格搜索的方法對隨機森林進行參數調優，你也嘗試其他的調優方法，比如隨機搜索，貝葉斯優化。對於模型參數的調優，很多時候需要靠經驗，比如樹的深度，葉子節點的數量，尤其當需要調整的參數有很多時，好的經驗可以減少嘗試的次數。另外，需要注意的是，構建一個靠譜的驗證集非常重要。

（歡迎大家在評論區探討交流，也歡迎大家轉載，轉載請註明出處！)

上篇：隨機森林（Random Forest）算法原理總結
下篇：AdaBoost算法原理詳細總結

Scikit-learn隨機森林算法庫總結與調參實踐

1）隨機森林庫類概述

2）隨機森林框架參數

3）CART樹參數

4）隨機森林算法庫使用經驗總結

5）調參實踐

python gdal 安裝使用（Windows， python 3.6.8）

決策樹（Decision Tree）算法原理總結（一）

集成學習方法之Bagging，Boosting，Stacking

AdaBoost算法原理詳細總結

隨機森林（Random Forest）算法原理總結

Scikit-learn 支持向量機算法庫總結與簡單實踐

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結