【手把手機器學習入門到放棄】sklearn決策樹及其調參完全指南

使用sklearn構建決策樹,並調優

sklearn學習包裏的tree模塊實現的就是CART樹,但目前不支持離散變量的輸入。

from sklearn import tree
from sklearn.model_selection import train_test_split
import graphviz
from sklearn import metrics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

不調參進行訓練

#直接讀取上次SVM的notebook中加工好的數據
X=pd.read_csv("american_salary_feture.csv")
y=pd.read_csv("american_salary_label.csv",header=None)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=24)
clf=tree.DecisionTreeClassifier(random_state=24)
clf = clf.fit(X_train, y_train)
clf.score(X_test,y_test)
0.8135364205871515
print("test:",metrics.f1_score(clf.predict(X_test),y_test))
print("train",metrics.f1_score(clf.predict(X_train),y_train))
test: 0.6277587052476704
train 0.9999142440614013

在不進行調參的情況下,這明顯是一個過擬合的模型,在訓練集上的表現明顯優於測試集上的表現,這也就是決策樹的一個缺陷。

下面我們通過一些技術手段,讓這個決策樹更加符合實際。

sklearn官方文檔中的參數描述

  • criterion : string, optional (default=”gini”)

    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

    選擇 gini 或者 entropy 選擇以gini或是信息熵作爲信息增益的計算

  • splitter : string, optional (default=”best”)

    The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

    將連續變量分箱的方式,best選擇最佳分箱方式,random選擇最佳的隨機分箱方式

  • max_depth : int or None, optional (default=None)

    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

    決策樹深度,如果不設置,那麼決策樹會分裂到直到所有的葉子節點中只含有同一種樣本或者葉子中的樣本數小於min_samples_split參數

  • min_samples_split : int, float, optional (default=2)

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number.

    If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

    Changed in version 0.18: Added float values for fractions.

    如果是整數,就是每個葉子節點需要進一步分裂的最少樣本數,如果是小數,那麼這個最少樣本個數等於min_samples_split*樣本總數。

  • min_samples_leaf : int, float, optional (default=1)

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number.

    If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

    Changed in version 0.18: Added float values for fractions.

    如果是整數,就是每個葉子節點最少容納的樣本數,如果是小數,那麼每個葉子節點最少容納的個數等於min_samples_leaf*樣本總數。如果某個分裂條件下分裂出得某個子樹含有的樣本數小於這個數字,那麼不能進行分裂。

  • min_weight_fraction_leaf : float, optional (default=0.)

    The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

    葉子節點最少需要佔據總樣本的比重,如果樣本比重沒有提供的話,每個樣本佔有相同比重

  • max_features : int, float, string or None, optional (default=None)

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split.

    If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    If “auto”, then max_features=sqrt(n_features).

    If “sqrt”, then max_features=sqrt(n_features).

    If “log2”, then max_features=log2(n_features).

    If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

    分裂時需要考慮的最多的特徵數,如果是整數,那麼分裂時就考慮這幾個特徵,如果是小數,則分裂時考慮的特徵數=max_features*總特徵數,如果是“auto”或者“sqrt”,考慮的特徵數是總特徵數的平方根,如果是“log2”,考慮的特徵數是log2(總特徵素),如果是None,考慮的特徵數=總特徵數。需要注意的是,如果在規定的考慮特徵數之內無法找到滿足分裂條件的特徵,那麼決策樹會繼續尋找特徵,直到找到一個滿足分裂條件的特徵。

  • random_state : int, RandomState instance or None, optional (default=None)

    If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

    隨機數值,用於打亂,默認使用np.random

  • max_leaf_nodes : int or None, optional (default=None)

    Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

    規定最多的葉子個數,根據區分度從高到低選擇葉子節點,如果不傳入這個參數,則不限制葉子節點個數。

  • min_impurity_decrease : float, optional (default=0.)

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

    The weighted impurity decrease equation is the following:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
    - N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

    New in version 0.19.

    最低分裂不純度,當分裂後的減少的不純度大於等於這個值時,才進行分裂。不純度的計算公式如上。

  • min_impurity_split : float, (default=1e-7)

    Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

    Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split will change from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use min_impurity_decrease instead.

    最少分裂閥值,如果一個節點的不純度大於這個值的時候才進行分裂。

  • class_weight : dict, list of dicts, “balanced” or None, default=None

    Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

    Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

    For multi-output, the weights of each column of y will be multiplied.

    Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

    類別權重,對每個類別設置權重,示例如上,如果標籤是多列的,那麼每一列的的權重將會被相乘,如果在fit方法中傳入了樣本權重字典,那麼類別權重會和樣本權重相乘。

  • presort : bool, optional (default=False)

    Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.

    預排序,可以用來加速,注意在默認配置情況下,在數據集較大的情況下,預排序會降低訓練速度,而在小樣本或者在限制深度的情況下,預排序可以加速訓練過程。

sklearn官方文檔中的決策樹優化建議

在官方文檔中有一些優化建議,原文如下:(大致翻譯了一下)

  • Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.

    特徵數與樣本數的平衡,樣本數過小容易過擬合

  • Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.

    對特徵進行降維,使用PCA,ICA之類的技術,更有可能找到有區分度的特徵

  • Understanding the decision tree structure will help in gaining more insights about how the decision tree makes predictions, which is important for understanding the important features in the data.

    瞭解決策樹的結構有利於瞭解數據中的重要特徵

  • Visualise your tree as you are training by using the export function. Use max_depth=3 as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth.

    將樹可視化,初始使用比較小的深度來查看決策樹是否適合用於訓練這個數據。然後再慢慢增加深度。

  • Remember that the number of samples required to populate the tree doubles for each additional level the tree grows to. Use max_depth to control the size of the tree to prevent overfitting.

    每增加一層深度需要的樣本數就需要翻倍,所以控制好樹的深度,避免過擬合。

  • Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification with few classes, min_samples_leaf=1 is often the best choice.

    使用min_samples_split min_samples_leaf兩個參數保證每個葉子節點都有多個樣本,如果每個葉子中的樣本數很少,往往說明決策樹過擬合了。從min_samples_leaf=5開始嘗試。對於類別較少的分類問題,min_samples_leaf=1往往是最好的選擇。

  • Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

    對數據集進行平衡性調整,以免整個樹被數據量很大的樹支配,將每個類別的樣本數調整成一樣的,或者更好的,將每個類別的樣本權重調整成一樣的。這樣,根據權重進行剪枝的方案,如min_weight_fraction_leaf,會更好的調整樣本的平衡性,比min_samples_leaf這種對權重不敏感的方案要好。

  • If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.

    如果增加了樣本權重,那麼就要使用對樣本權重敏感的剪枝方案,如min_weight_fraction_leaf

  • All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.

    所有決策樹的內部的數據類型是np.float32,如果不是這個數據類型,那麼決策樹會複製一個數據集。(考慮內存,可以自己先把數據集的數據類型調整好)

  • If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples.

    對於有很多0值的稀疏樣本,建議將輸入轉換成csc_matrix再進行訓練,並將測試集轉換成csr_matrix再進行預測。這樣可以節省很多時間。

下面我們根據官方文檔的建議進行調優

構建一個三層樹

clf=tree.DecisionTreeClassifier(max_depth =3,random_state=24)
clf.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=24, splitter='best')
#可視化
feature_name = X_train.columns
class_name =['0','1']
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=feature_name,  
                     class_names=class_name,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

在這裏插入圖片描述

#分數
clf.score(X_test,y_test)
0.8400687876182287
print("test:",metrics.f1_score(clf.predict(X_test),y_test))
print("train",metrics.f1_score(clf.predict(X_train),y_train))
test: 0.6134204275534441
train 0.6132760560499131

在急劇減少樹的深度之後,我們過擬合的問題解決了,但仍有可能欠擬合

平衡樣本

y_train[0].value_counts()
0    18589
1     5831
Name: 0, dtype: int64
# 構建樣本權重
sample_weight = np.zeros(y_train.shape[0])
sample_weight[y_train[0]==0]=1
sample_weight[y_train[0]==1]=18589/5831
clf=tree.DecisionTreeClassifier(max_depth =3, random_state=24)
clf = clf.fit(X_train, y_train,sample_weight=sample_weight)
print(clf.score(X_test,y_test))
0.8161159562707284
print("test:",metrics.f1_score(clf.predict(X_test),y_test))
print("train",metrics.f1_score(clf.predict(X_train),y_train))
test: 0.6509675915131732
train 0.6576849489795917

平衡樣本之後我們的f1指標得到了很好的提升,雖然準確率下降了,但這個模型應該更加適用於預測。

轉換成稀疏矩陣加速訓練

根據官方建議,我們的訓練數據由於有大量的啞變量,也就是有大量的0存在,所以壓縮成稀疏矩陣可以加速訓練,我們驗證一下

from scipy import sparse
sX_train = sparse.csc_matrix(X_train)

未壓縮的情況

%%time
clf=tree.DecisionTreeClassifier(max_depth=30,random_state=24)
clf = clf.fit(X_train, y_train)
CPU times: user 314 ms, sys: 4.98 ms, total: 319 ms
Wall time: 319 ms

壓縮後的情況

%%time
clf=tree.DecisionTreeClassifier(max_depth=30,random_state=24)
clf = clf.fit(sX_train, y_train)
CPU times: user 913 ms, sys: 3.73 ms, total: 917 ms
Wall time: 916 ms

好像還是壓縮後更慢,可能數據集中的0還不夠多所以我們放棄壓縮

調整樹的深度

我們有20000多個樣本,大概在214和215次方左右,我們取樹的深度在10以內比較合適,不然是很容易過擬合的

depths = range(3,50)
score=np.zeros(50)
f1_score_train=np.zeros(50)
f1_score_test=np.zeros(50)
for depth in depths:
    clf=tree.DecisionTreeClassifier(max_depth =depth, random_state=24)
    clf.fit(X_train,y_train,sample_weight=sample_weight)
    score[depth]=clf.score(X_test,y_test)
    f1_score_test[depth] = metrics.f1_score(clf.predict(X_test), y_test)
    f1_score_train[depth] = metrics.f1_score(clf.predict(X_train), y_train)
plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"score":score,"f1_score_train": f1_score_train,"f1_score_test": f1_score_test})
sns.lineplot(data=data)
plt.xlabel("tree_depth")
plt.ylabel("score")
plt.title("scores varies with tree depths")
Text(0.5, 1.0, 'scores varies with tree depths')

在這裏插入圖片描述

clf=tree.DecisionTreeClassifier(max_depth =8, random_state=24)
clf.fit(X_train,y_train,sample_weight=sample_weight)
print(clf.score(X_test,y_test))
print("test:",metrics.f1_score(clf.predict(X_test),y_test))
print("train",metrics.f1_score(clf.predict(X_train),y_train))
0.8000245670065103
test: 0.6781336496638987
train 0.6962163638847151

可以清楚看到隨着樹的深度的增加,測試集上與訓練集上的表現差距越來越大,綜合考慮三個指標,我們認爲取max_depth=8是比較合適的一個數值

對 min_weight_fraction_leaf進行調參

由於我們的樣本是帶權重的,所以不適合使用min_samples_split,min_samples_leaf兩個參數進行調參,如果樣本本身比較平衡,那麼可以使用這兩個參數進行調參。

min_fractions = np.linspace(0,0.02,100,endpoint=True)
score_frac=np.zeros(100)
f1_score_test_frac=np.zeros(100)
f1_score_train_frac=np.zeros(100)
i=0
for min_fraction in min_fractions:
    clf=tree.DecisionTreeClassifier(max_depth =8, min_weight_fraction_leaf=min_fraction,random_state=24)
    clf.fit(X_train,y_train,sample_weight=sample_weight)
    score_frac[i]=clf.score(X_test,y_test)
    f1_score_test_frac[i] = metrics.f1_score(clf.predict(X_test), y_test)
    f1_score_train_frac[i] = metrics.f1_score(clf.predict(X_train), y_train)
    i=i+1
plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"score":score_frac,"f1_score_train": f1_score_test_frac,"f1_score_test": f1_score_train_frac},
                   index=min_fractions)
sns.lineplot(data=data)
plt.xlabel("min_weight_fraction_leaf")
plt.ylabel("score")
plt.title("scores varies with min_weight_fraction_leaf")
Text(0.5, 1.0, 'scores varies with min_weight_fraction_leaf')

在這裏插入圖片描述

可以看到還是將這個權重設爲0比較好,默認值也是0,就可以不用寫了

PCA降維

爲了把模型的訓練加速,並且讓樹更簡單在更少的層次中融入更多信息量,PCA是一個推薦的做法

from sklearn.decomposition import PCA
score_pca=np.zeros(30)
f1_score_test_pca=np.zeros(30)
f1_score_train_pca = np.zeros(30)
j=0
for i in range(1,31):
    pca=PCA(n_components=i,random_state=24)
    pca.fit(X_train)
    X_train_pca=pca.transform(X_train)
    X_test_pca=pca.transform(X_test)
    clf=tree.DecisionTreeClassifier(max_depth=8,random_state=24)
    clf.fit(X_train_pca,y_train,sample_weight=sample_weight)
    score_pca[j] = clf.score(X_test_pca,y_test)
    f1_score_test_pca[j]=metrics.f1_score(clf.predict(X_test_pca),y_test)
    f1_score_train_pca[j]=metrics.f1_score(clf.predict(X_train_pca),y_train)
    j=j+1
plt.figure(figsize=(10,6))
sns.set(style="whitegrid")
data = pd.DataFrame({"score":score_pca,"f1_score_train": f1_score_test_pca,"f1_score_test": f1_score_train_pca},
                   index=range(1,31))
sns.lineplot(data=data)
plt.xlabel("n_components")
plt.ylabel("score")
plt.title("scores varies with n_components")
Text(0.5, 1.0, 'scores varies with n_components')

在這裏插入圖片描述

可以看到選擇n_components值在8左右比較合適

pca=PCA(n_components=8,random_state=24)
pca.fit(X_train)
X_train_pca=pca.transform(X_train)
X_test_pca=pca.transform(X_test)
clf=tree.DecisionTreeClassifier(max_depth=8,random_state=24)
clf.fit(X_train_pca,y_train,sample_weight=sample_weight)
print(clf.score(X_test_pca,y_test))
print(metrics.f1_score(clf.predict(X_test_pca),y_test))
print(metrics.f1_score(clf.predict(X_train_pca),y_train))
0.8038324530156
0.6737487231869255
0.6905610284356879

可以看到在表現沒有明顯降低的情況下,我們的特徵緯度降低到了8維。在訓練和進行預測的時候速度都可以明顯加快,同時也有缺點,就是可解釋性變差。對於有些特徵共線性比較小,那麼PCA並不能降低太多的維度,那麼做這個變化就沒有必要了。

可視化樹

clf=tree.DecisionTreeClassifier(max_depth =8, random_state=24)
clf.fit(X_train,y_train,sample_weight=sample_weight)
print(clf.score(X_test,y_test))
print("test:",metrics.f1_score(clf.predict(X_test),y_test))
print("train",metrics.f1_score(clf.predict(X_train),y_train))

0.8000245670065103
test: 0.6781336496638987
train 0.6962163638847151
#可視化
feature_name = np.array(X_train.columns)
for i in range(len(feature_name)):
    feature_name[i]=feature_name[i].replace("native-country","native_country")
    feature_name[i]=feature_name[i].replace("&","_")

class_name =['0','1']
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=feature_name,  
                     class_names=class_name,  
                     filled=True, rounded=True,  
                     special_characters=True)  

type(feature_name[0])
feature_name[0]=feature_name[0].replace('a','e')
feature_name[0]
'ege'
graph = graphviz.Source(dot_data)  
graph 

圖片

print("test:",metrics.classification_report(clf.predict(X_test),y_test))
test:               precision    recall  f1-score   support

           0       0.78      0.94      0.85      5093
           1       0.85      0.56      0.68      3048

    accuracy                           0.80      8141
   macro avg       0.82      0.75      0.77      8141
weighted avg       0.81      0.80      0.79      8141
cm = metrics.confusion_matrix(y_test, clf.predict(X_test))
plt.matshow(cm,cmap=plt.cm.Greens)
plt.grid(False)
plt.colorbar()
for x in range(len(cm)):
    for y in range(len(cm)):
        plt.annotate(cm[x,y],xy=(x,y),horizontalalignment='center',verticalalignment='center')
plt.ylabel('True label')# 座標軸標籤

plt.xlabel('Predicted label')# 座標軸標籤
Text(0.5, 0, 'Predicted label')

在這裏插入圖片描述

總結

我們最後得到了一個比較平衡,且分類效果比較好的模型,我們使用了樣本權重調整深度PCA三個方法來優化模型。
最後我們生成了決策樹的圖,可以看到結婚與否工作時間學歷是影響收入比較重要的幾個因素,結果也比較符合常識,預測準確率在80%左右。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章