Supervised Classification——Decision tree

1. 理解

決策樹分類算法就是在訓練集上建立一個決策樹,樹的節點是劃分樹的屬性,算法關鍵就是如何去選擇劃分的屬性。當我們確定好了這個決策樹,就可以對測試集的數據進行分類了。

我們需要一個指標(Attribute Selection Measures (ASM)),去評估哪一個屬性劃分的樹更好。常見的指標有Information GainGain Ratio and Gini Index。 這些指標實際上是來源於香農(Shannon)的Information Theory

在這裏插入圖片描述

Advantages

  • Decision trees implicitly perform variable screening or feature selection.
  • Decision trees require relatively little effort from users for data preparation.(不需要 standardise/normalise data)
  • Nonlinear relationships between parameters do not affect tree performance.
  • Can handle both numerical and categorical data. Can also handle multi-output problems. (對於numeric attributes,需要確定breakpoints/Discretisation procedure, 例如:Place breakpoints where the class changes/enforce minimum number of instances in majority class per interval )
  • It can be used for feature engineering such as predicting missing values, suitable for variable selection.

Disadvantages

  • Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
  • Sensitive to noisy data.
  • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
  • Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.

2. 方法

2.1 Attribute Selection Measures

2.1.1 Information Gain

要計算Information Gain, 我們需要知道如何取衡量數據中信息,這裏使用的是(shannon) entropy.

在這裏插入圖片描述

entropy=i=1np(xi)log2(p(xi))entropy = -\sum_{i=1}^{n}p(x_i)log_2(p(x_i))
info([n1,n2,...])=entropy([n1N,n2N,...]),N=n1+n2+...info([n_1,n_2,...]) = entropy([\frac{n_1}{N},\frac{n_2}{N},...]), N= n_1+n_2+...
Information Gain=info([before split])info([after split])Information \ Gain = info([before\ split]) - info([after\ split])

Example:
在這裏插入圖片描述

實際上, 我們需要劃分之後的entropy最小,而劃分前的entropy是固定的,即劃分前後的差值要最大,這個差值就是Information Gain。選擇產生最大的information gain 屬性來劃分. 實際上這是一個貪心算法,遞歸的的split剩下的節點,每一次都選則最好的節點劃分。

Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

2.1.2 Gain Ratio

當屬性劃分之後有很多的分支,Information Gain 會產生比較的偏差。例如, 利用ID來劃分,分割ID劃分之後每個子分支都只含有一個數據點,此時Info([after split])爲0,Information Gain因此也最大。

It means it prefers the attribute with a large number of distinct values. For instance, consider an attribute with a unique identifier such as customer_ID has zero info(D) because of pure partition. This maximizes the information gain and creates useless partitioning.

選擇屬性的時候,Gain ratio 考慮了分支的數量和大小

Gain ratio takes number and size of branches into account when choosing an attribute.

Gain Ratio=<information gain>/<information value of attribute>Gain \ Ratio = <information \ gain> / <information \ value\ of \ attribute>

Example:
在這裏插入圖片描述

2.1.3 Gini Index

Gini Index表示一個隨機選中的樣本在子集中被分錯的可能性。 K表示分類類別數量。PiP_i表示第 i 類樣本的數量佔總樣本數量的比例。基尼係數的性質與信息熵一樣:度量隨機變量的不確定度的大小。

  • G 越大,數據的不確定性越高;
  • G 越小,數據的不確定性越低;
  • G = 0,數據集中的所有樣本都是同一類別;

G=i=1KPi(1Pi)=1i=1KPi2i=1KPi=1G = \sum_{i=1}^KP_i(1-P_i) = 1-\sum_{i = 1}^KP_i^2, \sum_{i=1}^KP_i = 1

對於樣本D,如果根據特徵A的某個值a, 把D分成D1和D2兩部分,則在特徵A的條件下,D的基尼係數表達式爲:
G(D,A)=D1DGini(D1)+D2DGini(D2)G (D, A)= \frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2)

選擇最小的Gini index 的作爲劃分屬性。

Example:
在這裏插入圖片描述

2.2 When to stop splitting

通常來說, 我們可能有許多的features, 這樣會使得我們的決策樹非常的龐大,而且一直劃分的話會有overfitting的問題,所以我們需要 知道什麼時候去停止劃分樹。

  • Set a minimum number of training inputs to use on each leaf. For example we can use a minimum of 10 passengers to reach a decision(died or survived), and ignore any leaf that takes less than 10 passengers.
  • Set a maximum depth of the decision tree. Maximum depth refers to the the length of the longest path from a root to a leaf.

2.3 Pruning

It involves removing the branches that make use of features having low importance. This way, we reduce the complexity of tree, and thus increasing its predictive power by reducing overfitting.

2.3.1 Post-pruning:

pruning the tree after it has finished

  • Minimum error. The tree is pruned back to the point where the cross-validated error is a minimum. Cross-validation is the process of building a tree with most of the data and then using the remaining part of the data to test the accuracy of the decision tree.
  • Smallest tree. The tree is pruned back slightly
    further than the minimum error. Technically the pruning creates a decision tree with cross-validation error within 1 standard error of the minimum error. The smaller tree is more intelligible at the cost of a small increase in error.

2.3.2 Pre-pruning/early stopping

stopping the tree before it has completed classifying the training set.

3. 代碼

Dataset from kaggle

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation


# Loading data
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()

# Feature Selection
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

# Splitting data
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test


# Build Decision Tree Model
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)


# Evaluateing Model
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Visualizing Decision trees
# pip install graphviz
# pip install pydotplus
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

優化決策樹性能:
clf = DecisionTreeClassifier(criterion = "entropy", max_depth = 3)

  • criterion: optional (default=”gini”) or Choose attribute selection measure: This parameter allows us to use the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy” for the information gain.

  • splitter : string, optional (default=”best”) or Split Strategy: This parameter allows us to choose the split strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

  • max_depth: int or None, optional (default=None) or Maximum Depth of a Tree: The maximum depth of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The higher value of maximum depth causes overfitting, and a lower value causes underfitting (Source).

  • max_leaf_nodes: Reduce the number of leaf nodes

  • min_samples_leaf: Restrict the size of sample leaf
    Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total

參考

  1. Decision Trees in Machine Learning
  2. Decision Tree Classification in Python
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章