決策樹

【關鍵詞】樹，信息增益

決策樹優缺點

優點：計算複雜度不高，輸出結果易於理解，對中間值的缺失不敏感，可以處理不相關特徵數據。既能用於分類，也能用於迴歸

缺點：可能會產生過度匹配問題

決策樹的原理

如果以前沒有接觸過決策樹，完全不用擔心，它的概念非常簡單。即使不知道它也可以通過簡單的圖形瞭解其工作原理。

決策樹分類的思想類似於找對象。現想象一個女孩的母親要給這個女孩介紹男朋友，於是有了下面的對話：

  女兒：多大年紀了？

  母親：26。

  女兒：長的帥不帥？

  母親：挺帥的。

  女兒：收入高不？

  母親：不算很高，中等情況。

  女兒：是公務員不？

  母親：是，在稅務局上班呢。

  女兒：那好，我去見見。

這個女孩的決策過程就是典型的分類樹決策。相當於通過年齡、長相、收入和是否公務員對將男人分爲兩個類別：見和不見。假設這個女孩對男人的要求是：30歲以下、長相中等以上並且是高收入者或中等以上收入的公務員，那麼這個可以用下圖表示女孩的決策邏輯：

id3算法

劃分數據集的大原則是：將無序的數據變得更加有序。

我們可以使用多種方法劃分數據集，但是每種方法都有各自的優缺點。組織雜亂無章數據的一種方法就是使用信息論度量信息，信息論是量化處理信息的分支科學。我們可以在劃分數據之前使用信息論量化度量信息的內容。

在劃分數據集之前之後信息發生的變化稱爲信息增益，知道如何計算信息增益，我們就可以計算每個特徵值劃分數據集獲得的信息增益，獲得信息增益最高的特徵就是最好的選擇。

集合信息的度量方式稱爲香農熵或者簡稱爲熵，這個名字來源於信息論之父克勞德•香農。

entropy

熵定義爲信息的期望值，在明晰這個概念之前，我們必須知道信息的定義。如果待分類的事務可能劃分在多個分類之中，則符號x的信息定義爲：

其中p(x)是選擇該分類的概率

爲了計算熵，我們需要計算所有類別所有可能值包含的信息期望值，通過下面的公式得到：

其中n是分類的數目。

在決策樹當中，設D爲用類別對訓練元組進行的劃分，則D的熵（entropy）表示爲：

其中pi表示第i個類別在整個訓練元組中出現的概率，可以用屬於此類別元素的數量除以訓練元組元素總數量作爲估計。熵的實際意義表示是D中元組的類標號所需要的平均信息量。

現在我們假設將訓練元組D按屬性A進行劃分，則A對D劃分的期望信息爲：

而信息增益即爲兩者的差值：

簡單決策樹代碼實現示例

熵增益計算練習

根據天氣（晴\陰\雨、氣溫、溼度、風）預測是否出去玩

是否出去玩的信息熵

import numpy as np

h_play = -((9. / 14) * np.log2(9. / 14) + (5. / 14) * np.log2(5. / 14))
round(h_play, 3)

0.94

天氣的信息增益

h_sunny = -((2. / 5) * np.log2(2. / 5) + (3. / 5) * np.log2(3. / 5)) * (5. / 14)
h_overcast = -((1.) * np.log2(1.)) * (4. / 14)
h_rain = -((3. / 5) * np.log2(3. / 5) + (2. / 5) * np.log2(2. / 5)) * (5. / 14)
h_outlook = h_sunny + h_overcast + h_rain
round(h_outlook, 3)

0.694

r_outlook = round(h_play - h_outlook, 3)
r_outlook

0.247

氣溫的信息增益

h_hot = -((2. / 4) * np.log2(2. / 4) + (2. / 4) * np.log2(2. / 4)) * (4. / 14)
h_mild = -((4. / 6) * np.log2(4. / 6) + (2. / 6) * np.log2(2. / 6)) * (6. / 14)
h_cool = -((3. / 4) * np.log2(3. / 4) + (1. / 4) * np.log2(1. / 4)) * (4. / 14)
h_temp = h_hot + h_mild + h_cool
round(h_temp, 3)

0.911

r_temp = round(h_play - h_temp, 3)
r_temp

0.029

溼度的信息增益

h_high = -((3. / 7) * np.log2(3. / 7) + (4. / 7) * np.log2(4. / 7)) * (7. / 14)
h_norm = -((6. / 7) * np.log2(6. / 7) + (1. / 7) * np.log2(1. / 7)) * (7. / 14)
h_hum = h_high + h_norm
round(h_hum, 3)

0.788

r_hum = round(h_play - h_hum, 3)
r_hum

0.152

颳風的信息增益

h_wtrue = -((3. / 6) * np.log2(3. / 6) + (3. / 6) * np.log2(3. / 6)) * (6. / 14)
h_wfalse = -((6. / 8) * np.log2(6. / 8) + (2. / 8) * np.log2(2. / 8)) * (8. / 14)
h_wind = h_wtrue + h_wfalse
round(h_wind, 3)

0.892

r_wind = round(h_play - h_wind, 3)
r_wind

0.048

信息增益排名： r_outlook(0.247) > r_hum(0.152) > r_wind(0.048) > r_temp(0.029)

晴天的前提下，其它因素的信息增益

h_sunny_sunny =  -((2. / 5) * np.log2(2. / 5) + (3. / 5) * np.log2(3. / 5))
round(h_sunny_sunny, 3)

0.971

晴天的前提下，氣溫的信息增益

h_hot_sunny = -((2. / 2) * np.log2(2. / 2)) * (2. / 5)
h_mild_sunny = -((1. / 2) * np.log2(1. / 2) + (1. / 2) * np.log2(1. / 2)) * (2. / 5)
h_cool_sunny = -((1. / 1) * np.log2(1. / 1)) * (1. / 5)
h_temp_sunny = h_hot_sunny + h_mild_sunny + h_cool_sunny
round(h_temp_sunny, 3)

0.4

r_temp_sunny = h_sunny_sunny - h_temp_sunny
round(r_temp_sunny, 3)

0.571

晴天的前提下，溼度的信息增益

h_high_sunny = -((3. / 3) * np.log2(3. / 3)) * (3. / 5)
h_norm_sunny = -((2. / 2) * np.log2(2. / 2)) * (2. / 5)
h_hum_sunny = h_high_sunny + h_norm_sunny
round(h_hum_sunny, 3)

-0.0

r_hum_sunny = h_sunny_sunny - h_hum_sunny
round(r_hum_sunny, 3)

0.971

晴天的前提下，颳風的信息增益

h_wtrue_sunny = -((1. / 2) * np.log2(1. / 2) + (1. / 2) * np.log2(1. / 2)) * (2. / 5)
h_wfalse_sunny = -((1. / 3) * np.log2(1. / 3) + (2. / 3) * np.log2(2. / 3)) * (3. / 5)
h_wind_sunny = h_wtrue_sunny + h_wfalse_sunny
round(h_wind_sunny, 3)

0.951

r_wind_sunny = h_sunny_sunny - h_wind_sunny
round(r_wind_sunny, 3)

0.02

晴天時，其它因素信息增益排名：

r_hum_sunny(0.971) > r_temp_sunny(0.571) > r_wind_sunny(0.02)

代碼實現示例

根據天氣預測是否出去玩

導入數據

from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
import pandas as pd


df = pd.read_csv('./dtree.csv')

df

	Outlook	Temperature	Humidity	Windy	Play
0	sunny	85	85	False	no
1	sunny	80	90	True	no
2	overcast	83	86	False	yes
3	rainy	70	96	False	yes
4	rainy	68	80	False	yes
5	rainy	65	70	True	no
6	overcast	64	65	True	yes
7	sunny	72	95	False	no
8	sunny	69	70	False	yes
9	rainy	75	80	False	yes
10	sunny	75	70	True	yes
11	overcast	72	90	True	yes
12	overcast	81	75	False	yes
13	rainy	71	91	True	no

df.columns

Index(['Outlook', 'Temperature', 'Humidity', 'Windy', 'Play'], dtype='object')

data = df.loc[:, ['Outlook', 'Temperature', 'Humidity', 'Windy']].to_dict(orient = 'record')
target = df.loc[:, ['Play']].to_dict(orient = 'record')

訓練數據向量化

dt_vect_data = DictVectorizer(sparse = False)

vect_data = dt_vect_data.fit_transform(data)

vect_data

array([[85.,  0.,  0.,  1., 85.,  0.],
       [90.,  0.,  0.,  1., 80.,  1.],
       [86.,  1.,  0.,  0., 83.,  0.],
       [96.,  0.,  1.,  0., 70.,  0.],
       [80.,  0.,  1.,  0., 68.,  0.],
       [70.,  0.,  1.,  0., 65.,  1.],
       [65.,  1.,  0.,  0., 64.,  1.],
       [95.,  0.,  0.,  1., 72.,  0.],
       [70.,  0.,  0.,  1., 69.,  0.],
       [80.,  0.,  1.,  0., 75.,  0.],
       [70.,  0.,  0.,  1., 75.,  1.],
       [90.,  1.,  0.,  0., 72.,  1.],
       [75.,  1.,  0.,  0., 81.,  0.],
       [91.,  0.,  1.,  0., 71.,  1.]])

數據向量化，天氣分成了3個特徵

標記數據向量化

dt_vect_target = DictVectorizer(sparse = False)

vect_target = dt_vect_target.fit_transform(target)

dt_tree = DecisionTreeClassifier()
dt_tree.fit(vect_data, vect_target)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

print('屬性重要性', dt_tree.feature_importances_)

屬性重要性 [0.28       0.22222222 0.         0.         0.49777778 0.        ]

測試數據向量化

new_data = {'Outlook': 'sunny', 'Temperature': 60, 'Humidity': 90, 'Windy': True}
vect_new_data = dt_vect_data.transform(new_data)
vect_new_data

array([[90.,  0.,  0.,  1., 60.,  1.]])

預測

dt_vect_target.inverse_transform(dt_tree.predict(vect_new_data))

[{'Play=yes': 1.0}]

彭黎明

發佈了18 篇原創文章 · 獲贊 2 · 訪問量 635

私信關注

決策樹簡單代碼實現示例

決策樹