分類樹與迴歸樹
分類樹用於分類問題。分類決策樹在選取劃分點,用信息熵、信息增益、或者信息增益率、或者基尼係數爲標準。
Classification tree analysis is when the predicted outcome is the class to which the data belongs.
迴歸決策樹用於處理輸出爲連續型的數據。迴歸決策樹在選取劃分點,就希望劃分的兩個分支的誤差越小越好。
Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital)。
迴歸樹
英文名字:Regression Tree
原理介紹
決策樹最直觀的理解其實就是,輸入特徵空間(\(R^n\)),然後對特徵空間做劃分,每一個劃分屬於同一類或者對於一個輸出的預測值。那麼這個算法需要解決的問題是1. 如何決策邊界(劃分點)?2. 儘可能少的比較次數(決策樹的形狀)
如上圖,每一個非葉子對於某個特徵的劃分。
最小二乘迴歸樹生成算法
Q1: 選擇劃分點?遍歷所有的特徵(\(n\)),對於每一個特徵對應\(s_i\)個取值,嘗試完所有特徵,以及特徵所以有劃分,選擇使得損失函數最小的那組特徵以及特徵的劃分取值。
Q2: 葉節點的輸出?取每個區域所以結果的平均數作爲輸出
節點的損失函數的形式
\[
\min _{j, s}\left[\min _{c_{1}} Loss(y_i,c_1)+\min _{c_{2}} Loss(y_i,c_2)\right]
\]
節點有兩條分支,\(c1\)是左節點的平均值,\(c2\)是右節點的平均值,換句話說,分一次劃分都是使得劃分出的兩個分支的誤差和最小。最終得到函數是分段函數
CART算法
輸入: 訓練數據集
輸出:迴歸樹\(f(x)\)
選擇最優的特徵\(j\)和分切點\(s\)
\[ \min _{j, s}\left[\min _{c_{1}} \sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{x_{i} \in R_{2}(j, s)}\left(y_{i}-c_{2}\right)^{2}\right] \]對於選定的\((j,s)\)劃分區域,並確定該區域的預測值
對兩個區域遞歸1. 2. 直到滿足停止條件
返回生成樹
注:分切點選擇:先排序,二分。
Python代碼
節點類
屬性:左右節點、loss、特徵編號或者特徵、分割點
class Node(object):
def __init__(self, score=None):
# 構造函數
self.score = score
self.left = None
self.right = None
self.feature = None
self.split = None
迴歸樹類
構造方法
class RegressionTree(object):
def __init__(self):
self.root = Node()
self.height = 0
給定特徵、劃分點,返回計算MAPE
def _get_split_mse(self, X, y, idx, feature, split):
'''
X:訓練樣本輸入
y:訓練樣本輸出
idx:該分支對應的樣本編號
feaure: 特徵
split: 劃分點
'''
split_x1=X[X[idex,feature]<split]
split_y1=y[X[idex,feature]<split]
split_x2=X[X[idex,feature]>=split]
split_y2=y[X[idex,feature]>=split]
split_avg = [np.mean(split_y1), np.mean(split_y2)]
split_mape = [np.sum((split_y1-split_avg[0])**2),np.sum((split_y2-split_avg[1])**2)]
return split_mse, split, split_avg
計算給定特徵的最佳分割點
遍歷特徵某一列的所有的不重複的點,找出MAPE最小的點作爲最佳分割點。如果特徵中沒有不重複的元素則返回None。
def _choose_split_point(self, X, y, idx, feature):
feature_x = X[idx,feature]
uniques = np.unique(feature_x)
if len(uniques)==1:
return Noe
mape, split, split_avg = min(
(self._get_split_mse(X, y, idx, feature, split)
for split in unique[1:]), key=lambda x: x[0])
return mape, feature, split, split_avg
選擇特徵
遍歷全部特徵,計算mape,然後確定特徵和對應的切割點,注意如果某個特徵的值是一樣的,則返回None
def _choose_feature(self, X, y, idx):
m = len(X[0])
split_rets = [x for x in map(lambda x: self._choose_split_point(
X, y, idx, x), range(m)) if x is not None]
if split_rets == []:
return None
_, feature, split, split_avg = min(
split_rets, key=lambda x: x[0])
idx_split = [[], []]
while idx:
i = idx.pop()
xi = X[i][feature]
if xi < split:
idx_split[0].append(i)
else:
idx_split[1].append(i)
return feature, split, split_avg, idx_split
對應葉子節點,打印相關的信息
def _expr2literal(self, expr):
feature, op, split = expr
op = ">=" if op == 1 else "<"
return "Feature%d %s %.4f" % (feature, op, split)
建立好二叉樹以後,遍歷操作
def _get_rules(self):
que = [[self.root, []]]
self.rules = []
while que:
nd, exprs = que.pop(0)
if not(nd.left or nd.right):
literals = list(map(self._expr2literal, exprs))
self.rules.append([literals, nd.score])
if nd.left:
rule_left = []
rule_left.append([nd.feature, -1, nd.split])
que.append([nd.left, rule_left])
if nd.right:
rule_right =[]
rule_right.append([nd.feature, 1, nd.split])
que.append([nd.right, rule_right])
建立二叉樹的過程,也就是訓練的過程
- 控制深度
- 控制節葉子節點的最少樣本數量
- 至少有一個特徵是不重複的
def fit(self, X, y, max_depth=5, min_samples_split=2):
self.root = Node()
que = [[0, self.root, list(range(len(y)))]]
while que:
depth, nd, idx = que.pop(0)
if depth == max_depth:
break
if len(idx) < min_samples_split or set(map(lambda i: y[i,0], idx)) == 1:
continue
feature_rets = self._choose_feature(X, y, idx)
if feature_rets is None:
continue
nd.feature, nd.split, split_avg, idx_split = feature_rets
nd.left = Node(split_avg[0])
nd.right = Node(split_avg[1])
que.append([depth+1, nd.left, idx_split[0]])
que.append([depth+1, nd.right, idx_split[1]])
self.height = depth
self._get_rules()
打印葉子節點
def print_rules(self):
for i, rule in enumerate(self.rules):
literals, score = rule
print("Rule %d: " % i, ' | '.join(
literals) + ' => split_hat %.4f' % score)
預測單樣本
def _predict(self, row):
nd = self.root
while nd.left and nd.right:
if row[nd.feature] < nd.split:
nd = nd.left
else:
nd = nd.right
return nd.score
# 預測多條樣本
def predict(self, X):
return [self._predict(Xi) for Xi in X]
def main():
print("Tesing the accuracy of RegressionTree...")
X_train=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y_train=np.array([[5.56 ],[5.7],[5.91],[6.4
],[6.8],[7.05],[8.9],[8.7
],[9 ],[9.05]])
reg = RegressionTree()
print(reg)
reg.fit(X=X_train, y=y_train, max_depth=3)
reg.print_rules()
main()
簡單的例子
訓練數據
x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
y | 5.56 | 5.7 | 5.91 | 6.4 | 6.8 | 7.05 | 8.9 | 8.7 | 9 | 9.05 |
根據上表,只有一個特徵\(x\).
選擇最優的特徵\(j\)和分切點\(s\)
分切點(s) 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 \(c_1\) 5.56 5.63 5.72 5.89 6.07 6.24 6.62 6.88 7.11 \(c_2\) 7.5 7.73 7.99 8.25 8.54 8.91 8.92 9.03 9.05 loss 15.72 12.07 8.36 5.78 3.91 1.93 8.01 11.73 15.74 當分切點取\(s=6.5\),損失最小\(l(s=6.5)=1.93\),此時劃分出兩個分支,分別是\(R_1=\{1,2,3,4,5,6\}\),\(c_1=6.42\),\(R_2=\{7,8,9,10\}\),\(c_2=8.91\)
a) 對R1繼續劃分
x 1 2 3 4 5 6 y 5.56 5.7 5.91 6.4 6.8 7.05 分切點(s) 1.5 2.5 3.5 4.5 5.5 \(c_1\) 5.56 5.63 5.72 5.89 6.07 \(c_2\) 6.37 6.54 6.75 6.93 7.05 loss 1.3087 0.754 0.2771 0.4368 1.0644 當分切點取\(s=3.5\),損失函數\(l(s=3.6)=0.2771\)(假設此時滿足停止條件),此時得到兩個分支,分別是\(R_1=\{1,2,3\}\),\(c_1=5.72\),\(R_2={4,,5,6}\),\(c_2=6.75\)
b) 對R2繼續劃分
x 7 8 9 10 y 8.9 8.7 9 9.05 分切點(s) 7.5 8.5 9.5 \(c_1\) 8.9 8.8 8.87 \(c_2\) 8.92 9.03 9.05 loss 0.0717 0.0213 0.0467 當分切點取\(s=8.5\),損失函數\(l(s=8,5)=0.0213\)(假設此時滿足停止條件),此時得到兩個分支,分別是\(R_1=\{7,8\}\),\(c_1=8.8\),\(R_2=\{9,10\}\),\(c_2=9.03\)
函數表達式
\[ \begin{equation} f(x)=\left\{ \begin{aligned} 5.72 & & x<3.5\\ 6.7 5& &3.5<=x<6.5\\ 8.8& &6.5<=x<8.5\\ 9.03& &8.5<=x<10\\ \end{aligned} \right. \end{equation} \]
Python庫
class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 13 19:59:53 2019
@author: 23230
"""
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
X=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y=np.array([[5.56 ],[5.7],[5.91],[6.4],[6.8],[7.05],[8.9],[8.7],[9 ],[9.05]])
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=3)
regr_3 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
X_test = np.copy(X)
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=4", linewidth=2)
plt.plot(X_test, y_3, color="r", label="max_depth=8", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()