分類樹與迴歸樹

分類樹用於分類問題。分類決策樹在選取劃分點，用信息熵、信息增益、或者信息增益率、或者基尼係數爲標準。
Classification tree analysis is when the predicted outcome is the class to which the data belongs.

迴歸決策樹用於處理輸出爲連續型的數據。迴歸決策樹在選取劃分點，就希望劃分的兩個分支的誤差越小越好。

Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital)。

迴歸樹

英文名字：Regression Tree

原理介紹

決策樹最直觀的理解其實就是，輸入特徵空間(\(R^n\))，然後對特徵空間做劃分，每一個劃分屬於同一類或者對於一個輸出的預測值。那麼這個算法需要解決的問題是1. 如何決策邊界(劃分點)？2. 儘可能少的比較次數(決策樹的形狀)

如上圖，每一個非葉子對於某個特徵的劃分。

最小二乘迴歸樹生成算法

Q1: 選擇劃分點？遍歷所有的特徵(\(n\)),對於每一個特徵對應\(s_i\)個取值，嘗試完所有特徵，以及特徵所以有劃分，選擇使得損失函數最小的那組特徵以及特徵的劃分取值。

Q2: 葉節點的輸出？取每個區域所以結果的平均數作爲輸出

節點的損失函數的形式
\[ \min _{j, s}\left[\min _{c_{1}} Loss(y_i,c_1)+\min _{c_{2}} Loss(y_i,c_2)\right] \]
節點有兩條分支，\(c1\)是左節點的平均值，\(c2\)是右節點的平均值，換句話說，分一次劃分都是使得劃分出的兩個分支的誤差和最小。最終得到函數是分段函數

CART算法

輸入：訓練數據集

輸出：迴歸樹\(f(x)\)

選擇最優的特徵\(j\)和分切點\(s\)
\[ \min _{j, s}\left[\min _{c_{1}} \sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{x_{i} \in R_{2}(j, s)}\left(y_{i}-c_{2}\right)^{2}\right] \]
對於選定的\((j,s)\)劃分區域，並確定該區域的預測值
對兩個區域遞歸1. 2. 直到滿足停止條件
返回生成樹

注：分切點選擇：先排序，二分。

Python代碼

節點類

屬性：左右節點、loss、特徵編號或者特徵、分割點

class Node(object):
    def __init__(self, score=None):
        # 構造函數
        self.score = score
        self.left = None
        self.right = None
        self.feature = None
        self.split = None

迴歸樹類

構造方法

class RegressionTree(object):
    def __init__(self):
        self.root = Node()
        self.height = 0

給定特徵、劃分點，返回計算MAPE

def _get_split_mse(self, X, y, idx, feature, split):
    '''
    X:訓練樣本輸入
    y:訓練樣本輸出
    idx:該分支對應的樣本編號
    feaure: 特徵
    split: 劃分點
    '''
    split_x1=X[X[idex,feature]<split]
    split_y1=y[X[idex,feature]<split]
    split_x2=X[X[idex,feature]>=split]
    split_y2=y[X[idex,feature]>=split]
    
    split_avg = [np.mean(split_y1), np.mean(split_y2)]
    split_mape = [np.sum((split_y1-split_avg[0])**2),np.sum((split_y2-split_avg[1])**2)]
    return split_mse, split, split_avg

計算給定特徵的最佳分割點

遍歷特徵某一列的所有的不重複的點，找出MAPE最小的點作爲最佳分割點。如果特徵中沒有不重複的元素則返回None。

def _choose_split_point(self, X, y, idx, feature):
    feature_x = X[idx,feature]
    uniques = np.unique(feature_x)
    if len(uniques)==1:
        return Noe

    mape, split, split_avg = min(
   (self._get_split_mse(X, y, idx, feature, split)
       for split in unique[1:]), key=lambda x: x[0])
    return mape, feature, split, split_avg

選擇特徵
遍歷全部特徵，計算mape,然後確定特徵和對應的切割點，注意如果某個特徵的值是一樣的，則返回None

def _choose_feature(self, X, y, idx):
    m = len(X[0])
    split_rets = [x for x in map(lambda x: self._choose_split_point(
        X, y, idx, x), range(m)) if x is not None]
 
    if split_rets == []:
        return None
    _, feature, split, split_avg = min(
        split_rets, key=lambda x: x[0])
 
    idx_split = [[], []]
    while idx:
        i = idx.pop()
        xi = X[i][feature]
        if xi < split:
            idx_split[0].append(i)
        else:
            idx_split[1].append(i)
    return feature, split, split_avg, idx_split

對應葉子節點，打印相關的信息

def _expr2literal(self, expr):
        feature, op, split = expr
        op = ">=" if op == 1 else "<"
        return "Feature%d %s %.4f" % (feature, op, split)

建立好二叉樹以後，遍歷操作

def _get_rules(self):
    que = [[self.root, []]]
    self.rules = []
     
    while que:
        nd, exprs = que.pop(0)
        if not(nd.left or nd.right):
            literals = list(map(self._expr2literal, exprs))
            self.rules.append([literals, nd.score])
     
        if nd.left:
            rule_left = []
            rule_left.append([nd.feature, -1, nd.split])
            que.append([nd.left, rule_left])
     
        if nd.right:
            rule_right =[]
            rule_right.append([nd.feature, 1, nd.split])
            que.append([nd.right, rule_right])

建立二叉樹的過程，也就是訓練的過程

控制深度
控制節葉子節點的最少樣本數量
至少有一個特徵是不重複的

def fit(self, X, y, max_depth=5, min_samples_split=2):
        self.root = Node()
        que = [[0, self.root, list(range(len(y)))]]
    
        while que:
            depth, nd, idx = que.pop(0)
    
            if depth == max_depth:
                break
    
            if len(idx) < min_samples_split or set(map(lambda i: y[i,0], idx)) == 1:
                continue
    
            feature_rets = self._choose_feature(X, y, idx)
            if feature_rets is None:
                continue
    
            nd.feature, nd.split, split_avg, idx_split = feature_rets
            nd.left = Node(split_avg[0])
            nd.right = Node(split_avg[1])
            que.append([depth+1, nd.left, idx_split[0]])
            que.append([depth+1, nd.right, idx_split[1]])
    
        self.height = depth
        self._get_rules()

打印葉子節點

def print_rules(self):
        for i, rule in enumerate(self.rules):
            literals, score = rule
            print("Rule %d: " % i, ' | '.join(
                literals) + ' => split_hat %.4f' % score)

預測單樣本

def _predict(self, row):
        nd = self.root
        while nd.left and nd.right:
            if row[nd.feature] < nd.split:
                nd = nd.left
            else:
                nd = nd.right
        return nd.score
    
# 預測多條樣本
def predict(self, X):
    return [self._predict(Xi) for Xi in X]

 
def main():
    print("Tesing the accuracy of RegressionTree...")
    X_train=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
    y_train=np.array([[5.56 ],[5.7],[5.91],[6.4
                      ],[6.8],[7.05],[8.9],[8.7
                        ],[9 ],[9.05]])
    reg = RegressionTree()
    print(reg)
    reg.fit(X=X_train, y=y_train, max_depth=3)
    reg.print_rules()


main()

簡單的例子

訓練數據

x	1	2	3	4	5	6	7	8	9	10
y	5.56	5.7	5.91	6.4	6.8	7.05	8.9	8.7	9	9.05

根據上表，只有一個特徵\(x\).

選擇最優的特徵\(j\)和分切點\(s\)

分切點(s)	1.5	2.5	3.5	4.5	5.5	6.5	7.5	8.5	9.5
\(c_1\)	5.56	5.63	5.72	5.89	6.07	6.24	6.62	6.88	7.11
\(c_2\)	7.5	7.73	7.99	8.25	8.54	8.91	8.92	9.03	9.05
loss	15.72	12.07	8.36	5.78	3.91	1.93	8.01	11.73	15.74

當分切點取\(s=6.5\),損失最小\(l(s=6.5)=1.93\),此時劃分出兩個分支，分別是\(R_1=\{1,2,3,4,5,6\}\),\(c_1=6.42\),\(R_2=\{7,8,9,10\}\),\(c_2=8.91\)

a) 對R1繼續劃分

x 1 2 3 4 5 6

y 5.56 5.7 5.91 6.4 6.8 7.05

分切點(s) 1.5 2.5 3.5 4.5 5.5

\(c_1\) 5.56 5.63 5.72 5.89 6.07

\(c_2\) 6.37 6.54 6.75 6.93 7.05

loss 1.3087 0.754 0.2771 0.4368 1.0644

當分切點取\(s=3.5\),損失函數\(l(s=3.6)=0.2771\)(假設此時滿足停止條件）,此時得到兩個分支，分別是\(R_1=\{1,2,3\}\)，\(c_1=5.72\),\(R_2={4,,5,6}\),\(c_2=6.75\)

b) 對R2繼續劃分

x 7 8 9 10

y 8.9 8.7 9 9.05

分切點(s) 7.5 8.5 9.5

\(c_1\) 8.9 8.8 8.87

\(c_2\) 8.92 9.03 9.05

loss 0.0717 0.0213 0.0467

當分切點取\(s=8.5\),損失函數\(l(s=8,5)=0.0213\)(假設此時滿足停止條件）,此時得到兩個分支，分別是\(R_1=\{7,8\}\)，\(c_1=8.8\),\(R_2=\{9,10\}\),\(c_2=9.03\)
函數表達式
\[ \begin{equation} f(x)=\left\{ \begin{aligned} 5.72 & & x<3.5\\ 6.7 5& &3.5<=x<6.5\\ 8.8& &6.5<=x<8.5\\ 9.03& &8.5<=x<10\\ \end{aligned} \right. \end{equation} \]

x	1	2	3	4	5	6
y	5.56	5.7	5.91	6.4	6.8	7.05

分切點(s)	1.5	2.5	3.5	4.5	5.5
\(c_1\)	5.56	5.63	5.72	5.89	6.07
\(c_2\)	6.37	6.54	6.75	6.93	7.05
loss	1.3087	0.754	0.2771	0.4368	1.0644

x	7	8	9	10
y	8.9	8.7	9	9.05

分切點(s)	7.5	8.5	9.5
\(c_1\)	8.9	8.8	8.87
\(c_2\)	8.92	9.03	9.05
loss	0.0717	0.0213	0.0467

Python庫

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 13 19:59:53 2019

@author: 23230
"""

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

X=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y=np.array([[5.56 ],[5.7],[5.91],[6.4],[6.8],[7.05],[8.9],[8.7],[9 ],[9.05]])

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=3)
regr_3 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

X_test = np.copy(X)
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
 
# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=4", linewidth=2)
plt.plot(X_test, y_3, color="r", label="max_depth=8", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()

迴歸樹

分類樹與迴歸樹

迴歸樹

原理介紹

最小二乘迴歸樹生成算法

CART算法

Python代碼

節點類

迴歸樹類

簡單的例子

Python庫

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

202302第二個月總結

紀念第一段工作經歷

2023第一個月總結

從現在開始到我什麼都有了

2022終究是回憶，2023要充實且輕鬆

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結