迴歸樹

分類樹與迴歸樹

分類樹用於分類問題。分類決策樹在選取劃分點,用信息熵、信息增益、或者信息增益率、或者基尼係數爲標準。
Classification tree analysis is when the predicted outcome is the class to which the data belongs.

迴歸決策樹用於處理輸出爲連續型的數據。迴歸決策樹在選取劃分點,就希望劃分的兩個分支的誤差越小越好。

Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital)。

迴歸樹

英文名字:Regression Tree

原理介紹

決策樹最直觀的理解其實就是,輸入特徵空間(\(R^n\)),然後對特徵空間做劃分,每一個劃分屬於同一類或者對於一個輸出的預測值。那麼這個算法需要解決的問題是1. 如何決策邊界(劃分點)?2. 儘可能少的比較次數(決策樹的形狀)

如上圖,每一個非葉子對於某個特徵的劃分。

最小二乘迴歸樹生成算法

Q1: 選擇劃分點?遍歷所有的特徵(\(n\)),對於每一個特徵對應\(s_i\)個取值,嘗試完所有特徵,以及特徵所以有劃分,選擇使得損失函數最小的那組特徵以及特徵的劃分取值。

Q2: 葉節點的輸出?取每個區域所以結果的平均數作爲輸出

節點的損失函數的形式
\[ \min _{j, s}\left[\min _{c_{1}} Loss(y_i,c_1)+\min _{c_{2}} Loss(y_i,c_2)\right] \]
節點有兩條分支,\(c1\)是左節點的平均值,\(c2\)是右節點的平均值,換句話說,分一次劃分都是使得劃分出的兩個分支的誤差和最小。最終得到函數是分段函數

CART算法

輸入: 訓練數據集

輸出:迴歸樹\(f(x)\)

  1. 選擇最優的特徵\(j\)和分切點\(s\)
    \[ \min _{j, s}\left[\min _{c_{1}} \sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{x_{i} \in R_{2}(j, s)}\left(y_{i}-c_{2}\right)^{2}\right] \]

  2. 對於選定的\((j,s)\)劃分區域,並確定該區域的預測值

  3. 對兩個區域遞歸1. 2. 直到滿足停止條件

  4. 返回生成樹

    注:分切點選擇:先排序,二分。

Python代碼

節點類

屬性:左右節點、loss、特徵編號或者特徵、分割點

class Node(object):
    def __init__(self, score=None):
        # 構造函數
        self.score = score
        self.left = None
        self.right = None
        self.feature = None
        self.split = None

迴歸樹類

構造方法

class RegressionTree(object):
    def __init__(self):
        self.root = Node()
        self.height = 0

給定特徵、劃分點,返回計算MAPE

def _get_split_mse(self, X, y, idx, feature, split):
    '''
    X:訓練樣本輸入
    y:訓練樣本輸出
    idx:該分支對應的樣本編號
    feaure: 特徵
    split: 劃分點
    '''
    split_x1=X[X[idex,feature]<split]
    split_y1=y[X[idex,feature]<split]
    split_x2=X[X[idex,feature]>=split]
    split_y2=y[X[idex,feature]>=split]
    
    split_avg = [np.mean(split_y1), np.mean(split_y2)]
    split_mape = [np.sum((split_y1-split_avg[0])**2),np.sum((split_y2-split_avg[1])**2)]
    return split_mse, split, split_avg

計算給定特徵的最佳分割點

遍歷特徵某一列的所有的不重複的點,找出MAPE最小的點作爲最佳分割點。如果特徵中沒有不重複的元素則返回None。

def _choose_split_point(self, X, y, idx, feature):
    feature_x = X[idx,feature]
    uniques = np.unique(feature_x)
    if len(uniques)==1:
        return Noe

    mape, split, split_avg = min(
   (self._get_split_mse(X, y, idx, feature, split)
       for split in unique[1:]), key=lambda x: x[0])
    return mape, feature, split, split_avg

選擇特徵
遍歷全部特徵,計算mape,然後確定特徵和對應的切割點,注意如果某個特徵的值是一樣的,則返回None

def _choose_feature(self, X, y, idx):
    m = len(X[0])
    split_rets = [x for x in map(lambda x: self._choose_split_point(
        X, y, idx, x), range(m)) if x is not None]
 
    if split_rets == []:
        return None
    _, feature, split, split_avg = min(
        split_rets, key=lambda x: x[0])
 
    idx_split = [[], []]
    while idx:
        i = idx.pop()
        xi = X[i][feature]
        if xi < split:
            idx_split[0].append(i)
        else:
            idx_split[1].append(i)
    return feature, split, split_avg, idx_split

對應葉子節點,打印相關的信息

def _expr2literal(self, expr):
        feature, op, split = expr
        op = ">=" if op == 1 else "<"
        return "Feature%d %s %.4f" % (feature, op, split)  

建立好二叉樹以後,遍歷操作

def _get_rules(self):
    que = [[self.root, []]]
    self.rules = []
     
    while que:
        nd, exprs = que.pop(0)
        if not(nd.left or nd.right):
            literals = list(map(self._expr2literal, exprs))
            self.rules.append([literals, nd.score])
     
        if nd.left:
            rule_left = []
            rule_left.append([nd.feature, -1, nd.split])
            que.append([nd.left, rule_left])
     
        if nd.right:
            rule_right =[]
            rule_right.append([nd.feature, 1, nd.split])
            que.append([nd.right, rule_right])

建立二叉樹的過程,也就是訓練的過程

  1. 控制深度
  2. 控制節葉子節點的最少樣本數量
  3. 至少有一個特徵是不重複的
def fit(self, X, y, max_depth=5, min_samples_split=2):
        self.root = Node()
        que = [[0, self.root, list(range(len(y)))]]
    
        while que:
            depth, nd, idx = que.pop(0)
    
            if depth == max_depth:
                break
    
            if len(idx) < min_samples_split or set(map(lambda i: y[i,0], idx)) == 1:
                continue
    
            feature_rets = self._choose_feature(X, y, idx)
            if feature_rets is None:
                continue
    
            nd.feature, nd.split, split_avg, idx_split = feature_rets
            nd.left = Node(split_avg[0])
            nd.right = Node(split_avg[1])
            que.append([depth+1, nd.left, idx_split[0]])
            que.append([depth+1, nd.right, idx_split[1]])
    
        self.height = depth
        self._get_rules()

打印葉子節點

def print_rules(self):
        for i, rule in enumerate(self.rules):
            literals, score = rule
            print("Rule %d: " % i, ' | '.join(
                literals) + ' => split_hat %.4f' % score)
 

預測單樣本

def _predict(self, row):
        nd = self.root
        while nd.left and nd.right:
            if row[nd.feature] < nd.split:
                nd = nd.left
            else:
                nd = nd.right
        return nd.score
    
# 預測多條樣本
def predict(self, X):
    return [self._predict(Xi) for Xi in X]
  
 
def main():
    print("Tesing the accuracy of RegressionTree...")
    X_train=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
    y_train=np.array([[5.56 ],[5.7],[5.91],[6.4
                      ],[6.8],[7.05],[8.9],[8.7
                        ],[9 ],[9.05]])
    reg = RegressionTree()
    print(reg)
    reg.fit(X=X_train, y=y_train, max_depth=3)
    reg.print_rules()


main()

簡單的例子

訓練數據

x 1 2 3 4 5 6 7 8 9 10
y 5.56 5.7 5.91 6.4 6.8 7.05 8.9 8.7 9 9.05

根據上表,只有一個特徵\(x\).

  1. 選擇最優的特徵\(j\)和分切點\(s\)

    分切點(s) 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
    \(c_1\) 5.56 5.63 5.72 5.89 6.07 6.24 6.62 6.88 7.11
    \(c_2\) 7.5 7.73 7.99 8.25 8.54 8.91 8.92 9.03 9.05
    loss 15.72 12.07 8.36 5.78 3.91 1.93 8.01 11.73 15.74

    當分切點取\(s=6.5\),損失最小\(l(s=6.5)=1.93\),此時劃分出兩個分支,分別是\(R_1=\{1,2,3,4,5,6\}\),\(c_1=6.42\),\(R_2=\{7,8,9,10\}\),\(c_2=8.91\)

    1. a) 對R1繼續劃分

      x 1 2 3 4 5 6
      y 5.56 5.7 5.91 6.4 6.8 7.05
      分切點(s) 1.5 2.5 3.5 4.5 5.5
      \(c_1\) 5.56 5.63 5.72 5.89 6.07
      \(c_2\) 6.37 6.54 6.75 6.93 7.05
      loss 1.3087 0.754 0.2771 0.4368 1.0644

      當分切點取\(s=3.5\),損失函數\(l(s=3.6)=0.2771\)(假設此時滿足停止條件),此時得到兩個分支,分別是\(R_1=\{1,2,3\}\)\(c_1=5.72\),\(R_2={4,,5,6}\),\(c_2=6.75\)

      b) 對R2繼續劃分

      x 7 8 9 10
      y 8.9 8.7 9 9.05
      分切點(s) 7.5 8.5 9.5
      \(c_1\) 8.9 8.8 8.87
      \(c_2\) 8.92 9.03 9.05
      loss 0.0717 0.0213 0.0467

      當分切點取\(s=8.5\),損失函數\(l(s=8,5)=0.0213\)(假設此時滿足停止條件),此時得到兩個分支,分別是\(R_1=\{7,8\}\)\(c_1=8.8\),\(R_2=\{9,10\}\),\(c_2=9.03\)

    2. 函數表達式
      \[ \begin{equation} f(x)=\left\{ \begin{aligned} 5.72 & & x<3.5\\ 6.7 5& &3.5<=x<6.5\\ 8.8& &6.5<=x<8.5\\ 9.03& &8.5<=x<10\\ \end{aligned} \right. \end{equation} \]

Python庫

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 13 19:59:53 2019

@author: 23230
"""

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

X=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y=np.array([[5.56 ],[5.7],[5.91],[6.4],[6.8],[7.05],[8.9],[8.7],[9 ],[9.05]])

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=3)
regr_3 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

X_test = np.copy(X)
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
 
# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=4", linewidth=2)
plt.plot(X_test, y_3, color="r", label="max_depth=8", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章