1 基本流程

決策樹（decision tree）是一類常見的機器學習方法.

一般的，一棵決策樹包含一個根結點、若干個內部結點和若干個葉結點；葉結點對應於決策樹結果，其他每個結點則對應於一個屬性測試；每個結點包含的樣本集合根據屬性測試的結果被劃分到子結點中；根結點包含樣本全集。從根結點到每個葉結點的路徑對應了一個判定測試序列。

決策樹的學習目的是爲了產生一棵泛化能力強，即處理未見示例能力強的決策樹，其基本流程遵循簡單且直觀的“分而治之”（divide-and-conquer）策略。

決策樹的生成是一個遞歸過程。.在決策樹基本算法中，有三種情形會
導致遞歸返回:
(1) 當前結點包含的樣本全屬於同一類別，無需劃分;
(2) 當前屬性集爲空，或是所有樣本在所有屬性上取值相同，無法劃分;
把當前結點標記爲葉結點，井將其類別設定爲該結
點所含樣本最多的類別;
(3) 當前結點包含的樣本集合爲空，不能劃分.
把當前結點標記爲葉結點，但將其類別設定爲其父結點所含樣本最多的類別.

構建決策樹

決策樹算法流程

決策樹參數

min_samples_leaf : 決定了能否繼續分隔的最少分隔樣本
當設置的很小時，可能會導致過擬合。

決策樹的準確率

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()



########################## DECISION TREE #################################


### your code goes here--now create 2 decision tree classifiers,
### one with min_samples_split=2 and one with min_samples_split=50
### compute the accuracies on the testing data and store
### the accuracy numbers to acc_min_samples_split_2 and
### acc_min_samples_split_50, respectively

from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=2)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

from sklearn.metrics import accuracy_score
acc_min_samples_split_2 = accuracy_score(pred, labels_test)
print(acc_min_samples_split_2)


############min_samples_split=50#####################
clf50 = tree.DecisionTreeClassifier(min_samples_split=50)
clf50.fit(features_train, labels_train)
pred50 = clf50.predict(features_test)

acc_min_samples_split_50 = accuracy_score(pred50, labels_test)
print acc_min_samples_split_50

def submitAccuracies():
  return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
          "acc_min_samples_split_50":round(acc_min_samples_split_50,3)}

0.908
0.912
{"message": "{'acc_min_samples_split_50': 0.912, 'acc_min_samples_split_2': 0.908}"}

當min_samples_split=2 時，準確率爲90.8%
當min_samples_split=50時，準確率爲91.2%

熵

熵主要控制決策樹在何處分隔數據。
它是一系列樣本中的不純度的測量值。

熵的計算

import math
ss = -0.5 * math.log(0.5,2 ) + (-0.5 *math.log(0.5,2 ) )
 = 1.0

熵爲1.0，這是單一性最差的狀態。
我們所能得到的熵值最大爲1，所以這是我們所能得到的單一性最差的樣本。
如果我們有兩個類別的標籤，那麼我們所能得到的單一性最差的情況就是各樣本被平均分配到兩類中。

2 劃分選擇

決策樹學習的關鍵是如何選擇最優劃分屬性。一般而言，隨着劃分過程不斷進行，我們希望決策樹的分支結點所包含的樣本儘可能屬於同一類別，即結點的“純度”（purity）越來越高。

2.1 信息增益

“信息熵”（information entropy）是度量樣本集合最常用的一種指標。

信息增益定義爲：父節點的熵減去子節點的熵的加權平均。這些子節點是劃分父節點後生成的。

決策樹算法會最大化信息增益。

考慮到不同的分支結點所包含的樣本數不同，樣本數越多的分支結點的影響越大，於是可計算出用屬性a對樣本D進行劃分所獲得的“信息增益”（information gain）

一般而言，信息增益越大，則意味着使用屬性a來劃分所獲得的“純度提升”越大。

信息增益的計算（坡度）

右邊節點f,由於只有一種，所以它的熵爲0
左邊節點的P_slow = 2/3
左邊節點的P_fast = 1/3
左邊節點的熵爲 -2/3 *log(2/3,2) -1/3 *log(1/3,2) = 0.9184

根據坡度劃分的信息增益爲 0.3112

信息增益的計算（地形的顛簸和平坦）

根據地形的顛簸和平坦劃分的信息增益爲0；所以我們構建決策樹時不能選擇地形的顛簸和平坦來劃分樣本。

信息增益的計算（限速）

根據地形的顛簸和平坦劃分的信息增益爲1.0

2.2 增益率

實際上，信息增益準則對可取數目較多的屬性有所偏好，爲減少這種偏好可能帶來的不利影響，著名的 C4.5 決策樹不直接使用信息增益，而是使用“增益率”（gain ratio）來選擇最優劃分屬性。
增益率定義爲：
其中

稱爲屬性a的“固有值”（intrinsic value）。
增益率準則對可取數目較少的屬性有所偏好，因此，C4.5算法並不是直接選擇增益率最大的候選劃分屬性，而是使用了一個啓發式：先從候選劃分屬性中找出信息增益高於平均水平的屬性，再從中選擇增益率最高的。

2.3 基尼指數

CART 決策樹[Breiman et al., 1984] 使用"基尼指數" (Gini index)來選擇劃分屬性.

3 剪枝處理

剪枝（pruning）是決策樹學習算法對付“過擬合”的主要手段。
決策樹剪枝的基本策略有“預剪枝”（prepruning）和“後剪枝”（postpruning）。

3.1 預剪枝

預剪枝是指在決策樹生成過程中，對每個結點在劃分前先進行估計，若當前結點的劃分不能帶來決策樹泛化性能提升，則停止劃分並將當前結點標記爲葉結點。

預剪枝優缺點：

預剪枝使得決策樹的很多分支都沒有“展開”，這不僅降低了過擬合的風險，還顯著減少了決策樹的訓練時間開銷和測試時間開銷，但另一方面，有些分支的當前劃分雖不能提升泛化性能、甚至可能導致泛化性能暫時下降，但在其基礎上進行的後續劃分卻有可能導致性能顯著提高；預剪枝基於“貪心”本質禁止這些分支展開，給預剪枝決策樹帶來了欠擬合的風險。

3.2 後剪枝

後剪枝則是先從訓練集生成一棵完整的決策樹，然後自底向上地對非葉結點進行考察，若將該結點對應的子樹替換爲葉結點能帶來決策樹泛化性能提升，則將該子樹替換爲葉結點。

後剪枝優缺點：

後剪枝決策樹通常比預剪枝決策樹保留了更多的分支。一般情形下，後剪枝決策樹的欠擬合風險很小，泛化性能往往優於預剪枝決策樹。但後剪枝過程是在生成完全決策樹之後進行的，並且要自底向上地對數中的所有非葉結點進行逐一考察，因此其訓練時間開銷比未剪枝決策樹和預剪枝決策樹要大得多。

4 連續與缺失值

4.1 連續值處理

對於連續屬性可以使用連續屬性離散化技術，最簡單的策略是採用二分法（bi-partition）對連續屬性進行處理，這是 C4.5 決策樹中採用的機制。
對連續屬性a，我們可考察包含 n-1 個元素的候選劃分點集合

4.2 缺失值處理

現實任務中常會遇到某些屬性值缺失的不完整樣本，此時需要解決兩個問題：（1）如何在屬性值缺失的情況下進行劃分屬性選擇？（2）給定劃分屬性，若樣本在該屬性上的值缺失，如何對樣本進行劃分？

對於問題（1）可以根據訓練集中在屬性a上沒有缺失值的樣本子集來判斷屬性a的優劣，從而劃分屬性選擇；對於問題（2）若樣本 x 在劃分屬性 a 上的取值已知，則將 x 劃入與其取值對應的子結點。若樣本 x 在劃分屬性 a 上的取值未知，就讓同一個樣本以不同的概率劃入到不同的子結點中去。

5 多變量決策樹

決策樹所形成的分類邊界有一個明顯的特點：軸平行（axis-parallel），即它的分類邊界由若干個與座標軸平行的分段組成。“多變量決策樹”（multivariate decision tree）能實現“斜劃分”甚至更復雜劃分的決策樹。
與傳統的“單變量決策樹”（univariate decision tree）不同，在多變量決策樹的學習過程中，不是爲每個非葉結點尋找一個最優劃分屬性，而是試圖建立一個合適的線性分類器。

6 決策樹代碼實現

// Structure to represent node of kd tree
struct Node
{
	std::vector<float> point;
	int id;
	Node* left;
	Node* right;

	Node(std::vector<float> arr, int setId)
	:	point(arr), id(setId), left(NULL), right(NULL)
	{}
};

struct KdTree
{
	Node* root;

	KdTree()
	: root(NULL)
	{}

	void insertHelper(Node** node, uint depth, std::vector<float>point, int id)
	{
		if (*node == NULL)
		{
			*node = new Node(point, id);
		}
		else
		{
			uint cd = depth % 3;
			if (point[cd] < ((*node)->point[cd]))
				insertHelper(&((*node)->left), depth + 1, point, id);
			else
				insertHelper(&((*node)->right), depth + 1, point, id);
		}

	}

	void insert(std::vector<float> point, int id)
	{
		// TODO: Fill in this function to insert a new point into the tree
		// the function should create a new node and place correctly with in the root 
		insertHelper(&root, 0, point, id);

	}

	void searchHelper(std::vector<float> target, Node* node, int depth, float distanceTol, std::vector<int>& ids)
	{
		if (node != NULL)
		{
			if ((node->point[0] >= (target[0]- distanceTol) && node->point[0] <= (target[0]+ distanceTol) &&
				node-point[1] >= (target[1] - distanceTol) && node->point[1] <= (target[1] + distanceTol)))
			{
				float distance = sqrt((node->point[0] - target[0]) * (node->point[0] - target[0]) +
					(node->point[1] - target[1]) * (node->point[1] - target[1]));
				if (distance <= distanceTol)
					ids.push_back(node->id);
			}


			//check across boundary
			if ((target[depth % 2] - distanceTol) < node->point[depth % 2])
				searchHelper(target, node->left, depth + 1, distanceTol, ids);
			if((target[depth % 2] - distanceTol) > node->point[depth % 2])
				searchHelper(target, node->right, depth + 1, distanceTol, ids);
		}

	}

	// return a list of point ids in the tree that are within distance of target
	std::vector<int> search(std::vector<float> target, float distanceTol)
	{
		std::vector<int> ids;
		searchHelper(target, root, 0, distanceTol, ids);
		return ids;
	}
	

};

7.決策樹編碼sklearn

classifyDT.py

def classify(features_train, labels_train):
    
    ### your code goes here--should return a trained decision tree classifer
    
    from sklearn import tree

    clf = tree.DecisionTreeClassifier()

    clf = clf.fit(features_train, labels_train)

	pred = clf.predict(features_test)

   from sklearn.metrics import accuracy_score 


     acc = accuracy_score(pred, labels_test)
	 ### you fill this in!
	print acc
    
    
    return clf

main.py

#!/usr/bin/python

""" lecture and example code for decision tree unit """

import sys
from class_vis import prettyPicture, output_image
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
from classifyDT import classify

features_train, labels_train, features_test, labels_test = makeTerrainData()



### the classify() function in classifyDT is where the magic
### happens--fill in this function in the file 'classifyDT.py'!
clf = classify(features_train, labels_train)

#### grader code, do not modify below this line

prettyPicture(clf, features_test, labels_test)
output_image("test.png", "png", open("test.png", "rb").read())

準確率爲0.908

8.決策樹優缺點

優點
1.非常易用，能以圖形化方式很好地剖析數據
2.能通過集成的方法，從決策樹出發構件更大規模的分類器

缺點：
容易過擬合，尤其是對於包含大量特徵的數據時，複雜的決策樹會導致過擬合。
測量決策樹的準確性非常重要，你需要在合適的時候提前停止決策樹的生長。

9. 偏差和方差

高偏差機器學習算法實際上會忽略訓練數據，它幾乎沒有能力學習任何數據，這被稱爲偏差，相當於欠擬合

高方差算法會對數據高度敏感，它只能復現曾經見過的東西，它的問題在於，對於之前未見過的情況，它的反應很差，因爲沒有適當的偏差，讓它泛化新的東西，相當於過擬合。

決策樹小項目——識別作者

幾年前，J.K. 羅琳（憑藉《哈利波特》出名）試着做了件有趣的事。她以 Robert Galbraith 的化名寫了本名叫《The Cuckoo’s Calling》的書。儘管該書得到一些不錯的評論，但是大家都不太重視它，直到 Twitter 上一個匿名的知情人士說那是 J.K. Rowling 寫的。《倫敦週日泰晤士報》找來兩名專家對《杜鵑在呼喚》和 Rowling 的《偶發空缺》以及其他幾名作者的書進行了比較。分析結果強有力地指出羅琳就是作者，《泰晤士報》直接詢問出版商情況是否屬實，而出版商也證實了這一說法，該書在此後一夜成名。

我們也將在此項目中做類似的事。我們有一組郵件，分別由同一家公司的兩個人撰寫其中半數的郵件。我們的目標是僅根據郵件正文區分每個人寫的郵件。在這個迷你項目一開始，我們將使用樸素貝葉斯，並在之後的項目中擴展至其他算法。

我們會先給你一個字符串列表。每個字符串代表一封經過預處理的郵件的正文；然後，我們會提供代碼，用來將數據集分解爲訓練集和測試集。

準備好決策樹並將它作爲分類器運行起來，設置 min_samples_split=40。可能需要等一段時間才能開始訓練。

1) 準確率是多少？

import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()




#########################################################
### your code goes here ###
from sklearn import tree

clf = tree.DecisionTreeClassifier(min_samples_split=40)

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"
t1 = time()

pred = clf.predict(features_test)
print "testing time:", round(time()-t1, 3), "s"

from sklearn.metrics import accuracy_score

acc = accuracy_score(pred, labels_test)
print acc

training time: 64.302 s
tsting time: 0.222 s
0.975540386803

2) 特徵數量

你從 SVM 迷你項目中瞭解到，參數調整可以顯著加快機器學習算法的訓練時間。一般情況下，參數可以調整算法的複雜度，越複雜的算法通常運行起來越慢。

控制算法複雜度的另一種方法是通過你在訓練/測試時用到的特徵數量。算法可用的特徵數越多，越有可能發生複雜擬合。我們將在“特徵選擇”這節課中詳細探討，但你現在可以提前有所瞭解。

你數據中的特徵數是多少？

print len(features_train[0])

共計3785個特徵

3）更改特徵數量

在其他所有方面都相等的情況下，特徵數量越多會使決策樹的複雜性更高

當你僅使用 1% 的可用特徵（即百分位數 = 1）時，決策樹的準確率是 0.9664

機器學習——決策樹

文章目錄