機器學習算法的Python實現 (2)：ID3決策樹

本文數據參照機器學習-周志華一書中的決策樹一章。可作爲此章課後習題3的答案

代碼參考了《機器學習實戰》一書的內容，並做了不少修改。使其能用於同時包含離散與連續特徵的數據集。

本文使用的Python庫包括

numpy
pandas
math
operator
matplotlib

本文所用的數據如下：

Idx	色澤	根蒂	敲聲	紋理	臍部	觸感	密度	含糖率	label
1	青綠	蜷縮	濁響	清晰	凹陷	硬滑	0.697	0.46	1
2	烏黑	蜷縮	沉悶	清晰	凹陷	硬滑	0.774	0.376	1
3	烏黑	蜷縮	濁響	清晰	凹陷	硬滑	0.634	0.264	1
4	青綠	蜷縮	沉悶	清晰	凹陷	硬滑	0.608	0.318	1
5	淺白	蜷縮	濁響	清晰	凹陷	硬滑	0.556	0.215	1
6	青綠	稍蜷	濁響	清晰	稍凹	軟粘	0.403	0.237	1
7	烏黑	稍蜷	濁響	稍糊	稍凹	軟粘	0.481	0.149	1
8	烏黑	稍蜷	濁響	清晰	稍凹	硬滑	0.437	0.211	1
9	烏黑	稍蜷	沉悶	稍糊	稍凹	硬滑	0.666	0.091	0
10	青綠	硬挺	清脆	清晰	平坦	軟粘	0.243	0.267	0
11	淺白	硬挺	清脆	模糊	平坦	硬滑	0.245	0.057	0
12	淺白	蜷縮	濁響	模糊	平坦	軟粘	0.343	0.099	0
13	青綠	稍蜷	濁響	稍糊	凹陷	硬滑	0.639	0.161	0
14	淺白	稍蜷	沉悶	稍糊	凹陷	硬滑	0.657	0.198	0
15	烏黑	稍蜷	濁響	清晰	稍凹	軟粘	0.36	0.37	0
16	淺白	蜷縮	濁響	模糊	平坦	硬滑	0.593	0.042	0
17	青綠	蜷縮	沉悶	稍糊	稍凹	硬滑	0.719	0.103	0

由於我沒搞定matplotlib的中文輸出，因此將中文字符全換成了英文，如下：

Idx	color	root	knocks	texture	navel	touch	density	sugar_ratio	label
1	dark_green	curl_up	little_heavily	distinct	sinking	hard_smooth	0.697	0.46	1
2	black	curl_up	heavily	distinct	sinking	hard_smooth	0.774	0.376	1
3	black	curl_up	little_heavily	distinct	sinking	hard_smooth	0.634	0.264	1
4	dark_green	curl_up	heavily	distinct	sinking	hard_smooth	0.608	0.318	1
5	light_white	curl_up	little_heavily	distinct	sinking	hard_smooth	0.556	0.215	1
6	dark_green	little_curl_up	little_heavily	distinct	little_sinking	soft_stick	0.403	0.237	1
7	black	little_curl_up	little_heavily	little_blur	little_sinking	soft_stick	0.481	0.149	1
8	black	little_curl_up	little_heavily	distinct	little_sinking	hard_smooth	0.437	0.211	1
9	black	little_curl_up	heavily	little_blur	little_sinking	hard_smooth	0.666	0.091	0
10	dark_green	stiff	clear	distinct	even	soft_stick	0.243	0.267	0
11	light_white	stiff	clear	blur	even	hard_smooth	0.245	0.057	0
12	light_white	curl_up	little_heavily	blur	even	soft_stick	0.343	0.099	0
13	dark_green	little_curl_up	little_heavily	little_blur	sinking	hard_smooth	0.639	0.161	0
14	light_white	little_curl_up	heavily	little_blur	sinking	hard_smooth	0.657	0.198	0
15	black	little_curl_up	little_heavily	distinct	little_sinking	soft_stick	0.36	0.37	0
16	light_white	curl_up	little_heavily	blur	even	hard_smooth	0.593	0.042	0
17	dark_green	curl_up	heavily	little_blur	little_sinking	hard_smooth	0.719	0.103	0

字符的含義可自行對照上下兩表

決策樹生成的代碼參照機器學習實戰第三章的代碼，但是書上第三章是針對離散特徵的，下面程序中對其進行了修改，使其能用於同時包含離散與連續特徵的數據集。

決策樹生成代碼如下：

# -*- coding: utf-8 -*-

from numpy import *
import numpy as np
import pandas as pd
from math import log
import operator



#計算數據集的香農熵
def calcShannonEnt(dataSet):
    numEntries=len(dataSet)
    labelCounts={}
    #給所有可能分類創建字典
    for featVec in dataSet:
        currentLabel=featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    #以2爲底數計算香農熵
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt


#對離散變量劃分數據集，取出該特徵取值爲value的所有樣本
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec=featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

#對連續變量劃分數據集，direction規定劃分的方向，
#決定是劃分出小於value的數據樣本還是大於value的數據樣本集
def splitContinuousDataSet(dataSet,axis,value,direction):
    retDataSet=[]
    for featVec in dataSet:
        if direction==0:
            if featVec[axis]>value:
                reducedFeatVec=featVec[:axis]
                reducedFeatVec.extend(featVec[axis+1:])
                retDataSet.append(reducedFeatVec)
        else:
            if featVec[axis]<=value:
                reducedFeatVec=featVec[:axis]
                reducedFeatVec.extend(featVec[axis+1:])
                retDataSet.append(reducedFeatVec)
    return retDataSet

#選擇最好的數據集劃分方式
def chooseBestFeatureToSplit(dataSet,labels):
    numFeatures=len(dataSet[0])-1
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0
    bestFeature=-1
    bestSplitDict={}
    for i in range(numFeatures):
        featList=[example[i] for example in dataSet]
        #對連續型特徵進行處理
        if type(featList[0]).__name__=='float' or type(featList[0]).__name__=='int':
            #產生n-1個候選劃分點
            sortfeatList=sorted(featList)
            splitList=[]
            for j in range(len(sortfeatList)-1):
                splitList.append((sortfeatList[j]+sortfeatList[j+1])/2.0)
            
            bestSplitEntropy=10000
            slen=len(splitList)
            #求用第j個候選劃分點劃分時，得到的信息熵，並記錄最佳劃分點
            for j in range(slen):
                value=splitList[j]
                newEntropy=0.0
                subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
                subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
                prob0=len(subDataSet0)/float(len(dataSet))
                newEntropy+=prob0*calcShannonEnt(subDataSet0)
                prob1=len(subDataSet1)/float(len(dataSet))
                newEntropy+=prob1*calcShannonEnt(subDataSet1)
                if newEntropy<bestSplitEntropy:
                    bestSplitEntropy=newEntropy
                    bestSplit=j
            #用字典記錄當前特徵的最佳劃分點
            bestSplitDict[labels[i]]=splitList[bestSplit]
            infoGain=baseEntropy-bestSplitEntropy
        #對離散型特徵進行處理
        else:
            uniqueVals=set(featList)
            newEntropy=0.0
            #計算該特徵下每種劃分的信息熵
            for value in uniqueVals:
                subDataSet=splitDataSet(dataSet,i,value)
                prob=len(subDataSet)/float(len(dataSet))
                newEntropy+=prob*calcShannonEnt(subDataSet)
            infoGain=baseEntropy-newEntropy
        if infoGain>bestInfoGain:
            bestInfoGain=infoGain
            bestFeature=i
    #若當前節點的最佳劃分特徵爲連續特徵，則將其以之前記錄的劃分點爲界進行二值化處理
    #即是否小於等於bestSplitValue
    if type(dataSet[0][bestFeature]).__name__=='float' or type(dataSet[0][bestFeature]).__name__=='int':      
        bestSplitValue=bestSplitDict[labels[bestFeature]]        
        labels[bestFeature]=labels[bestFeature]+'<='+str(bestSplitValue)
        for i in range(shape(dataSet)[0]):
            if dataSet[i][bestFeature]<=bestSplitValue:
                dataSet[i][bestFeature]=1
            else:
                dataSet[i][bestFeature]=0
    return bestFeature

#特徵若已經劃分完，節點下的樣本還沒有統一取值，則需要進行投票
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote]+=1
    return max(classCount)

#主程序，遞歸產生決策樹
def createTree(dataSet,labels,data_full,labels_full):
    classList=[example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet,labels)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    if type(dataSet[0][bestFeat]).__name__=='str':
        currentlabel=labels_full.index(labels[bestFeat])
        featValuesFull=[example[currentlabel] for example in data_full]
        uniqueValsFull=set(featValuesFull)
    del(labels[bestFeat])
    #針對bestFeat的每個取值，劃分出一個子樹。
    for value in uniqueVals:
        subLabels=labels[:]
        if type(dataSet[0][bestFeat]).__name__=='str':
            uniqueValsFull.remove(value)
        myTree[bestFeatLabel][value]=createTree(splitDataSet\
         (dataSet,bestFeat,value),subLabels,data_full,labels_full)
    if type(dataSet[0][bestFeat]).__name__=='str':
        for value in uniqueValsFull:
            myTree[bestFeatLabel][value]=majorityCnt(classList)
    return myTree

通過以下語句進行調用：

df=pd.read_csv('watermelon_4_3.csv')
data=df.values[:,1:].tolist()
data_full=data[:]
labels=df.columns.values[1:-1].tolist()
labels_full=labels[:]
myTree=createTree(data,labels,data_full,labels_full)

可以得到以下結果

>>> myTree
{'texture': {'distinct': {'density<=0.3815': {0: 1L, 1: 0L}}, 'little_blur': {'touch': {'hard_smooth': 0L, 'soft_stick': 1L}}, 'blur': 0L}}

以下爲畫圖代碼：

import matplotlib.pyplot as plt
decisionNode=dict(boxstyle="sawtooth",fc="0.8")
leafNode=dict(boxstyle="round4",fc="0.8")
arrow_args=dict(arrowstyle="<-")


#計算樹的葉子節點數量
def getNumLeafs(myTree):
    numLeafs=0
    firstStr=myTree.keys()[0]
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            numLeafs+=getNumLeafs(secondDict[key])
        else: numLeafs+=1
    return numLeafs

#計算樹的最大深度
def getTreeDepth(myTree):
    maxDepth=0
    firstStr=myTree.keys()[0]
    secondDict=myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            thisDepth=1+getTreeDepth(secondDict[key])
        else: thisDepth=1
        if thisDepth>maxDepth:
            maxDepth=thisDepth
    return maxDepth

#畫節點
def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',\
    xytext=centerPt,textcoords='axes fraction',va="center", ha="center",\
    bbox=nodeType,arrowprops=arrow_args)

#畫箭頭上的文字
def plotMidText(cntrPt,parentPt,txtString):
    lens=len(txtString)
    xMid=(parentPt[0]+cntrPt[0])/2.0-lens*0.002
    yMid=(parentPt[1]+cntrPt[1])/2.0
    createPlot.ax1.text(xMid,yMid,txtString)
    
def plotTree(myTree,parentPt,nodeTxt):
    numLeafs=getNumLeafs(myTree)
    depth=getTreeDepth(myTree)
    firstStr=myTree.keys()[0]
    cntrPt=(plotTree.x0ff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.y0ff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDict=myTree[firstStr]
    plotTree.y0ff=plotTree.y0ff-1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            plotTree(secondDict[key],cntrPt,str(key))
        else:
            plotTree.x0ff=plotTree.x0ff+1.0/plotTree.totalW
            plotNode(secondDict[key],(plotTree.x0ff,plotTree.y0ff),cntrPt,leafNode)
            plotMidText((plotTree.x0ff,plotTree.y0ff),cntrPt,str(key))
    plotTree.y0ff=plotTree.y0ff+1.0/plotTree.totalD

def createPlot(inTree):
    fig=plt.figure(1,facecolor='white')
    fig.clf()
    axprops=dict(xticks=[],yticks=[])
    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)
    plotTree.totalW=float(getNumLeafs(inTree))
    plotTree.totalD=float(getTreeDepth(inTree))
    plotTree.x0ff=-0.5/plotTree.totalW
    plotTree.y0ff=1.0
    plotTree(inTree,(0.5,1.0),'')
    plt.show()

調用方式爲

createPlot(myTree)

以上的決策樹計算代碼以及畫圖代碼可以放在不同的文件中進行調用，也可以直接放在一個py文件中。

得到的決策樹如下圖所示：

與機器學習教材P85頁的圖一致。

若文中或代碼中有錯誤之處，煩請指正，不甚感激。

更新：

2016.4.3對決策樹生成的createTree函數進行了更新（上文代碼已經是更新後的代碼）。

原來的代碼爲：

#主程序，遞歸產生決策樹  
def createTree(dataSet,labels):  
    classList=[example[-1] for example in dataSet]  
    if classList.count(classList[0])==len(classList):  
        return classList[0]  
    if len(dataSet[0])==1:  
        return majorityCnt(classList)  
    bestFeat=chooseBestFeatureToSplit(dataSet,labels)  
    bestFeatLabel=labels[bestFeat]  
    myTree={bestFeatLabel:{}}  
    del(labels[bestFeat])  
    featValues=[example[bestFeat] for example in dataSet]  
    uniqueVals=set(featValues)  
    #針對bestFeat的每個取值，劃分出一個子樹。  
    for value in uniqueVals:  
        subLabels=labels[:]  
        myTree[bestFeatLabel][value]=createTree(splitDataSet\  
         (dataSet,bestFeat,value),subLabels)  
    return myTree

比如顏色有 dark_green, black, light_white 三種，紋理有 distinct,little_blur, blur 這幾種。若先按照紋理進行劃分，則劃分出distinct的子樣本集中的顏色就沒有light_white這個取值了。這使得得到的決策樹在遇到新數據時可能無法進行決策（比如一個 texture:distinct; color:light_white的西瓜）。因此在遞歸的時候需要傳遞完整的訓練數據集。從而產生完整的決策樹。（缺失取值的類別劃分選擇當前數據集的多數類別（投票法））

如使用書上的表4.2（就是前面表格去掉密度和含糖量這兩行）。使用之前代碼得到的圖爲