算法簡介

決策數(Decision Tree)在機器學習中也是比較常見的一種算法，屬於監督學習中的一種。其中ID3算法是以信息熵和信息增益作爲衡量標準的分類算法。

信息熵(Entropy)

熵的概念主要是指信息的混亂程度，變量的不確定性越大，熵的值也就越大，熵的公式可以表示爲：

信息增益(Information gain) - ID3算法

信息增益指的是根據特徵劃分數據前後熵的變化，可以用下面的公式表示：

根據不同特徵分類後熵的變化不同，信息增益也不同，信息增益越大，區分樣本的能力越強，越具有代表性。這是一種自頂向下的貪心策略，即在ID3中根據“最大信息增益”原則選擇特徵。

ID3採用信息增益來選擇特徵，存在一個缺點，它一般會優先選擇有較多屬性值的特徵，因爲屬性值多的特徵會有相對較大的信息增益。(這是因爲：信息增益反映的給定一個條件以後不確定性減少的程度,必然是分得越細的數據集確定性更高,也就是條件熵越小,信息增益越大)。

信息增益率(gain ratio) - C4.5算法

爲了避免ID3的不足，C4.5中是用信息增益率(gain ratio)來作爲選擇分支的準則。對於有較多屬性值的特徵，信息增益率的分母Split information（S,A），我們稱之爲分裂信息，會稀釋掉它對特徵選擇的影響。分裂信息（公式1）和信息增益率（公式2）的計算如下所示。

源數據

收入	身高	長相	體型	是否見面
一般	高	醜	胖	否
高	一般	帥	瘦	是
一般	一般	一般	一般	否
高	高	醜	一般	是
一般	高	帥	胖	是

這是一位單身女性根據對方的一些基本條件，判斷是否去約會的數據，此處展示前五行。我們要通過這位女士歷史的數據建立決策樹模型，使得儘量給這位女性推送她比較願意約會的異性信息。

代碼實現 - ID3算法

from math import log
import operator
import numpy as np
import pandas as pd
from pandas import DataFrame,Series

# 計算數據的熵(entropy)-原始熵
def dataentropy(data, feat):  
    lendata=len(data)  # 數據條數
    labelCounts={} # 數據中不同類別的條數
    for featVec in data:
        category=featVec[-1] # 每行數據的最後一個字（葉子節點）
        if category not in labelCounts.keys():
            labelCounts[category]=0 
        labelCounts[category]+=1  # 統計有多少個類以及每個類的數量
    entropy=0
    for key in labelCounts:
        prob=float(labelCounts[key])/lendata # 計算單個類的熵值
        entropy-=prob*log(prob,2) # 累加每個類的熵值
    return entropy


# 處理後導入數據數據
def Importdata(datafile): 
    dataa = pd.read_excel(datafile)#datafile是excel文件，所以用read_excel,如果是csv文件則用read_csv
    #將文本中不可直接使用的文本變量替換成數字
    productDict={'高':1,'一般':2,'低':3, '帥':1,  '醜':3,  '胖':3,  '瘦':1,  '是':1,  '否':0}
    dataa['income'] = dataa['收入'].map(productDict)#將每一列中的數據按照字典規定的轉化成數字
    dataa['hight'] = dataa['身高'].map(productDict)
    dataa['look'] = dataa['長相'].map(productDict)
    dataa['shape'] = dataa['體型'].map(productDict)
    dataa['is_meet'] = dataa['是否見面'].map(productDict)
	
    data = dataa.iloc[:,5:].values.tolist()#取量化後的幾列，去掉文本列
    b = dataa.iloc[0:0,5:-1]
    labels = b.columns.values.tolist()#將標題中的值存入列表中
    return data,labels

# 按某個特徵value分類後的數據
def splitData(data,i,value): 
    splitData=[]
    for featVec in data:
        if featVec[i]==value:
            rfv =featVec[:i]
            rfv.extend(featVec[i+1:])
            splitData.append(rfv)
    return splitData
	
	
# 選擇最優的分類特徵
def BestSplit(data):  
    numFea = len(data[0])-1#計算一共有多少個特徵，因爲最後一列一般是分類結果，所以需要-1
    baseEnt = dataentropy(data,-1)   # 定義初始的熵,用於對比分類後信息增益的變化
    bestInfo = 0
    bestFeat = -1
    for i in range(numFea):
        featList = [rowdata[i] for rowdata in data]
        uniqueVals = set(featList)
        newEnt = 0
        for value in uniqueVals:
            subData = splitData(data,i,value)#獲取按照特徵value分類後的數據
            prob =len(subData)/float(len(data))
            newEnt +=prob*dataentropy(subData,i)  # 按特徵分類後計算得到的熵
        info = baseEnt - newEnt  # 原始熵與按特徵分類後的熵的差值，即信息增益
        if (info>bestInfo):   # 若按某特徵劃分後，若infoGain大於bestInf，則infoGain對應的特徵分類區分樣本的能力更強，更具有代表性。 
            bestInfo=info #將infoGain賦值給bestInf，如果出現比infoGain更大的信息增益，說明還有更好地特徵分類
            bestFeat = i #將最大的信息增益對應的特徵下標賦給bestFea，返回最佳分類特徵
    return bestFeat 

#按分類後類別數量排序，取數量較大的
def majorityCnt(classList):    
    c_count={}
    for i in classList:
        if i not in c_count.keys():
            c_count[i]=0
        c_count[i]+=1
    ClassCount = sorted(c_count.items(),key=operator.itemgetter(1),reverse=True)#按照統計量降序排序
    return ClassCount[0][0]#reverse=True表示降序，因此取[0][0]，即最大值

#建樹
def createTree(data,labels):
    classList = [rowdata[-1] for rowdata in data]  # 取每一行的最後一列，分類結果（1/0）
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(data[0])==1:
        return majorityCnt(classList)
    bestFeat = BestSplit(data) #根據信息增益選擇最優特徵
    bestLab = labels[bestFeat]
    myTree = {bestLab:{}} #分類結果以字典形式保存
    del(labels[bestFeat])
    featValues = [rowdata[bestFeat] for rowdata in data]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestLab][value] = createTree(splitData(data,bestFeat,value),subLabels)
    return myTree


if __name__=='__main__':
    datafile = u'E:\\pythondata\\tree.xlsx'#文件所在位置，u爲防止路徑中有中文名稱
    data, labels=Importdata(datafile)  # 導入數據
    print(createTree(data, labels))  # 輸出決策樹模型結果

運行結果：

{'hight': {1: {'look': {1: {'income': {1: {'shape': {1: 1, 2: 1}}, 2: 1, 3: {'shape': {1: 1, 2: 0}}}}, 2: 1, 3: {'income': {1: 1, 2: 0}}}}, 2: {'income': {1: 1, 2: {'look': {1: 1, 2: 0}}, 3: 0}}, 3: {'look': {1: {'shape': {3: 0, 1: 1}}, 2: 0, 3: 0}}}}

對應的決策樹：

代碼實現 - C4.5算法

C4.5算法和ID3算法邏輯很相似，只是ID3算法是用信息增益來選擇特徵，而C4.5算法是用的信息增益率，因此對代碼的影響也只有BestSplit(data)函數的定義部分，只需要加一個信息增益率的計算即可，BestSplit(data)函數定義代碼更改後如下：

# 選擇最優的分類特徵
def BestSplit(data):  
    numFea = len(data[0])-1#計算一共有多少個特徵，因爲最後一列一般是分類結果，所以需要-1
    baseEnt = dataentropy(data,-1)   # 定義初始的熵,用於對比分類後信息增益的變化
    bestGainRate = 0
    bestFeat = -1
    for i in range(numFea):
        featList = [rowdata[i] for rowdata in data]
        uniqueVals = set(featList)
        newEnt = 0
        for value in uniqueVals:
            subData = splitData(data,i,value)#獲取按照特徵value分類後的數據
            prob =len(subData)/float(len(data))
            newEnt +=prob*dataentropy(subData,i)  # 按特徵分類後計算得到的熵
        info = baseEnt - newEnt  # 原始熵與按特徵分類後的熵的差值，即信息增益
        splitonfo = dataentropy(subData,i) #分裂信息
        if splitonfo == 0:#若特徵值相同（eg:長相這一特徵的值都是帥），即splitonfo和info均爲0，則跳過該特徵
            continue
        GainRate = info/splitonfo #計算信息增益率
        if (GainRate>bestGainRate):   # 若按某特徵劃分後，若infoGain大於bestInf，則infoGain對應的特徵分類區分樣本的能力更強，更具有代表性。 
            bestGainRate=GainRate #將infoGain賦值給bestInf，如果出現比infoGain更大的信息增益，說明還有更好地特徵分類
            bestFeat = i #將最大的信息增益對應的特徵下標賦給bestFea，返回最佳分類特徵
    return bestFeat

運行結果：

{'hight': {1: {'look': {1: {'income': {1: {'shape': {0: 0, 1: 1}}, 2: 1, 3: {'shape': {0: 0, 1: 1}}}}, 2: 1, 3: {'shape': {0: 0, 1: 1}}}}, 2: {'shape': {0: 0, 1: 1}}, 3: {'shape': {1: 0, 3: {'look': {0: 0, 1: 1}}}}}}

畫決策樹代碼-treePlotter

決策樹可以代碼實現的，不需要按照運行結果一點一點手動畫圖。

import treePlotter
treePlotter.createPlot(myTree)

其中treePlotter模塊是如下一段代碼，可以保存爲.py文件，放在Python/Lib/site-package目錄下，然後用的時候import 【文件名】就可以了。

treePlotter模塊代碼：

#繪製決策樹
import matplotlib.pyplot as plt

# 定義文本框和箭頭格式，boxstyle用於指定邊框類型，color表示填充色
decisionNode = dict(boxstyle="round4", color='#ccccff')  #定義判斷結點爲圓角長方形，填充淺藍色
leafNode = dict(boxstyle="circle", color='#66ff99')  #定義葉結點爲圓形，填充綠色
arrow_args = dict(arrowstyle="<-", color='ffcc00')  #定義箭頭及顏色


#繪製帶箭頭的註釋
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
                            xytext=centerPt, textcoords='axes fraction',
                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)


#計算葉結點數
def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs


#計算樹的層數
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return maxDepth


#在父子結點間填充文本信息
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)


def plotTree(myTree, parentPt, nodeTxt):
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = myTree.keys()[0]
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)  #在父子結點間填充文本信息
    plotNode(firstStr, cntrPt, parentPt, decisionNode)  #繪製帶箭頭的註釋
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            plotTree(secondDict[key], cntrPt, str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD


def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW;
    plotTree.yOff = 1.0;
    plotTree(inTree, (0.5, 1.0), '')
    plt.show()

Python機器學習 - 決策樹 - （ID3算法、C4.5算法） - 代碼實現

算法簡介

信息熵(Entropy)

信息增益(Information gain) - ID3算法

信息增益率(gain ratio) - C4.5算法

源數據

代碼實現 - ID3算法

代碼實現 - C4.5算法

畫決策樹代碼-treePlotter

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

數據建模 - 因子分析法

Python - 下載附件 - 解壓附件

機器學習與建模中 - 怎麼克服過擬合問題？

機器學習與建模中 - 怎麼判斷模型的好壞？

機器學習 - 聚類、分類、迴歸的區別

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結