決策樹算法實戰之預測眼鏡類型

一、說明

我是在jupyter完成的,然後導出成markdown格式,ipynb文件導出爲markdown的命令如下:

jupyter nbconvert --to markdown xxx.ipynb

二、題目

2.1 概述

使用決策樹預測患者需要佩戴的眼鏡類型,根據數據集lenses.data,分爲訓練集和測試集,使用ID3算法,信息增益作爲屬性選擇度量,自頂向下的分治方式構造決策樹,從有類標號的訓練元組中學習決策樹,再用決策樹對測試集分類,計算準確率和誤分類率,評估分類器的性能。

2.2 流程

  • 收集數據
    lenses.txt文件 24行數據

young	myope	no	reduced	no lenses
young	myope	no	normal	soft
young	myope	yes	reduced	no lenses
young	myope	yes	normal	hard
young	hyper	no	reduced	no lenses
young	hyper	no	normal	soft
young	hyper	yes	reduced	no lenses
young	hyper	yes	normal	hard
pre	myope	no	reduced	no lenses
pre	myope	no	normal	soft
pre	myope	yes	reduced	no lenses
pre	myope	yes	normal	hard
pre	hyper	no	reduced	no lenses
pre	hyper	no	normal	soft
pre	hyper	yes	reduced	no lenses
pre	hyper	yes	normal	no lenses
presbyopic	myope	no	reduced	no lenses
presbyopic	myope	no	normal	no lenses
presbyopic	myope	yes	reduced	no lenses
presbyopic	myope	yes	normal	hard
presbyopic	hyper	no	reduced	no lenses
presbyopic	hyper	no	normal	soft
presbyopic	hyper	yes	reduced	no lenses
presbyopic	hyper	yes	normal	no lenses
  • 準備數據:解析tab鍵分隔符的
  • 分析數據:快速檢測數據,確保正確的解析數據內容,使用createPlot()函數繪製最終的樹形圖。
  • 訓練算法:使用createTree()函數
  • 測試算法:編寫測試函數驗證決策樹可以正確分類給定的數據實例
  • 使用算法:存儲數的數據結構,以便下次使用時無需重新構造樹
    在這裏插入圖片描述

三、實踐

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述

在這裏插入圖片描述
在這裏插入圖片描述

在這裏插入圖片描述
在這裏插入圖片描述

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述

在這裏插入圖片描述
在這裏插入圖片描述

在這裏插入圖片描述

在這裏插入圖片描述
在這裏插入圖片描述
在這裏插入圖片描述

四、源代碼

4.1 work01.ipynb

data = open("lenses.txt")
dataSet = [inst.strip().split("\t") for inst in data.readlines()]

dataSet
[['young', 'myope', 'no', 'reduced', 'no lenses'],
 ['young', 'myope', 'no', 'normal', 'soft'],
 ['young', 'myope', 'yes', 'reduced', 'no lenses'],
 ['young', 'myope', 'yes', 'normal', 'hard'],
 ['young', 'hyper', 'no', 'reduced', 'no lenses'],
 ['young', 'hyper', 'no', 'normal', 'soft'],
 ['young', 'hyper', 'yes', 'reduced', 'no lenses'],
 ['young', 'hyper', 'yes', 'normal', 'hard'],
 ['pre', 'myope', 'no', 'reduced', 'no lenses'],
 ['pre', 'myope', 'no', 'normal', 'soft'],
 ['pre', 'myope', 'yes', 'reduced', 'no lenses'],
 ['pre', 'myope', 'yes', 'normal', 'hard'],
 ['pre', 'hyper', 'no', 'reduced', 'no lenses'],
 ['pre', 'hyper', 'no', 'normal', 'soft'],
 ['pre', 'hyper', 'yes', 'reduced', 'no lenses'],
 ['pre', 'hyper', 'yes', 'normal', 'no lenses'],
 ['presbyopic', 'myope', 'no', 'reduced', 'no lenses'],
 ['presbyopic', 'myope', 'no', 'normal', 'no lenses'],
 ['presbyopic', 'myope', 'yes', 'reduced', 'no lenses'],
 ['presbyopic', 'myope', 'yes', 'normal', 'hard'],
 ['presbyopic', 'hyper', 'no', 'reduced', 'no lenses'],
 ['presbyopic', 'hyper', 'no', 'normal', 'soft'],
 ['presbyopic', 'hyper', 'yes', 'reduced', 'no lenses'],
 ['presbyopic', 'hyper', 'yes', 'normal', 'no lenses']]
np.shape(dataSet)
(24, 5)
labels = ["age", "prescript", "astigmatic", "tearRate"]
import numpy as np
from math import log
import operator as op
# 計算給定數據集的香農熵     熵越高數據越亂
def calcShannonEnt(dataSet):
    '''
    :param dataSet: 數據集
    :return: 香農熵
    '''
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if (currentLabel not in labelCounts.keys()):
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    rowNum = len(dataSet)
    for key in labelCounts:
        prob = float(labelCounts[key]) / rowNum
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

# 按照某個特徵上的某個值來劃分數據集  分成兩份
def splitDataSet(dataSet, axis, value):
    '''
    :param dataSet: 數據集
    :param axis: 特徵index
    :param value: 值
    :return: 劃分後的數據集
    '''
    retDataSet = []
    for featVec in dataSet:
        if (featVec[axis] == value):
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis + 1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet
# 劃分數據集示例
splitDataSet(dataSet, 0,'young' )
[['myope', 'no', 'reduced', 'no lenses'],
 ['myope', 'no', 'normal', 'soft'],
 ['myope', 'yes', 'reduced', 'no lenses'],
 ['myope', 'yes', 'normal', 'hard'],
 ['hyper', 'no', 'reduced', 'no lenses'],
 ['hyper', 'no', 'normal', 'soft'],
 ['hyper', 'yes', 'reduced', 'no lenses'],
 ['hyper', 'yes', 'normal', 'hard']]
# 選擇最好的數據集劃分方式
def chooseBestFeatureToSplit(dataSet):
    '''
    :param dataSet: 數據集
    :return: bestFeature
    '''
    # numFeatures=len(dataSet[0])-1 # 特徵的數量(因每一行有一列類別,故減一)
    numFeatures = np.shape(dataSet)[1] - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

# 選擇最好的數據集劃分方式
def chooseBestFeatureToSplit(dataSet):
    '''
    :param dataSet: 數據集
    :return: bestFeature
    '''
    # numFeatures=len(dataSet[0])-1 # 特徵的數量(因每一行有一列類別,故減一)
    numFeatures = np.shape(dataSet)[1] - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

chooseBestFeatureToSplit(dataSet)
3
# 選取頻率最高的類
def majorityCnt(classList):
    '''
    :param classList:
    :return: 出現頻率最高的類別名
    '''
    classCount = {}   # 存儲class_list每個類標籤出現的頻率
    for vote in classList: 
        if (vote not in classCount.keys()):
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=op.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

classList01 = [example[-1] for example in dataSet]
classList01
['no lenses',
 'soft',
 'no lenses',
 'hard',
 'no lenses',
 'soft',
 'no lenses',
 'hard',
 'no lenses',
 'soft',
 'no lenses',
 'hard',
 'no lenses',
 'soft',
 'no lenses',
 'no lenses',
 'no lenses',
 'no lenses',
 'no lenses',
 'hard',
 'no lenses',
 'soft',
 'no lenses',
 'no lenses']
majorityCnt(classList01)
'no lenses'
# 創建決策樹
def createTree(dataSet, labels):
    '''
    :param dataSet: 數據集 list
    :param labels: 標籤 list
    :return: myTree 類似二叉樹的結構 以dict形式
    '''
    # 複製labels,防止之後操作修改了labels
    labels_cp = labels.copy()
    
    # 獲取數據集中的最後一列的類標籤,存入classList列表
    classList = [example[-1] for example in dataSet]
     
    # 結束條件一:分區的所有元組都屬於同一個類
    if (classList.count(classList[0]) == len(classList)):
        return classList[0]
    
    # 結束條件二:沒有剩餘屬性可以用來劃分元組,只有類標籤列,多數表決
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    
    # 選擇最優特徵值
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels_cp[bestFeat]
    
    myTree = {bestFeatLabel: {}}
    
    del (labels_cp[bestFeat]) # 刪除分裂屬性
    
    # #將最好的屬性所在列用集合取唯一值
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    
    # 遞歸
    for value in uniqueVals:
        subLabels = labels_cp[:] 
        # 對屬性的每個取值作爲分枝遞歸建立決策樹
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

# 創建決策樹
tree = createTree(dataSet, labels)
tree
{'tearRate': {'reduced': 'no lenses',
  'normal': {'astigmatic': {'no': {'age': {'presbyopic': {'prescript': {'hyper': 'soft',
        'myope': 'no lenses'}},
      'young': 'soft',
      'pre': 'soft'}},
    'yes': {'prescript': {'hyper': {'age': {'presbyopic': 'no lenses',
        'young': 'hard',
        'pre': 'no lenses'}},
      'myope': 'hard'}}}}}}
labels
['age', 'prescript', 'astigmatic', 'tearRate']
# 使用決策樹分類
def classify(inputTree,featLabels,testVec):
    '''
    :param inputTree: list 決策樹
    :param featLabels: list 特徵標籤
    :param testVec: list 測試樣本的特徵值
    :return: classLabel 該樣本所屬的類別
    '''
    firstStr=list(inputTree.keys())[0]
    # 源代碼沒有list,會產生錯誤:TypeError: 'dict_keys' object does not support indexing,
    # 這是由於python3改變了dict.keys,返回的是dict_keys對象,支持iterable 但不支持indexable,我們可以將其明確的轉化成list:
    # print(firstStr)
    # print(featLabels)
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLabel=classify(secondDict[key],featLabels,testVec)
            else:
                classLabel=secondDict[key]
    return classLabel

classify(tree, labels, ['young','myope','yes','normal'])
'hard'

分類器評估

準確率:

Accuracy=(TP+TN)/ALL = 0.8

錯誤率:

Error_rate=1-accuracy(M)=0.2

若對 no contact lenses類感興趣

精度:

precision=正確分類的正元組/預測出的正元組=0.75

召回率:

Recall=TP/P=正確分類的正元組/實際正元組=1

# 畫出決策樹
import matplotlib.pyplot as plt
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
# 獲取節點數
def getNumLeafs(myTree):
    numLeafs = 0
    for i in myTree.keys():
        firstStr = i
        break
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs

# 獲取樹的深度
def getTreeDepth(myTree):
    maxDepth = 0
    for i in myTree.keys():
        firstStr = i
        break
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', xytext=centerPt,
                            textcoords='axes fraction',
                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt):
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    for i in myTree.keys():
        firstStr = i
        break
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            plotTree(secondDict[key], cntrPt, str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
    # createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW;
    plotTree.yOff = 1.0;
    plotTree(inTree, (0.5, 1.0), '')
    plt.show()

createPlot(tree)

在這裏插入圖片描述

# 評估模型

# 這裏數據不夠多,結果容易出現偏差

test_set=dataSet[18:24]   # 取後6個作爲測試集
train_set = dataSet[:18]  # 取前18個作爲訓練集

print('訓練集:\n', train_set)


訓練集:
 [['young', 'myope', 'no', 'reduced', 'no lenses'], ['young', 'myope', 'no', 'normal', 'soft'], ['young', 'myope', 'yes', 'reduced', 'no lenses'], ['young', 'myope', 'yes', 'normal', 'hard'], ['young', 'hyper', 'no', 'reduced', 'no lenses'], ['young', 'hyper', 'no', 'normal', 'soft'], ['young', 'hyper', 'yes', 'reduced', 'no lenses'], ['young', 'hyper', 'yes', 'normal', 'hard'], ['pre', 'myope', 'no', 'reduced', 'no lenses'], ['pre', 'myope', 'no', 'normal', 'soft'], ['pre', 'myope', 'yes', 'reduced', 'no lenses'], ['pre', 'myope', 'yes', 'normal', 'hard'], ['pre', 'hyper', 'no', 'reduced', 'no lenses'], ['pre', 'hyper', 'no', 'normal', 'soft'], ['pre', 'hyper', 'yes', 'reduced', 'no lenses'], ['pre', 'hyper', 'yes', 'normal', 'no lenses'], ['presbyopic', 'myope', 'no', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'no', 'normal', 'no lenses']]
lenses_tree = createTree(test_set, labels)
print('決策樹:\n', lenses_tree)
決策樹:
 {'tearRate': {'reduced': 'no lenses', 'normal': {'prescript': {'hyper': {'astigmatic': {'no': 'soft', 'yes': 'no lenses'}}, 'myope': 'hard'}}}}
# 測試分類
print('測試集:\n', test_set)
測試集:
 [['presbyopic', 'myope', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'myope', 'yes', 'normal', 'hard'], ['presbyopic', 'hyper', 'no', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'no', 'normal', 'soft'], ['presbyopic', 'hyper', 'yes', 'reduced', 'no lenses'], ['presbyopic', 'hyper', 'yes', 'normal', 'no lenses']]
# 測試分類器
def test_tree(D_set):
    trueclass = 0
    for row in range(5):
        if classify(lenses_tree, labels,
                          [D_set[row][0], D_set[row][1], D_set[row][2], D_set[row][3]]) == D_set[row][4]:
            trueclass += 1
        print(D_set[row], classify(lenses_tree, labels,
                                         [D_set[row][0], D_set[row][1], D_set[row][2], lenses[row][3]]))
    correct_rate = trueclass / 5.0  # 分類的正確率
    return correct_rate

score = test_tree(test_set)
print('正確率爲:\n', score )
print('錯誤率爲:\n', 1 - score)
['presbyopic', 'myope', 'yes', 'reduced', 'no lenses'] no lenses
['presbyopic', 'myope', 'yes', 'normal', 'hard'] hard
['presbyopic', 'hyper', 'no', 'reduced', 'no lenses'] no lenses
['presbyopic', 'hyper', 'no', 'normal', 'soft'] soft
['presbyopic', 'hyper', 'yes', 'reduced', 'no lenses'] no lenses
正確率爲:
 1.0
錯誤率爲:
 0.0

4.2 work01.py

import numpy as np
import operator as op
from math import log


# 計算給定數據集的香農熵     熵越高數據越亂
def calcShannonEnt(dataSet):
    '''
    :param dataSet: 數據集
    :return: 香農熵
    '''
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if (currentLabel not in labelCounts.keys()):
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    rowNum = len(dataSet)
    for key in labelCounts:
        prob = float(labelCounts[key]) / rowNum
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt


# 按照某個特徵上的某個值來劃分數據集  分成兩份
def splitDataSet(dataSet, axis, value):
    '''
    :param dataSet: 數據集
    :param axis: 特徵index
    :param value: 值
    :return: 劃分後的數據集
    '''
    retDataSet = []
    for featVec in dataSet:
        if (featVec[axis] == value):
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis + 1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet


# 選擇最好的數據集劃分方式
def chooseBestFeatureToSplit(dataSet):
    '''
    :param dataSet: 數據集
    :return: bestFeature
    '''
    # numFeatures=len(dataSet[0])-1 # 特徵的數量(因每一行有一列類別,故減一)
    numFeatures = np.shape(dataSet)[1] - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature


# 選取頻率最高的類
def majorityCnt(classList):
    '''
    :param classList:
    :return: 出現頻率最高的類別名
    '''
    classCount = {}
    for vote in classList:
        if (vote not in classCount.keys()):
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=op.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]


# 創建決策樹
def createTree(dataSet, labels):
    '''
    :param dataSet: 數據集 list
    :param labels: 標籤 list
    :return: myTree 類似二叉樹的結構 以list形式
    '''
    classList = [example[-1] for example in dataSet]
    if (classList.count(classList[0]) == len(classList)):
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel: {}}
    del (labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree


# 使用決策樹分類
def classify(inputTree, featLabels, testVec):
    '''
    :param inputTree: list 決策樹
    :param featLabels: list 特徵標籤
    :param testVec: list 測試樣本的特徵值
    :return:
    '''
    for i in inputTree.keys():
        firstStr = i
        break
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict):
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else:
        classLabel = valueOfFeat
    return classLabel


if __name__ == '__main__':

    data = open("lenses.txt")
    dataSet = [inst.strip().split("\t") for inst in data.readlines()]
    print(dataSet)
    print(np.shape(dataSet))
    labels = ["age", "prescript", "astigmatic", "tearRate"]
    tree = createTree(dataSet, labels)
    print(tree)
    classify(tree, labels, ['young','myope','yes','normal'])

    import matplotlib.pyplot as plt
    decisionNode = dict(boxstyle="sawtooth", fc="0.8")
    leafNode = dict(boxstyle="round4", fc="0.8")
    arrow_args = dict(arrowstyle="<-")

    # 獲取節點數
    def getNumLeafs(myTree):
        numLeafs = 0
        for i in myTree.keys():
            firstStr = i
            break
        secondDict = myTree[firstStr]
        for key in secondDict.keys():
            if type(secondDict[key]).__name__ == 'dict':
                numLeafs += getNumLeafs(secondDict[key])
            else:
                numLeafs += 1
        return numLeafs

    # 獲取樹的深度
    def getTreeDepth(myTree):
        maxDepth = 0
        for i in myTree.keys():
            firstStr = i
            break
        secondDict = myTree[firstStr]
        for key in secondDict.keys():
            if type(secondDict[key]).__name__ == 'dict':
                thisDepth = 1 + getTreeDepth(secondDict[key])
            else:
                thisDepth = 1
            if thisDepth > maxDepth: maxDepth = thisDepth
        return maxDepth


    def plotNode(nodeTxt, centerPt, parentPt, nodeType):
        createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', xytext=centerPt,
                                textcoords='axes fraction',
                                va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)


    def plotMidText(cntrPt, parentPt, txtString):
        xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
        yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
        createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)


    def plotTree(myTree, parentPt, nodeTxt):
        numLeafs = getNumLeafs(myTree)
        depth = getTreeDepth(myTree)
        for i in myTree.keys():
            firstStr = i
            break
        cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)
        plotMidText(cntrPt, parentPt, nodeTxt)
        plotNode(firstStr, cntrPt, parentPt, decisionNode)
        secondDict = myTree[firstStr]
        plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
        for key in secondDict.keys():
            if type(secondDict[key]).__name__ == 'dict':
                plotTree(secondDict[key], cntrPt, str(key))
            else:
                plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
                plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
                plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
        plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD


    def createPlot(inTree):
        fig = plt.figure(1, facecolor='white')
        fig.clf()
        axprops = dict(xticks=[], yticks=[])
        createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
        # createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
        plotTree.totalW = float(getNumLeafs(inTree))
        plotTree.totalD = float(getTreeDepth(inTree))
        plotTree.xOff = -0.5 / plotTree.totalW;
        plotTree.yOff = 1.0;
        plotTree(inTree, (0.5, 1.0), '')
        plt.show()


    createPlot(tree)

5. 小結

決策樹分類器就像帶有終止塊的流程圖,終止塊表示分類結果。

開始處理數據集時,我們首先需要測量集合中數據的不一致性, 也就是熵,然後尋找最優方案劃分數據集,直到數據集中的所有數據屬於同- -分類。 ID3算法可以用於劃分標稱型數據集。構建決策樹時,我們通常採用遞歸的方法將數據集轉化爲決策樹。一般我們並不構造新的數據結構,而是使用Python語言內嵌的數據結構字典存儲樹節點信息。

使用Matplotlib的註解功能,我們可以將存儲的樹結構轉化爲容易理解的圖形。

Python的pickle模塊可用於存儲決策樹的結構。隱形眼鏡的例子表明決策樹可能會產生過多的數據集劃分,從而產生過度匹配數據集的問題。我們可以通過裁剪決策樹,合併相鄰的無法產生大量信息增益的葉節點,消除過度匹配問題。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章