《機器學習實戰》——決策樹的構造及案例

ID3算法的決策樹的構造


決策樹的理論部分,不再贅述,本篇博文主要是自己的學習筆記(《機器學習實戰》)

先看下述決策樹,希望對理解決策樹有一定的幫助。


3.1.1信息增益

首先需要了解兩個公式:


創建名爲treesde.py文件,將下述代碼添加進去

from math import log
def calcShannonEnt(dataSet):#該函數的功能是計算給定數據集的香農熵
    numEntries=len(dataSet)
    labelCounts={}
    for featVec in dataSet:
        currentLabel=featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    for key in labelCounts:
        prob =float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt
輸入數據集

def createDataSet():
    dataSet=[[1,1,'yes'],
             [1, 1, 'yes'],
             [1,0,'no'],
             [0, 1, 'no'],
             [0, 1, 'no'],
             ]
    labels=['no suffacing','flippers']
    return dataSet,labels
在python命令提示符下輸入下述命令:


得到的0.970~~~~就是商,熵越高則說明混合的數據越多。

3.1.2 劃分數據集

def splitDataSet(dataSet,axis,value):#按照給定的特徵劃分數據集
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            #extend接受一個列表作爲參數,並將該參數的每個元素都添加到原有列表中
            retDataSet.append(reduceFeatVec)
    return retDataSet
現在來測試一下該函數

在python命令提示符中輸入:


接下來我們將遍歷整個數據集,循環計算香農熵和splitDataSet()函數,找到最好的特徵劃分方式。

依舊在trees.py中加入如下代碼

#選擇最好的數據集劃分方式
def chooseBestFeatureToSplit(dataSet):
    numFeatures=len(dataSet[0]-1)
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0;bestFeature=-1
    for i in range(numFeatures):
        featList=[example[i] for example in dataSet]
        uniqueVals=set(featList)
        newEntropy=0.0
        for value in uniqueVals:
            subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        if(infoGain>baseEntropy):
            bestInfoGain=infoGain
            bestFeature=i
    return bestFeature
測試代碼:


最好的劃分是0

3.1.3遞歸構建決策樹

import  operator
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount+=1
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]
該函數使用分類名稱的列表,然後創建鍵值爲classList中唯一值得數據字典,字典對象存儲了classList中每個類標籤出現的頻率,最後利用operator操作鍵值排序字典,並返回出現次數最多的分類名稱

def createTree(dataSet,labels):#創建數的函數代碼
    classList=[example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels =labels[:]
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree
進行測試:


決策樹已建成,不過看起來有點費勁~~·~我們需要利用Matplotlib註解來繪製樹圖形

在我的上一篇博文中有講到

3.3 測試和存儲分類器

3.3.1 測試算法:使用決策樹執行分類

def classfiy(inputTree,featLabels,testVec):
    firstStr=inputTree.keys()[0]
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLbel=classfiy(secondDict[key],featLabels,testVec)
            else: classLbel=secondDict[key]
    return classLbel
代碼測試:



3.3.2 使用算法:決策樹的存儲

def storeTree(inputTree,filename):#使用pickle模塊存儲決策樹
    import pickle
    fw=open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
def grabTree(filename):
    import pickle
    fr=open(filename)
    return pickle.load(fr)


3.4 實例:使用決策樹預測隱形眼鏡類型

數據如下:

young myope no reduced no lenses
young myope no normal soft
young myope yes reduced no lenses
young myope yes normal hard
young hyper no reduced no lenses
young hyper no normal soft
young hyper yes reduced no lenses
young hyper yes normal hard
pre myope no reduced no lenses
pre myope no normal soft
pre myope yes reduced no lenses
pre myope yes normal hard
pre hyper no reduced no lenses
pre hyper no normal soft
pre hyper yes reduced no lenses
pre hyper yes normal no lenses
presbyopic myope no reduced no lenses
presbyopic myope no normal no lenses
presbyopic myope yes reduced no lenses
presbyopic myope yes normal hard
presbyopic hyper no reduced no lenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced no lenses
presbyopic hyper yes normal no lenses

在python命令提示符中下列命令:



結果如下:


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章