《机器学习实战》——决策树的构造及案例

ID3算法的决策树的构造


决策树的理论部分,不再赘述,本篇博文主要是自己的学习笔记(《机器学习实战》)

先看下述决策树,希望对理解决策树有一定的帮助。


3.1.1信息增益

首先需要了解两个公式:


创建名为treesde.py文件,将下述代码添加进去

from math import log
def calcShannonEnt(dataSet):#该函数的功能是计算给定数据集的香农熵
    numEntries=len(dataSet)
    labelCounts={}
    for featVec in dataSet:
        currentLabel=featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    for key in labelCounts:
        prob =float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt
输入数据集

def createDataSet():
    dataSet=[[1,1,'yes'],
             [1, 1, 'yes'],
             [1,0,'no'],
             [0, 1, 'no'],
             [0, 1, 'no'],
             ]
    labels=['no suffacing','flippers']
    return dataSet,labels
在python命令提示符下输入下述命令:


得到的0.970~~~~就是商,熵越高则说明混合的数据越多。

3.1.2 划分数据集

def splitDataSet(dataSet,axis,value):#按照给定的特征划分数据集
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            #extend接受一个列表作为参数,并将该参数的每个元素都添加到原有列表中
            retDataSet.append(reduceFeatVec)
    return retDataSet
现在来测试一下该函数

在python命令提示符中输入:


接下来我们将遍历整个数据集,循环计算香农熵和splitDataSet()函数,找到最好的特征划分方式。

依旧在trees.py中加入如下代码

#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
    numFeatures=len(dataSet[0]-1)
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0;bestFeature=-1
    for i in range(numFeatures):
        featList=[example[i] for example in dataSet]
        uniqueVals=set(featList)
        newEntropy=0.0
        for value in uniqueVals:
            subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        if(infoGain>baseEntropy):
            bestInfoGain=infoGain
            bestFeature=i
    return bestFeature
测试代码:


最好的划分是0

3.1.3递归构建决策树

import  operator
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount+=1
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]
该函数使用分类名称的列表,然后创建键值为classList中唯一值得数据字典,字典对象存储了classList中每个类标签出现的频率,最后利用operator操作键值排序字典,并返回出现次数最多的分类名称

def createTree(dataSet,labels):#创建数的函数代码
    classList=[example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels =labels[:]
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree
进行测试:


决策树已建成,不过看起来有点费劲~~·~我们需要利用Matplotlib注解来绘制树图形

在我的上一篇博文中有讲到

3.3 测试和存储分类器

3.3.1 测试算法:使用决策树执行分类

def classfiy(inputTree,featLabels,testVec):
    firstStr=inputTree.keys()[0]
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLbel=classfiy(secondDict[key],featLabels,testVec)
            else: classLbel=secondDict[key]
    return classLbel
代码测试:



3.3.2 使用算法:决策树的存储

def storeTree(inputTree,filename):#使用pickle模块存储决策树
    import pickle
    fw=open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
def grabTree(filename):
    import pickle
    fr=open(filename)
    return pickle.load(fr)


3.4 实例:使用决策树预测隐形眼镜类型

数据如下:

young myope no reduced no lenses
young myope no normal soft
young myope yes reduced no lenses
young myope yes normal hard
young hyper no reduced no lenses
young hyper no normal soft
young hyper yes reduced no lenses
young hyper yes normal hard
pre myope no reduced no lenses
pre myope no normal soft
pre myope yes reduced no lenses
pre myope yes normal hard
pre hyper no reduced no lenses
pre hyper no normal soft
pre hyper yes reduced no lenses
pre hyper yes normal no lenses
presbyopic myope no reduced no lenses
presbyopic myope no normal no lenses
presbyopic myope yes reduced no lenses
presbyopic myope yes normal hard
presbyopic hyper no reduced no lenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced no lenses
presbyopic hyper yes normal no lenses

在python命令提示符中下列命令:



结果如下:


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章