《机器学习实战》——决策树的构造及案例

ID3算法的决策树的构造

决策树的理论部分，不再赘述，本篇博文主要是自己的学习笔记（《机器学习实战》）

先看下述决策树，希望对理解决策树有一定的帮助。

3.1.1信息增益

首先需要了解两个公式：

创建名为treesde.py文件，将下述代码添加进去

from math import log

def calcShannonEnt(dataSet):#该函数的功能是计算给定数据集的香农熵
    numEntries=len(dataSet)
    labelCounts={}
    for featVec in dataSet:
        currentLabel=featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    shannonEnt=0.0
    for key in labelCounts:
        prob =float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt

输入数据集

def createDataSet():
    dataSet=[[1,1,'yes'],
             [1, 1, 'yes'],
             [1,0,'no'],
             [0, 1, 'no'],
             [0, 1, 'no'],
             ]
    labels=['no suffacing','flippers']
    return dataSet,labels

在python命令提示符下输入下述命令：

得到的0.970~~~~就是商，熵越高则说明混合的数据越多。

3.1.2 划分数据集

def splitDataSet(dataSet,axis,value):#按照给定的特征划分数据集
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            #extend接受一个列表作为参数，并将该参数的每个元素都添加到原有列表中
            retDataSet.append(reduceFeatVec)
    return retDataSet

现在来测试一下该函数

在python命令提示符中输入：

接下来我们将遍历整个数据集，循环计算香农熵和splitDataSet()函数，找到最好的特征划分方式。

依旧在trees.py中加入如下代码

#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
    numFeatures=len(dataSet[0]-1)
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0;bestFeature=-1
    for i in range(numFeatures):
        featList=[example[i] for example in dataSet]
        uniqueVals=set(featList)
        newEntropy=0.0
        for value in uniqueVals:
            subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        if(infoGain>baseEntropy):
            bestInfoGain=infoGain
            bestFeature=i
    return bestFeature

测试代码：

最好的划分是0

3.1.3递归构建决策树

import  operator
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():classCount[vote]=0
        classCount+=1
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

该函数使用分类名称的列表，然后创建键值为classList中唯一值得数据字典，字典对象存储了classList中每个类标签出现的频率，最后利用operator操作键值排序字典，并返回出现次数最多的分类名称

def createTree(dataSet,labels):#创建数的函数代码
    classList=[example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
        subLabels =labels[:]
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

进行测试：

决策树已建成，不过看起来有点费劲~~·~我们需要利用Matplotlib注解来绘制树图形

在我的上一篇博文中有讲到

3.3 测试和存储分类器

3.3.1 测试算法：使用决策树执行分类

def classfiy(inputTree,featLabels,testVec):
    firstStr=inputTree.keys()[0]
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex]==key:
            if type(secondDict[key]).__name__=='dict':
                classLbel=classfiy(secondDict[key],featLabels,testVec)
            else: classLbel=secondDict[key]
    return classLbel

代码测试：

3.3.2 使用算法：决策树的存储

def storeTree(inputTree,filename):#使用pickle模块存储决策树
    import pickle
    fw=open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
def grabTree(filename):
    import pickle
    fr=open(filename)
    return pickle.load(fr)

3.4 实例：使用决策树预测隐形眼镜类型

数据如下：

young myope no reduced no lenses
young myope no normal soft
young myope yes reduced no lenses
young myope yes normal hard
young hyper no reduced no lenses
young hyper no normal soft
young hyper yes reduced no lenses
young hyper yes normal hard
pre myope no reduced no lenses
pre myope no normal soft
pre myope yes reduced no lenses
pre myope yes normal hard
pre hyper no reduced no lenses
pre hyper no normal soft
pre hyper yes reduced no lenses
pre hyper yes normal no lenses
presbyopic myope no reduced no lenses
presbyopic myope no normal no lenses
presbyopic myope yes reduced no lenses
presbyopic myope yes normal hard
presbyopic hyper no reduced no lenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced no lenses
presbyopic hyper yes normal no lenses

在python命令提示符中下列命令：

结果如下：

《机器学习实战》——决策树的构造及案例

35K*14 薪，入职了！这公司只要不裁员，我能一直呆下去！

Codewars算法題（4）

Codewars算法題（7）

《機器學習實戰》——決策樹的構造及案例

codewars算法題（找零錢）

機器學習實戰之K-鄰近算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結