ID3算法的决策树的构造
决策树的理论部分,不再赘述,本篇博文主要是自己的学习笔记(《机器学习实战》)
先看下述决策树,希望对理解决策树有一定的帮助。
3.1.1信息增益
首先需要了解两个公式:
创建名为treesde.py文件,将下述代码添加进去
from math import log
def calcShannonEnt(dataSet):#该函数的功能是计算给定数据集的香农熵
numEntries=len(dataSet)
labelCounts={}
for featVec in dataSet:
currentLabel=featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel]=0
labelCounts[currentLabel]+=1
shannonEnt=0.0
for key in labelCounts:
prob =float(labelCounts[key])/numEntries
shannonEnt-=prob*log(prob,2)
return shannonEnt
输入数据集
def createDataSet():
dataSet=[[1,1,'yes'],
[1, 1, 'yes'],
[1,0,'no'],
[0, 1, 'no'],
[0, 1, 'no'],
]
labels=['no suffacing','flippers']
return dataSet,labels
在python命令提示符下输入下述命令:
得到的0.970~~~~就是商,熵越高则说明混合的数据越多。
3.1.2 划分数据集
def splitDataSet(dataSet,axis,value):#按照给定的特征划分数据集
retDataSet=[]
for featVec in dataSet:
if featVec[axis]==value:
reduceFeatVec=featVec[:axis]
reduceFeatVec.extend(featVec[axis+1:])
#extend接受一个列表作为参数,并将该参数的每个元素都添加到原有列表中
retDataSet.append(reduceFeatVec)
return retDataSet
现在来测试一下该函数
在python命令提示符中输入:
接下来我们将遍历整个数据集,循环计算香农熵和splitDataSet()函数,找到最好的特征划分方式。
依旧在trees.py中加入如下代码
#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
numFeatures=len(dataSet[0]-1)
baseEntropy=calcShannonEnt(dataSet)
bestInfoGain=0.0;bestFeature=-1
for i in range(numFeatures):
featList=[example[i] for example in dataSet]
uniqueVals=set(featList)
newEntropy=0.0
for value in uniqueVals:
subDataSet=splitDataSet(dataSet,i,value)
prob=len(subDataSet)/float(len(dataSet))
newEntropy+=prob*calcShannonEnt(subDataSet)
infoGain=baseEntropy-newEntropy
if(infoGain>baseEntropy):
bestInfoGain=infoGain
bestFeature=i
return bestFeature
测试代码:
最好的划分是0
3.1.3递归构建决策树
import operator
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys():classCount[vote]=0
classCount+=1
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
该函数使用分类名称的列表,然后创建键值为classList中唯一值得数据字典,字典对象存储了classList中每个类标签出现的频率,最后利用operator操作键值排序字典,并返回出现次数最多的分类名称
def createTree(dataSet,labels):#创建数的函数代码
classList=[example[-1] for example in dataSet]
if classList.count(classList[0])==len(classList):
return classList[0]
if len(dataSet[0])==1:
return majorityCnt(classList)
bestFeat=chooseBestFeatureToSplit(dataSet)
bestFeatLabel=labels[bestFeat]
myTree={bestFeatLabel:{}}
del(labels[bestFeat])
featValues=[example[bestFeat] for example in dataSet]
uniqueVals=set(featValues)
for value in uniqueVals:
subLabels =labels[:]
myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
return myTree
进行测试:
决策树已建成,不过看起来有点费劲~~·~我们需要利用Matplotlib注解来绘制树图形
在我的上一篇博文中有讲到
3.3 测试和存储分类器
3.3.1 测试算法:使用决策树执行分类
def classfiy(inputTree,featLabels,testVec):
firstStr=inputTree.keys()[0]
secondDict=inputTree[firstStr]
featIndex=featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex]==key:
if type(secondDict[key]).__name__=='dict':
classLbel=classfiy(secondDict[key],featLabels,testVec)
else: classLbel=secondDict[key]
return classLbel
代码测试:
3.3.2 使用算法:决策树的存储
def storeTree(inputTree,filename):#使用pickle模块存储决策树
import pickle
fw=open(filename,'w')
pickle.dump(inputTree,fw)
fw.close()
def grabTree(filename):
import pickle
fr=open(filename)
return pickle.load(fr)
3.4 实例:使用决策树预测隐形眼镜类型
数据如下:
young myope
no reduced
no lenses
young myope
no normal
soft
young myope
yes reduced
no lenses
young myope
yes normal
hard
young hyper
no reduced
no lenses
young hyper
no normal
soft
young hyper
yes reduced
no lenses
young hyper
yes normal
hard
pre myope
no reduced
no lenses
pre myope
no normal
soft
pre myope
yes reduced
no lenses
pre myope
yes normal
hard
pre hyper
no reduced
no lenses
pre hyper
no normal
soft
pre hyper
yes reduced
no lenses
pre hyper
yes normal
no lenses
presbyopic myope
no reduced
no lenses
presbyopic myope
no normal
no lenses
presbyopic myope
yes reduced
no lenses
presbyopic myope
yes normal
hard
presbyopic hyper
no reduced
no lenses
presbyopic hyper
no normal
soft
presbyopic hyper
yes reduced
no lenses
presbyopic hyper
yes normal
no lenses
在python命令提示符中下列命令:
结果如下: