決策樹的生成(該函數是一個遞歸的過程)CreateTree
輸入:數據集、特徵
輸出:字典型數據——決策樹
a、判斷是否滿足停止劃分的條件
若當前數據集的屬性值爲空,則投票表決當前樣本中最多的類別
若當前所有的樣本類別相同,則返回當前數據的類別。
b、尋找當前數據的最佳劃分特徵
c、將最佳特徵作爲關鍵字,保存到字典中
d、從當前的屬性集合中刪除該最佳特徵
e、遍歷該最佳劃分特徵的所有屬性值feat,循環調用函數 CreateTree(輸入參數爲:最佳特徵值爲feat的所有數據集,去除最佳特徵的屬性集合)
代碼注意:
1、生成的決策樹用字典保存,並且每個關鍵字的值是一個字典;
2、生成的決策樹可以用 pickle 序列化對象保存;
3、ID3 算法適用於標稱型數據,在函數的輸入、輸出中,數據類型爲 listlist。
代碼:
#-*- coding:utf-8 -*-
import numpy as np
from numpy import *
import pandas as pd
from math import *
import operator
import pickle # 使用該模塊實現對決策樹的保存
# 數據導入
def loadData(fileName):
dataSet = []
fr = open(fileName)
for featVector in fr.readlines():
lineVector = featVector.strip().split('\t')
dataSet.append(lineVector)
return dataSet
def calcuEntropy(myData): # 計算信息熵
numSample = len(myData)
myClassCount = {}
for featVector in myData:
theKey = featVector[-1]
if theKey not in myClassCount:
myClassCount[theKey] = 0
myClassCount[theKey] += 1
myEntropy = 0
for Keys in myClassCount.keys():
Px = float(myClassCount[Keys])/numSample
myEntropy -= Px*log(Px,2) # 需要導入 math 庫
return myEntropy
# 劃分數據集:返回劃分好的數據集
def splitDataSet(dataX,FeatureNumber,value): # 輸入:數據集、第i 個特徵、該屬性的值
retMat = []
for featVect in dataX:
if featVect[FeatureNumber] == value:
x1 = featVect[:FeatureNumber]
x2 = featVect[FeatureNumber+1:]
x1.extend(x2)
retMat.append(x1)
#print"retMat", retMat
return retMat
# 計算最優特徵:計算每個特徵下的信息熵,信息熵最大的既是最優特徵,返回的數字 i 代表第 i 個特徵
def GetBestFeature(dataM):
BestFeat = -1; LargestInformGain = -1 # 最佳特徵、最大信息增益
theEntropy = calcuEntropy(dataM)
FeatNumber = len(dataM[0])-1
for i in range(FeatNumber):
FeatList = [example[i] for example in dataM] # 統計每個特徵有幾個特徵值
FeatUnique = set(FeatList) # 每個特徵中的特徵值,計算每個特徵值下的信息增益
NewEntropy = 0.0
for j in FeatUnique:
retMat = splitDataSet(dataM,i,j) # 得到滿足條件的數據
Prob = len(retMat)/float(len(dataM))
NewEntropy -= Prob*calcuEntropy(retMat) # 注意:這裏是子數據集的概率x 該數據集的熵
informGain = theEntropy + NewEntropy
if informGain > LargestInformGain:
LargestInformGain = informGain
BestFeat = i
return BestFeat
def RoleOfVote(dataM): # 投票規則:少數服從多數
lables = [example[-1] for example in dataM]
lablesCount = {}
for i in lables:
if i not in lablesCount.keys():
lablesCount[i] = 0
lablesCount[i] += 1
theSort = sorted(lablesCount.iteritems(),key =operator.itemgetter(1),reverse=True)
return theSort[0][0] # 返回出現次數最多的類別標籤
# 決策樹生成: 首先,判斷是否滿足停止劃分的條件:1、所有的類別標籤相同 2、屬性值爲空
def CreateTree(dataSet,label):
allLabels = [example[-1] for example in dataSet]
#print "調用幾次"
if allLabels.count(allLabels[0])==len(allLabels):
return allLabels[0]
if(len(dataSet[0])==1): # 沒有屬性可以劃分時,採用投票規則
return RoleOfVote(dataSet)
featNumber = GetBestFeature(dataSet) # 返回最佳劃分的編號
#print featNumber
bestFeature = label[featNumber]
myTree = {bestFeature:{}}
del(label[featNumber])
featValue = [example[featNumber] for example in dataSet]
uniqueFeatValues = set(featValue)
for i in uniqueFeatValues:
subLabels = label[:]
myTree[bestFeature][i] = CreateTree(splitDataSet(dataSet, featNumber, i), subLabels)
return myTree
# 對輸入樣本進行分類
def SampleClassify(inputTree, featLabels, testVec): # 輸入樹、屬性標籤、測試向量
firstStr = inputTree.keys()[0] # 決策樹的第一個關鍵詞(第一個劃分的屬性)
secondDict = inputTree[firstStr] # 決策樹每個關鍵字對應的值也是字典
featIndex = featLabels.index(firstStr) # 得到該特徵在屬性標籤中的編號 K
key = testVec[featIndex] # 得到當前測試向量中第 K 號特徵值
valueOfFeat = secondDict[key]
if isinstance(valueOfFeat, dict): # 判斷變量是否爲字典類型
classLabel = SampleClassify(valueOfFeat, featLabels, testVec)
else:
classLabel = valueOfFeat
return classLabel
# 將生成的樹保存
def storeTree(inputTree, filename):
fw = open(filename, 'w')
pickle.dump(inputTree, fw)
fw.close()
# 加載保存的樹
def loadTree(filename):
fr = open(filename)
return pickle.load(fr)
if __name__=="__main__":
print "hello world"
dataSet = loadData('lenses.txt')
labels = ['age','prescript','astigmatic','tearRate']
slabels = labels[:] # pyhton 函數中的參數是按照引用方式傳遞的,爲防止labels 改變,複製類標籤帶入
myTree = CreateTree(dataSet,slabels)
print labels
#storeTree(myTree,"xu") # 保存決策樹
test = dataSet[0]
#print test[:-1]
print"預測結果是:", SampleClassify(myTree,labels,test[:-1])
print"真是標籤是:",test[-1]
注:
- ID3決策樹在選擇屬性時,會找到信息增益最大的劃分屬性
- ID3決策樹如果對連續變化的屬性進行劃分,則需要先將連續值離散化(分區間)
參考資料:https://blog.csdn.net/qq_32933503/article/details/78256029