目錄
1、一個完全的樣本
設計好的決策樹:
用機器學習的算法對相應數據處理後,會不會得出設計的這棵樹?
節點分支數量:
A,B,C,D,E=(3,4,3,2,2)
可提供的樣本總數:3×4×3×2×2=144
當一條路徑唯一確定,那些非該路徑特徵取值就可任意,因此可以確定唯一包含一條路徑的樣本數量:
所以,有效樣本容量:112
有36個樣本不屬於任何路徑。
所有樣本如下:dataprofull.csv
特徵A,特徵B,特徵C,特徵D,特徵E,結果RES
A,B,C,D,E,RES
1,5,10,12,3,no
1,8,10,12,4,yes
0,8,9,13,4,yes
2,6,11,12,3,yes
0,6,9,13,3,no
1,5,10,13,3,no
2,8,9,12,3,yes
2,8,9,13,4,no
1,6,11,13,4,no
1,7,9,12,3,yes
1,7,11,12,4,no
1,6,9,12,4,yes
2,8,10,13,3,no
2,5,10,12,3,yes
0,8,10,12,4,yes
2,8,10,12,3,yes
1,5,9,12,3,yes
1,8,10,13,3,yes
1,5,11,12,3,no
2,7,11,12,4,yes
1,8,11,12,4,no
0,5,10,12,4,yes
0,6,10,12,4,yes
2,8,11,12,4,yes
0,7,10,13,3,yes
0,7,9,12,4,yes
2,7,10,12,4,yes
0,7,9,13,4,yes
1,6,11,13,3,no
2,6,10,12,4,yes
1,8,10,12,3,yes
0,8,10,13,4,yes
2,6,9,13,4,yes
0,7,9,12,3,yes
2,6,9,12,3,yes
1,6,11,12,3,no
2,6,11,13,4,yes
2,6,9,12,4,yes
2,8,9,13,3,no
1,5,9,13,3,yes
0,5,10,13,4,yes
2,6,9,13,3,yes
0,6,10,13,3,no
2,5,10,12,4,yes
1,7,11,13,4,no
0,7,11,13,4,yes
1,7,9,13,4,yes
1,5,10,13,4,no
0,5,11,13,4,yes
1,8,9,13,3,yes
1,5,11,13,3,no
0,5,9,13,4,yes
0,6,11,12,4,yes
2,7,11,12,3,yes
1,8,9,12,3,yes
0,7,10,12,3,yes
0,7,10,13,4,yes
0,6,10,12,3,no
1,5,11,13,4,no
1,6,9,13,3,yes
2,8,10,12,4,yes
2,5,9,12,3,yes
1,5,10,12,4,no
0,6,9,13,4,yes
0,7,10,12,4,yes
1,7,11,13,3,no
0,8,9,12,4,yes
1,8,10,13,4,yes
2,5,9,12,4,yes
0,7,11,12,4,yes
0,7,11,13,3,yes
0,8,11,13,4,yes
2,8,11,13,3,no
1,5,11,12,4,no
1,8,11,13,4,no
2,8,9,12,4,yes
2,7,10,12,3,yes
1,7,9,13,3,yes
0,6,9,12,3,no
1,7,9,12,4,yes
1,6,9,12,3,yes
2,6,11,13,3,yes
2,7,9,12,3,yes
0,7,9,13,3,yes
1,8,11,13,3,no
1,6,11,12,4,no
1,8,9,12,4,yes
2,5,11,12,4,yes
0,6,11,13,3,no
1,6,9,13,4,yes
2,8,11,12,3,yes
0,6,9,12,4,yes
2,6,10,13,3,yes
0,8,11,12,4,yes
1,7,11,12,3,no
0,6,11,13,4,yes
2,8,11,13,4,no
1,5,9,12,4,yes
1,8,11,12,3,no
2,7,9,12,4,yes
0,5,9,12,4,yes
0,7,11,12,3,yes
2,6,10,13,4,yes
2,5,11,12,3,yes
0,6,11,12,3,no
2,6,10,12,3,yes
1,5,9,13,4,yes
0,6,10,13,4,yes
1,8,9,13,4,yes
2,8,10,13,4,no
0,5,11,12,4,yes
2,6,11,12,4,yes
我自己寫的id3算法以及生成的決策樹:
有問題嗎?驗證一下便可知:
從樣本中任取一條記錄,如最後一行的“2,6,11,12,4,yes”,這個次序依次對應特徵A,B,C,D,E,RES,即A=2,B=6,C=11,D=12,E=4,RES=‘yes'。從樹中訪問來看,C=11,A=2,D=12,到達決策節點'yes',這與記錄中結果RES='yes'是一致的,說明生成的決策樹沒問題。其他的記錄也可類似訪問,都能從樹中得到相應結果。至於B=6和E=4,對於樹中這條路徑來講是多餘信息,可隨機取值。
再從最初設計的決策樹來看,“2,6,11,12,4,yes”是屬於決策路徑d10=(A=2,D=12,’yes’),從這裏,既能看到內、外在規則的統一,又能看到內、外在規則的區別。
決策樹的生成的決策代碼:
def Decision(**d):
if d['C']=='9':
if d['B']=='8':
if d['A']=='1':
return 'yes'
elif d['A']=='0':
return 'yes'
elif d['A']=='2':
if d['D']=='13':
return 'no'
elif d['D']=='12':
return 'yes'
elif d['B']=='6':
if d['A']=='1':
return 'yes'
elif d['A']=='0':
if d['E']=='3':
return 'no'
elif d['E']=='4':
return 'yes'
elif d['A']=='2':
return 'yes'
elif d['B']=='5':
return 'yes'
elif d['B']=='7':
return 'yes'
elif d['C']=='11':
if d['A']=='1':
return 'no'
elif d['A']=='0':
if d['E']=='3':
if d['B']=='6':
return 'no'
elif d['B']=='7':
return 'yes'
elif d['E']=='4':
return 'yes'
elif d['A']=='2':
if d['D']=='13':
if d['B']=='8':
return 'no'
elif d['B']=='6':
return 'yes'
elif d['D']=='12':
return 'yes'
elif d['C']=='10':
if d['B']=='8':
if d['A']=='1':
return 'yes'
elif d['A']=='0':
return 'yes'
elif d['A']=='2':
if d['D']=='13':
return 'no'
elif d['D']=='12':
return 'yes'
elif d['B']=='6':
if d['E']=='3':
if d['A']=='0':
return 'no'
elif d['A']=='2':
return 'yes'
elif d['E']=='4':
return 'yes'
elif d['B']=='5':
if d['A']=='1':
return 'no'
elif d['A']=='0':
return 'yes'
elif d['A']=='2':
return 'yes'
elif d['B']=='7':
return 'yes'
2、sklearn
下面是通過sklearn對dataprofull.csv生成決策樹(目前對sklearn剛剛接觸,還達不到靈活運用,只好照搬好心人的代碼修修補補了):
import numpy as np
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
def main():
data = []
label= []
with open("dataprofull.csv") as ifile:
ifile.readline()
ifile.readline()
for line in ifile:
tmp = line.strip('\n').split(',')
tmp1=[]
for i in range(len(tmp)):
if i==len(tmp)-1:
if tmp[i]=='yes':
tmp[i]=1
else:
tmp[i]=0
tmp1.append(float(tmp[i]))
data.append(tmp1)
label.append(float(tmp1[-1]))
x = np.array(data)
y = np.array(label)
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.2)
clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=6)
clf.fit(x_train, y_train)
with open("tree.dot", 'w') as f:
dot_data = tree.export_graphviz(clf,out_file=f,filled=True)
if __name__ == '__main__':
main()
sklearn生成圖形結果如下:
目前來說,我是無法理解這棵樹的。有待學習後進一步改進。