新學習新迷惑:sklearn,一次驗證之旅

目錄

1、一個完全的樣本

2、sklearn


1、一個完全的樣本

設計好的決策樹:

用機器學習的算法對相應數據處理後,會不會得出設計的這棵樹?

節點分支數量:

A,B,C,D,E=(3,4,3,2,2)

可提供的樣本總數:3×4×3×2×2=144

當一條路徑唯一確定,那些非該路徑特徵取值就可任意,因此可以確定唯一包含一條路徑的樣本數量:

所以,有效樣本容量:112

有36個樣本不屬於任何路徑。

所有樣本如下:dataprofull.csv

特徵A,特徵B,特徵C,特徵D,特徵E,結果RES
A,B,C,D,E,RES
1,5,10,12,3,no
1,8,10,12,4,yes
0,8,9,13,4,yes
2,6,11,12,3,yes
0,6,9,13,3,no
1,5,10,13,3,no
2,8,9,12,3,yes
2,8,9,13,4,no
1,6,11,13,4,no
1,7,9,12,3,yes
1,7,11,12,4,no
1,6,9,12,4,yes
2,8,10,13,3,no
2,5,10,12,3,yes
0,8,10,12,4,yes
2,8,10,12,3,yes
1,5,9,12,3,yes
1,8,10,13,3,yes
1,5,11,12,3,no
2,7,11,12,4,yes
1,8,11,12,4,no
0,5,10,12,4,yes
0,6,10,12,4,yes
2,8,11,12,4,yes
0,7,10,13,3,yes
0,7,9,12,4,yes
2,7,10,12,4,yes
0,7,9,13,4,yes
1,6,11,13,3,no
2,6,10,12,4,yes
1,8,10,12,3,yes
0,8,10,13,4,yes
2,6,9,13,4,yes
0,7,9,12,3,yes
2,6,9,12,3,yes
1,6,11,12,3,no
2,6,11,13,4,yes
2,6,9,12,4,yes
2,8,9,13,3,no
1,5,9,13,3,yes
0,5,10,13,4,yes
2,6,9,13,3,yes
0,6,10,13,3,no
2,5,10,12,4,yes
1,7,11,13,4,no
0,7,11,13,4,yes
1,7,9,13,4,yes
1,5,10,13,4,no
0,5,11,13,4,yes
1,8,9,13,3,yes
1,5,11,13,3,no
0,5,9,13,4,yes
0,6,11,12,4,yes
2,7,11,12,3,yes
1,8,9,12,3,yes
0,7,10,12,3,yes
0,7,10,13,4,yes
0,6,10,12,3,no
1,5,11,13,4,no
1,6,9,13,3,yes
2,8,10,12,4,yes
2,5,9,12,3,yes
1,5,10,12,4,no
0,6,9,13,4,yes
0,7,10,12,4,yes
1,7,11,13,3,no
0,8,9,12,4,yes
1,8,10,13,4,yes
2,5,9,12,4,yes
0,7,11,12,4,yes
0,7,11,13,3,yes
0,8,11,13,4,yes
2,8,11,13,3,no
1,5,11,12,4,no
1,8,11,13,4,no
2,8,9,12,4,yes
2,7,10,12,3,yes
1,7,9,13,3,yes
0,6,9,12,3,no
1,7,9,12,4,yes
1,6,9,12,3,yes
2,6,11,13,3,yes
2,7,9,12,3,yes
0,7,9,13,3,yes
1,8,11,13,3,no
1,6,11,12,4,no
1,8,9,12,4,yes
2,5,11,12,4,yes
0,6,11,13,3,no
1,6,9,13,4,yes
2,8,11,12,3,yes
0,6,9,12,4,yes
2,6,10,13,3,yes
0,8,11,12,4,yes
1,7,11,12,3,no
0,6,11,13,4,yes
2,8,11,13,4,no
1,5,9,12,4,yes
1,8,11,12,3,no
2,7,9,12,4,yes
0,5,9,12,4,yes
0,7,11,12,3,yes
2,6,10,13,4,yes
2,5,11,12,3,yes
0,6,11,12,3,no
2,6,10,12,3,yes
1,5,9,13,4,yes
0,6,10,13,4,yes
1,8,9,13,4,yes
2,8,10,13,4,no
0,5,11,12,4,yes
2,6,11,12,4,yes

我自己寫的id3算法以及生成的決策樹:

有問題嗎?驗證一下便可知:

從樣本中任取一條記錄,如最後一行的“2,6,11,12,4,yes”,這個次序依次對應特徵A,B,C,D,E,RES,即A=2,B=6,C=11,D=12,E=4,RES=‘yes'。從樹中訪問來看,C=11,A=2,D=12,到達決策節點'yes',這與記錄中結果RES='yes'是一致的,說明生成的決策樹沒問題。其他的記錄也可類似訪問,都能從樹中得到相應結果。至於B=6和E=4,對於樹中這條路徑來講是多餘信息,可隨機取值。

再從最初設計的決策樹來看,“2,6,11,12,4,yes”是屬於決策路徑d10=(A=2,D=12,’yes’),從這裏,既能看到內、外在規則的統一,又能看到內、外在規則的區別。

決策樹的生成的決策代碼:

def Decision(**d):
      if d['C']=='9':
             if d['B']=='8':
                    if d['A']=='1':
                           return 'yes'
                    elif d['A']=='0':
                           return 'yes'
                    elif d['A']=='2':
                           if d['D']=='13':
                                  return 'no'
                           elif d['D']=='12':
                                  return 'yes'
             elif d['B']=='6':
                    if d['A']=='1':
                           return 'yes'
                    elif d['A']=='0':
                           if d['E']=='3':
                                  return 'no'
                           elif d['E']=='4':
                                  return 'yes'
                    elif d['A']=='2':
                           return 'yes'
             elif d['B']=='5':
                    return 'yes'
             elif d['B']=='7':
                    return 'yes'
      elif d['C']=='11':
             if d['A']=='1':
                    return 'no'
             elif d['A']=='0':
                    if d['E']=='3':
                           if d['B']=='6':
                                  return 'no'
                           elif d['B']=='7':
                                  return 'yes'
                    elif d['E']=='4':
                           return 'yes'
             elif d['A']=='2':
                    if d['D']=='13':
                           if d['B']=='8':
                                  return 'no'
                           elif d['B']=='6':
                                  return 'yes'
                    elif d['D']=='12':
                           return 'yes'
      elif d['C']=='10':
             if d['B']=='8':
                    if d['A']=='1':
                           return 'yes'
                    elif d['A']=='0':
                           return 'yes'
                    elif d['A']=='2':
                           if d['D']=='13':
                                  return 'no'
                           elif d['D']=='12':
                                  return 'yes'
             elif d['B']=='6':
                    if d['E']=='3':
                           if d['A']=='0':
                                  return 'no'
                           elif d['A']=='2':
                                  return 'yes'
                    elif d['E']=='4':
                           return 'yes'
             elif d['B']=='5':
                    if d['A']=='1':
                           return 'no'
                    elif d['A']=='0':
                           return 'yes'
                    elif d['A']=='2':
                           return 'yes'
             elif d['B']=='7':
                    return 'yes'

2、sklearn

下面是通過sklearn對dataprofull.csv生成決策樹(目前對sklearn剛剛接觸,還達不到靈活運用,只好照搬好心人的代碼修修補補了):

import numpy as np
from sklearn import tree
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import classification_report 

def main():
    data = []
    label= []
    with open("dataprofull.csv") as ifile:
        ifile.readline()
        ifile.readline()
        for line in ifile:
            tmp = line.strip('\n').split(',')
            tmp1=[]
            for i in range(len(tmp)):
                if i==len(tmp)-1:
                    if tmp[i]=='yes':
                        tmp[i]=1
                    else:
                        tmp[i]=0
                tmp1.append(float(tmp[i]))
            data.append(tmp1)
            label.append(float(tmp1[-1]))
    
    x = np.array(data)
    y = np.array(label)

    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.2)

    clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=6)  
    clf.fit(x_train, y_train)  

    with open("tree.dot", 'w') as f:
        dot_data = tree.export_graphviz(clf,out_file=f,filled=True)  

if __name__ == '__main__':
    main()

sklearn生成圖形結果如下:

目前來說,我是無法理解這棵樹的。有待學習後進一步改進。

 

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章