

關於決策樹的本質,個人認爲最好的是《統計學習方法》(李航)中說的說法 P56-P58。
1. 最基本的決策樹
圖1是重中之重,請確保看懂。(圖摘自機器學習 周志華 P74,註釋是博主添加的,其中提及的節點在圖3中)


對於不含衝突數據(即特徵向量完全相同但標記不同)的訓練集,必存在訓練誤差爲0的決策樹。(機器學習 周志華 P93)所以決策樹很容易出現過擬合,剪枝就是改善過擬合的一種手段。


(圖4 5 6 7 8均來自 機器學習 周志華)
若用每次節點使用多個特徵則會生成任意非軸平行的分類線。如圖4中數據依照圖7的樹訓練出來的分類線如圖8所示。只用了兩次分類。分類線的確定可使用“線性判別分析”。順便提一句,線性模型可以衍生出很多種分類器,機器學習 周志華的第三章所講的個人認爲並不是很系統,這部分推薦看吳恩達機器學習網易公開課。

決策樹的具體實現的代碼可以參考 《機器學習實戰》,本博講基於scikit-learn讓它run起來的過程,因爲在機器學習中最關鍵的不是追求高大上的模型!!而是選特徵、調參數!!當然底層實現還是需要學習的,畢竟不是每個問題都有合適的現成模型去解決,還是有很多需要我們自己動手造算法的時候。建議學習一下“算法設計與分析”。

驗證可行,環境pycharm python2.7

#1:choosing good features is key to getting good outcomes more so than choosing the right algorithm!!!

#2:update scikie-learn 0.1.8, noticing :
#Changed `RandomizedPCA` to `PCA` with `svd_solver='randomized'.
#Changed all references from `cross_validation` to `model_selection`.

#3:changing strings to int to increase efficiency and changing int to onehotencoder to remove continuity

#4:GridSearchCV & best_estimator_ :solving parameters' space by cross validation to find the best parameter group & print the group

#5.a format to update features:
#for index, row in dataset.iterrows():
#   row["feature"] = last[]
#   last[] = row["data"]

(1). 導入數據並準備做好數據集框架
《Python數據挖掘入門與實踐》這部分所講的數據集的獲取方式已無效,博主自己整理的數據集:NBA 2013-2014比賽數據集

import pandas as pd
import numpy as np

dataset = pd.read_csv("dicision trees sample.csv", parse_dates=["Date"])#import original dataset and neaten titles
dataset.columns = ["Date", "Start (ET)", "Visitor Team","VisitorPts", "Home Team", "HomePts", "Score Type","OT?", "Notes"]
dataset["HomeWin"] = 0
dataset["HomeLastWin"] = 0
dataset["VisitorLastWin"] = 0
dataset["HomeTeamRanksHigher"] = 0
dataset["HomeTeamWonLast"] = 0
#add a new feature which store whether home team's rank is higher than visitor.Init all zeros.
#we want to add some columns, who are features used to train module.
#they are "int", but they will become "dict" ,if counting in the title:"HomeLastWin"/"VisitorLastWin/......".
#init : {"HomeLastWin":[0,0,0,0,......]}
#init : {"VisitorLastWin":[0,0,0,0....]}

(2). 重頭戲之一:創造特徵

#some preparation for get features' values
dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]#the feature "Homewin", bool, is true when VisitorPts<HomePts.
y_true = dataset["HomeWin"].values#change format in order to dispose via scikit-learn
standings = pd.read_csv("dicision trees expanded standings.csv", skiprows=[0])#another dataset which is the rank of teams, for the feature HomeTeamRanksHigher
from collections import defaultdict
won_last = defaultdict(int)# for the features HomeLastWin and VisitorLastWin
last_match_winner = defaultdict(int)#for the features HomeTeamWonLast
#when we traversing each row of dataset, this dict will store the outcome of the two team in the current row
#as the last situation when meeting the same team again
#and use those features to judge new data
#somewhen: won_last : {"Miami Heat": 1, "Oklahoma City": 0, ......}
for index, row in dataset.iterrows():
    home_team = row["Home Team"]#get the home team's name of the current row
    visitor_team = row["Visitor Team"]#get the visitor team's name of the current row
    teams = tuple(sorted([home_team, visitor_team]))# sort these teams in alphabetical order
    if home_team == "New Orleans Pelicans":  # neaten team name
        home_team = "New Orleans Hornets"
    elif visitor_team == "New Orleans Pelicans":
        visitor_team = "New Orleans Hornets"
    home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]  #the rank of the home team in current row
    visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]  # the rank of the visitor in current row
    row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)# the feature HomeTeamRanksHigher get value in current row
    row["HomeLastWin"] = won_last[home_team]# the feature HomeLastWin get value in current row
    row["VisitorLastWin"] = won_last[visitor_team]# the feature VisitorLastWin get value in current row
    row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
    dataset.ix[index] = row# update the current row
    won_last[home_team] = row["HomeWin"]#as the last situation when meeting the this team next time
    won_last[visitor_team] = not row["HomeWin"]#as the last situation when meeting this same team next time
    winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
    last_match_winner[teams] = winner

(3). 使用不同特徵產生的效果

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
estimator = DecisionTreeClassifier(random_state=14) #use DecisionTreeClassifier imported from scikit-learn as the estimator

X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values #the train samples
X_homehigher = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher"]].values # train dataset
X_lastwinner = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher", "HomeTeamWonLast"]].values

scores_p = cross_val_score(estimator, X_previouswins, y_true,scoring='accuracy') # fit & predict by cross validation
scores_h = cross_val_score(estimator, X_homehigher, y_true,scoring='accuracy') #fit & predict by cross validation
scores_l = cross_val_score(estimator, X_lastwinner, y_true,scoring='accuracy')

print("when features are HomeLastWin&VisitorLastWin ,the accuracy is: {0:.1f}%".format(np.mean(scores_p) * 100))
print("when features are HomeLastWin&VisitorLastWin&HomeTeamRanksHigher ,the accuracy is: {0:.1f}%".format(np.mean(scores_h) * 100))
print("when features are HomeLastWin&VisitorLastWin&HomeTeamRanksHigher&HomeTeamWonLast ,the accuracy is: {0:.1f}%".format(np.mean(scores_l) * 100))

(4). 獨熱碼

from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()["Home Team"].values)
home_teams = encoding.transform(dataset["Home Team"].values)# change strings to int to increase efficiency
visitor_teams = encoding.transform(dataset["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder() # change int to onehotencoder to remove continuity
X_teams_expanded = onehot.fit_transform(X_teams).todense()
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(estimator, X_teams_expanded, y_true,scoring='accuracy')
print("onehotencoder accuracy: {0:.1f}%".format(np.mean(scores) * 100))

(5). 隨機森林(很多樹共同決策)

from sklearn.ensemble import RandomForestClassifier #random forests
estimator = RandomForestClassifier(random_state=14)
scores = cross_val_score(estimator, X_teams, y_true, scoring='accuracy')
print("random forests,the accuracy is: {0:.1f}%".format(np.mean(scores) * 100))

X_all = np.hstack([X_lastwinner, X_teams])#train data using all features
scores = cross_val_score(estimator, X_all, y_true, scoring='accuracy')
print("random forests using all features accuracy: {0:.1f}%".format(np.mean(scores) * 100))

(6). 重頭戲之二:調參
在這裏可以體會調參的神奇。使用GridSearchCV可讓Python在給定的參數空間中自行尋找最佳參數,並可通過best_estimator_ 打印出來。

parameter_space = {
"max_features": [2, 4, 'auto'],
"n_estimators": [100,],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator, parameter_space), y_true)
print("random forests using all features and change parameters accuracy: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_) #output the best parameters
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.