Python與機器學習2——決策樹只有一個名字!

ID3、C4.5、C4.5Rule、CART還有衍生的隨機森林……
面對各種決策樹算法,我們要做的是汲取各種樹算法的優劣並在解決面臨的問題時有機融合,不必較真不同形式的算法到底應該叫C4.5還是CART,關鍵還是決策樹這種解決問題的基本思路,其他都是拓展與完善的技巧。所以,就決策樹,一個名字。
這裏給出的是生成決策樹的框架(都需要幹嘛),具體細節書上寫的很詳,沒必要放在博客裏加長篇幅貌似copy得多就代表懂得多一樣,且這並非某一本書或博客能概括全的,不贅述。

關於決策樹的本質,個人認爲最好的是《統計學習方法》(李航)中說的說法 P56-P58。
1. 最基本的決策樹
圖1是重中之重,請確保看懂。(圖摘自機器學習 周志華 P74,註釋是博主添加的,其中提及的節點在圖3中)
圖1
圖2
圖3
最關鍵是圖1第8行:選擇最佳屬性。有信息增益、增益率、基尼不純度等方法。只不過基於數據集通過不同的公式進行計算而已,自行Google。細節都在搜索引擎中,我只是寫出了輸入到引擎中的關鍵字。需要注意的是:對於不同數據集不同的選擇方法效果不同,沒有絕對的好壞。

2.分類樹->迴歸樹
某一屬性爲連續值如年齡,當選擇了閾值如35歲分把此連續屬性爲兩類,則和它就變成了離散屬性,對應的算法

3.剪枝(不去追求對訓練集多麼深刻的認識,儘管我們可以…)
對於不含衝突數據(即特徵向量完全相同但標記不同)的訓練集,必存在訓練誤差爲0的決策樹。(機器學習 周志華 P93)所以決策樹很容易出現過擬合,剪枝就是改善過擬合的一種手段。
前剪:每造一個節點時,若此節點不存在的情況下的樹的泛化能力大於此節點存在下書的泛化能力,則把此節點刪去。
後剪:造整個樹,從下至上歷遍節點,若某節點不存在的情況下的樹的泛化能力大於此節點存在下書的泛化能力,則把此節點刪去。

4.分類問題的一個看待角度
S個樣本,F個特徵,y個類別。分類問題可以看做將S個樣本放入F維空間中,我們的目標是找出一個泛化能力最佳的分類超平面(可能是多個、不平),把y個類別分隔開。

5.多變量決策樹
決策樹的每個節點所進行的分類都是依據一個特徵,也就是F維空間中的某一維,在這個軸上找到一個點或多個點把訓練集分類,如果圖4中數據依照圖5的樹訓練出來的分類線如圖6所示,都是軸平行的。複雜度相當高。
(圖4 5 6 7 8均來自 機器學習 周志華)
圖4
圖5
圖6
若用每次節點使用多個特徵則會生成任意非軸平行的分類線。如圖4中數據依照圖7的樹訓練出來的分類線如圖8所示。只用了兩次分類。分類線的確定可使用“線性判別分析”。順便提一句,線性模型可以衍生出很多種分類器,機器學習 周志華的第三章所講的個人認爲並不是很系統,這部分推薦看吳恩達機器學習網易公開課。
圖7
圖8

6.Python時間
決策樹的具體實現的代碼可以參考 《機器學習實戰》,本博講基於scikit-learn讓它run起來的過程,因爲在機器學習中最關鍵的不是追求高大上的模型!!而是選特徵、調參數!!當然底層實現還是需要學習的,畢竟不是每個問題都有合適的現成模型去解決,還是有很多需要我們自己動手造算法的時候。建議學習一下“算法設計與分析”。

代碼來自《Python數據挖掘入門與實踐》,並對代碼做了整理和小改動。代碼是別人的,收穫是自己的。博主從這段代碼中學到了如下5點。關於貼出的代碼,重點想說的是我做的註釋。
驗證可行,環境pycharm python2.7

#1:choosing good features is key to getting good outcomes more so than choosing the right algorithm!!!

#2:update scikie-learn 0.1.8, noticing :
#Changed `RandomizedPCA` to `PCA` with `svd_solver='randomized'.
#Changed all references from `cross_validation` to `model_selection`.

#3:changing strings to int to increase efficiency and changing int to onehotencoder to remove continuity

#4:GridSearchCV & best_estimator_ :solving parameters' space by cross validation to find the best parameter group & print the group

#5.a format to update features:
#for index, row in dataset.iterrows():
#   row["feature"] = last[]
#   last[] = row["data"]

(1). 導入數據並準備做好數據集框架
《Python數據挖掘入門與實踐》這部分所講的數據集的獲取方式已無效,博主自己整理的數據集:NBA 2013-2014比賽數據集

import pandas as pd
import numpy as np

dataset = pd.read_csv("dicision trees sample.csv", parse_dates=["Date"])#import original dataset and neaten titles
dataset.columns = ["Date", "Start (ET)", "Visitor Team","VisitorPts", "Home Team", "HomePts", "Score Type","OT?", "Notes"]
dataset["HomeWin"] = 0
dataset["HomeLastWin"] = 0
dataset["VisitorLastWin"] = 0
dataset["HomeTeamRanksHigher"] = 0
dataset["HomeTeamWonLast"] = 0
#add a new feature which store whether home team's rank is higher than visitor.Init all zeros.
#we want to add some columns, who are features used to train module.
#they are "int", but they will become "dict" ,if counting in the title:"HomeLastWin"/"VisitorLastWin/......".
#init : {"HomeLastWin":[0,0,0,0,......]}
#init : {"VisitorLastWin":[0,0,0,0....]}

(2). 重頭戲之一:創造特徵

#some preparation for get features' values
dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]#the feature "Homewin", bool, is true when VisitorPts<HomePts.
y_true = dataset["HomeWin"].values#change format in order to dispose via scikit-learn
standings = pd.read_csv("dicision trees expanded standings.csv", skiprows=[0])#another dataset which is the rank of teams, for the feature HomeTeamRanksHigher
from collections import defaultdict
won_last = defaultdict(int)# for the features HomeLastWin and VisitorLastWin
last_match_winner = defaultdict(int)#for the features HomeTeamWonLast
#when we traversing each row of dataset, this dict will store the outcome of the two team in the current row
#as the last situation when meeting the same team again
#and use those features to judge new data
#somewhen: won_last : {"Miami Heat": 1, "Oklahoma City": 0, ......}
for index, row in dataset.iterrows():
    home_team = row["Home Team"]#get the home team's name of the current row
    visitor_team = row["Visitor Team"]#get the visitor team's name of the current row
    teams = tuple(sorted([home_team, visitor_team]))# sort these teams in alphabetical order
    if home_team == "New Orleans Pelicans":  # neaten team name
        home_team = "New Orleans Hornets"
    elif visitor_team == "New Orleans Pelicans":
        visitor_team = "New Orleans Hornets"
    home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]  #the rank of the home team in current row
    visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]  # the rank of the visitor in current row
    row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)# the feature HomeTeamRanksHigher get value in current row
    row["HomeLastWin"] = won_last[home_team]# the feature HomeLastWin get value in current row
    row["VisitorLastWin"] = won_last[visitor_team]# the feature VisitorLastWin get value in current row
    row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
    dataset.ix[index] = row# update the current row
    won_last[home_team] = row["HomeWin"]#as the last situation when meeting the this team next time
    won_last[visitor_team] = not row["HomeWin"]#as the last situation when meeting this same team next time
    winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
    last_match_winner[teams] = winner

(3). 使用不同特徵產生的效果
注意:特徵不是越多越好。

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
estimator = DecisionTreeClassifier(random_state=14) #use DecisionTreeClassifier imported from scikit-learn as the estimator

X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values #the train samples
X_homehigher = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher"]].values # train dataset
X_lastwinner = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher", "HomeTeamWonLast"]].values

scores_p = cross_val_score(estimator, X_previouswins, y_true,scoring='accuracy') # fit & predict by cross validation
scores_h = cross_val_score(estimator, X_homehigher, y_true,scoring='accuracy') #fit & predict by cross validation
scores_l = cross_val_score(estimator, X_lastwinner, y_true,scoring='accuracy')

print("when features are HomeLastWin&VisitorLastWin ,the accuracy is: {0:.1f}%".format(np.mean(scores_p) * 100))
print("when features are HomeLastWin&VisitorLastWin&HomeTeamRanksHigher ,the accuracy is: {0:.1f}%".format(np.mean(scores_h) * 100))
print("when features are HomeLastWin&VisitorLastWin&HomeTeamRanksHigher&HomeTeamWonLast ,the accuracy is: {0:.1f}%".format(np.mean(scores_l) * 100))

(4). 獨熱碼
將衆多不同的字符串變成數字,處理速度更快。

from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
encoding.fit(dataset["Home Team"].values)
home_teams = encoding.transform(dataset["Home Team"].values)# change strings to int to increase efficiency
visitor_teams = encoding.transform(dataset["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder() # change int to onehotencoder to remove continuity
X_teams_expanded = onehot.fit_transform(X_teams).todense()
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(estimator, X_teams_expanded, y_true,scoring='accuracy')
print("onehotencoder accuracy: {0:.1f}%".format(np.mean(scores) * 100))

(5). 隨機森林(很多樹共同決策)

from sklearn.ensemble import RandomForestClassifier #random forests
estimator = RandomForestClassifier(random_state=14)
scores = cross_val_score(estimator, X_teams, y_true, scoring='accuracy')
print("random forests,the accuracy is: {0:.1f}%".format(np.mean(scores) * 100))

X_all = np.hstack([X_lastwinner, X_teams])#train data using all features
scores = cross_val_score(estimator, X_all, y_true, scoring='accuracy')
print("random forests using all features accuracy: {0:.1f}%".format(np.mean(scores) * 100))

(6). 重頭戲之二:調參
在這裏可以體會調參的神奇。使用GridSearchCV可讓Python在給定的參數空間中自行尋找最佳參數,並可通過best_estimator_ 打印出來。

parameter_space = {
"max_features": [2, 4, 'auto'],
"n_estimators": [100,],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator, parameter_space)
grid.fit(X_all, y_true)
print("random forests using all features and change parameters accuracy: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_) #output the best parameters
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章