隨機森林&極限隨機樹(ensemble learning >Bagging)

隨機森林與極限隨機樹的Python代碼:

from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

X,y = make_blobs(n_samples=10000,n_features=10,centers=100,random_state=0)
# make_blobs函數是爲聚類產生數據集,樣本中心數
clf = DecisionTreeClassifier(max_depth=None,min_samples_split=2,random_state=0)
scores = cross_val_score(clf,X,y,cv=5)
print("DecisionTree:%f" %scores.mean())

clf2 = RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)
scores2 =cross_val_score(clf2,X,y,cv=5)
print("RandomForestClassifier:%f" %scores2.mean())

clf3 = ExtraTreesClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)
scores3 = cross_val_score(clf3,X,y,cv=5)
print("ExtraTreesClassifier:%f" %scores3.mean())

Out:

DecisionTree:0.982300
RandomForestClassifier:0.999700
ExtraTreesClassifier:1.000000

Q:如何通過cross_val_score來評判一個模型的優劣???

森林的重要特徵python代碼

這個例子展示了使用樹木的森林來評估特徵在人工分類任務中的重要性。紅條是森林的重要特徵,以及它們在樹間的變異性。
正如所料,圖中顯示了3個特徵是有用的,而其餘的則不是。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
X,y = make_classification(n_samples=100,
                          n_features=10,
                          n_informative=3,
                          n_redundant=0,
                          n_repeated=0,
                          n_classes=2,
                          random_state=0,
                          shuffle=False)
'''
 n_informative:多信息特徵的個數;n_redundant:冗餘信息,informative特徵的隨機線性組合;
 n_repeated :重複信息,隨機提取n_informative和n_redundant 特徵;
返回值:
X:形狀數組[n_samples,n_features]
生成的樣本。
y:形狀數組[n_samples]
每個樣本的類成員的整數標籤
'''
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,random_state=0)
forest.fit(X,y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],axis=0)
# 計算矩陣標準差
# np中求平均的時候除以的是數據的總數N,而pd中卻是N-1
indices = np.argsort(importances)[::-1]#將重要性按升序排列再轉爲降序
#[開始:結束:步進],[::-1]爲切片,步進爲-1:0,0-1,0-1-1,。。。

# Print the feature ranking
print("Feature ranking")
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" %(f+1,indices[f],importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]),importances[indices],color='r',yerr=std[indices],align='center')
#yerr生成y軸的錯誤欄
plt.xticks(range(X.shape[1]),indices)
plt.xlim([-1,X.shape[1]])
plt.show()

Out:

Feature ranking:
1. feature 1 (0.295902)
2. feature 2 (0.208351)
3. feature 0 (0.177632)
4. feature 3 (0.047121)
5. feature 6 (0.046303)
6. feature 8 (0.046013)
7. feature 7 (0.045575)
8. feature 4 (0.044614)
9. feature 9 (0.044577)
10. feature 5 (0.043912)

figure:

平行森林的像素重要性Python代碼:

這個例子展示瞭如何使用樹木的森林來評估圖像分類任務(人臉)中像素的重要性。像素越高越重要。
下面的代碼還說明了如何在多個作業中並行化預測的構造和計算。

from time import time
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.ensemble import ExtraTreesClassifier
#用於對森林模型進行並行擬合的核數
# Number of cores to use to perform parallel fitting of the forest model
'''
這個模塊還支持樹的並行構建和預測結果的並行計算,這可以通過 n_jobs 參數實現。
如果設置 n_jobs = k ,則計算被劃分爲 k 個作業,並運行在機器的 k 個核上。
如果設置 n_jobs = -1 ,則使用機器的所有核。
注意由於進程間通信具有一定的開銷,這裏的提速並不是線性的(即,使用 k 個作業不會快 k 倍)。
當然,在建立大量的樹,或者構建單個樹需要相當長的時間(例如,在大數據集上)時,
(通過並行化)仍然可以實現顯著的加速。
'''
n_jobs = 1
# Load the faces dataset人臉識別數據集
data = fetch_olivetti_faces()
X = data.images.reshape(len(data.images),-1)
y = data.target#限制爲5個類
mask = y<5
#mask爲掩膜,表達式理解爲如果小於5就錄入?
X = X[mask]
y = y[mask]

# Build a forest and compute the pixel importances像素重要性
print("Fitting ExtraTreesClassifier on faces data with %d cores..." %n_jobs)
t0 = time()
forest = ExtraTreesClassifier(n_estimators=1000,max_features=128,n_jobs=n_jobs,random_state=0)
forest.fit(X,y)
print("done in %0.3fs" %(time()-t0))
importances = forest.feature_importances_
importances = importances.reshape(data.images[0].shape)

# Plot pixel importances
plt.matshow(importances,cmap=plt.cm.hot)
#矩陣可視化
plt.title("Pixel importances with forests of trees")
plt.show()

out:

Fitting ExtraTreesClassifier on faces data with 1 cores...
done in 0.975s

figure:

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章