Supervised Learning In-Depth: Random Forests(隨機森林)

以前，我們看到了一個強大的判別分類器** Support Vector Machines **。
在這裏，我們將探討激勵另一個強大的算法。這是一種稱爲“隨機森林”的非參數算法。

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

plt.style.use('seaborn')

激勵隨機森林:決策樹

隨機森林是建立在決策樹上的“整體學習者”的一個示例。
因此，我們將從討論決策樹本身開始。
決策樹是對對象進行分類或標記的非常直觀的方法：您只需提出一系列旨在歸類到分類的問題：

二進制拆分使此過程非常有效。
與往常一樣，訣竅是提出正確的問題。
這就是算法過程的來歷：在訓練決策樹分類器時，算法會查看特徵並確定哪些問題（或“拆分”）包含的信息最多。

創建一個決策樹

這是scikit-learn中決策樹分類器的示例。我們將從定義一些二維標籤數據開始：

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');

我們的存儲庫中有一些便利功能可以幫助您

from fig_code import visualize_tree, plot_tree_interactive
plot_tree_interactive(X, y);

請注意，每次增加深度時，每個節點都會分爲兩個**，****僅包含單個類的節點除外。
結果是非常快速的“非參數”分類，在實踐中可能非常有用。
問題：您對此有任何疑問嗎？

決策樹和過度擬合

決策樹的一個問題是創建“過度擬合”數據的樹非常容易。也就是說，它們足夠靈活，可以瞭解數據而不是信號中的噪聲結構！例如，看一下基於此數據集的兩個子集構建的兩棵樹：

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

plt.figure()
visualize_tree(clf, X[:200], y[:200], boundaries=False)
plt.figure()
visualize_tree(clf, X[-200:], y[-200:], boundaries=False)

分類的細節完全不同！這表示“過度擬合”：當您預測新點的值時，結果更能反映模型中的噪聲而不是信號。

估算器合奏：隨機森林

解決過度擬合的一種可能方法是使用“合體方法”：這是一種元估計器，它實際上是對許多過度擬合數據的單個估計器的結果進行平均。出乎意料的是，所得到的估算值比構成它們的單個估算值更加穩健和準確！

最常見的集成方法之一是“隨機森林”，其中的集成由許多決策樹組成，這些決策樹在某種程度上會受到干擾。

關於如何隨機化這些樹，有大量的理論和先例，但是作爲一個例子，讓我們想象一下一組估計器適合數據的子集。我們可以對它們的外觀有所瞭解，如下所示：

def fit_randomized_tree(random_state=0):
    X, y = make_blobs(n_samples=300, centers=4,
                      random_state=0, cluster_std=2.0)
    clf = DecisionTreeClassifier(max_depth=15)
    
    rng = np.random.RandomState(random_state)
    i = np.arange(len(y))
    rng.shuffle(i)
    visualize_tree(clf, X[i[:250]], y[i[:250]], boundaries=False,
                   xlim=(X[:, 0].min(), X[:, 0].max()),
                   ylim=(X[:, 1].min(), X[:, 1].max()))
    
from ipywidgets import interact
interact(fit_randomized_tree, random_state=(0, 100));

查看模型的詳細信息如何隨樣本變化，而較大的特徵保持不變！

隨機森林分類器將執行類似的操作，但是使用所有這些樹的組合版本來得出最終答案：

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=0)
visualize_tree(clf, X, y, boundaries=False);

通過平均100多個隨機擾動模型，我們最終得到了一個更適合我們數據的整體模型！

（注意：以上我們通過子採樣對模型進行了隨機化處理。…隨機森林使用更復雜的隨機化方法，您可以在[scikit-learn文檔]（http://scikit-learn.org /stable/modules/ensemble.html#forest)）

快速示例：轉向迴歸

以上我們在分類的背景下考慮了隨機森林。

在迴歸的情況下（也就是連續變量而不是分類變量），也可以使隨機森林起作用。用於此的估算器是``sklearn.ensemble.RandomForestRegressor’’。

讓我們快速演示如何使用它：

from sklearn.ensemble import RandomForestRegressor

x = 10 * np.random.rand(100)

def model(x, sigma=0.3):
    fast_oscillation = np.sin(5 * x)
    slow_oscillation = np.sin(0.5 * x)
    noise = sigma * np.random.randn(len(x))

    return slow_oscillation + fast_oscillation + noise

y = model(x)
plt.errorbar(x, y, 0.3, fmt='o');

xfit = np.linspace(0, 10, 1000)
yfit = RandomForestRegressor(100).fit(x[:, None], y).predict(xfit[:, None])
ytrue = model(xfit, 0)

plt.errorbar(x, y, 0.3, fmt='o')
plt.plot(xfit, yfit, '-r');
plt.plot(xfit, ytrue, '-k', alpha=0.5);

示例：用於分類數字的隨機森林

我們之前看到了“手寫數字”數據。讓我們在這裏使用它來測試SVM和隨機森林分類器的功效。

from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
>>dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

X = digits.data
y = digits.target
print(X.shape)
print(y.shape)
>>(1797, 64)
(1797,)

# set up the figure
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

from sklearn.model_selection import train_test_split
from sklearn import metrics

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
clf = DecisionTreeClassifier(max_depth=11)
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)

metrics.accuracy_score(ypred, ytest)
>>0.8222222222222222

爲了更好的測量，畫出混亂矩陣

plt.imshow(metrics.confusion_matrix(ypred, ytest),
           interpolation='nearest', cmap=plt.cm.binary)
plt.grid(False)
plt.colorbar()
plt.xlabel("predicted label")
plt.ylabel("true label");

盜盜盜號

發佈了109 篇原創文章 · 獲贊 31 · 訪問量 3萬+

私信關注

scikit-learn03.2Regression-Forests

Supervised Learning In-Depth: Random Forests(隨機森林)

激勵隨機森林:決策樹

創建一個決策樹

決策樹和過度擬合

估算器合奏：隨機森林

快速示例：轉向迴歸

示例：用於分類數字的隨機森林

Win10 LTSC 2019 安裝後的一些步驟

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

在Linux下管理MySQL的大小寫敏感性

scikit-learn02.2Basic-Principles

深度學習-1

scikit-learn03.1Classification-SVMs

解決No module named 'sklearn.cross_validation'

pandas清洗Kobe數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

scikit-learn__03.2__Regression-Forests

Supervised Learning In-Depth: Random Forests(隨機森林)

激勵隨機森林:決策樹

創建一個決策樹

決策樹和過度擬合

估算器合奏：隨機森林

快速示例：轉向迴歸

示例：用於分類數字的隨機森林

scikit-learn03.2Regression-Forests