文章目錄

1. 通過網格搜索完善模型

在本文中，我們將爲決策樹模型擬合一些樣本數據。這個初始模型會過擬合。然後，我們將使用網格搜索爲這個模型找到更好的參數，以減少過擬合。

首先，導入所需要的庫：

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1.1 數據導入

首先定義一個函數用於讀取 csv 數據並進行可視化：

def load_pts(csv_name):
    data = np.asarray(pd.read_csv(csv_name, header=None))
    X = data[:,0:2]
    y = data[:,2]

    plt.scatter(X[np.argwhere(y==0).flatten(),0], X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')
    plt.scatter(X[np.argwhere(y==1).flatten(),0], X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')
    
    plt.xlim(-2.05,2.05)
    plt.ylim(-2.05,2.05)
    plt.grid(False)
    plt.tick_params(
        axis='x',
        which='both',
        bottom='off',
        top='off')

    return X,y

X, y = load_pts('Data/data.csv')
plt.show()

1.2 拆分數據爲訓練集和測試集

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, make_scorer

#Fixing a random seed
import random
random.seed(42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1.3 擬合決策樹模型

from sklearn.tree import DecisionTreeClassifier

# Define the model (with default hyperparameters)
clf = DecisionTreeClassifier(random_state=42)

# Fit the model
clf.fit(X_train, y_train)

# Make predictions
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

現在我們來可視化模型，並測試 f1_score，首先定義可視化函數：

def plot_model(X, y, clf):
    
    # 繪製兩類點的散點圖
    plt.scatter(X[np.argwhere(y==0).flatten(),0],X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')
    plt.scatter(X[np.argwhere(y==1).flatten(),0],X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')

    # 圖形設置
    plt.xlim(-2.05,2.05)
    plt.ylim(-2.05,2.05)
    plt.grid(False)
    plt.tick_params(
        axis='x',
        which='both',
        bottom='off',
        top='off')

    # 利用 np.meshgrid(r,r) 生成一個平面對於的橫縱座標
    r = np.linspace(-2.1,2.1,300)
    s,t = np.meshgrid(r,r)
    
    # 將座標轉換爲與決策樹的訓練集相同格式
    s = np.reshape(s,(np.size(s),1))
    t = np.reshape(t,(np.size(t),1))
    h = np.concatenate((s,t),1)

    # 對平面上的每一個點進行預測類別
    z = clf.predict(h)

    # 將橫縱座標及對應類別轉換爲矩陣形式
    s = s.reshape((np.size(r),np.size(r)))
    t = t.reshape((np.size(r),np.size(r)))
    z = z.reshape((np.size(r),np.size(r)))

    # 利用 plt.contourf 繪製不同等高面
    plt.contourf(s,t,z,colors = ['blue','red'],alpha = 0.2,levels = range(-1,2))
    
    # 繪製等高面邊緣
    if len(np.unique(z)) > 1:
        plt.contour(s,t,z,colors = 'k', linewidths = 2)
    plt.show()

plot_model(X, y, clf)
print('The Training F1 Score is', f1_score(train_predictions, y_train))
print('The Testing F1 Score is', f1_score(test_predictions, y_test))

The Training F1 Score is 1.0
The Testing F1 Score is 0.7000000000000001

訓練集得分爲 1 ，而測試集得分爲 0.7，可以看出當前模型有些過擬合，下面我們通過網絡搜索來優化參數。

1.4 使用網絡搜索完善模型

現在，我們將執行以下步驟：

1.首先，定義一些參數來執行網格搜索：max_depth, min_samples_leaf, 和 min_samples_split。

2.使用f1_score，爲模型製作記分器。

3.使用參數和記分器，在分類器上執行網格搜索。

4.將數據擬合到新的分類器中。

5.繪製模型並找到 f1_score。

6.如果模型不太好，則更改參數的範圍並再次擬合。

from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier(random_state=42)

# 生成參數列表
parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,4,6,8,10]}

# 定義計分器
scorer = make_scorer(f1_score)

# 生成網絡搜索器
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

# 擬合網絡搜索器
grid_fit = grid_obj.fit(X_train, y_train)

# 獲得最佳決策樹模型
best_clf = grid_fit.best_estimator_

# 對最佳模型進行擬合
best_clf.fit(X_train, y_train)

# 對測試集和訓練集進行預測
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

# 計算測試集得分和訓練集得分
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

# 模型可視化
plot_model(X, y, best_clf)

# 查看最佳模型的參數設置
best_clf

The training F1 Score is 0.8148148148148148
The testing F1 Score is 0.8

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=42, splitter='best')

由此可以看出，最佳參數爲：

max_depth=4

min_samples_leaf=2

min_samples_split=2

且相對於第一個圖，邊界更爲簡單，這意味着它不太可能過擬合。

1.5 交叉驗證可視化

首先看一下不同參數下的信息：

results = pd.DataFrame(grid_obj.cv_results_)
results.T

	0	1	2	3	4	5	6	7	8	9	...	115	116	117	118	119	120	121	122	123	124
mean_fit_time	0.000536919	0.000609636	0.00067091	0.0005006	0.000532627	0.000538429	0.00162276	0.000725031	0.000346661	0.000960668	...	0.000691652	0.000363668	0.00054733	0.000414769	0.000365416	0.000314713	0.000483354	0.000389099	0.000378688	0.000585318
std_fit_time	7.36079e-05	0.000217965	0.000120917	7.64067e-05	0.000118579	0.000201879	0.00125017	0.000257437	2.95338e-05	0.000841452	...	0.00027542	3.20732e-05	0.000166427	5.31612e-05	2.22742e-05	5.03509e-06	0.000156771	0.000113168	6.09452e-05	0.000168713
mean_score_time	0.00124542	0.00209157	0.0011754	0.00118478	0.00127451	0.00132982	0.00173569	0.00158167	0.000804345	0.00165256	...	0.00107495	0.000776132	0.0010496	0.00107972	0.000799974	0.000889381	0.00097998	0.000957966	0.00082167	0.00108504
std_score_time	0.000460223	0.00131765	0.000217313	0.000175357	0.000274129	0.000684221	0.00012585	0.000531796	2.83336e-05	0.00103978	...	0.000327278	1.61637e-05	0.000181963	0.000226213	3.18651e-05	0.00015253	0.000282182	0.00014771	5.40954e-05	0.00015636
param_max_depth	2	2	2	2	2	2	2	2	2	2	...	10	10	10	10	10	10	10	10	10	10
param_min_samples_leaf	2	2	2	2	2	4	4	4	4	4	...	8	8	8	8	8	10	10	10	10	10
param_min_samples_split	2	4	6	8	10	2	4	6	8	10	...	2	4	6	8	10	2	4	6	8	10
params	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...	...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 8, 'min_...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...	{'max_depth': 10, 'min_samples_leaf': 10, 'min...
split0_test_score	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	...	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857	0.642857
split1_test_score	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	0.764706	...	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
split2_test_score	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	0.709677	...	0.714286	0.714286	0.714286	0.714286	0.714286	0.666667	0.666667	0.666667	0.666667	0.666667
mean_test_score	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	0.705698	...	0.617857	0.617857	0.617857	0.617857	0.617857	0.602381	0.602381	0.602381	0.602381	0.602381
std_test_score	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	0.0501306	...	0.0889995	0.0889995	0.0889995	0.0889995	0.0889995	0.0737135	0.0737135	0.0737135	0.0737135	0.0737135
rank_test_score	14	14	14	14	14	14	14	14	14	14	...	42	42	42	42	42	62	62	62	62	62

14 rows × 125 columns

接着我們來看一下在不同的最大深度（max_depth）下，每片葉子的最小樣本數（min_samples_leaf）和每次分裂的最小樣本數（min_samples_split）對決策樹模型的泛化性能的影響。

首先定義一個函數來繪製不同最大深度下的熱力圖（需安裝 mglearn）：

def hotmap(max_depth, results):
    fliter = results[results['param_max_depth']==max_depth]
    scores = np.array(fliter['mean_test_score']).reshape(5, 5)
    mglearn.tools.heatmap(scores, xlabel='min_samples_split', xticklabels=parameters['min_samples_split'],
                      ylabel='min_samples_leaf', yticklabels=parameters['min_samples_leaf'], cmap="viridis")

繪製到子圖中：

import matplotlib.pyplot as plt 
plt.figure(figsize=(20, 20))
plt
for i in [1,2,3,4,5]:
    plt.subplot(1,5,i, title='max_depth={}'.format(2*i))
    hotmap(2*i, results)

從圖中可以看出，每次分裂的最小樣本數（min_samples_split）對模型幾乎沒有影響，而隨着最大深度（max_depth）的增加，模型得分逐漸降低。

1.5 總結

通過使用網格搜索，我們將 F1 分數從 0.7 提高到 0.8（同時我們失去了一些訓練分數，但這沒問題）。另外，如果你看繪製的圖，第二個模型的邊界更爲簡單，這意味着它不太可能過擬合。

監督學習 | 決策樹之網絡搜索

文章目錄

1. 通過網格搜索完善模型

1.1 數據導入

1.2 拆分數據爲訓練集和測試集

1.3 擬合決策樹模型

1.4 使用網絡搜索完善模型

1.5 交叉驗證可視化

1.5 總結

機器學習 | 目錄（持續更新）

無監督學習 | GMM 高斯混合聚類原理及Sklearn實現

無監督學習 | KMeans與KMeans++原理

無監督學習 | DBSCAN 原理及Sklearn實現

SQLite | SQLite 與 Pandas 比較篇之一

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結