整理一下前階段複習的關於網格搜索的知識:
程序及數據 請到github 上 下載 GridSearch練習
網格搜索是將訓練集訓練的一堆模型中,選取超參數的所有值(或者代表性的幾個值),將這些選取的參數及值全部列出一個表格,並分別將其進行模擬,選出最優模型。
上面是數據集的可視化分佈圖,具體代碼如下:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('data/grid.csv',header=None)
X = data[[0,1]]
y = data[2]
#print(y)
X_blue = data[data[2]== 0]
X_red = data[data[2]== 1]
plt.scatter(X_blue[0],X_blue[1],c='blue',edgecolor='k',s=50)
plt.scatter(X_red[0],X_red[1],c='red',edgecolor='k',s=50)
plt.xlim(-2.05,2.05)
plt.ylim(-2.05,2.05)
採用決策樹來訓練數據
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train,y_train)
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)
print(clf.get_params())
數據分類可視化自定義函數的代碼如下:
def plot_model(X, y, clf):
plt.scatter(X_blue[0],X_blue[1],c='blue',edgecolor='k',s=50)
plt.scatter(X_red[0],X_red[1],c='red',edgecolor='k',s=50)
plt.xlim(-2.05,2.05)
plt.ylim(-2.05,2.05)
plt.grid(False)
plt.tick_params(
axis='x',
which='both',
bottom='off',
top='off')
r = np.linspace(-2.1,2.1,300)
s,t = np.meshgrid(r,r)
s = np.reshape(s,(np.size(s),1))
t = np.reshape(t,(np.size(t),1))
h = np.concatenate((s,t),1)
z = clf.predict(h)
s = s.reshape((np.size(r),np.size(r)))
t = t.reshape((np.size(r),np.size(r)))
z = z.reshape((np.size(r),np.size(r)))
plt.contourf(s,t,z,colors = ['blue','red'],alpha = 0.2,levels = range(-1,2))
if len(np.unique(z)) > 1:
plt.contour(s,t,z,colors = 'k', linewidths = 2)
plt.show()
數據集分類的可視化顯示:
plot_model(X, y, clf)
從上面的界限可視化上來看是處於過擬合的狀態,因爲在訓練數據的時候未設定參數,超參數 max_depth=None 時候,訓練數據時候一直到決策樹的最底層的葉子節點結束,所以就出現了過擬合的狀態。
模型複雜度曲線可視化
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import validation_curve
#from sklearn.tree import DecisionTreeRegressor
def ModelComplexity(X, y):
""" Calculates the performance of the model as model complexity increases.
The learning and testing errors rates are then plotted. """
# Create 10 cross-validation sets for training and testing
cv = ShuffleSplit(X.shape[0], test_size = 0.2, random_state = 42)
# Vary the max_depth parameter from 1 to 10
max_depth = np.arange(1,11)
scorer = make_scorer(f1_score)
# Calculate the training and testing scores
train_scores, test_scores = validation_curve(DecisionTreeClassifier(), X, y, \
param_name = "max_depth", param_range = max_depth, cv = cv, scoring =scorer)
# Find the mean and standard deviation for smoothing
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
# Plot the validation curve
plt.figure(figsize=(7, 5))
plt.title('Decision Tree Classifier Complexity Performance')
plt.plot(max_depth, train_mean, 'o-', color = 'r', label = 'Training Score')
plt.plot(max_depth, test_mean, 'o-', color = 'g', label = 'Validation Score')
plt.fill_between(max_depth, train_mean - train_std, \
train_mean + train_std, alpha = 0.15, color = 'r')
plt.fill_between(max_depth, test_mean - test_std, \
test_mean + test_std, alpha = 0.15, color = 'g')
# Visual aesthetics
plt.legend(loc = 'lower right')
plt.xlabel('Maximum Depth')
plt.ylabel('Score')
plt.ylim([-0.05,1.05])
plt.show()
ModelComplexity(X, y)
從上面的複雜度曲線圖可以看出,在max_depth=4 的時候 ,訓練集和測試集的得分是最接近的,在向右的時候,測試集的得分就呈下降趨勢, 雖然此時訓練集的得分很高,但訓練集的得分下降了,這說明在測試集上模型沒有很好的擬合數據,就是過擬合狀態了。
下面來採用網格搜索來尋找最優參數,本例中以 max_depth 和min_samples_leaf 這兩個參數來進行篩選
from sklearn.model_selection import GridSearchCV
clf = DecisionTreeClassifier(random_state=42)
scorer = make_scorer(f1_score)
parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,4,6,8,10]}
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
grid_obj.fit(X_train,y_train)
best_clf = grid_obj.best_estimator_
print(grid_obj.best_params_)
best_clf.fit(X_train,y_train)
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))
plot_model(X, y, best_clf)
上面是通過網格搜索得出的最優模型來模擬出來的分類界限可視化圖,可以從圖中很直觀的看出,劃分的效果好了很多。
下面看下決策樹的分支示意圖:圖一 是優化前 max_depth=None 的情況,圖二 是網格搜索出的最優模型
圖1 :優化前
圖二:網格搜索的最優模型
具體代碼在程序中,請大家自行閱讀。
最後給出網格搜索前後的模型對比示意圖:(學習曲線的可視化程序在github 的源碼中,請大家自行下載查看 網格搜索練習)
時間關係,寫的比較粗糙,請大家多提寶貴意見,我會逐步改進!