任務五-模型參數優化

模型參數優化

使用網格搜索法對模型進行調優（調參時採用五折交叉驗證的方式），並進行模型評估

網格搜索

網格搜索，其實可以叫窮舉搜索，就是在所有候選的參數中，將不同參數組合起來，找出表現最好的一組參數。
eg：以有兩個參數的模型爲例，參數a有3種可能，參數b有4種可能，把所有可能性列出來，可以表示成一個3*4的表格，其中每個cell就是一個網格，循環過程就像是在每個網格里遍歷、搜索，所以叫網格搜索

五折交叉驗證

五折交叉驗證，不是打五折的意思。而是將訓練模型的數據分爲五份，並取第五份作爲驗證集，剩下的作爲訓練集。最後，用五分數據跑模型，得到的分數求平均值。用交叉驗證的目的是爲了得到可靠穩定的模型。

各類模型優化結果

邏輯迴歸模型

clf = LogisticRegression(C=0.1, penalty='l1')
model = clf.fit(X_train, y_train)
# 參數說明。C表示邏輯迴歸模型中從參數C，分別取如下數組中的值，penalty同理，分別可以取l1和l2
params = {'C': [0.01, 0.1, 0.5, 1], 'penalty': ['l1', 'l2']}
# 使用網格搜索法，對模型進行調優，準確度評估採用roc_auc值
grid_search = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
grid_search.fit(X_train, y_train)
print("最優化參數組合:", grid_search.best_params_)

結果：
優化後模型的分數評估
最優化參數組合: {‘penalty’: ‘l1’, ‘C’: 0.1}
accuracy：
訓練集： 0.7959
測試集： 0.8059
precision:
訓練集： 0.7158
測試集： 0.7436
recall:
訓練集： 0.3179
測試集： 0.3286
F1-score:
訓練集： 0.4402
測試集： 0.4558
AUC:
訓練集： 0.8011
測試集： 0.8013

SVM模型

代碼實現：

# 使用svm模型
clf = SVC(C=0.01, kernel='linear', probability=True)
model = clf.fit(X_train, y_train)
params = {'C': [0.01, 0.1, 0.5, 1, 10], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}
# 使用網格搜索法，對模型進行調優，準確度評估採用roc_auc值
grid_search = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
grid_search.fit(X_train, y_train)

注：在對SVM模型進行參數優化，當params中的kernel包含’precomputed’時，程序運行異常：X should be a square kernel matrix。原因是kernel='precomputed’時，傳入的矩陣不是普通的數據矩陣，而是表示樣本成對相似性的數據矩陣。參考鏈接：原址

運行結果：
優化後模型的分數評估
最優化參數組合: {‘C’: 0.01, ‘kernel’: ‘linear’}
accuracy：
訓練集： 0.7860
測試集： 0.7912
precision:
訓練集： 0.7783
測試集： 0.8022
recall:
訓練集： 0.2131
測試集： 0.2068
F1-score:
訓練集： 0.3346
測試集： 0.3288
AUC:
訓練集： 0.8032
測試集： 0.8127

決策樹模型

# 決策樹
clf = DecisionTreeClassifier(max_depth=5)
model = clf.fit(X_train, y_train.values)
# 由於參數過多，導致網格搜索優化運行很久，因此，決策樹模型暫時只加了一個參數
params = {'max_depth': [5, 10, 50]}
# 使用網格搜索法，對模型進行調優，準確度評估採用roc_auc值
grid_search = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
grid_search.fit(X_train, y_train)
print('\n優化後模型的分數評估')

結果：
優化後模型的分數評估
最優化參數組合: {‘max_depth’: 5}
accuracy：
訓練集： 0.8058
測試集： 0.7624
precision:
訓練集： 0.7310
測試集： 0.5455
recall:
訓練集： 0.3655
測試集： 0.2380
F1-score:
訓練集： 0.4873
測試集： 0.3314
AUC:
訓練集： 0.8005
測試集： 0.7328

隨機森林

代碼

clf = RandomForestClassifier(oob_score=True, n_estimators=10)
model = clf.fit(X_train, y_train.values)
params = {'oob_score': [True, False], 'n_estimators': [10, 50, 200, 500]}
grid_search = GridSearchCV(model, param_grid=params, scoring='roc_auc', cv=5)
grid_search.fit(X_train, y_train)
print('\n優化後模型的分數評估')

結果：
優化後模型的分數評估
最優化參數組合: {‘n_estimators’: 500, ‘oob_score’: False}
accuracy：
訓練集： 1.0000
測試集： 0.8003
precision:
訓練集： 1.0000
測試集： 0.7698
recall:
訓練集： 1.0000
測試集： 0.2748
F1-score:
訓練集： 1.0000
測試集： 0.4050
AUC:
訓練集： 1.0000
測試集： 0.7792

總結

準備對比下優化前的模型參數，看看效果
注：XGBClassifier模型沒跑起來，報錯了，填坑中…

任務五-模型參數優化

模型參數優化

網格搜索

五折交叉驗證

各類模型優化結果

邏輯迴歸模型

SVM模型

決策樹模型

隨機森林

總結

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

用csv格式的文件代替poi導出xls文件

任務五-模型參數優化

Maven傳遞依賴的原則

HashMap的到底是有序還是無序

Excel生成之java heap space異常

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結