XGBoost 入門實戰 - 配合sklearn應用

"""
本節內容
* xgboost 和 sklearn一起使用
* 使用校驗集選擇最佳模型
"""
from xgboost import XGBClassifier
# 加載LibSVM格式數據模塊
from sklearn.datasets import load_svmlight_file
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# 數據讀取
"""
* scikit-learn支持多種格式的數據，包括LibSVM格式數據
* XGBoost提供一個wrapper類，允許模型可以和scikit-learn框架中其他分類器或迴歸器一樣對待
* XGBoost中分類模型爲XGBClassifier
"""
work_path = '../data/'
X_train, y_train = load_svmlight_file(work_path + 'agaricus.txt.train')
X_test, y_test = load_svmlight_file(work_path + 'agaricus.txt.test')

# print(X_train.shape)
# print(X_test.shape)
# (6513, 126)
# (1611, 126)
# print(X_train)

# 訓練模型
num_round = 2   # 迭代次數（樹的數目）
# 構造分類器，模型參數在構造時傳遞
bst = XGBClassifier(max_depth=2,
                    learning_rate=1,
                    n_estimators=num_round,
                    objective='binary:logistic')
# 類似train函數
bst.fit(X_train, y_train)

# 查看訓練模型準確度
train_preds = bst.predict(X_train)
train_predictions = [round(value) for value in train_preds]
train_accuracy = accuracy_score(y_train, train_predictions)
print("Train Accuary: %.2f%%" % (train_accuracy * 100.0))       # 97.77%

# 對測試數據進行測試
test_preds = bst.predict(X_test)
test_predictions = [round(value) for value in test_preds]
test_accuracy = accuracy_score(y_test, test_predictions)
print("Test Accuary: %.2f%%" % (test_accuracy * 100.0))     # 97.83%

"""
校驗集
* 我們在train集和test集都檢查了模型的性能
* 實際場景中test數據是未知的（無監督數據），如何評估模型？ -> 校驗集
* 校驗集:將train數據中一部分流出來，不參與訓練，作爲校驗使用，餘下train數據進行模型訓練，
  訓練好的模型在校驗集上進行測試，校驗集上的性能表現可以視爲在未知數據上的表現，
  從而選擇表現最好的模型
* 上面的代碼是將所有訓練集進行模型訓練，下面的代碼會將之前的訓練集進行分割，然後訓練模型、校驗模型
* 總結就是：sklearn爲我們提供了訓練數據分組的train_test_split函數
"""

# 構造數據：訓練集 + 校驗集
seed = 7
# 假設取1/3的訓練數據作爲校驗數據
test_size = 0.33
# 將訓練集X_train分割成X_train_part &  X_validate
# 將訓練集y_train分割成y_train_part &  y_validate
X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train,
                                                                      y_train,
                                                                      test_size=test_size,
                                                                      random_state=seed)

bst_part = XGBClassifier(max_depth=2,
                         learning_rate=1,
                         n_estimators=num_round,
                         objective='binary:logistic')
bst_part.fit(X_train_part, y_train_part)

# 校驗集上的性能
validate_preds = bst_part.predict(X_validate)
validate_predictions = [round(value) for value in validate_preds]
train_part_accuracy = accuracy_score(y_validate, validate_predictions)
print("Validate Accuary: %.2f%%" % (train_part_accuracy * 100.0))     # 97.63%

"""
學習曲線
* 模型預測性能隨某個變化的學習參數如樣本數目、迭代次數變化的情況
* 輸出並圖形化模型評估數據
* eval_set：評估集，數組形勢，可以同時評估多個集合
* eval_metric：評價指標
    * error:錯誤率
    * logloss:對數損失，模型越好數越小
        比較難理解需要深究一下：
        https://www.cnblogs.com/klchang/p/9217551.html
        https://zhuanlan.zhihu.com/p/52100927
        https://cloud.tencent.com/developer/article/1165263
"""

num_round = 100     # 畫學習曲線使用
bst = XGBClassifier(max_depth=2,
                    learning_rate=1,
                    n_estimators=num_round,
                    objective='binary:logistic')
eval_set = [(X_train_part, y_train_part), (X_validate, y_validate)]
bst.fit(X_train_part,
        y_train_part,
        eval_metric=["error", "logloss"],
        eval_set=eval_set,
        verbose=True)   # 評估數據會log出來

# 顯示學習曲線
results = bst.evals_result()    # 訓練的時候有評估集合可以產生評估結果

epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# 丟失率
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
# pyplot.show()

# 錯誤率
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
# pyplot.show()


"""
過擬合（overfitting）
* 機器學習中最棘手的障礙之一：過擬合（overfitting）一個極其重要的概念；防火防盜防過擬合；
* overfitting:
    * 定義： 爲了得到一致假設而使假設變得過度嚴格稱爲過擬合。
    * 判斷：一個假設在訓練數據上能夠獲得比其他假設更好的擬合， 但是在訓練數據外的數據集上卻不能很好地擬合數據，
            此時認爲這個假設出現了過擬合的現象。出現這種現象的主要原因是訓練數據中存在噪音或者訓練數據太少。
    * 個人理解：就是在訓練數據的時候，過分的追求訓練數據的準確度，導致模型的規則偏向"定製化"，不能適應普遍數據；
                可能就會出現模型跑Train數據的時候效果很好，但是用在Test或者其他數據上效果並不好的現象，這就是過擬合。
    * 解決方法：
        1、增加數據量
        2、正則化：降低 "偏激" 數據對整體的影響
    * 更多深解：
        https://baike.baidu.com/item/%E8%BF%87%E6%8B%9F%E5%90%88
        https://www.jiqizhixin.com/articles/2019-02-13-13
        https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-02-A-overfitting/
        此時我發現一個很棒的東西：bilibili "莫煩Python"
"""

"""
Early stop
* 一種防止訓練複雜模型過擬合的方法
* 監控模型在校驗集上的性能，如果在經過固定次數的迭代，性能不再提高時結束訓練過程
* fit參數：early_stopping_rounds=10; 代表接下來10輪性能沒有上升就停止
bst.fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric="error",
    eval_set=eval_set, verbose=True) 可以看到輸出結果
"""

"""
交叉驗證
* train_test_split構造的校驗集
    優點：速度快
    缺點：使訓練數據減少，結果沒那麼精準
* 交差驗證舉例：
    比如3折交叉驗證：將train數據均分3份，
    1.以1，2爲train數據，3爲校驗
    2.以1，3爲train數據，2爲校驗
    3.以2，3爲train數據，1爲校驗
    特點：效果好，耗時長
* 如果每類樣本不均衡或類別數較多，採用StratifiedKFold， 將數據集中每一類樣本 的數據等分
"""

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
results = cross_val_score(bst, X_train, y_train, cv=kfold)
# print("CV Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
# print(results)

"""
參數調優（GridSearchCV）
* 我們可以根據交叉驗證評估的結果，選擇最佳參數的模型
* 輸入待調節參數的範圍(grid)，對一組參數對應的模型進行評估， 並給出最佳模型及其參數
* 完整方法：GridSearchCV(estimator, param_grid,scoring=None , fit_params=None, n_jobs=1, 
                    iid=True, refit=True, c v=None, verbose=0, pre_dispatch='2*n_jobs', 
                    error_ score='raise', return_train_score=True)
"""
param_test = {
 'n_estimators': range(1, 51, 1)
}
clf = GridSearchCV(estimator=bst, param_grid=param_test, scoring='accuracy', cv=5)
clf.fit(X_train, y_train)
clf.param_grid
# clf.grid_scores_, clf.best_params_, clf.best_score_
# 測試
preds = clf.predict(X_test)
predictions = [round(value) for value in preds]
test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy of gridsearchcv: %.2f%%" % (test_accuracy * 100.0))

"""
模型評估小結
* 通常 k-折交叉驗證是評估機器學習模型的黃金準則(k=3, 5, 10)
* 當類別數目較多，或者每類樣本數目不均衡時，採用stratified交叉驗證
* 當訓練數據集很大， train/test split帶來的模型性能估計偏差很小，或者模型訓練很慢時，採用train/test split
* 對給定問題找到一種技術，速度快且能得到合理的性能估計
* 如果有疑問，對迴歸問題，採用10-fold cross-validation ; 對分類， 採用stratified 10-fold cross-validation
"""
XGBoost 入門實戰 - 配合sklearn應用

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

XGBoost 入門實戰 - 配合sklearn應用

php Interpreter is not configured

iPhone X適配最簡單粗暴的

iOS11適配 tableView頂部多一塊 cell高度錯誤

神器Anaconda

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結