early-stopping在xgboost的過擬合使用

原創

2019-01-02 17:13

https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

baseline，劃分訓練集測試集

# monitor training performance
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

結果如下：
[89]   validation_0-error:0.204724
[90]   validation_0-error:0.208661
[91]   validation_0-error:0.208661
[92]   validation_0-error:0.208661
[93]   validation_0-error:0.208661
[94]   validation_0-error:0.208661
[95]   validation_0-error:0.212598
[96]   validation_0-error:0.204724
[97]   validation_0-error:0.212598
[98]   validation_0-error:0.216535
[99]   validation_0-error:0.220472 -----這是train的
Accuracy: 77.95% ----這是test的

2.可視化

# plot learning curve
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
# plot log loss
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
pyplot.show()
# plot classification error
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()

重點：

1.eval_set把train和test都放進去了

所以result=model。evals_result有兩個對象

長這樣：{
'validation_0': {'error': [0.259843, 0.26378, 0.26378, ...]},
'validation_1': {'error': [0.22179, 0.202335, 0.196498, ...]}
}

所以results['validation_0]['error']代表train

2.可以再model.fit(eval_metric=["error", "logloss"])指定兩個衡量指標，那麼，對應字典也可以指定

results['validation_0']['logloss']

3.verbose=False關閉顯示迭代結果

簡單看一下，可以看出來在20-40之間test就不怎麼下降了

三採用earlystopping

# early stopping
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

early_stopping_rounds=10,代表在10個迭代內結果沒什麼改進就停止

結果：

...
[35]   validation_0-logloss:0.487962
[36]   validation_0-logloss:0.488218
[37]   validation_0-logloss:0.489582
[38]   validation_0-logloss:0.489334
[39]   validation_0-logloss:0.490969
[40]   validation_0-logloss:0.48978
[41]   validation_0-logloss:0.490704
[42]   validation_0-logloss:0.492369
Stopping. Best iteration:
[32]   validation_0-logloss:0.487297

建議：It is generally a good idea to select the early_stopping_rounds as a reasonable function of the total number of training epochs (10% in this case) or attempt to correspond to the period of inflection points as might be observed on plots of learning curves.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

early-stopping在xgboost的過擬合使用

985 碩士程序員，空窗 4 個月沒有 Offer！

營銷系統黑名單優化：位圖的應用解析

我真的從測試轉成了開發......

nginx添加相應配置，通過瀏覽器訪問或curl時返回客戶端對應公網IP

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

python內置函數——sorted

[oeasy]python020在遊戲中體驗數值自由_勇闖地下城_終端文字遊戲

爲何我建議你學會抄代碼

一文搞懂 Spring 循環依賴

抖音面試：說說延遲任務的調度算法？

多個left join的疑問

異常檢測實戰

時間序列流程

python非參數檢驗

從組合中估計概率

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結