這個博文的內容:
1.k-Fold和StratifiedKFold的區別;
2.LightGBM的代碼流程,不會講LightGBM的內部原理。
3. 貝葉斯優化超參數:這個在之前的博文已經講過了,鏈接:
貝葉斯優化(Bayesian Optimization)只需要看這一篇就夠了,算法到python實現
K-Fold vs StratifiedKFold
這裏就不說爲什麼要用K-Fold了,如果有人不清楚可以評論emm(估計是騙不到評論了哈哈)。
StratifiedKFold的Stratified就是社會分層的意思,就是假設按照4:1的比例劃分訓練集和驗證集的時候,劃分的訓練集和驗證集是等比例標籤的。看個例子:
from sklearn.model_selection import StratifiedKFold,KFold
x_train = pd.DataFrame({'x':[0,1,2,3,4,5,6,7,8,9],'y':[0,1,0,1,0,1,0,1,0,1]})
kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
for i in kf.split(x_train.x,x_train.y):
print(i)
運行結果:
我簡單的構建了一個十個數據的數據集,然後標籤類別有兩類(奇數和偶數,不過這個不重要啦)。然後通過K-Fold把是個數據劃分成五組訓練集+驗證集(也就是說我們會訓練五個模型)。因爲原始的十個數據中,有五個是奇數,五個是偶數,所以比例是1:1,沒錯吧。
所以StratifiedKFold就是保證驗證機的不同類別的樣本和原始數據的比例相同,可以看到五組數據中每一組的驗證集和訓練集的不同類別的樣本數量比例就是1:1。
那麼KFold就是隨機劃分,沒有考慮這個保持比例相同的問題,可以看上面的例子,用KFold來實現(完全一樣,就是StratifiedKFold變成了KFold):
from sklearn.model_selection import StratifiedKFold,KFold
x_train = pd.DataFrame({'x':[0,1,2,3,4,5,6,7,8,9],'y':[0,1,0,1,0,1,0,1,0,1]})
kf = KFold(n_splits = 5, shuffle = True, random_state = 0)
for i in kf.split(x_train.x,x_train.y):
print(i)
運行結果:
可以看到,第一組驗證集是兩個偶數,第四組是兩個奇數,就是隨機分的。
這裏需要注意,上面說兩個偶數和兩個奇數是指驗證集的樣本的類別,而不是上圖的兩個數組。圖中的數組就是原始數據的索引值,比方說第一個驗證集中是【2,8】,是吧第2個樣本和第8個樣本分到了驗證集,而不是樣本特徵爲2和8的分到了驗證集.
不過可想而知,假如數據樣本是迴歸任務的數據,不是分類的,那麼怎麼用StratifiedKFold呢?不能用了,所以到時候就用KFold就行了,不過迴歸任務也要注意考慮時間關係。
python代碼
先導入數據,使用sklearn內嵌的啥乳腺癌數據,500多個樣本,每個樣本30個特徵,然後是一個2分類問題。
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold,KFold
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris,load_breast_cancer
cancer = load_breast_cancer()
X_train,x_test,Y_train,y_test = train_test_split(cancer.data,cancer.target,test_size=0.2,random_state=3)
先把數據分成測試集和訓練集,這裏的訓練集之後還會再被分成訓練集+驗證集。
下面隨便用一組超參數構建lightGBM模型:
from sklearn import metrics
kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
# kf = KFold(n_splits = 5, shuffle = True, random_state = 0)
y_pred = np.zeros(len(x_test))
for fold,(train_index,val_index) in enumerate(kf.split(X_train,Y_train)):
x_train,x_val = X_train[train_index],X_train[val_index]
y_train,y_val = Y_train[train_index],Y_train[val_index]
train_set = lgb.Dataset(x_train,y_train)
val_set = lgb.Dataset(x_val,y_val)
params = {
'boosting_type': 'gbdt',
'metric': {'binary_logloss', 'auc'},
'objective': 'binary',#regression,binary,multiclass
# 'num_class':2,
'seed': 666,
'num_leaves': 20,
'learning_rate': 0.1,
'max_depth': 10,
'n_estimators': 5000,
'lambda_l1': 1,
'lambda_l2': 1,
'bagging_fraction': 0.9,
'bagging_freq': 1,
'colsample_bytree': 0.9,
'verbose':-1,
}
model = lgb.train(params,train_set,num_boost_round=5000,early_stopping_rounds=50,
valid_sets = [val_set],verbose_eval=100)
y_pred += model.predict(x_test,num_iteration=model.best_iteration)/kf.n_splits
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
rmse = metrics.accuracy_score(y_pred,y_test)
print(rmse)
這裏0.9385就是測試集的準確率。下面使用貝葉斯調參(下面對於驗證集和測試集的概念可能有點混亂,是因爲在比賽中,會有一個要提交的分數,那個是真正的測試集而不是從訓練集中分出來的,沒事看代碼就好):
def cv_lgm(num_leaves,max_depth,lambda_l1,lambda_l2,bagging_fraction,bagging_freq,colsample_bytree):
kf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
# kf = KFold(n_splits = 5, shuffle = True, random_state = 0)
y_pred = np.zeros(len(x_test))
for fold,(train_index,val_index) in enumerate(kf.split(X_train,Y_train)):
x_train,x_val = X_train[train_index],X_train[val_index]
y_train,y_val = Y_train[train_index],Y_train[val_index]
train_set = lgb.Dataset(x_train,y_train)
val_set = lgb.Dataset(x_val,y_val)
params = {
'boosting_type': 'gbdt',
'metric': {'binary_logloss', 'auc'},
'objective': 'binary',#regression,binary,multiclass
# 'num_class':2,
'seed': 666,
'num_leaves': int(num_leaves), #20
'learning_rate': 0.1,
'max_depth': int(max_depth),
'lambda_l1': lambda_l1,
'lambda_l2': lambda_l2,
'bagging_fraction': bagging_fraction,
'bagging_freq': int(bagging_freq),
'colsample_bytree': colsample_bytree,
}
model = lgb.train(params,train_set,num_boost_round=5000,early_stopping_rounds=50,
valid_sets = [val_set],verbose_eval=100)
y_pred += model.predict(x_test,num_iteration=model.best_iteration)/kf.n_splits
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
accuracy = metrics.accuracy_score(y_pred,y_test)
return accuracy
from bayes_opt import BayesianOptimization
rf_bo = BayesianOptimization(
cv_lgm,
{'num_leaves': (10, 50),
'max_depth': (5,40),
'lambda_l1': (0.1,3),
'lambda_l2': (0.1,3),
'bagging_fraction':(0.6,1),
'bagging_freq':(1,5),
'colsample_bytree': (0.6,1),
}
)
rf_bo.maximize()
找到的最優化超參數是:{‘target’: 0.9473684210526315,
‘params’: {‘bagging_fraction’: 0.9726062500743026,
‘bagging_freq’: 3.3964048376327938,
‘colsample_bytree’: 0.9350215025768822,
‘lambda_l1’: 0.5219868645323071,
‘lambda_l2’: 0.1006192480920821,
‘max_depth’: 32.69376913856991,
‘num_leaves’: 36.29712182920533}}
然後再把參數帶入LGB,重新訓練就OK了。可以使用
rf_bo.max()
找到最優參數。
不過了我在貝葉斯優化的那個博文中最後一句說的,貝葉斯超參數某種意義上,也是一種瞎猜,但是多數做比賽的都會用用這個2333.