1. 簡介
內心一直想把自己前一段時間寫的代碼整理一下,梳理一下知識點,方便以後查看,同時也方便和大家交流。希望我的分享能幫助到一些小白用戶快速前進,也希望大家看到不足之處慷慨的指出,相互學習,快速成長。我將從三個部分介紹數據挖掘類比賽中常用的一些方法,分別是lightgbm、xgboost和keras實現的mlp模型,分別介紹他們實現的二分類任務、多分類任務和迴歸任務,並給出完整的開源python代碼。這篇文章主要介紹基於lightgbm實現的三類任務。如果直接想看源碼,可以直接移到文末即可獲得鏈接。
2.數據加載
該部分數據是基於拍拍貸比賽截取的一部分特徵,隨機選擇了一部分數據做訓練集,一部分做測試集。針對其中gender、cell_province等類別特徵,直接進行重新編碼處理。原始數據的lable是0-32,共有33個類別的數據。針對二分類任務,將原始label爲32的數據直接轉化爲1,label爲其他的數據轉爲0;迴歸問題就是將這些類別作爲待預測的目標值。代碼如下:其中gc是釋放不必要的內存。
## category feature one_hot
test_data['label'] = -1
data = pd.concat([train_data, test_data])
cate_feature = ['gender', 'cell_province', 'id_province', 'id_city', 'rate', 'term']
for item in cate_feature:
data[item] = LabelEncoder().fit_transform(data[item])
train = data[data['label'] != -1]
test = data[data['label'] == -1]
## Clean up the memory
del data, train_data, test_data
gc.collect()
## get train feature
del_feature = ['auditing_date', 'due_date', 'label']
features = [i for i in train.columns if i not in del_feature]
## Convert the label to two categories
train['label'] = train['label'].apply(lambda x: 1 if x==32 else 0)
train_x = train[features]
train_y = train['label'].values
test = test[features]
3.二分類任務
params = {'num_leaves': 60, #結果對最終效果影響較大,越大值越好,太大會出現過擬合
'min_data_in_leaf': 30,
'objective': 'binary', #定義的目標函數
'max_depth': -1,
'learning_rate': 0.03,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9, #提取的特徵比率
"bagging_freq": 1,
"bagging_fraction": 0.8,
"bagging_seed": 11,
"lambda_l1": 0.1, #l1正則
# 'lambda_l2': 0.001, #l2正則
"verbosity": -1,
"nthread": -1, #線程數量,-1表示全部線程,線程越多,運行的速度越快
'metric': {'binary_logloss', 'auc'}, ##評價函數選擇
"random_state": 2019, #隨機數種子,可以防止每次運行的結果不一致
# 'device': 'gpu' ##如果安裝的事gpu版本的lightgbm,可以加快運算
}
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], ))
test_pred_prob = np.zeros((test.shape[0], ))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=20,
early_stopping_rounds=60)
prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
threshold = 0.5
for pred in test_pred_prob:
result = 1 if pred > threshold else 0
上面的參數中目標函數採用的事binary
,評價函數採用的是{'binary_logloss', 'auc'}
,可以根據需要對評價函數做調整,可以設定一個或者多個評價函數;'num_leaves'
對最終的結果影響較大,如果值設置的過大會出現過擬合現象。
針對模型訓練部分,採用的事5折交叉訓練的方法,常用的5折統計有兩種:StratifiedKFold和KFold,其中最大的不同是StratifiedKFold分層採樣交叉切分,確保訓練集,測試集中各類別樣本的比例與原始數據集中相同,實際使用中可以根據具體的數據分別測試兩者的表現。
最後fold_importance_df
表存放的事模型的特徵重要性,可以方便分析特徵重要性
4.多分類任務
params = {'num_leaves': 60,
'min_data_in_leaf': 30,
'objective': 'multiclass',
'num_class': 33,
'max_depth': -1,
'learning_rate': 0.03,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.8,
"bagging_seed": 11,
"lambda_l1": 0.1,
"verbosity": -1,
"nthread": 15,
'metric': 'multi_logloss',
"random_state": 2019,
# 'device': 'gpu'
}
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], 33))
test_pred_prob = np.zeros((test.shape[0], 33))
## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=20,
early_stopping_rounds=60)
prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
result = np.argmax(test_pred_prob, axis=1)
該部分同上面最大的區別就是該表了損失函數和評價函數。分別更換爲'multiclass'
和'multi_logloss'
,當進行多分類任務是必須還要指定類別數:'num_class'
。
5.迴歸任務
params = {'num_leaves': 38,
'min_data_in_leaf': 50,
'objective': 'regression',
'max_depth': -1,
'learning_rate': 0.02,
"min_sum_hessian_in_leaf": 6,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.7,
"bagging_seed": 11,
"lambda_l1": 0.1,
"verbosity": -1,
"nthread": 4,
'metric': 'mae',
"random_state": 2019,
# 'device': 'gpu'
}
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / (y_true))) * 100
def smape_func(preds, dtrain):
label = dtrain.get_label().values
epsilon = 0.1
summ = np.maximum(0.5 + epsilon, np.abs(label) + np.abs(preds) + epsilon)
smape = np.mean(np.abs(label - preds) / summ) * 2
return 'smape', float(smape), False
folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof = np.zeros(train_x.shape[0])
predictions = np.zeros(test.shape[0])
train_y = np.log1p(train_y) # Data smoothing
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
print("fold {}".format(fold_ + 1))
trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])
clf = lgb.train(params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=200,
early_stopping_rounds=200)
oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)
fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = features
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits
print('mse %.6f' % mean_squared_error(train_y, oof))
print('mae %.6f' % mean_absolute_error(train_y, oof))
result = np.expm1(predictions) #reduction
result = predictions
在迴歸任務中對目標函數值添加了一個log平滑,如果待預測的結果值跨度很大,做log平滑很有很好的效果提升。
附錄:
下面幾張表爲重要參數的含義和如何應用
Control Parameters | 含義 | 用法 |
---|---|---|
max_depth |
樹的最大深度 | 當模型過擬合時,可以考慮首先降低 max_depth |
min_data_in_leaf |
葉子可能具有的最小記錄數 | 默認20,過擬合時用 |
feature_fraction |
例如 爲0.8時,意味着在每次迭代中隨機選擇80%的參數來建樹 | boosting 爲 random forest 時用 |
bagging_fraction |
每次迭代時用的數據比例 | 用於加快訓練速度和減小過擬合 |
early_stopping_round |
如果一次驗證數據的一個度量在最近的early_stopping_round 回合中沒有提高,模型將停止訓練 |
加速分析,減少過多迭代 |
lambda | 指定正則化 | 0~1 |
min_gain_to_split |
描述分裂的最小 gain | 控制樹的有用的分裂 |
max_cat_group |
在 group 邊界上找到分割點 | 當類別數量很多時,找分割點很容易過擬合時 |
Core Parameters | 含義 | 用法 |
---|---|---|
Task | 數據的用途 | 選擇 train 或者 predict |
application | 模型的用途 | 選擇 regression: 迴歸時,binary: 二分類時,multiclass: 多分類時 |
boosting | 要用的算法 | gbdt, rf: random forest, dart: Dropouts meet Multiple Additive Regression Trees, goss: Gradient-based One-Side Sampling |
num_boost_round |
迭代次數 | 通常 100+ |
learning_rate |
如果一次驗證數據的一個度量在最近的 early_stopping_round 回合中沒有提高,模型將停止訓練 |
常用 0.1, 0.001, 0.003… |
num_leaves |
默認 31 | |
device | cpu 或者 gpu | |
metric | mae: mean absolute error , mse: mean squared error , binary_logloss: loss for binary classification , multi_logloss: loss for multi classification |
IO parameter | 含義 |
---|---|
max_bin |
表示 feature 將存入的 bin 的最大數量 |
categorical_feature |
如果 categorical_features = 0,1,2 , 則列 0,1,2是 categorical 變量 |
ignore_column |
與 categorical_features 類似,只不過不是將特定的列視爲categorical,而是完全忽略 |
save_binary |
這個參數爲 true 時,則數據集被保存爲二進制文件,下次讀數據時速度會變快 |
調參
IO parameter | 含義 |
---|---|
num_leaves |
取值應 <= 2 ^(max_depth) , 超過此值會導致過擬合 |
min_data_in_leaf |
將它設置爲較大的值可以避免生長太深的樹,但可能會導致 underfitting,在大型數據集時就設置爲數百或數千 |
max_depth |
這個也是可以限制樹的深度 |
下表對應了 Faster Speed ,better accuracy ,over-fitting 三種目的時,可以調的參數
Faster Speed | better accuracy | over-fitting |
---|---|---|
將 max_bin 設置小一些 |
用較大的 max_bin |
max_bin 小一些 |
num_leaves 大一些 |
num_leaves 小一些 |
|
用 feature_fraction 來做 sub-sampling |
用 feature_fraction |
|
用 bagging_fraction 和 bagging_freq |
設定 bagging_fraction 和 bagging_freq |
|
training data 多一些 | training data 多一些 | |
用 save_binary 來加速數據加載 |
直接用 categorical feature | 用 gmin_data_in_leaf 和 min_sum_hessian_in_leaf |
用 parallel learning | 用 dart | 用 lambda_l1, lambda_l2 ,min_gain_to_split 做正則化 |
num_iterations 大一些,learning_rate 小一些 |
用 max_depth 控制樹的深度 |