【lightgbm/xgboost/nn代碼整理一】lightgbm做二分類，多分類以及迴歸任務

1. 簡介

內心一直想把自己前一段時間寫的代碼整理一下，梳理一下知識點，方便以後查看，同時也方便和大家交流。希望我的分享能幫助到一些小白用戶快速前進，也希望大家看到不足之處慷慨的指出，相互學習，快速成長。我將從三個部分介紹數據挖掘類比賽中常用的一些方法，分別是lightgbm、xgboost和keras實現的mlp模型，分別介紹他們實現的二分類任務、多分類任務和迴歸任務，並給出完整的開源python代碼。這篇文章主要介紹基於lightgbm實現的三類任務。如果直接想看源碼，可以直接移到文末即可獲得鏈接。

2.數據加載

該部分數據是基於拍拍貸比賽截取的一部分特徵，隨機選擇了一部分數據做訓練集，一部分做測試集。針對其中gender、cell_province等類別特徵，直接進行重新編碼處理。原始數據的lable是0-32，共有33個類別的數據。針對二分類任務，將原始label爲32的數據直接轉化爲1，label爲其他的數據轉爲0；迴歸問題就是將這些類別作爲待預測的目標值。代碼如下：其中gc是釋放不必要的內存。

## category feature one_hot
test_data['label'] = -1
data = pd.concat([train_data, test_data])
cate_feature = ['gender', 'cell_province', 'id_province', 'id_city', 'rate', 'term']
for item in cate_feature:
    data[item] = LabelEncoder().fit_transform(data[item])

train = data[data['label'] != -1]
test = data[data['label'] == -1]

## Clean up the memory
del data, train_data, test_data
gc.collect()

## get train feature
del_feature = ['auditing_date', 'due_date', 'label']
features = [i for i in train.columns if i not in del_feature]

## Convert the label to two categories
train['label'] = train['label'].apply(lambda x: 1 if x==32 else 0)
train_x = train[features]
train_y = train['label'].values
test = test[features]

3.二分類任務

params = {'num_leaves': 60, #結果對最終效果影響較大，越大值越好，太大會出現過擬合
          'min_data_in_leaf': 30,
          'objective': 'binary', #定義的目標函數
          'max_depth': -1,
          'learning_rate': 0.03,
          "min_sum_hessian_in_leaf": 6,
          "boosting": "gbdt",
          "feature_fraction": 0.9,  #提取的特徵比率
          "bagging_freq": 1,
          "bagging_fraction": 0.8,
          "bagging_seed": 11,
          "lambda_l1": 0.1,             #l1正則
          # 'lambda_l2': 0.001,     #l2正則
          "verbosity": -1,
          "nthread": -1,                #線程數量，-1表示全部線程，線程越多，運行的速度越快
          'metric': {'binary_logloss', 'auc'},  ##評價函數選擇
          "random_state": 2019, #隨機數種子，可以防止每次運行的結果不一致
          # 'device': 'gpu' ##如果安裝的事gpu版本的lightgbm,可以加快運算
          }

folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], ))
test_pred_prob = np.zeros((test.shape[0], ))


## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
    print("fold {}".format(fold_ + 1))
    trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])
    val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])


    clf = lgb.train(params,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=20,
                    early_stopping_rounds=60)
    prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

threshold = 0.5
for pred in test_pred_prob:
    result = 1 if pred > threshold else 0

上面的參數中目標函數採用的事binary，評價函數採用的是{'binary_logloss', 'auc'}，可以根據需要對評價函數做調整，可以設定一個或者多個評價函數；'num_leaves'對最終的結果影響較大，如果值設置的過大會出現過擬合現象。

針對模型訓練部分，採用的事5折交叉訓練的方法，常用的5折統計有兩種：StratifiedKFold和KFold，其中最大的不同是StratifiedKFold分層採樣交叉切分，確保訓練集，測試集中各類別樣本的比例與原始數據集中相同，實際使用中可以根據具體的數據分別測試兩者的表現。

最後fold_importance_df表存放的事模型的特徵重要性，可以方便分析特徵重要性

4.多分類任務

params = {'num_leaves': 60,
          'min_data_in_leaf': 30,
          'objective': 'multiclass',
          'num_class': 33,
          'max_depth': -1,
          'learning_rate': 0.03,
          "min_sum_hessian_in_leaf": 6,
          "boosting": "gbdt",
          "feature_fraction": 0.9,
          "bagging_freq": 1,
          "bagging_fraction": 0.8,
          "bagging_seed": 11,
          "lambda_l1": 0.1,
          "verbosity": -1,
          "nthread": 15,
          'metric': 'multi_logloss',
          "random_state": 2019,
          # 'device': 'gpu' 
          }


folds = KFold(n_splits=5, shuffle=True, random_state=2019)
prob_oof = np.zeros((train_x.shape[0], 33))
test_pred_prob = np.zeros((test.shape[0], 33))

## train and predict
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
    print("fold {}".format(fold_ + 1))
    trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
    val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])

    clf = lgb.train(params,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=20,
                    early_stopping_rounds=60)
    prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)


    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits
result = np.argmax(test_pred_prob, axis=1)

該部分同上面最大的區別就是該表了損失函數和評價函數。分別更換爲'multiclass'和'multi_logloss'，當進行多分類任務是必須還要指定類別數：'num_class'。

5.迴歸任務

params = {'num_leaves': 38,
          'min_data_in_leaf': 50,
          'objective': 'regression',
          'max_depth': -1,
          'learning_rate': 0.02,
          "min_sum_hessian_in_leaf": 6,
          "boosting": "gbdt",
          "feature_fraction": 0.9,
          "bagging_freq": 1,
          "bagging_fraction": 0.7,
          "bagging_seed": 11,
          "lambda_l1": 0.1,
          "verbosity": -1,
          "nthread": 4,
          'metric': 'mae',
          "random_state": 2019,
          # 'device': 'gpu'
          }


def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / (y_true))) * 100

def smape_func(preds, dtrain):
    label = dtrain.get_label().values
    epsilon = 0.1
    summ = np.maximum(0.5 + epsilon, np.abs(label) + np.abs(preds) + epsilon)
    smape = np.mean(np.abs(label - preds) / summ) * 2
    return 'smape', float(smape), False


folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof = np.zeros(train_x.shape[0])
predictions = np.zeros(test.shape[0])

train_y = np.log1p(train_y) # Data smoothing
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
    print("fold {}".format(fold_ + 1))
    trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])
    val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])


    clf = lgb.train(params,
                    trn_data,
                    num_round,
                    valid_sets=[trn_data, val_data],
                    verbose_eval=200,
                    early_stopping_rounds=200)
    oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits

print('mse %.6f' % mean_squared_error(train_y, oof))
print('mae %.6f' % mean_absolute_error(train_y, oof))

result = np.expm1(predictions) #reduction
result = predictions

在迴歸任務中對目標函數值添加了一個log平滑，如果待預測的結果值跨度很大，做log平滑很有很好的效果提升。

附錄：

下面幾張表爲重要參數的含義和如何應用

Control Parameters	含義	用法
`max_depth`	樹的最大深度	當模型過擬合時,可以考慮首先降低 `max_depth`
`min_data_in_leaf`	葉子可能具有的最小記錄數	默認20，過擬合時用
`feature_fraction`	例如爲0.8時，意味着在每次迭代中隨機選擇80％的參數來建樹	boosting 爲 random forest 時用
`bagging_fraction`	每次迭代時用的數據比例	用於加快訓練速度和減小過擬合
`early_stopping_round`	如果一次驗證數據的一個度量在最近的`early_stopping_round` 回合中沒有提高，模型將停止訓練	加速分析，減少過多迭代
lambda	指定正則化	0～1
`min_gain_to_split`	描述分裂的最小 gain	控制樹的有用的分裂
`max_cat_group`	在 group 邊界上找到分割點	當類別數量很多時，找分割點很容易過擬合時

Core Parameters	含義	用法
Task	數據的用途	選擇 train 或者 predict
application	模型的用途	選擇 regression: 迴歸時，binary: 二分類時，multiclass: 多分類時
boosting	要用的算法	gbdt， rf: random forest， dart: Dropouts meet Multiple Additive Regression Trees， goss: `Gradient-based One-Side Sampling`
`num_boost_round`	迭代次數	通常 100+
`learning_rate`	如果一次驗證數據的一個度量在最近的 `early_stopping_round` 回合中沒有提高，模型將停止訓練	常用 0.1, 0.001, 0.003…
`num_leaves`		默認 31
device		cpu 或者 gpu
metric		`mae: mean absolute error ， mse: mean squared error ， binary_logloss: loss for binary classification ， multi_logloss: loss for multi classification`

IO parameter	含義
`max_bin`	表示 feature 將存入的 bin 的最大數量
`categorical_feature`	如果 `categorical_features = 0,1,2`，則列 0，1，2是 categorical 變量
`ignore_column`	與 `categorical_features` 類似，只不過不是將特定的列視爲categorical，而是完全忽略
`save_binary`	這個參數爲 true 時，則數據集被保存爲二進制文件，下次讀數據時速度會變快

調參

IO parameter	含義
`num_leaves`	取值應 `<= 2 ^（max_depth）`，超過此值會導致過擬合
`min_data_in_leaf`	將它設置爲較大的值可以避免生長太深的樹，但可能會導致 underfitting，在大型數據集時就設置爲數百或數千
`max_depth`	這個也是可以限制樹的深度

下表對應了 Faster Speed ，better accuracy ，over-fitting 三種目的時，可以調的參數

Faster Speed	better accuracy	over-fitting
將 `max_bin` 設置小一些	用較大的 `max_bin`	`max_bin` 小一些
	`num_leaves` 大一些	`num_leaves` 小一些
用 `feature_fraction` 來做 `sub-sampling`		用 `feature_fraction`
用 `bagging_fraction 和 bagging_freq`		設定 `bagging_fraction 和 bagging_freq`
	training data 多一些	training data 多一些
用 `save_binary` 來加速數據加載	直接用 categorical feature	用 `gmin_data_in_leaf 和 min_sum_hessian_in_leaf`
用 parallel learning	用 dart	用 `lambda_l1, lambda_l2 ，min_gain_to_split` 做正則化
	`num_iterations` 大一些，`learning_rate` 小一些	用 `max_depth` 控制樹的深度

【lightgbm/xgboost/nn代碼整理一】lightgbm做二分類，多分類以及迴歸任務

1. 簡介

2.數據加載

3.二分類任務

4.多分類任務

5.迴歸任務

附錄：

調參

CNN模型分析 | 5 利用Pytorch復現VGG-16網絡

SQL server建表操作

CNN模型分析 | 3 ZFNet

SQL server簡單查詢和彙總查詢

一、圖像直方圖顯示（python）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結