lightgbm做二分類,多分類以及迴歸任務(含python源碼)

點擊上方藍字關注我們

1. 簡介

內心一直想把自己前一段時間寫的代碼整理一下,梳理一下知識點,方便以後查看,同時也方便和大家交流。希望我的分享能幫助到一些小白用戶快速前進,也希望大家看到不足之處慷慨的指出,相互學習,快速成長。我將從三個部分介紹數據挖掘類比賽中常用的一些方法,分別是lightgbm、xgboost和keras實現的mlp模型,分別介紹他們實現的二分類任務、多分類任務和迴歸任務,並給出完整的開源python代碼。這篇文章主要介紹基於lightgbm實現的三類任務。源碼地址:https://github.com/QLMX/data_mining_models

2.數據加載

該部分數據是基於拍拍貸比賽截取的一部分特徵,隨機選擇了5000個訓練數據,3000個測試數據。針對其中gender、cell_province等類別特徵,直接進行重新編碼處理。原始數據的lable是0-32,共有33個類別的數據。針對二分類任務,將原始label爲32的數據直接轉化爲1,label爲其他的數據轉爲0;迴歸問題就是將這些類別作爲待預測的目標值。代碼如下:其中gc是釋放不必要的內存。

  
    
  
  
  
  1. ## category feature one_hot

  2. test_data['label'] = -1

  3. data = pd.concat([train_data, test_data])

  4. cate_feature = ['gender', 'cell_province', 'id_province', 'id_city', 'rate', 'term']

  5. for item in cate_feature:

  6. data[item] = LabelEncoder().fit_transform(data[item])


  7. train = data[data['label'] != -1]

  8. test = data[data['label'] == -1]


  9. ## Clean up the memory

  10. del data, train_data, test_data

  11. gc.collect()


  12. ## get train feature

  13. del_feature = ['auditing_date', 'due_date', 'label']

  14. features = [i for i in train.columns if i not in del_feature]


  15. ## Convert the label to two categories

  16. train['label'] = train['label'].apply(lambda x: 1 if x==32 else 0)

  17. train_x = train[features]

  18. train_y = train['label'].values

  19. test = test[features]

3.二分類任務

  
    
  
  
  
  1. params = {'num_leaves': 60, #結果對最終效果影響較大,越大值越好,太大會出現過擬合

  2. 'min_data_in_leaf': 30,

  3. 'objective': 'binary', #定義的目標函數

  4. 'max_depth': -1,

  5. 'learning_rate': 0.03,

  6. "min_sum_hessian_in_leaf": 6,

  7. "boosting": "gbdt",

  8. "feature_fraction": 0.9, #提取的特徵比率

  9. "bagging_freq": 1,

  10. "bagging_fraction": 0.8,

  11. "bagging_seed": 11,

  12. "lambda_l1": 0.1, #l1正則

  13. # 'lambda_l2': 0.001, #l2正則

  14. "verbosity": -1,

  15. "nthread": -1, #線程數量,-1表示全部線程,線程越多,運行的速度越快

  16. 'metric': {'binary_logloss', 'auc'}, ##評價函數選擇

  17. "random_state": 2019, #隨機數種子,可以防止每次運行的結果不一致

  18. # 'device': 'gpu' ##如果安裝的事gpu版本的lightgbm,可以加快運算

  19. }


  20. folds = KFold(n_splits=5, shuffle=True, random_state=2019)

  21. prob_oof = np.zeros((train_x.shape[0], ))

  22. test_pred_prob = np.zeros((test.shape[0], ))



  23. ## train and predict

  24. feature_importance_df = pd.DataFrame()

  25. for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):

  26. print("fold {}".format(fold_ + 1))

  27. trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y[trn_idx])

  28. val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y[val_idx])



  29. clf = lgb.train(params,

  30. trn_data,

  31. num_round,

  32. valid_sets=[trn_data, val_data],

  33. verbose_eval=20,

  34. early_stopping_rounds=60)

  35. prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)


  36. fold_importance_df = pd.DataFrame()

  37. fold_importance_df["Feature"] = features

  38. fold_importance_df["importance"] = clf.feature_importance()

  39. fold_importance_df["fold"] = fold_ + 1

  40. feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)


  41. test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits


  42. threshold = 0.5

  43. for pred in test_pred_prob:

  44. result = 1 if pred > threshold else 0

上面的參數中目標函數採用的事 binary,評價函數採用的是 {'binary_logloss','auc'},可以根據需要對評價函數做調整,可以設定一個或者多個評價函數;'num_leaves'對最終的結果影響較大,如果值設置的過大會出現過擬合現象。

針對模型訓練部分,採用的事5折交叉訓練的方法,常用的5折統計有兩種:StratifiedKFoldKFold,其中最大的不同是StratifiedKFold分層採樣交叉切分,確保訓練集,測試集中各類別樣本的比例與原始數據集中相同,實際使用中可以根據具體的數據分別測試兩者的表現。

最後 fold_importance_df表存放的事模型的特徵重要性,可以方便分析特徵重要性

4.多分類任務

  
    
  
  
  
  1. params = {'num_leaves': 60,

  2. 'min_data_in_leaf': 30,

  3. 'objective': 'multiclass',

  4. 'num_class': 33,

  5. 'max_depth': -1,

  6. 'learning_rate': 0.03,

  7. "min_sum_hessian_in_leaf": 6,

  8. "boosting": "gbdt",

  9. "feature_fraction": 0.9,

  10. "bagging_freq": 1,

  11. "bagging_fraction": 0.8,

  12. "bagging_seed": 11,

  13. "lambda_l1": 0.1,

  14. "verbosity": -1,

  15. "nthread": 15,

  16. 'metric': 'multi_logloss',

  17. "random_state": 2019,

  18. # 'device': 'gpu'

  19. }



  20. folds = KFold(n_splits=5, shuffle=True, random_state=2019)

  21. prob_oof = np.zeros((train_x.shape[0], 33))

  22. test_pred_prob = np.zeros((test.shape[0], 33))


  23. ## train and predict

  24. feature_importance_df = pd.DataFrame()

  25. for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):

  26. print("fold {}".format(fold_ + 1))

  27. trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])

  28. val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])


  29. clf = lgb.train(params,

  30. trn_data,

  31. num_round,

  32. valid_sets=[trn_data, val_data],

  33. verbose_eval=20,

  34. early_stopping_rounds=60)

  35. prob_oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)



  36. fold_importance_df = pd.DataFrame()

  37. fold_importance_df["Feature"] = features

  38. fold_importance_df["importance"] = clf.feature_importance()

  39. fold_importance_df["fold"] = fold_ + 1

  40. feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)


  41. test_pred_prob += clf.predict(test[features], num_iteration=clf.best_iteration) / folds.n_splits

  42. result = np.argmax(test_pred_prob, axis=1)

該部分同上面最大的區別就是該表了損失函數和評價函數。分別更換爲 'multiclass''multi_logloss',當進行多分類任務是必須還要指定類別數:'num_class'

5.迴歸任務

  
    
  
  
  
  1. params = {'num_leaves': 38,

  2. 'min_data_in_leaf': 50,

  3. 'objective': 'regression',

  4. 'max_depth': -1,

  5. 'learning_rate': 0.02,

  6. "min_sum_hessian_in_leaf": 6,

  7. "boosting": "gbdt",

  8. "feature_fraction": 0.9,

  9. "bagging_freq": 1,

  10. "bagging_fraction": 0.7,

  11. "bagging_seed": 11,

  12. "lambda_l1": 0.1,

  13. "verbosity": -1,

  14. "nthread": 4,

  15. 'metric': 'mae',

  16. "random_state": 2019,

  17. # 'device': 'gpu'

  18. }



  19. def mean_absolute_percentage_error(y_true, y_pred):

  20. return np.mean(np.abs((y_true - y_pred) / (y_true))) * 100


  21. def smape_func(preds, dtrain):

  22. label = dtrain.get_label().values

  23. epsilon = 0.1

  24. summ = np.maximum(0.5 + epsilon, np.abs(label) + np.abs(preds) + epsilon)

  25. smape = np.mean(np.abs(label - preds) / summ) * 2

  26. return 'smape', float(smape), False



  27. folds = KFold(n_splits=5, shuffle=True, random_state=2019)

  28. oof = np.zeros(train_x.shape[0])

  29. predictions = np.zeros(test.shape[0])


  30. train_y = np.log1p(train_y) # Data smoothing

  31. feature_importance_df = pd.DataFrame()

  32. for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_x)):

  33. print("fold {}".format(fold_ + 1))

  34. trn_data = lgb.Dataset(train_x.iloc[trn_idx], label=train_y.iloc[trn_idx])

  35. val_data = lgb.Dataset(train_x.iloc[val_idx], label=train_y.iloc[val_idx])



  36. clf = lgb.train(params,

  37. trn_data,

  38. num_round,

  39. valid_sets=[trn_data, val_data],

  40. verbose_eval=200,

  41. early_stopping_rounds=200)

  42. oof[val_idx] = clf.predict(train_x.iloc[val_idx], num_iteration=clf.best_iteration)


  43. fold_importance_df = pd.DataFrame()

  44. fold_importance_df["Feature"] = features

  45. fold_importance_df["importance"] = clf.feature_importance()

  46. fold_importance_df["fold"] = fold_ + 1

  47. feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)


  48. predictions += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits


  49. print('mse %.6f' % mean_squared_error(train_y, oof))

  50. print('mae %.6f' % mean_absolute_error(train_y, oof))


  51. result = np.expm1(predictions) #reduction

  52. result = predictions

在迴歸任務中對目標函數值添加了一個log平滑,如果待預測的結果值跨度很大,做log平滑很有很好的效果提升。

    AI成長社,分享技術,分享生活!

長按識別二維碼關注

點擊“閱讀原文”

本文分享自微信公衆號 - AI成長社(ai-growth)。
如有侵權,請聯繫 [email protected] 刪除。
本文參與“OSC源創計劃”,歡迎正在閱讀的你也加入,一起分享。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章