lightgbm簡易評分卡製作

  LightGBM的意思是輕量級(light)的梯度提升機(GBM),其相對Xgboost具有訓練速度快、內存佔用低的特點。關於lgb針對xgb做的優化,後面想寫一篇文章複習一下。本篇文章主要講解如何利用lgb建立一張評分卡,不涉及公式推導。關於lgb的基礎使用教程,由於和xgb有許多相似之處,這裏放一篇基礎教程的鏈接。
LightGBM使用簡單介紹:https://mathpretty.com/10649.html
  本文是梅子行老師的金融風控實戰課程的筆記。

import pandas as pd
from sklearn.metrics import roc_auc_score,roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import numpy as np
import random
import math
import time
import lightgbm as lgb

data = pd.read_csv('Acard.txt')

df_train = data[data.obs_mth != '2018-11-30'].reset_index().copy()
val = data[data.obs_mth == '2018-11-30'].reset_index().copy()

lst = ['person_info','finance_info','credit_info','act_info','td_score','jxl_score','mj_score','rh_score']

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-61JGuNcc-1587045363117)(https://imgkr.cn-bj.ufileos.com/b7132395-64c1-4068-8b49-7297f24472b2.png)]

  變量都是數值型,無需進行處理。由於lgb採用的是cart迴歸樹,所以只能接受數值特徵輸入,不直接支持類別特徵。對於類別特徵,可以進行one-hot編碼或者label-encoding編碼轉化成數值型變量。
  然後選取跨時間驗證集,將2018年11月份的數據選作跨時間驗證集,用於評估模型的表現。
  一共有8個入模變量,其中info結尾的是無監督系統輸出的個人表現,score結尾的是收費的外部徵信數據。

df_train = df_train.sort_values(by = 'obs_mth',ascending = False)

rank_lst = []
for i in range(1,len(df_train)+1):
    rank_lst.append(i)
    
df_train['rank'] = rank_lst

df_train['rank'] = df_train['rank']/len(df_train)

pct_lst = []
for x in df_train['rank']:
    if x <= 0.2:
        x = 1
    elif x <= 0.4:
        x = 2
    elif x <= 0.6:
        x = 3
    elif x <= 0.8:
        x = 4
    else:
        x = 5
    pct_lst.append(x)
df_train['rank'] = pct_lst        
#train = train.drop('obs_mth',axis = 1)
df_train.head()

  這裏將樣本均分爲5份,並打上相應的標籤,目的是爲了訓練中進行交叉驗證。這裏有一個注意的地方,在機器學習中,一般數據集可以分爲訓練集、驗證集、測試集,但是在信貸風控領域中驗證集和測試集定義正好相反。跨時間驗證集其實是測試集,而上面作交叉驗證的“測試集”其實才是驗證集。總之,驗證集是爲了模型調參用的,而測試集是爲了看模型的泛化效果。

def LGB_test(train_x,train_y,test_x,test_y):
    from multiprocessing import cpu_count
    clf = lgb.LGBMClassifier(
        boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,
        max_depth=2, n_estimators=800,max_features = 140, objective='binary',
        subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
        learning_rate=0.05, min_child_weight=50,random_state=None,n_jobs=cpu_count()-1,
        num_iterations = 800 #迭代次數
    )
    clf.fit(train_x, train_y,eval_set=[(train_x, train_y),(test_x,test_y)],eval_metric='auc',early_stopping_rounds=100)
    print(clf.n_features_)

    return clf,clf.best_score_[ 'valid_1']['auc']

  定義lightgbm函數,這裏用的是sklearn接口的方法。lgb建模和xgb一樣,也有兩種方法。可以看到,參數也有很多是一樣的。解釋一下幾個參數的含義:

‘num_leaves’:一顆樹上的葉子數,默認爲31.
‘reg_alpha’:L1 正則,默認爲0,在xgb自帶建模中爲lambda_l1。
‘objective’:損失函數。默認爲regression,即迴歸問題的均方誤差損失函數。這裏用的binary是對數損失函數作爲目標函數,表示二分類任務。
‘min_child_weight’:子節點上最小的樣本權重和。指建立每個模型所需要的最小樣本數,可以用來處理過擬合。
‘subsample_freq’:bagging 的頻率,默認爲1。即多少次迭代之後進行一次bagging。

  n_features_爲入模特徵的個數,這裏爲8。best_score_表示模型的最佳得分,是一個字典,返回驗證集上的AUC值。

feature_lst = {}
ks_train_lst = []
ks_test_lst = []
for rk in set(df_train['rank']):   
    
    ttest = df_train[df_train['rank'] ==  rk]
    ttrain = df_train[df_train['rank'] !=  rk]
    
    train = ttrain[lst]
    train_y = ttrain.bad_ind
    
    test = ttest[lst]
    test_y = ttest.bad_ind    
    
    start = time.time()
    model,auc = LGB_test(train,train_y,test,test_y)                    
    end = time.time()
    
    #模型貢獻度放在feture中
    feature = pd.DataFrame(
                {'name' : model.booster_.feature_name(),
                'importance' : model.feature_importances_
              }).sort_values(by =  ['importance'],ascending = False)
    #計算訓練集、測試集、驗證集上的KS和AUC

    y_pred_train_lgb = model.predict_proba(train)[:, 1]
    y_pred_test_lgb = model.predict_proba(test)[:, 1]


    train_fpr_lgb, train_tpr_lgb, _ = roc_curve(train_y, y_pred_train_lgb)
    test_fpr_lgb, test_tpr_lgb, _ = roc_curve(test_y, y_pred_test_lgb)


    train_ks = abs(train_fpr_lgb - train_tpr_lgb).max()
    test_ks = abs(test_fpr_lgb - test_tpr_lgb).max()


    train_auc = metrics.auc(train_fpr_lgb, train_tpr_lgb)
    test_auc = metrics.auc(test_fpr_lgb, test_tpr_lgb)
    
    ks_train_lst.append(train_ks)
    ks_test_lst.append(test_ks)    

    feature_lst[str(rk)] = feature[feature.importance>=20].name

train_ks = np.mean(ks_train_lst)
test_ks = np.mean(ks_test_lst)

ft_lst = {}
for i in range(1,6):
    ft_lst[str(i)] = feature_lst[str(i)]

fn_lst=list(set(ft_lst['1']) & set(ft_lst['2']) 
    & set(ft_lst['3']) & set(ft_lst['4']) &set(ft_lst['5']))

print('train_ks: ',train_ks)
print('test_ks: ',test_ks)

print('ft_lst: ',fn_lst )

  這裏的ks值取的是每次交叉驗證的ks值的平均值。特徵重要性大於20的有4個,是將所有交叉驗證過程中特徵重要度大於20的特徵去重,最後得到4個重要度最高的特徵。模型的booster_.feature_name()參數保存特徵的名字,feature_importances_保存特徵的重要性。
  然後用這4個變量入模,看下模型在跨時間驗證集(測試集)上的表現。

lst = ['person_info','finance_info','credit_info','act_info']

train = data[data.obs_mth != '2018-11-30'].reset_index().copy()
evl = data[data.obs_mth == '2018-11-30'].reset_index().copy()

x = train[lst]
y = train['bad_ind']

evl_x =  evl[lst]
evl_y = evl['bad_ind']

model,auc = LGB_test(x,y,evl_x,evl_y)

y_pred = model.predict_proba(x)[:,1]
fpr_lgb_train,tpr_lgb_train,_ = roc_curve(y,y_pred)
train_ks = abs(fpr_lgb_train - tpr_lgb_train).max()
print('train_ks : ',train_ks)

y_pred = model.predict_proba(evl_x)[:,1]
fpr_lgb,tpr_lgb,_ = roc_curve(evl_y,y_pred)
evl_ks = abs(fpr_lgb - tpr_lgb).max()
print('evl_ks : ',evl_ks)

from matplotlib import pyplot as plt
plt.plot(fpr_lgb_train,tpr_lgb_train,label = 'train LR')
plt.plot(fpr_lgb,tpr_lgb,label = 'evl LR')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()

最後將概率映射成得分,並生成得分報告。

def score(xbeta):
    score = 1000-500*(math.log2(1-xbeta)/xbeta)  #好人的概率/壞人的概率
    return score
evl['xbeta'] = model.predict_proba(evl_x)[:,1]   
evl['score'] = evl.apply(lambda x : score(x.xbeta) ,axis=1)
#生成報告
row_num, col_num = 0, 0
bins = 20
Y_predict = evl['score']
Y = evl_y
nrows = Y.shape[0]
lis = [(Y_predict[i], Y[i]) for i in range(nrows)]
ks_lis = sorted(lis, key=lambda x: x[0], reverse=True)
bin_num = int(nrows/bins+1)
bad = sum([1 for (p, y) in ks_lis if y > 0.5])
good = sum([1 for (p, y) in ks_lis if y <= 0.5])
bad_cnt, good_cnt = 0, 0
KS = []
BAD = []
GOOD = []
BAD_CNT = []
GOOD_CNT = []
BAD_PCTG = []
BADRATE = []
dct_report = {}
for j in range(bins):
    ds = ks_lis[j*bin_num: min((j+1)*bin_num, nrows)]
    bad1 = sum([1 for (p, y) in ds if y > 0.5])
    good1 = sum([1 for (p, y) in ds if y <= 0.5])
    bad_cnt += bad1
    good_cnt += good1
    bad_pctg = round(bad_cnt/sum(evl_y),3)
    badrate = round(bad1/(bad1+good1),3)
    ks = round(math.fabs((bad_cnt / bad) - (good_cnt / good)),3)
    KS.append(ks)
    BAD.append(bad1)
    GOOD.append(good1)
    BAD_CNT.append(bad_cnt)
    GOOD_CNT.append(good_cnt)
    BAD_PCTG.append(bad_pctg)
    BADRATE.append(badrate)
    dct_report['KS'] = KS
    dct_report['BAD'] = BAD
    dct_report['GOOD'] = GOOD
    dct_report['BAD_CNT'] = BAD_CNT
    dct_report['GOOD_CNT'] = GOOD_CNT
    dct_report['BAD_PCTG'] = BAD_PCTG
    dct_report['BADRATE'] = BADRATE
val_repot = pd.DataFrame(dct_report)

【作者】:Labryant
【原創公衆號】:風控獵人
【簡介】:某創業公司策略分析師,積極上進,努力提升。乾坤未定,你我都是黑馬。
【轉載說明】:轉載請說明出處,謝謝合作!~

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章