變量分箱是評分卡建模流程中的關鍵環節,可以說是評分卡的核心環節。合理的分箱可以消除變量的量綱影響,而且能減少異常值等噪聲數據的影響,有效避免模型過擬合。此外,分箱可以給模型實現業務上的可解釋性,可以說是評分卡的核心了。
下面開始實現評分卡建立中的分箱操作。
首先,變量需要分爲數值型變量和類別型變量。對於這兩種類型的變量分箱過程中需要注意的點如下:
- 類別型變量
- 如果不超過5個,無需進行分箱;
- 超過5個,有兩種方法。一,如果類別很多,可以對其進行bad_rate編碼之後劃入數值型變量;二,類別不是很多,對其進行降基處理,縮小至5個以內。
- 數值型變量
有無監督和有監督分箱兩種方法。無監督分箱有等比分箱、等寬分箱、聚類分箱等。有監督分箱有卡方分箱、最優分箱等等。
num_features = ['int_rate_clean', 'emp_length_clean', 'annual_inc', 'dti', 'delinq_2yrs', 'earliest_cr_to_app',
'inq_last_6mths', \
'mths_since_last_record_clean', 'mths_since_last_delinq_clean', 'open_acc', 'pub_rec', 'total_acc',
'limit_income', 'earliest_cr_to_app']
cat_features = ['home_ownership', 'verification_status', 'desc_clean', 'purpose', 'zip_code', 'addr_state']
一共有14個數值型變量和6個類別型變量。‘zip_code’、'addr_state’兩個變量的類別很多,進行bad_rate編碼後劃入數值型變量。另外4個變量單獨進行分箱。
def binning_cate(df,col_list,target):
"""
df:數據集
col_list:變量list集合
target:目標變量的字段名
return:
bin_df :list形式,裏面存儲每個變量的分箱結果
iv_value:list形式,裏面存儲每個變量的IV值
"""
total = df[target].count()
bad = df[target].sum()
good = total-bad
all_odds = good*1.0/bad
bin_df =[]
iv_value=[]
for col in col_list:
d1 = df.groupby([col],as_index=True)
d2 = pd.DataFrame()
d2['min_bin'] = d1[col].min()
d2['max_bin'] = d1[col].max()
d2['total'] = d1[target].count()
d2['totalrate'] = d2['total']/total
d2['bad'] = d1[target].sum()
d2['badrate'] = d2['bad']/d2['total']
d2['good'] = d2['total'] - d2['bad']
d2['goodrate'] = d2['good']/d2['total']
d2['badattr'] = d2['bad']/bad
d2['goodattr'] = (d2['total']-d2['bad'])/good
d2['odds'] = d2['good']/d2['bad']
GB_list=[]
for i in d2.odds:
if i>=all_odds:
GB_index = str(round((i/all_odds)*100,0))+str('G')
else:
GB_index = str(round((all_odds/i)*100,0))+str('B')
GB_list.append(GB_index)
d2['GB_index'] = GB_list
d2['woe'] = np.log(d2['badattr']/d2['goodattr'])
d2['bin_iv'] = (d2['badattr']-d2['goodattr'])*d2['woe']
d2['IV'] = d2['bin_iv'].sum()
iv = d2['bin_iv'].sum().round(3)
print('變量名:{}'.format(col))
print('IV:{}'.format(iv))
print('\t')
bin_df.append(d2)
iv_value.append(iv)
return bin_df,iv_value
注意,如果類別型變量的某一箱只有好樣本/壞樣本,將造成變量的IV值爲inf/-inf,此時就需要對變量進行降基處理或者重新分箱。
接着看一下每一箱的明細情況。
IV值一般大於0.01,就可以入模使用。IV值不宜過高,如果過高說明變量的預測能力過強,其實可以單獨拿出來作爲一條策略。評分卡的變量最好還是弱變量。此外,每一箱的WOE值也不宜大於1,因爲大於1說明這一箱至少有65%以上的好壞樣本,其實可以單獨作爲一條規則了。
下面利用條形圖將分箱結果可視化展示。
# woe的可視化
def plot_woe(bin_df,hspace=0.4,wspace=0.4,plt_size=None,plt_num=None,x=None,y=None):
"""
bin_df:list形式,裏面存儲每個變量的分箱結果
hspace :子圖之間的間隔(y軸方向)
wspace :子圖之間的間隔(x軸方向)
plt_size :圖紙的尺寸
plt_num :子圖的數量
x :子圖矩陣中一行子圖的數量
y :子圖矩陣中一列子圖的數量
return :每個變量的woe變化趨勢圖
"""
plt.figure(figsize=plt_size)
plt.subplots_adjust(hspace=hspace,wspace=wspace)
for i,df in zip(range(1,plt_num+1,1),bin_df):
col_name = df.index.name
df = df.reset_index()
plt.subplot(x,y,i)
plt.title(col_name)
sns.barplot(data=df,x=col_name,y='woe')
plt.xlabel('')
plt.xticks(rotation=30)
return plt.show()
plot_woe(bin_df_cat,hspace=0.4,wspace=0.4,plt_size=(15,8),plt_num=4,x=2,y=2)
下面對zip_code、addr_state這兩個變量進行bad_rate編碼,就是將變量的每個類別映射成這個類別的壞樣本率,這樣就可以將類別型變量轉化爲數值型變量了。
def BadRateEncoding(df, col, target):
'''
:param df: dataframe containing feature and target
:param col: the feature that needs to be encoded with bad rate, usually categorical type
:param target: good/bad indicator
:return: the assigned bad rate to encode the categorical feature
'''
regroup = BinBadRate(df, col, target, grantRateIndicator=0)[1]
br_dict = regroup[[col,'bad_rate']].set_index([col]).to_dict(orient='index')
for k, v in br_dict.items():
br_dict[k] = v['bad_rate']
badRateEnconding = df[col].map(lambda x: br_dict[x])
return {'encoding':badRateEnconding, 'bad_rate':br_dict}
def BinBadRate(df, col, target, grantRateIndicator=0):
'''
:param df: 需要計算好壞比率的數據集
:param col: 需要計算好壞比率的特徵
:param target: 好壞標籤
:param grantRateIndicator: 1返回總體的壞樣本率,0不返回
:return: 每箱的壞樣本率,以及總體的壞樣本率(當grantRateIndicator==1時)
'''
total = df.groupby([col])[target].count()
total = pd.DataFrame({'total': total})
bad = df.groupby([col])[target].sum()
bad = pd.DataFrame({'bad': bad})
regroup = total.merge(bad, left_index=True, right_index=True, how='left') # 每箱的壞樣本數,總樣本數
regroup.reset_index(level=0, inplace=True)
regroup['bad_rate'] = regroup.apply(lambda x: x.bad * 1.0 / x.total, axis=1) # 加上一列壞樣本率
dicts = dict(zip(regroup[col],regroup['bad_rate'])) # 每箱對應的壞樣本率組成的字典
if grantRateIndicator==0:
return (dicts, regroup)
N = sum(regroup['total'])
B = sum(regroup['bad'])
overallRate = B * 1.0 / N
return (dicts, regroup, overallRate)
# 對zip_code,addr_state進行bad_rate編碼
br_encoding_dict = {}
more_value_features=['zip_code','addr_state']
for col in more_value_features:
br_encoding = BadRateEncoding(trainData, col, 'y')
trainData[col + '_br_encoding'] = br_encoding['encoding']
br_encoding_dict[col] = br_encoding['bad_rate']
num_features.append(col + '_br_encoding')
bad_rate編碼之後產生兩個新的列,將這兩列劃入數值型變量中一起進行卡方分箱。
# 數值型變量的分箱
# 先用卡方分箱輸出變量的分割點
def split_data(df,col,split_num):
"""
df: 原始數據集
col:需要分箱的變量
split_num:分割點的數量
"""
df2 = df.copy()
count = df2.shape[0] # 總樣本數
n = math.floor(count/split_num) # 按照分割點數目等分後每組的樣本數
split_index = [i*n for i in range(1,split_num)] # 分割點的索引
values = sorted(list(df2[col])) # 對變量的值從小到大進行排序
split_value = [values[i] for i in split_index] # 分割點對應的value
split_value = sorted(list(set(split_value))) # 分割點的value去重排序
return split_value
def assign_group(x,split_bin):
"""
x:變量的value
split_bin:split_data得出的分割點list
"""
n = len(split_bin)
if x<=min(split_bin):
return min(split_bin) # 如果x小於分割點的最小值,則x映射爲分割點的最小值
elif x>max(split_bin): # 如果x大於分割點的最大值,則x映射爲分割點的最大值
return 10e10
else:
for i in range(n-1):
if split_bin[i]<x<=split_bin[i+1]:# 如果x在兩個分割點之間,則x映射爲分割點較大的值
return split_bin[i+1]
def bin_bad_rate(df,col,target,grantRateIndicator=0):
"""
df:原始數據集
col:原始變量/變量映射後的字段
target:目標變量的字段
grantRateIndicator:是否輸出總體的違約率
"""
total = df.groupby([col])[target].count()
bad = df.groupby([col])[target].sum()
total_df = pd.DataFrame({'total':total})
bad_df = pd.DataFrame({'bad':bad})
regroup = pd.merge(total_df,bad_df,left_index=True,right_index=True,how='left')
regroup = regroup.reset_index()
regroup['bad_rate'] = regroup['bad']/regroup['total'] # 計算根據col分組後每組的違約率
dict_bad = dict(zip(regroup[col],regroup['bad_rate'])) # 轉爲字典形式
if grantRateIndicator==0:
return (dict_bad,regroup)
total_all= df.shape[0]
bad_all = df[target].sum()
all_bad_rate = bad_all/total_all # 計算總體的違約率
return (dict_bad,regroup,all_bad_rate)
def cal_chi2(df,all_bad_rate):
"""
df:bin_bad_rate得出的regroup
all_bad_rate:bin_bad_rate得出的總體違約率
"""
df2 = df.copy()
df2['expected'] = df2['total']*all_bad_rate # 計算每組的壞用戶期望數量
combined = zip(df2['expected'],df2['bad']) # 遍歷每組的壞用戶期望數量和實際數量
chi = [(i[0]-i[1])**2/i[0] for i in combined] # 計算每組的卡方值
chi2 = sum(chi) # 計算總的卡方值
return chi2
def assign_bin(x,cutoffpoints):
"""
x:變量的value
cutoffpoints:分箱的切割點
"""
bin_num = len(cutoffpoints)+1 # 箱體個數
if x<=cutoffpoints[0]: # 如果x小於最小的cutoff點,則映射爲Bin 0
return 'Bin 0'
elif x>cutoffpoints[-1]: # 如果x大於最大的cutoff點,則映射爲Bin(bin_num-1)
return 'Bin {}'.format(bin_num-1)
else:
for i in range(0,bin_num-1):
if cutoffpoints[i]<x<=cutoffpoints[i+1]: # 如果x在兩個cutoff點之間,則x映射爲Bin(i+1)
return 'Bin {}'.format(i+1)
def ChiMerge(df,col,target,max_bin=5,min_binpct=0):
col_unique = sorted(list(set(df[col]))) # 變量的唯一值並排序
n = len(col_unique) # 變量唯一值得個數
df2 = df.copy()
if n>100: # 如果變量的唯一值數目超過100,則將通過split_data和assign_group將x映射爲split對應的value
split_col = split_data(df2,col,100) # 通過這個目的將變量的唯一值數目人爲設定爲100
df2['col_map'] = df2[col].map(lambda x:assign_group(x,split_col))
else:
df2['col_map'] = df2[col] # 變量的唯一值數目沒有超過100,則不用做映射
# 生成dict_bad,regroup,all_bad_rate的元組
(dict_bad,regroup,all_bad_rate) = bin_bad_rate(df2,'col_map',target,grantRateIndicator=1)
col_map_unique = sorted(list(set(df2['col_map']))) # 對變量映射後的value進行去重排序
group_interval = [[i] for i in col_map_unique] # 對col_map_unique中每個值創建list並存儲在group_interval中
while (len(group_interval)>max_bin): # 當group_interval的長度大於max_bin時,執行while循環
chi_list=[]
for i in range(len(group_interval)-1):
temp_group = group_interval[i]+group_interval[i+1] # temp_group 爲生成的區間,list形式,例如[1,3]
chi_df = regroup[regroup['col_map'].isin(temp_group)]
chi_value = cal_chi2(chi_df,all_bad_rate) # 計算每一對相鄰區間的卡方值
chi_list.append(chi_value)
best_combined = chi_list.index(min(chi_list)) # 最小的卡方值的索引
# 將卡方值最小的一對區間進行合併
group_interval[best_combined] = group_interval[best_combined]+group_interval[best_combined+1]
# 刪除合併前的右區間
group_interval.remove(group_interval[best_combined+1])
# 對合並後每個區間進行排序
group_interval = [sorted(i) for i in group_interval]
# cutoff點爲每個區間的最大值
cutoffpoints = [max(i) for i in group_interval[:-1]]
# 檢查是否有箱只有好樣本或者只有壞樣本
df2['col_map_bin'] = df2['col_map'].apply(lambda x:assign_bin(x,cutoffpoints)) # 將col_map映射爲對應的區間Bin
# 計算每個區間的違約率
(dict_bad,regroup) = bin_bad_rate(df2,'col_map_bin',target)
# 計算最小和最大的違約率
[min_bad_rate,max_bad_rate] = [min(dict_bad.values()),max(dict_bad.values())]
# 當最小的違約率等於0,說明區間內只有好樣本,當最大的違約率等於1,說明區間內只有壞樣本
while min_bad_rate==0 or max_bad_rate==1:
bad01_index = regroup[regroup['bad_rate'].isin([0,1])].col_map_bin.tolist()# 違約率爲1或0的區間
bad01_bin = bad01_index[0]
if bad01_bin==max(regroup.col_map_bin):
cutoffpoints = cutoffpoints[:-1] # 當bad01_bin是最大的區間時,刪除最大的cutoff點
elif bad01_bin==min(regroup.col_map_bin):
cutoffpoints = cutoffpoints[1:] # 當bad01_bin是最小的區間時,刪除最小的cutoff點
else:
bad01_bin_index = list(regroup.col_map_bin).index(bad01_bin) # 找出bad01_bin的索引
prev_bin = list(regroup.col_map_bin)[bad01_bin_index-1] # bad01_bin前一個區間
df3 = df2[df2.col_map_bin.isin([prev_bin,bad01_bin])]
(dict_bad,regroup1) = bin_bad_rate(df3,'col_map_bin',target)
chi1 = cal_chi2(regroup1,all_bad_rate) # 計算前一個區間和bad01_bin的卡方值
later_bin = list(regroup.col_map_bin)[bad01_bin_index+1] # bin01_bin的後一個區間
df4 = df2[df2.col_map_bin.isin([later_bin,bad01_bin])]
(dict_bad,regroup2) = bin_bad_rate(df4,'col_map_bin',target)
chi2 = cal_chi2(regroup2,all_bad_rate) # 計算後一個區間和bad01_bin的卡方值
if chi1<chi2: # 當chi1<chi2時,刪除前一個區間對應的cutoff點
cutoffpoints.remove(cutoffpoints[bad01_bin_index-1])
else: # 當chi1>=chi2時,刪除bin01對應的cutoff點
cutoffpoints.remove(cutoffpoints[bad01_bin_index])
df2['col_map_bin'] = df2['col_map'].apply(lambda x:assign_bin(x,cutoffpoints))
(dict_bad,regroup) = bin_bad_rate(df2,'col_map_bin',target)
# 重新將col_map映射至區間,並計算最小和最大的違約率,直達不再出現違約率爲0或1的情況,循環停止
[min_bad_rate,max_bad_rate] = [min(dict_bad.values()),max(dict_bad.values())]
# 檢查分箱後的最小佔比
if min_binpct>0:
group_values = df2['col_map'].apply(lambda x:assign_bin(x,cutoffpoints))
df2['col_map_bin'] = group_values # 將col_map映射爲對應的區間Bin
group_df = group_values.value_counts().to_frame()
group_df['bin_pct'] = group_df['col_map']/n # 計算每個區間的佔比
min_pct = group_df.bin_pct.min() # 得出最小的區間佔比
while min_pct<min_binpct and len(cutoffpoints)>2: # 當最小的區間佔比小於min_pct且cutoff點的個數大於2,執行循環
# 下面的邏輯基本與“檢驗是否有箱體只有好/壞樣本”的一致
min_pct_index = group_df[group_df.bin_pct==min_pct].index.tolist()
min_pct_bin = min_pct_index[0]
if min_pct_bin == max(group_df.index):
cutoffpoints=cutoffpoints[:-1]
elif min_pct_bin == min(group_df.index):
cutoffpoints=cutoffpoints[1:]
else:
minpct_bin_index = list(group_df.index).index(min_pct_bin)
prev_pct_bin = list(group_df.index)[minpct_bin_index-1]
df5 = df2[df2['col_map_bin'].isin([min_pct_bin,prev_pct_bin])]
(dict_bad,regroup3) = bin_bad_rate(df5,'col_map_bin',target)
chi3 = cal_chi2(regroup3,all_bad_rate)
later_pct_bin = list(group_df.index)[minpct_bin_index+1]
df6 = df2[df2['col_map_bin'].isin([min_pct_bin,later_pct_bin])]
(dict_bad,regroup4) = bin_bad_rate(df6,'col_map_bin',target)
chi4 = cal_chi2(regroup4,all_bad_rate)
if chi3<chi4:
cutoffpoints.remove(cutoffpoints[minpct_bin_index-1])
else:
cutoffpoints.remove(cutoffpoints[minpct_bin_index])
return cutoffpoints
# 數值型變量的分箱(卡方分箱)
def binning_num(df,target,col_list,max_bin=None,min_binpct=None):
"""
df:數據集
target:目標變量的字段名
col_list:變量list集合
max_bin:最大的分箱個數
min_binpct:區間內樣本所佔總體的最小比
return:
bin_df :list形式,裏面存儲每個變量的分箱結果
iv_value:list形式,裏面存儲每個變量的IV值
"""
total = df[target].count()
bad = df[target].sum()
good = total-bad
all_odds = good/bad
inf = float('inf')
ninf = float('-inf')
bin_df=[]
iv_value=[]
for col in col_list:
cut = ChiMerge(df,col,target,max_bin=max_bin,min_binpct=min_binpct)
cut.insert(0,ninf)
cut.append(inf)
bucket = pd.cut(df[col],cut)
d1 = df.groupby(bucket)
d2 = pd.DataFrame()
d2['min_bin'] = d1[col].min()
d2['max_bin'] = d1[col].max()
d2['total'] = d1[target].count()
d2['totalrate'] = d2['total']/total
d2['bad'] = d1[target].sum()
d2['badrate'] = d2['bad']/d2['total']
d2['good'] = d2['total'] - d2['bad']
d2['goodrate'] = d2['good']/d2['total']
d2['badattr'] = d2['bad']/bad
d2['goodattr'] = (d2['total']-d2['bad'])/good
d2['odds'] = d2['good']/d2['bad']
GB_list=[]
for i in d2.odds:
if i>=all_odds:
GB_index = str(round((i/all_odds)*100,0))+str('G')
else:
GB_index = str(round((all_odds/i)*100,0))+str('B')
GB_list.append(GB_index)
d2['GB_index'] = GB_list
d2['woe'] = np.log(d2['badattr']/d2['goodattr'])
d2['bin_iv'] = (d2['badattr']-d2['goodattr'])*d2['woe']
d2['IV'] = d2['bin_iv'].sum()
iv = d2['bin_iv'].sum().round(3)
print('變量名:{}'.format(col))
print('IV:{}'.format(iv))
print('\t')
bin_df.append(d2)
iv_value.append(iv)
return bin_df,iv_value
下面看一下woe可視化之後的圖。
# woe的可視化
def plot_woe(bin_df,hspace=0.4,wspace=0.4,plt_size=None,plt_num=None,x=None,y=None):
"""
bin_df:list形式,裏面存儲每個變量的分箱結果
hspace :子圖之間的間隔(y軸方向)
wspace :子圖之間的間隔(x軸方向)
plt_size :圖紙的尺寸
plt_num :子圖的數量
x :子圖矩陣中一行子圖的數量
y :子圖矩陣中一列子圖的數量
return :每個變量的woe變化趨勢圖
"""
plt.figure(figsize=plt_size)
plt.subplots_adjust(hspace=hspace,wspace=wspace)
for i,df in zip(range(1,plt_num+1,1),bin_df):
col_name = df.index.name
df = df.reset_index()
plt.subplot(x,y,i)
plt.title(col_name)
sns.pointplot(data=df,x=col_name,y='woe')
plt.xlabel('')
plt.xticks(rotation=30)
return plt.show()
plot_woe(bin_df_num,hspace=0.6,wspace=0.4,plt_size=(15,15),plt_num=16,x=4,y=4)
評分卡要求模型的可解釋性,所以最好每一箱的woe要單調。比如int_rate_clean這個變量分爲4箱,woe值呈現單調上升,映射成評分之後也是單調上升的。這樣評分卡的業務邏輯就比較容易解釋。當然,如果一些變量的woe不單調,但是業務邏輯上能夠解釋,也允許出現U型的圖,但是一波三折的圖是不能接受的。
總結:變量分箱其實就是觀察每一個特徵值和壞樣本率之間的對應關係。變量分箱的方法多種多樣,需要結合業務邏輯選擇合適的分箱方法。
【作者】:Labryant
【原創公衆號】:風控獵人
【簡介】:某創業公司策略分析師,積極上進,努力提升。乾坤未定,你我都是黑馬。
【轉載說明】:轉載請說明出處,謝謝合作!~