歡迎各位同學學習python信用評分卡建模視頻系列教程(附代碼, 博主錄製) :
(微信二維碼掃一掃報名)
特徵錦囊:徹底瞭解一下WOE和IV
第一次接觸這兩個名詞是在做風控模型的時候,老師教我們可以用IV去做變量篩選,IV(Information Value),中文名是信息值,簡單來說這個指標的作用就是來衡量變量的預測能力強弱的,然後IV又是WOE算出來的。姑且先不管原理哈,我們先給出來一下結論。
IV範圍 | 變量預測力 |
---|---|
<0.02 | 無預測力😯 |
0.02~0.10 | 弱👎 |
0.10~0.30 | 中等😊 |
`> 0.30 | 強👍 |
def iv_count(data_bad, data_good): '''計算iv值''' value_list = set(data_bad.unique()) | set(data_good.unique()) iv = 0 len_bad = len(data_bad) len_good = len(data_good) for value in value_list: # 判斷是否某類是否爲0,避免出現無窮小值和無窮大值 if sum(data_bad == value) == 0: bad_rate = 1 / len_bad else: bad_rate = sum(data_bad == value) / len_bad if sum(data_good == value) == 0: good_rate = 1 / len_good else: good_rate = sum(data_good == value) / len_good iv += (good_rate - bad_rate) * math.log(good_rate / bad_rate,2) print(value,iv) return iv
那麼我們如何使用呢,一步一步來:
Step1:導入數據
測試數據集可以後臺回覆 'age' 進行獲取。
data = pd.read_csv('./data/age.csv') # 定義必要的參數 feature = data.loc[:,['age']] labels = data['target'] keep_cols = ['age'] cut_bin_dict = {'age':[0,18,25,30,40,50,100]}
Step2:按照指定閾值分箱
按照我們之前Excel相同的分箱邏輯進行分箱:
cut_bin = cut_bin_dict['age'] # 按照分箱閾值分箱,並將缺失值替換成Blank,區分好壞樣本 data_bad = pd.cut(feature['age'], cut_bin, right=False).cat.add_categories(['Blank']).fillna('Blank')[labels == 1] data_good = pd.cut(feature['age'], cut_bin, right=False ).cat.add_categories(['Blank']).fillna('Blank')[labels == 0] value_list = set(data_bad.unique()) | set(data_good.unique()) value_list
Step3:調用函數計算IV
iv_series['age'] = iv_count(data_bad, data_good) iv_series
def get_iv_series(feature, labels, keep_cols=None, cut_bin_dict=None): ''' 計算各變量最大的iv值,get_iv_series方法出入參如下: ------------------------------------------------------------ 入參結果如下: feature: 數據集的特徵空間 labels: 數據集的輸出空間 keep_cols: 需計算iv值的變量列表 cut_bin_dict: 數值型變量要進行分箱的閾值字典,格式爲{'col1':[value1,value2,...], 'col2':[value1,value2,...], ...} ------------------------------------------------------------ 入參結果如下: iv_series: 各變量最大的IV值 ''' def iv_count(data_bad, data_good): '''計算iv值''' value_list = set(data_bad.unique()) | set(data_good.unique()) iv = 0 len_bad = len(data_bad) len_good = len(data_good) for value in value_list: # 判斷是否某類是否爲0,避免出現無窮小值和無窮大值 if sum(data_bad == value) == 0: bad_rate = 1 / len_bad else: bad_rate = sum(data_bad == value) / len_bad if sum(data_good == value) == 0: good_rate = 1 / len_good else: good_rate = sum(data_good == value) / len_good iv += (good_rate - bad_rate) * math.log(good_rate / bad_rate,2) return iv if keep_cols is None: keep_cols = sorted(list(feature.columns)) col_types = feature[keep_cols].dtypes categorical_feature = list(col_types[col_types == 'object'].index) numerical_feature = list(col_types[col_types != 'object'].index) iv_series = pd.Series() # 遍歷數值變量計算iv值 for col in numerical_feature: cut_bin = cut_bin_dict[col] # 按照分箱閾值分箱,並將缺失值替換成Blank,區分好壞樣本 data_bad = pd.cut(feature[col], cut_bin, right=False).cat.add_categories(['Blank']).fillna('Blank')[labels == 1] data_good = pd.cut(feature[col], cut_bin, right=False ).cat.add_categories(['Blank']).fillna('Blank')[labels == 0] iv_series[col] = iv_count(data_bad, data_good) # 遍歷類別變量計算iv值 for col in categorical_feature: # 將缺失值替換成Blank,區分好壞樣本 data_bad = feature[col].fillna('Blank')[labels == 1] data_good = feature[col].fillna('Blank')[labels == 0] iv_series[col] = iv_count(data_bad, data_good) return iv_series
調用demo:
iv_series = get_iv_series(feature, labels, keep_cols, cut_bin_dict=cut_bin_dict) iv_series # age 0.434409
python金融風控評分卡模型和數據分析微專業課
(微信二維碼掃一掃報名)