徹底瞭解WOE和IV

歡迎各位同學學習python信用評分卡建模視頻系列教程(附代碼, 博主錄製) :

騰訊課堂報名入口

網易雲課堂報名入口

(微信二維碼掃一掃報名)

 

特徵錦囊:徹底瞭解一下WOE和IV

第一次接觸這兩個名詞是在做風控模型的時候,老師教我們可以用IV去做變量篩選,IV(Information Value),中文名是信息值,簡單來說這個指標的作用就是來衡量變量的預測能力強弱的,然後IV又是WOE算出來的。姑且先不管原理哈,我們先給出來一下結論。

IV範圍變量預測力
<0.02 無預測力😯
0.02~0.10 弱👎
0.10~0.30 中等😊
`> 0.30 強👍

 

 

 

 

 

 

 

 

 

 

 

def iv_count(data_bad, data_good):
    '''計算iv值'''
    value_list = set(data_bad.unique()) | set(data_good.unique())
    iv = 0
    len_bad = len(data_bad)
    len_good = len(data_good)
    for value in value_list:
        # 判斷是否某類是否爲0,避免出現無窮小值和無窮大值
        if sum(data_bad == value) == 0:
            bad_rate = 1 / len_bad
        else:
            bad_rate = sum(data_bad == value) / len_bad
        if sum(data_good == value) == 0:
            good_rate = 1 / len_good
        else:
            good_rate = sum(data_good == value) / len_good
        iv += (good_rate - bad_rate) * math.log(good_rate / bad_rate,2)
        print(value,iv)
    return iv

 

那麼我們如何使用呢,一步一步來:

Step1:導入數據

測試數據集可以後臺回覆 'age' 進行獲取。

data = pd.read_csv('./data/age.csv')

# 定義必要的參數
feature = data.loc[:,['age']]
labels = data['target']
keep_cols = ['age']
cut_bin_dict = {'age':[0,18,25,30,40,50,100]}

 

Step2:按照指定閾值分箱

按照我們之前Excel相同的分箱邏輯進行分箱:

cut_bin = cut_bin_dict['age']
# 按照分箱閾值分箱,並將缺失值替換成Blank,區分好壞樣本
data_bad = pd.cut(feature['age'], cut_bin, right=False).cat.add_categories(['Blank']).fillna('Blank')[labels == 1]
data_good = pd.cut(feature['age'], cut_bin, right=False
                   ).cat.add_categories(['Blank']).fillna('Blank')[labels == 0]

value_list = set(data_bad.unique()) | set(data_good.unique())
value_list

 

 

 

 

Step3:調用函數計算IV
iv_series['age'] = iv_count(data_bad, data_good)
iv_series

 

 

 

 

def get_iv_series(feature, labels, keep_cols=None, cut_bin_dict=None):
    '''
    計算各變量最大的iv值,get_iv_series方法出入參如下:
    ------------------------------------------------------------
    入參結果如下:
        feature: 數據集的特徵空間
        labels: 數據集的輸出空間
        keep_cols: 需計算iv值的變量列表
        cut_bin_dict: 數值型變量要進行分箱的閾值字典,格式爲{'col1':[value1,value2,...], 'col2':[value1,value2,...], ...}
    ------------------------------------------------------------
    入參結果如下:
        iv_series: 各變量最大的IV值
    '''
    def iv_count(data_bad, data_good):
        '''計算iv值'''
        value_list = set(data_bad.unique()) | set(data_good.unique())
        iv = 0
        len_bad = len(data_bad)
        len_good = len(data_good)
        for value in value_list:
            # 判斷是否某類是否爲0,避免出現無窮小值和無窮大值
            if sum(data_bad == value) == 0:
                bad_rate = 1 / len_bad
            else:
                bad_rate = sum(data_bad == value) / len_bad
            if sum(data_good == value) == 0:
                good_rate = 1 / len_good
            else:
                good_rate = sum(data_good == value) / len_good
            iv += (good_rate - bad_rate) * math.log(good_rate / bad_rate,2)
        return iv

    if keep_cols is None:
        keep_cols = sorted(list(feature.columns))
    col_types = feature[keep_cols].dtypes
    categorical_feature = list(col_types[col_types == 'object'].index)
    numerical_feature = list(col_types[col_types != 'object'].index)

    iv_series = pd.Series()

    # 遍歷數值變量計算iv值
    for col in numerical_feature:
        cut_bin = cut_bin_dict[col]
        # 按照分箱閾值分箱,並將缺失值替換成Blank,區分好壞樣本
        data_bad = pd.cut(feature[col], cut_bin, right=False).cat.add_categories(['Blank']).fillna('Blank')[labels == 1]
        data_good = pd.cut(feature[col], cut_bin, right=False
                           ).cat.add_categories(['Blank']).fillna('Blank')[labels == 0]
        iv_series[col] = iv_count(data_bad, data_good)
    # 遍歷類別變量計算iv值
    for col in categorical_feature:
        # 將缺失值替換成Blank,區分好壞樣本
        data_bad = feature[col].fillna('Blank')[labels == 1]
        data_good = feature[col].fillna('Blank')[labels == 0]
        iv_series[col] = iv_count(data_bad, data_good)

    return iv_series

 調用demo:

iv_series = get_iv_series(feature, labels, keep_cols, cut_bin_dict=cut_bin_dict)
iv_series
# age    0.434409

 

 

 

 

python金融風控評分卡模型和數據分析微專業課

騰訊課堂報名入口

網易雲課堂報名入口

(微信二維碼掃一掃報名)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章