風控特徵學習筆記

總體業務建模流程:
    

1、將業務抽象爲分類or迴歸問題

2、定義標籤,得到y

3、選取合適的樣本,並匹配出全部的信息作爲特徵的來源

4、特徵工程 + 模型訓練 + 模型評價與調優(相互之間可能會有交互)

5、輸出模型報告

6、上線與監控

 

什麼是特徵?

在機器學習的背景下,特徵是用來解釋現象發生的單個特性或一組特性。 當這些特性轉換爲某種可度量的形式時,它們被稱爲特徵。

舉個例子,假設你有一個學生列表,這個列表裏包含每個學生的姓名、學習小時數、IQ和之前考試的總分數。現在,有一個新學生,你知道他/她的學習小時數和IQ,但他/她的考試分數缺失,你需要估算他/她可能獲得的考試分數。

在這裏,你需要用IQ和study_hours構建一個估算分數缺失值的預測模型。所以,IQ和study_hours就成了這個模型的特徵。
 

特徵工程可能包含的內容:

1、基礎特徵構造

2、數據預處理

3、特徵衍生

4、特徵變換

5、特徵篩選

這是一個完整的特徵工程流程,但不是唯一的流程,每個過程都有可能會交換順序。

一、基礎特徵構造

""" 預覽數據 """

import pandas as pd
import numpy as np

df_train = pd.read_csv('train.csv')
df_train.head(3)

"""查看數據基本情況"""

df_train.shape
df_train.info()
df_train.describe()

 

"""可以畫3D圖對數據進行可視化,例子下面所示"""

from pyecharts import Bar3D

bar3d = Bar3D("2018年申請人數分佈", width=1200, height=600)
x_axis = [
    "12a", "1a", "2a", "3a", "4a", "5a", "6a", "7a", "8a", "9a", "10a", "11a",
    "12p", "1p", "2p", "3p", "4p", "5p", "6p", "7p", "8p", "9p", "10p", "11p"
]
y_axis = [
    "Saturday", "Friday", "Thursday", "Wednesday", "Tuesday", "Monday", "Sunday"
]
data = [
    [0, 0, 5], [0, 1, 1], [0, 2, 0], [0, 3, 0], [0, 4, 0], [0, 5, 0],
    [0, 6, 0], [0, 7, 0], [0, 8, 0], [0, 9, 0], [0, 10, 0], [0, 11, 2],
    [0, 12, 4], [0, 13, 1], [0, 14, 1], [0, 15, 3], [0, 16, 4], [0, 17, 6],
    [0, 18, 4], [0, 19, 4], [0, 20, 3], [0, 21, 3], [0, 22, 2], [0, 23, 5],
    [1, 0, 7], [1, 1, 0], [1, 2, 0], [1, 3, 0], [1, 4, 0], [1, 5, 0],
    [1, 6, 0], [1, 7, 0], [1, 8, 0], [1, 9, 0], [1, 10, 5], [1, 11, 2],
    [1, 12, 2], [1, 13, 6], [1, 14, 9], [1, 15, 11], [1, 16, 6], [1, 17, 7],
    [1, 18, 8], [1, 19, 12], [1, 20, 5], [1, 21, 5], [1, 22, 7], [1, 23, 2],
    [2, 0, 1], [2, 1, 1], [2, 2, 0], [2, 3, 0], [2, 4, 0], [2, 5, 0],
    [2, 6, 0], [2, 7, 0], [2, 8, 0], [2, 9, 0], [2, 10, 3], [2, 11, 2],
    [2, 12, 1], [2, 13, 9], [2, 14, 8], [2, 15, 10], [2, 16, 6], [2, 17, 5],
    [2, 18, 5], [2, 19, 5], [2, 20, 7], [2, 21, 4], [2, 22, 2], [2, 23, 4],
    [3, 0, 7], [3, 1, 3], [3, 2, 0], [3, 3, 0], [3, 4, 0], [3, 5, 0],
    [3, 6, 0], [3, 7, 0], [3, 8, 1], [3, 9, 0], [3, 10, 5], [3, 11, 4],
    [3, 12, 7], [3, 13, 14], [3, 14, 13], [3, 15, 12], [3, 16, 9], [3, 17, 5],
    [3, 18, 5], [3, 19, 10], [3, 20, 6], [3, 21, 4], [3, 22, 4], [3, 23, 1],
    [4, 0, 1], [4, 1, 3], [4, 2, 0], [4, 3, 0], [4, 4, 0], [4, 5, 1],
    [4, 6, 0], [4, 7, 0], [4, 8, 0], [4, 9, 2], [4, 10, 4], [4, 11, 4],
    [4, 12, 2], [4, 13, 4], [4, 14, 4], [4, 15, 14], [4, 16, 12], [4, 17, 1],
    [4, 18, 8], [4, 19, 5], [4, 20, 3], [4, 21, 7], [4, 22, 3], [4, 23, 0],
    [5, 0, 2], [5, 1, 1], [5, 2, 0], [5, 3, 3], [5, 4, 0], [5, 5, 0],
    [5, 6, 0], [5, 7, 0], [5, 8, 2], [5, 9, 0], [5, 10, 4], [5, 11, 1],
    [5, 12, 5], [5, 13, 10], [5, 14, 5], [5, 15, 7], [5, 16, 11], [5, 17, 6],
    [5, 18, 0], [5, 19, 5], [5, 20, 3], [5, 21, 4], [5, 22, 2], [5, 23, 0],
    [6, 0, 1], [6, 1, 0], [6, 2, 0], [6, 3, 0], [6, 4, 0], [6, 5, 0],
    [6, 6, 0], [6, 7, 0], [6, 8, 0], [6, 9, 0], [6, 10, 1], [6, 11, 0],
    [6, 12, 2], [6, 13, 1], [6, 14, 3], [6, 15, 4], [6, 16, 0], [6, 17, 0],
    [6, 18, 0], [6, 19, 0], [6, 20, 1], [6, 21, 2], [6, 22, 2], [6, 23, 6]
]
range_color = ['#313695', '#4575b4', '#74add1', '#abd9e9', '#e0f3f8', '#ffffbf',
               '#fee090', '#fdae61', '#f46d43', '#d73027', '#a50026']
bar3d.add(
    "",
    x_axis,
    y_axis,
    [[d[1], d[0], d[2]] for d in data],
    is_visualmap=True,
    visual_range=[0, 20],
    visual_range_color=range_color,
    grid3d_width=200,
    grid3d_depth=80,
    is_grid3d_rotate=True,  # 自動旋轉
    grid3d_rotate_speed=180,  # 旋轉速度
)
bar3d

二、數據預處理

缺失值-主要用到的兩個包:1、pandas fillna  2、sklearn Imputer

"""均值填充"""

df_train['Age'].fillna(value=df_train['Age'].mean()).sample(5)


""" 另一種均值填充的方式 """

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
age = imp.fit_transform(df_train[['Age']].values).copy()
df_train.loc[:,'Age'] = df_train['Age'].fillna(value=df_train['Age'].mean()).copy()
df_train.head(5)


數值型 - 數值縮放

"""取對數等變換"""

import numpy as np
log_age = df_train['Age'].apply(lambda x:np.log(x))
df_train.loc[:,'log_age'] = log_age

df_train.head(5)

""" 幅度縮放,最大最小值縮放到[0,1]區間內 """

from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
fare_trans = mm_scaler.fit_transform(df_train[['Fare']])

""" 幅度縮放,將每一列的數據標準化爲正態分佈 """

from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
fare_std_trans = std_scaler.fit_transform(df_train[['Fare']])

""" 中位數或者四分位數去中心化數據,對異常值不敏感 """

from sklearn.preprocessing import robust_scale
fare_robust_trans = robust_scale(df_train[['Fare','Age']])

""" 將同一行數據規範化,前面的同一變爲1以內也可以達到這樣的效果 """

from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
fare_normal_trans = normalizer.fit_transform(df_train[['Age','Fare']])
fare_normal_trans

統計值

""" 最大最小值 """

max_age = df_train['Age'].max()
min_age = df_train["Age"].min()

""" 分位數,極值處理,我們最粗暴的方法就是將前後1%的值替換成前後兩個端點的值 """

age_quarter_01 = df_train['Age'].quantile(0.01)
print(age_quarter_01)
age_quarter_99 = df_train['Age'].quantile(0.99)
print(age_quarter_99)

""" 四則運算 """

df_train.loc[:,'family_size'] = df_train['SibSp']+df_train['Parch']+1
df_train.head(2)

df_train.loc[:,'tmp'] = df_train['Age']*df_train['Pclass'] + 4*df_train['family_size']
df_train.head(2)


""" 多項式特徵 """

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
df_train[['SibSp','Parch']].head()

poly_fea = poly.fit_transform(df_train[['SibSp','Parch']])
pd.DataFrame(poly_fea,columns = poly.get_feature_names()).head()

""" 等距切分 """

df_train.loc[:, 'fare_cut'] = pd.cut(df_train['Fare'], 20)
df_train.head(2)

""" 等頻切分 """

df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10)
df_train.head(2)

""" badrate 曲線 """

df_train = df_train.sort_values('Fare')

alist = list(set(df_train['fare_qcut']))
badrate = {}
for x in alist:
    
    a = df_train[df_train.fare_qcut == x]
    
    bad = a[a.label == 1]['label'].count()
    good = a[a.label == 0]['label'].count()
    
    badrate[x] = bad/(bad+good)
    
f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
print(badrate.head())
badrate.plot('cut','badrate')


""" 一般採取等頻分箱,很少等距分箱,等距分箱可能造成樣本非常不均勻 """

""" 一般分5-6箱,保證badrate曲線從非嚴格遞增轉化爲嚴格遞增曲線 """

""" OneHot encoding/獨熱向量編碼 """

""" 一般像男、女這種二分類categories類型的數據採取獨熱向量編碼, 轉化爲0、1  主要用到 pd.get_dummies """

embarked_oht = pd.get_dummies(df_train[['Embarked']])
embarked_oht.head(2)


fare_qcut_oht = pd.get_dummies(df_train[['fare_qcut']])
fare_qcut_oht.head(2)


時間型 日期處理

car_sales = pd.read_csv('car_data.csv')
car_sales.head(2)

car_sales.loc[:,'date'] = pd.to_datetime(car_sales['date_t'])
car_sales.head(2)

""" 取出關鍵時間信息  """

""" 月份 """

car_sales.loc[:,'month'] = car_sales['date'].dt.month
car_sales.head()

""" 幾號 """

car_sales.loc[:,'dom'] = car_sales['date'].dt.day

""" 一年當中第幾天 """

car_sales.loc[:,'doy'] = car_sales['date'].dt.dayofyear

""" 星期幾 """

car_sales.loc[:,'dow'] = car_sales['date'].dt.dayofweek

car_sales.head(2)

文本型數據

from pyecharts import WordCloud

name = [
 'bupt', '金融', '濤濤', '實戰', '人長得帥' ,
 '機器學習', '深度學習', '異常檢測', '知識圖譜', '社交網絡', '圖算法',
 '遷移學習', '不均衡學習', '瞪噔', '數據挖掘', '哈哈',
 '集成算法', '模型融合','python', '聰明']
value = [
 10000, 6181, 4386, 4055, 2467, 2244, 1898, 1484, 1112,
 965, 847, 582, 555, 550, 462, 366, 360, 282, 273, 265]
wordcloud = WordCloud(width=800, height=600)
wordcloud.add("", name, value, word_size_range=[30, 80])

""" 詞袋模型 """

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'This is a very good class',
    'students are very very very good',
    'This is the third sentence',
    'Is this the last doc',
    'PS teacher Mei is very very handsome'
]

X = vectorizer.fit_transform(corpus)
X.toarray()  """ one-hot 編碼"""

vec = CountVectorizer(ngram_range=(1,3))
X_ngram = vec.fit_transform(corpus)
X_ngram.toarray()

""" TF-IDF """

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
tfidf_X = tfidf_vec.fit_transform(corpus)
tfidf_vec.get_feature_names()
tfidf_X.toarray()

組合特徵

""" 根據條件去判斷獲取組合特徵  """

df_train.loc[:,'alone'] = (df_train['SibSp']==0)&(df_train['Parch']==0)
df_train.head(3)

""" 詞雲圖可以直觀的反應哪些詞作用權重比較大 """

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = [
    'This is a very good class',
    'students are very very very good',
    'This is the third sentence',
    'Is this the last doc',
    'teacher Mei is very very handsome'
]

X = vectorizer.fit_transform(corpus)

L = []

for item in list(X.toarray()):
    L.append(list(item))

value = [0 for i in range(len(L[0]))]

for i in range(len(L[0])):
    for j in range(len(L)):
        value[i] += L[j][i]

from pyecharts import WordCloud

wordcloud = WordCloud(width=800,height=500)
#這裏是需要做的
wordcloud.add('',vectorizer.get_feature_names(),value,word_size_range=[20,100])
wordcloud

三、特徵衍生

data = pd.read_excel('textdata.xlsx')
data.head()

""" ft 和 gt 表示兩個變量名 1-12 表示對應12個月中每個月的相應數值 """

""" 基於時間序列進行特徵衍生 """

""" 最近p個月,inv>0的月份數 inv表示傳入的變量名 """

def Num(data,inv,p):

    df=data.loc[:,inv+'1':inv+str(p)]
    auto_value=np.where(df>0,1,0).sum(axis=1)

    return data,inv+'_num'+str(p),auto_value

data_new = data.copy()

for p in range(1,12):
    for inv in ['ft','gt']:
        data_new,columns_name,values=Num(data_new,inv,p)
        data_new[columns_name]=values

# -*- coding:utf-8 -*-

'''

    @Author : wangtao
    @Time : 19/9/3 下午6:28
    @desc :  構建時間序列衍生特徵

'''

import numpy as np
import pandas as pd

class time_series_feature(object):

    def __init__(self):
        pass

    def Num(self,data,inv,p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv大於0的月份個數
        """
        df = data.loc[:,inv+'1':inv+str(p)]
        auto_value = np.where(df > 0,1,0).sum(axis=1)

        return inv+'_num'+str(p),auto_value

    def Nmz(self,data,inv,p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv=0的月份個數
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.where(df == 0, 1, 0).sum(axis=1)

        return inv + '_nmz' + str(p), auto_value

    def Evr(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv>0的月份數是否>=1
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        arr = np.where(df > 0, 1, 0).sum(axis=1)
        auto_value = np.where(arr, 1, 0)

        return inv + '_evr' + str(p), auto_value

    def Avg(self,data,inv, p):

        """
        :param p:
        :return: 最近p個月,inv均值

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nanmean(df, axis=1)

        return inv + '_avg' + str(p), auto_value

    def Tot(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv和

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nansum(df, axis=1)

        return inv + '_tot' + str(p), auto_value

    def Tot2T(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近(2,p+1)個月,inv和  可以看出該變量的波動情況
        """

        df = data.loc[:, inv + '2':inv + str(p + 1)]
        auto_value = df.sum(1)

        return inv + '_tot2t' + str(p), auto_value

    def Max(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv最大值

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nanmax(df, axis=1)

        return inv + '_max' + str(p), auto_value

    def Min(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv最小值

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nanmin(df, axis=1)

        return inv + '_min' + str(p), auto_value

    def Msg(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,最近一次inv>0到現在的月份數

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        df_value = np.where(df > 0, 1, 0)
        auto_value = []
        for i in range(len(df_value)):
            row_value = df_value[i, :]
            if row_value.max() <= 0:
                indexs = '0'
                auto_value.append(indexs)
            else:
                indexs = 1
                for j in row_value:
                    if j > 0:
                        break
                    indexs += 1
                auto_value.append(indexs)

        return inv + '_msg' + str(p), auto_value

    def Msz(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,最近一次inv=0到現在的月份數

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        df_value = np.where(df == 0, 1, 0)
        auto_value = []
        for i in range(len(df_value)):
            row_value = df_value[i, :]
            if row_value.max() <= 0:
                indexs = '0'
                auto_value.append(indexs)
            else:
                indexs = 1
                for j in row_value:
                    if j > 0:
                        break
                    indexs += 1
                auto_value.append(indexs)

        return inv + '_msz' + str(p), auto_value

    def Cav(self,data,inv, p):

        """
        :param p:
        :return: 當月inv/(最近p個月inv的均值)

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = df[inv + '1'] / np.nanmean(df, axis=1)

        return inv + '_cav' + str(p), auto_value

    def Cmn(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 當月inv/(最近p個月inv的最小值)

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = df[inv + '1'] / np.nanmin(df, axis=1)

        return inv + '_cmn' + str(p), auto_value

    def Mai(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,每兩個月間的inv的增長量的最大值

        """

        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []

        for i in range(len(arr)):
            df_value = arr[i, :]
            value_lst = []
            for k in range(len(df_value) - 1):
                minus = df_value[k] - df_value[k + 1]
                value_lst.append(minus)
            auto_value.append(np.nanmax(value_lst))

        return inv + '_mai' + str(p), auto_value

    def Mad(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,每兩個月間的inv的減少量的最大值

        """

        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []
        for i in range(len(arr)):
            df_value = arr[i, :]
            value_lst = []
            for k in range(len(df_value) - 1):
                minus = df_value[k + 1] - df_value[k]
                value_lst.append(minus)
            auto_value.append(np.nanmax(value_lst))

        return inv + '_mad' + str(p), auto_value

    def Std(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv的標準差

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nanvar(df, axis=1)

        return inv + '_std' + str(p), auto_value

    def Cva(self,data,inv, p):

        """
        :param p:
        :return: 最近p個月,inv的變異係數

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nanmean(df, axis=1) / np.nanvar(df, axis=1)

        return inv + '_cva' + str(p), auto_value

    def Cmm(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: (當月inv) - (最近p個月inv的均值)

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = df[inv + '1'] - np.nanmean(df, axis=1)

        return inv + '_cmm' + str(p), auto_value

    def Cnm(self,data,inv, p):

        """
        :param p:
        :return: (當月inv) - (最近p個月inv的最小值)
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = df[inv + '1'] - np.nanmin(df, axis=1)

        return inv + '_cnm' + str(p), auto_value

    def Cxm(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: (當月inv) - (最近p個月inv的最大值)

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = df[inv + '1'] - np.nanmax(df, axis=1)

        return inv + '_cxm' + str(p), auto_value

    def Cxp(self,data,inv, p):

        """
        :param p:
        :return: ( (當月inv) - (最近p個月inv的最大值) ) / (最近p個月inv的最大值) )

        """

        df = data.loc[:, inv + '1':inv + str(p)]
        temp = np.nanmin(df, axis=1)
        auto_value = (df[inv + '1'] - temp) / temp

        return inv + '_cxp' + str(p), auto_value

    def Ran(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月,inv的極差
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = np.nanmax(df, axis=1) - np.nanmin(df, axis=1)

        return inv + '_ran' + str(p), auto_value

    def Nci(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近min( Time on book,p )個月中,後一個月相比於前一個月增長了的月份數
        """

        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []
        for i in range(len(arr)):
            df_value = arr[i, :]
            value_lst = []
            for k in range(len(df_value) - 1):
                minus = df_value[k] - df_value[k + 1]
                value_lst.append(minus)
            value_ng = np.where(np.array(value_lst) > 0, 1, 0).sum()
            auto_value.append(np.nanmax(value_ng))

        return inv + '_nci' + str(p), auto_value

    def Ncd(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近min( Time on book,p )個月中,後一個月相比於前一個月減少了的月份數
        """

        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []
        for i in range(len(arr)):
            df_value = arr[i, :]
            value_lst = []
            for k in range(len(df_value) - 1):
                minus = df_value[k] - df_value[k + 1]
                value_lst.append(minus)
            value_ng = np.where(np.array(value_lst) < 0, 1, 0).sum()
            auto_value.append(np.nanmax(value_ng))

        return inv + '_ncd' + str(p), auto_value

    def Ncn(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近min( Time on book,p )個月中,相鄰月份inv 相等的月份數
        """

        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []
        for i in range(len(arr)):
            df_value = arr[i, :]
            value_lst = []
            for k in range(len(df_value) - 1):
                minus = df_value[k] - df_value[k + 1]
                value_lst.append(minus)
            value_ng = np.where(np.array(value_lst) == 0, 1, 0).sum()
            auto_value.append(np.nanmax(value_ng))

        return inv + '_ncn' + str(p), auto_value

    def Bup(self,data,inv, p):

        """
        :param p:
        :return:
        desc:If  最近min( Time on book,p )個月中,對任意月份i ,都有 inv[i] > inv[i+1]  即嚴格遞增,且inv > 0則flag = 1 Else flag = 0

        """
        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []
        for i in range(len(arr)):
            df_value = arr[i, :]
            index = 0
            for k in range(len(df_value) - 1):
                if df_value[k] > df_value[k + 1]:
                    break
                index = + 1
            if index == p:
                value = 1
            else:
                value = 0
            auto_value.append(value)

        return inv + '_bup' + str(p), auto_value

    def Pdn(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return:
        desc: If  最近min( Time on book,p )個月中,對任意月份i ,都有 inv[i] < inv[i+1] ,即嚴格遞減,且inv > 0則flag = 1 Else flag = 0

        """

        arr = np.array(data.loc[:, inv + '1':inv + str(p)])
        auto_value = []
        for i in range(len(arr)):
            df_value = arr[i, :]
            index = 0
            for k in range(len(df_value) - 1):
                if df_value[k + 1] > df_value[k]:
                    break
                index = + 1
            if index == p:
                value = 1
            else:
                value = 0
            auto_value.append(value)

        return inv + '_pdn' + str(p), auto_value

    def Trm(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近min( Time on book,p )個月,inv的修建均值
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = []
        for i in range(len(df)):
            trm_mean = list(df.loc[i, :])
            trm_mean.remove(np.nanmax(trm_mean))
            trm_mean.remove(np.nanmin(trm_mean))
            temp = np.nanmean(trm_mean)
            auto_value.append(temp)

        return inv + '_trm' + str(p), auto_value

    def Cmx(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 當月inv / 最近p個月的inv中的最大值
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = (df[inv + '1'] - np.nanmax(df, axis=1)) / np.nanmax(df, axis=1)

        return inv + '_cmx' + str(p), auto_value

    def Cmp(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: ( 當月inv - 最近p個月的inv均值 ) / inv均值
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = (df[inv + '1'] - np.nanmean(df, axis=1)) / np.nanmean(df, axis=1)

        return inv + '_cmp' + str(p), auto_value

    def Cnp(self,data,inv, p):

        """
        :param p:
        :return: ( 當月inv - 最近p個月的inv最小值 ) /inv最小值
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        auto_value = (df[inv + '1'] - np.nanmin(df, axis=1)) / np.nanmin(df, axis=1)

        return inv + '_cnp' + str(p), auto_value

    def Msx(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近min( Time on book,p )個月取最大值的月份距現在的月份數
        """

        df = data.loc[:, inv + '1':inv + str(p)]
        df['_max'] = np.nanmax(df, axis=1)
        for i in range(1, p + 1):
            df[inv + str(i)] = list(df[inv + str(i)] == df['_max'])
        del df['_max']
        df_value = np.where(df == True, 1, 0)
        auto_value = []
        for i in range(len(df_value)):
            row_value = df_value[i, :]
            indexs = 1
            for j in row_value:
                if j == 1:
                    break
                indexs += 1
            auto_value.append(indexs)

        return inv + '_msx' + str(p), auto_value

    def Rpp(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 近p個月的均值/((p,2p)個月的inv均值)
        """

        df1 = data.loc[:, inv + '1':inv + str(p)]
        value1 = np.nanmean(df1, axis=1)
        df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
        value2 = np.nanmean(df2, axis=1)
        auto_value = value1 / value2

        return inv + '_rpp' + str(p), auto_value

    def Dpp(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 最近p個月的均值 - ((p,2p)個月的inv均值)

        """

        df1 = data.loc[:, inv + '1':inv + str(p)]
        value1 = np.nanmean(df1, axis=1)
        df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
        value2 = np.nanmean(df2, axis=1)
        auto_value = value1 - value2

        return inv + '_dpp' + str(p), auto_value

    def Mpp(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: (最近p個月的inv最大值)/ (最近(p,2p)個月的inv最大值)
        """

        df1 = data.loc[:, inv + '1':inv + str(p)]
        value1 = np.nanmax(df1, axis=1)
        df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
        value2 = np.nanmax(df2, axis=1)
        auto_value = value1 / value2

        return inv + '_mpp' + str(p), auto_value

    def Npp(self,data,inv, p):

        """
        :param data:
        :param inv:
        :param p:
        :return: (最近p個月的inv最小值)/ (最近(p,2p)個月的inv最小值)

        """

        df1 = data.loc[:, inv + '1':inv + str(p)]
        value1 = np.nanmin(df1, axis=1)
        df2 = data.loc[:, inv + str(p):inv + str(2 * p)]
        value2 = np.nanmin(df2, axis=1)
        auto_value = value1 / value2

        return inv + '_npp' + str(p), auto_value


    def auto_var(self,data_new,inv,p):

        """
        :param data:
        :param inv:
        :param p:
        :return: 批量調用雙參數函數

        """
        try:
            columns_name, values = self.Num(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Nmz(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Evr(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Avg(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Tot(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Tot2T(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Max(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Max(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Min(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Msg(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Msz(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cav(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cmn(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Std(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cva(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cmm(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cnm(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cxm(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cxp(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Ran(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Nci(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Ncd(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Ncn(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Pdn(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cmx(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cmp(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Cnp(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Msx(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Nci(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Trm(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Bup(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Mai(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Mad(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Rpp(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Dpp(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Mpp(data_new,inv, p)
            data_new[columns_name] = values

            columns_name, values = self.Npp(data_new,inv, p)
            data_new[columns_name] = values

        except:
            pass

        return data_new


if __name__ == "__main__":
    
    file_dir = ""
    file_name = "textdata.xlsx"
    data_ = pd.read_excel(file_dir + file_name)
    
    auto_var2 = time_series_feature()
    
    for p in range(1,12):
        for inv in ['ft','gt']:
            data_ = auto_var2.auto_var(data_,inv,p)

四、特徵篩選

 

常用特徵選擇三種方法:

1、Filter

移除低方差的特徵 (Removing features with low variance)

單變量特徵選擇 (Univariate feature selection)

2、Wrapper

遞歸特徵消除 (Recursive Feature Elimination)

3、Embedded

使用SelectFromModel選擇特徵 (Feature selection using SelectFromModel)

將特徵選擇過程融入pipeline (Feature selection as part of a pipeline)


當數據預處理完成後,我們需要選擇有意義的特徵輸入機器學習的算法和模型進行訓練。

通常來說,從兩個方面考慮來選擇特徵:

1、特徵是否發散

如果一個特徵不發散,例如方差接近於0,也就是說樣本在這個特徵上基本上沒有差異,這個特徵對於樣本的區分並沒有什麼用。

2、特徵與目標的相關性

這點比較顯見,與目標相關性高的特徵,應當優選選擇。除移除低方差法外,可從相關性考慮

根據特徵選擇的形式又可以將特徵選擇方法分爲3種:

Filter:過濾法,按照發散性或者相關性對各個特徵進行評分,設定閾值或者待選擇閾值的個數,選擇特徵。

Wrapper:包裝法,根據目標函數(通常是預測效果評分),每次選擇若干特徵,或者排除若干特徵。

Embedded:嵌入法,先使用某些機器學習的算法和模型進行訓練,得到各個特徵的權值係數,根據係數從大到小選擇特徵。類似於Filter方法,但是是通過訓練來確定特徵的優劣。

特徵選擇主要有兩個目的:

減少特徵數量、降維,使模型泛化能力更強,減少過擬合;

增強對特徵和特徵值之間的理解。

拿到數據集,一個特徵選擇方法,往往很難同時完成這兩個目的
 

Filter

1)移除低方差的特徵 (Removing features with low variance)

假設某特徵的特徵值只有0和1,並且在所有輸入樣本中,95%的實例的該特徵取值都是1,那就可以認爲這個特徵作用不大。

如果100%都是1,那這個特徵就沒意義了。當特徵值都是離散型變量的時候這種方法才能用,如果是連續型變量,就需要將連續變量離散化之後才能用。

而且實際當中,一般不太會有95%以上都取某個值的特徵存在,所以這種方法雖然簡單但是不太好用。可以把它作爲特徵選擇的預處理,

先去掉那些取值變化小的特徵,然後再從接下來提到的的特徵選擇方法中選擇合適的進行進一步的特徵選擇。
 

from sklearn.feature_selection import VarianceThreshold


X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

2)單變量特徵選擇 (Univariate feature selection)

單變量特徵選擇的原理是分別單獨的計算每個變量的某個統計指標,根據該指標來判斷哪些變量重要,剔除那些不重要的變量。

對於分類問題(y離散),可採用:

卡方檢驗
f_classif
mutual_info_classif
互信息


對於迴歸問題(y連續),可採用:

皮爾森相關係數
f_regression,
mutual_info_regression
最大信息係數

這種方法比較簡單,易於運行,易於理解,通常對於理解數據有較好的效果(但對特徵優化、提高泛化能力來說不一定有效)。

SelectKBest 移除得分前 k 名以外的所有特徵(取top k)

SelectPercentile 移除得分在用戶指定百分比以後的特徵(取top k%)

對每個特徵使用通用的單變量統計檢驗: 假正率(false positive rate) SelectFpr, 僞發現率(false discovery rate) SelectFdr, 或族系誤差率 SelectFwe.

GenericUnivariateSelect 可以設置不同的策略來進行單變量特徵選擇。同時不同的選擇策略也能夠使用超參數尋優,從而讓我們找到最佳的單變量特徵選擇策略。

Notice:
 The methods based on F-test estimate the degree of linear dependency between two random variables. 
 
  (F檢驗用於評估兩個隨機變量的線性相關性)
  
  On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.
  
  (另一方面,互信息的方法可以捕獲任何類型的統計依賴關係,但是作爲一個非參數方法,估計準確需要更多的樣本)

卡方(Chi2)檢驗

經典的卡方檢驗是檢驗定性自變量對定性因變量的相關性。

比如,我們可以對樣本進行一次chi2 測試來選擇最佳的兩項特徵:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X_new.shape)

Pearson相關係數 (Pearson Correlation)

皮爾森相關係數是一種最簡單的,能幫助理解特徵和響應變量之間關係的方法,

該方法衡量的是變量之間的線性相關性,結果的取值區間爲[-1,1],-1表示完全的負相關,+1表示完全的正相關,0表示沒有線性相關

import numpy as np
from scipy.stats import pearsonr

np.random.seed(0)
size = 300
x = np.random.normal(0, 1, size)

""" pearsonr(x, y)的輸入爲特徵矩陣和目標向量,能夠同時計算 相關係數 和p-value. """

print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)))
print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))

""" 比較了變量在加入噪音之前和之後的差異。當噪音比較小的時候,相關性很強,p-value很低 """
""" 使用Pearson相關係數主要是爲了看特徵之間的相關性,而不是和因變量之間的。 """

Wrapper

遞歸特徵消除 (Recursive Feature Elimination)

遞歸消除特徵法使用一個基模型來進行多輪訓練,每輪訓練後,移除若干權值係數的特徵,再基於新的特徵集進行下一輪訓練。

對特徵含有權重的預測模型(例如,線性模型對應參數coefficients),RFE通過遞歸減少考察的特徵集規模來選擇特徵。

首先,預測模型在原始特徵上訓練,每個特徵指定一個權重。之後,那些擁有最小絕對值權重的特徵被踢出特徵集。如此往復遞歸,直至剩餘的特徵數量達到所需的特徵數量。

RFECV 通過交叉驗證的方式執行RFE,以此來選擇最佳數量的特徵:對於一個數量爲d的feature的集合,他的所有的子集的個數是2的d次方減1(包含空集)。

指定一個外部的學習算法,比如SVM之類的。通過該算法計算所有子集的validation error。選擇error最小的那個子集作爲所挑選的特徵。

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

rf = RandomForestClassifier()
iris=load_iris()
X,y=iris.data,iris.target
rfe = RFE(estimator=rf, n_features_to_select=3)
X_rfe = rfe.fit_transform(X,y)
X_rfe.shape

Embedded

使用SelectFromModel選擇特徵 (Feature selection using SelectFromModel)

基於L1的特徵選擇 (L1-based feature selection)

使用L1範數作爲懲罰項的線性模型(Linear models)會得到稀疏解:大部分特徵對應的係數爲0。

當你希望減少特徵的維度以用於其它分類器時,可以通過 feature_selection.SelectFromModel 來選擇不爲0的係數。

特別指出,常用於此目的的稀疏預測模型有 linear_model.Lasso(迴歸), linear_model.LogisticRegression 和 svm.LinearSVC(分類)
 

from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X,y)
model = SelectFromModel(lsvc, prefit=True)
X_embed = model.transform(X)
X_embed.shape

首先來回顧一下我們在業務中的模型會遇到什麼問題。

1、模型效果不好:大概率數據有問題

2、訓練集效果好,跨時間測試(一般測試樣本是訓練數據的1/10)效果不好:

測試數據分佈與訓練數據不太一樣導致的,說明選入特徵變量有問題波動比較大,查看分析比較波動的特徵變量 

3、跨時間測試效果也好,上線之後效果不好:線下和線上和變量的邏輯出了問題,線下特徵信息可能包含未來變量

4、上線之後效果還好,幾周之後分數分佈開始下滑:說明模型效果不行,說明一兩個變量在跨時間上效果比較差

5、一兩個月內都比較穩定,突然分數分佈驟降:可能是外部因素,如運營部門一些操作或國家政策導致

6、沒有明顯問題,但模型每個月逐步失效:

然後我們來考慮一下業務所需要的變量是什麼。

    變量必須對模型有貢獻,也就是說必須能對客羣加以區分
    
    邏輯迴歸要求變量之間線性無關
    
    邏輯迴歸評分卡也希望變量呈現單調趨勢 
    
    (有一部分也是業務原因,但從模型角度來看,單調變量未必一定比有轉折的變量好)
    
    客羣在每個變量上的分佈穩定,分佈遷移無可避免,但不能波動太大
    
爲此我們從上述方法中找到最貼合當前使用場景的幾種方法。

from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np

data = [[1,2,3,4,5],
        [2,4,6,8,9],
        [1,1,1,1,1],
       [2,4,6,4,7]]
X = np.array(data).T

variance_inflation_factor(X,0)

3)單調性

- bivar圖

""" 等頻切分 """
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10)
df_train.head()
df_train = df_train.sort_values('Fare')
alist = list(set(df_train['fare_qcut']))
badrate = {}
for x in alist:
    
    a = df_train[df_train.fare_qcut == x]
    
    bad = a[a.label == 1]['label'].count()
    good = a[a.label == 0]['label'].count()
    
    badrate[x] = bad/(bad+good)
f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
print(badrate)
badrate.plot('cut','badrate')

def var_PSI(dev_data, val_data):
    dev_cnt, val_cnt = sum(dev_data), sum(val_data)
    if dev_cnt * val_cnt == 0:
        return None
    PSI = 0
    for i in range(len(dev_data)):
        dev_ratio = dev_data[i] / dev_cnt
        val_ratio = val_data[i] / val_cnt + 1e-10
        psi = (dev_ratio - val_ratio) * math.log(dev_ratio/val_ratio)
        PSI += psi
    return PSI

注意分箱的數量將會影響着變量的PSI值。

PSI並不只可以對模型來求,對變量來求也一樣。只需要對跨時間分箱的數據分別求PSI即可。

import pandas as pd
from sklearn.metrics import roc_auc_score,roc_curve,auc
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import numpy as np
import random
import math

data = pd.read_csv(file_dir + 'data.txt')
data.head()

""" 看一下月份分佈,我們用最後一個月做爲跨時間驗證集合  """
data.obs_mth.unique()

train = data[data.obs_mth != '2018-11-30'].reset_index().copy()
val = data[data.obs_mth == '2018-11-30'].reset_index().copy()

feature_lst = ['person_info','finance_info','credit_info','act_info','td_score','jxl_score','mj_score','rh_score']

x = train[feature_lst]
y = train['bad_ind']

val_x =  val[feature_lst]
val_y = val['bad_ind']

lr_model = LogisticRegression(C=0.1)
lr_model.fit(x,y)

y_pred = lr_model.predict_proba(x)[:,1]
fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred)
train_ks = abs(fpr_lr_train - tpr_lr_train).max()
print('train_ks : ',train_ks)

y_pred = lr_model.predict_proba(val_x)[:,1]
fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred)
val_ks = abs(fpr_lr - tpr_lr).max()
print('val_ks : ',val_ks)

from matplotlib import pyplot as plt
plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR')
plt.plot(fpr_lr,tpr_lr,label = 'evl LR')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()
""" 做特徵篩選 """

from statsmodels.stats.outliers_influence import variance_inflation_factor

X = np.array(x)

for i in range(X.shape[1]):
    print(variance_inflation_factor(X,i))
import lightgbm as lgb
from sklearn.model_selection import train_test_split

train_x,test_x,train_y,test_y = train_test_split(x,y,random_state=0,test_size=0.2)

def  lgb_test(train_x,train_y,test_x,test_y):
    
    clf =lgb.LGBMClassifier(boosting_type = 'gbdt',
                           objective = 'binary',
                           metric = 'auc',
                           learning_rate = 0.1,
                           n_estimators = 24,
                           max_depth = 5,
                           num_leaves = 20,
                           max_bin = 45,
                           min_data_in_leaf = 6,
                           bagging_fraction = 0.6,
                           bagging_freq = 0,
                           feature_fraction = 0.8,
                           )
    
    clf.fit(train_x,train_y,eval_set = [(train_x,train_y),(test_x,test_y)],eval_metric = 'auc')
    
    return clf,clf.best_score_['valid_1']['auc'],

lgb_model , lgb_auc  = lgb_test(train_x,train_y,test_x,test_y)

feature_importance = pd.DataFrame({'name':lgb_model.booster_.feature_name(),
                                   'importance':lgb_model.feature_importances_}).sort_values(by=['importance'],ascending=False)
feature_importance


feature_lst = ['person_info','finance_info','credit_info','act_info']
x = train[feature_lst]
y = train['bad_ind']

val_x =  val[feature_lst]
val_y = val['bad_ind']

lr_model = LogisticRegression(C=0.1,class_weight='balanced')
lr_model.fit(x,y)
y_pred = lr_model.predict_proba(x)[:,1]
fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred)
train_ks = abs(fpr_lr_train - tpr_lr_train).max()
print('train_ks : ',train_ks)

y_pred = lr_model.predict_proba(val_x)[:,1]
fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred)
val_ks = abs(fpr_lr - tpr_lr).max()
print('val_ks : ',val_ks)

from matplotlib import pyplot as plt
plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR')
plt.plot(fpr_lr,tpr_lr,label = 'evl LR')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()

# 係數

print('變量名單:',feature_lst)
print('係數:',lr_model.coef_)
print('截距:',lr_model.intercept_)


"""報告"""
model = lr_model
row_num, col_num = 0, 0
bins = 20
Y_predict = [s[1] for s in model.predict_proba(val_x)]
Y = val_y
nrows = Y.shape[0]
lis = [(Y_predict[i], Y[i]) for i in range(nrows)]
ks_lis = sorted(lis, key=lambda x: x[0], reverse=True)
bin_num = int(nrows/bins+1)
bad = sum([1 for (p, y) in ks_lis if y > 0.5])
good = sum([1 for (p, y) in ks_lis if y <= 0.5])
bad_cnt, good_cnt = 0, 0

KS = []
BAD = []
GOOD = []
BAD_CNT = []
GOOD_CNT = []
BAD_PCTG = []
BADRATE = []
dct_report = {}

for j in range(bins):
    ds = ks_lis[j*bin_num: min((j+1)*bin_num, nrows)]
    bad1 = sum([1 for (p, y) in ds if y > 0.5])
    good1 = sum([1 for (p, y) in ds if y <= 0.5])
    bad_cnt += bad1
    good_cnt += good1
    bad_pctg = round(bad_cnt/sum(val_y),3)
    badrate = round(bad1/(bad1+good1),3)
    ks = round(math.fabs((bad_cnt / bad) - (good_cnt / good)),3)
    KS.append(ks)
    BAD.append(bad1)
    GOOD.append(good1)
    BAD_CNT.append(bad_cnt)
    GOOD_CNT.append(good_cnt)
    BAD_PCTG.append(bad_pctg)
    BADRATE.append(badrate)
    dct_report['KS'] = KS
    dct_report['BAD'] = BAD
    dct_report['GOOD'] = GOOD
    dct_report['BAD_CNT'] = BAD_CNT
    dct_report['GOOD_CNT'] = GOOD_CNT
    dct_report['BAD_PCTG'] = BAD_PCTG
    dct_report['BADRATE'] = BADRATE
val_repot = pd.DataFrame(dct_report)
val_repot


""" 映射分數 """
#['person_info','finance_info','credit_info','act_info']

def score(person_info,finance_info,credit_info,act_info):
    
    xbeta = person_info * ( 3.49460978) + finance_info * ( 11.40051582 ) + credit_info * (2.45541981) + act_info * ( -1.68676079) --0.34484897 
    score = 650-34* (xbeta)/math.log(2)
    
    return score

val['score'] = val.apply(lambda x : score(x.person_info,x.finance_info,x.credit_info,x.act_info) ,axis=1)

fpr_lr,tpr_lr,_ = roc_curve(val_y,val['score'])
val_ks = abs(fpr_lr - tpr_lr).max()

print('val_ks : ',val_ks)

#對應評級區間
def level(score):
    level = 0
    if score <= 600:
        level = "D"
    elif score <= 640 and score > 600 : 
        level = "C"
    elif score <= 680 and score > 640:
        level = "B"
    elif  score > 680 :
        level = "A"
    return level

val['level'] = val.score.map(lambda x : level(x) )

val.level.groupby(val.level).count()/len(val)


""" 畫圖展示區間分佈情況 """
import seaborn as sns

sns.distplot(val.score,kde=True)

val = val.sort_values('score',ascending=True).reset_index(drop=True)
df2=val.bad_ind.groupby(val['level']).sum()
df3=val.bad_ind.groupby(val['level']).count()
print(df2/df3)    

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章