欢迎各位同学学习python信用评分卡建模视频系列教程(附代码, 博主录制) :
(微信二维码扫一扫)
2020届厦门银行比赛复赛已经结束了2个多月,比起19年的比赛,这次比赛我个人认为更好,留给选手操作的空间也很大,从这个比赛可以学到更多的有用的特征工程知识。目前还没有看到前排大佬的开源,因此我们就先抛砖引玉了。这次开源的代码来自"厦门夕阳红旅游团"队伍,该队伍在本次比赛中获得A榜第五,B榜第八的成绩。让我们直接来欣赏下他们的代码和思路吧。(代码较长,可以先看文字,如果需要再将代码复制下来)
数据读取
赛题链接:https://js.dclab.run/v2/cmptDetail.html?id=439
读取测试集,直接简单粗暴的读取方式
1import pandas as pdimport os 2import lightgbm as lgb 3import collections 4import numpy as np 5base_dir = os.getcwd() 6def gettest(): 7 x_test= os.path.join(base_dir,'x_test') 8 cust_avli_Q1=os.path.join(x_test,'cust_avli_Q1.csv') 9 cust_info_q1=os.path.join(x_test,'cust_info_q1.csv') 10 aum_test=os.path.join(x_test,'aum_test') 11 aum_m1=os.path.join(aum_test,'aum_m1.csv') 12 aum_m2=os.path.join(aum_test,'aum_m2.csv') 13 aum_m3=os.path.join(aum_test,'aum_m3.csv') 14 behavior_test=os.path.join(x_test,'behavior_test') 15 behavior_m1=os.path.join(behavior_test,'behavior_m1.csv') 16 behavior_m2=os.path.join(behavior_test,'behavior_m2.csv') 17 behavior_m3=os.path.join(behavior_test,'behavior_m3.csv') 18 big_event_test=os.path.join(x_test,'big_event_test') 19 big_event_Q1=os.path.join(big_event_test,'big_event_Q1.csv') 20 cunkuan_test=os.path.join(x_test,'cunkuan_test') 21 cunkuan_m1=os.path.join(cunkuan_test,'cunkuan_m1.csv') 22 cunkuan_m2=os.path.join(cunkuan_test,'cunkuan_m2.csv') 23 cunkuan_m3=os.path.join(cunkuan_test,'cunkuan_m3.csv') 24 data1=pd.read_csv(cust_info_q1) 25 data2=pd.read_csv(cust_avli_Q1) 26 data=pd.merge(data1,data2,on='cust_no',how='inner') 27 list_csv=[aum_m1,aum_m2,aum_m3,behavior_m1,behavior_m2,behavior_m3,big_event_Q1,cunkuan_m1,cunkuan_m2,cunkuan_m3] 28 for sir in list_csv: 29 tem=pd.read_csv(sir) 30 data=pd.merge(data,tem,on='cust_no',how='left') 31 return data 32test_data = gettest()
读取训练集,同样是简单粗暴的方式
1def gettrain(): 2 x_train = os.path.join(base_dir, 'x_train') 3 y_train = os.path.join(base_dir, 'y_train_3') 4 cust_avli_Q3 = os.path.join(x_train, 'cust_avli_Q3.csv') 5 cust_info_q3 = os.path.join(x_train, 'cust_info_q3.csv') 6 y_Q3_3 = os.path.join(y_train, 'y_Q3_3.csv') 7 8 cust_avli_Q4 = os.path.join(x_train, 'cust_avli_Q4.csv') 9 cust_info_q4 = os.path.join(x_train, 'cust_info_q4.csv') 10 y_Q4_3 = os.path.join(y_train, 'y_Q4_3.csv') 11 12 aum_train = os.path.join(x_train, 'aum_train') 13 aum_m7 = os.path.join(aum_train, 'aum_m7.csv') 14 aum_m8 = os.path.join(aum_train, 'aum_m8.csv') 15 aum_m9 = os.path.join(aum_train, 'aum_m9.csv') 16 aum_m10 = os.path.join(aum_train, 'aum_m10.csv') 17 aum_m11 = os.path.join(aum_train, 'aum_m11.csv') 18 aum_m12 = os.path.join(aum_train, 'aum_m12.csv') 19 20 behavior_train = os.path.join(x_train, 'behavior_train') 21 behavior_m7 = os.path.join(behavior_train, 'behavior_m7.csv') 22 behavior_m8 = os.path.join(behavior_train, 'behavior_m8.csv') 23 behavior_m9 = os.path.join(behavior_train, 'behavior_m9.csv') 24 behavior_m10 = os.path.join(behavior_train, 'behavior_m10.csv') 25 behavior_m11 = os.path.join(behavior_train, 'behavior_m11.csv') 26 behavior_m12 = os.path.join(behavior_train, 'behavior_m12.csv') 27 28 big_event_train = os.path.join(x_train, 'big_event_train') 29 big_event_Q3 = os.path.join(big_event_train, 'big_event_Q3.csv') 30 big_event_Q4 = os.path.join(big_event_train, 'big_event_Q4.csv') 31 32 cunkuan_train = os.path.join(x_train, 'cunkuan_train') 33 cunkuan_m7 = os.path.join(cunkuan_train, 'cunkuan_m7.csv') 34 cunkuan_m8 = os.path.join(cunkuan_train, 'cunkuan_m8.csv') 35 cunkuan_m9 = os.path.join(cunkuan_train, 'cunkuan_m9.csv') 36 cunkuan_m10 = os.path.join(cunkuan_train, 'cunkuan_m10.csv') 37 cunkuan_m11 = os.path.join(cunkuan_train, 'cunkuan_m11.csv') 38 cunkuan_m12 = os.path.join(cunkuan_train, 'cunkuan_m12.csv') 39 40 Q3List = [aum_m7, aum_m8, aum_m9, behavior_m7, behavior_m8, behavior_m9, big_event_Q3, cunkuan_m7, cunkuan_m8, cunkuan_m9] 41 Q4List = [aum_m10, aum_m11, aum_m12, behavior_m10, behavior_m11, behavior_m12, big_event_Q4, cunkuan_m10, cunkuan_m11, cunkuan_m12] 42 43 data3 = pd.read_csv(cust_info_q3) 44 data3_val = pd.read_csv(cust_avli_Q3) 45 for sir in Q3List: 46 tem = pd.read_csv(sir) 47 data3 = pd.merge(data3, tem, on='cust_no', how='left') 48 49 y_3 = pd.read_csv(y_Q3_3) 50 data3 = pd.merge(data3, y_3, on='cust_no', how='left') 51 52 data4 = pd.read_csv(cust_info_q4) 53 data4_val = pd.read_csv(cust_avli_Q4) 54 for sir in Q4List: 55 tem = pd.read_csv(sir) 56 data4 = pd.merge(data4, tem, on='cust_no', how='left') 57 58 y_4 = pd.read_csv(y_Q4_3) 59 data4 = pd.merge(data4, y_4, on='cust_no', how='left') 60 return data3, data3_val, data4, data4_val 61train_data3, val_3, train_data4, val_4 = gettrain()
下面就是本文第一个干货:
这是线上分数490到495以及以上的最关键操作。上文中的代码读取了第三第四季度中所有的用户信息,值得注意的是,第三季度中所有用户数量是493441个,第四季度的所有用户的数量是543823。这些用户没有标签,只有基础特征。一般的特征工程中很难利用这些用户的任何信息。看起来主办方给我们这么多用户信息有点浪费了。那么这些用户在模型中可以被使用吗?答案是可以,这是本次分享队伍挖掘到的很重要的一点,是模型可以到0.495的关键,将它称之为无效用户的利用。下图描绘了各个季度不同分类用户的数量。
从图中可以看到,有第三季度有6641个无效用户在第四季度中是有效用户,第四季度中5569个无效用户在第一季度中是有效用户。使用季度交叉特征的时候不需要上一个季度的是有效用户,因为不需要使用上一个季度的标签。将第三季度的6641个无效用户和第四季度的5569个无效用户用来做第五节的季度交叉特征。有效地提升了季度交叉特征地覆盖度,从而提升了整体模型地准确度。
无效用户利用的代码:
1all_cust_3=set(train_data3["cust_no"]) 2all_cust_4=set(train_data4["cust_no"]) 3valid_cust3=set(val_3["cust_no"]) 4valid_cust4=set(val_4["cust_no"]) 5valid_test_cust=set(test_data["cust_no"]) 6def is_need3(x): 7 if x in valid_cust3 or x in valid_cust4: 8 return 1 9 else: 10 0 11train_data3["is_need"]=train_data3["cust_no"].apply(is_need3) 12def is_need4(x): 13 if x in valid_cust4 or x in valid_test_cust: 14 return 1 15 else: 16 0 17train_data4["is_need"]=train_data4["cust_no"].apply(is_need4) 18train_data4=train_data4[train_data4["is_need"]==1] 19train_data3=train_data3[train_data3["is_need"]==1] 20train_data3=train_data3.drop(["is_need"],axis=1) 21train_data4=train_data4.drop(["is_need"],axis=1)
以上就是数据读取的全部内容以及代码。
特征工程
接下来就是该队伍的特征工程,从以下几个方面来介绍。
一:日期信息特征
日期特征有每个季度最近一次交易时间还有用户重大时间发生时间。判断用户的用户类型,可以从该用户的使用时间来做判断,一般来说,在交易时间距离本度末的时间越近,那么用户为提升型用户的可能性越大,代表用户活跃。反之,该用户为流失型用户的可能性越高。基于此考虑,我们队伍主要是计算各种日期距离季度末的时间和距离季度初的时间,并做一定的分箱操作。具体操作如下图所示:
一般来说,如果某个用户的重大操作行为距离该季度末的时间越近,那么该用户在这个季度越不可能是流失用户。相反,如果用户距离季度末的时间越远,那么该用户这个季度为流失用户的可能性越高。同样,可以采用季度初做差作为补充特征。
1import time 2import math 3import datetime 4from functools import partial 5def getbetweenday(x, mon): 6 if pd.isna(x): 7 return 8 x = str(x) 9 compare_time = time.strptime(x, "%Y-%m-%d") 10 11 if mon == 1: 12 now_time = time.strptime("2020-3-30", "%Y-%m-%d") 13 elif mon == 3: 14 now_time = time.strptime("2019-9-30", "%Y-%m-%d") 15 elif mon == 4: 16 now_time = time.strptime("2019-12-31", "%Y-%m-%d") 17 date1 = datetime.datetime(compare_time[0], compare_time[1], compare_time[2]) 18 date2 = datetime.datetime(now_time[0], now_time[1], now_time[2]) 19 res = (date2 - date1).days 20 if res < 0: 21 return 22 else: 23 return res 24def getbetweenfirstday(x, mon): ##距离当前的天数 25 if pd.isna(x): 26 return 27 x = str(x) 28 compare_time = time.strptime(x, "%Y-%m-%d") 29 30 if mon == 1: 31 now_time = time.strptime("2020-1-1", "%Y-%m-%d") 32 elif mon == 3: 33 now_time = time.strptime("2019-7-1", "%Y-%m-%d") 34 elif mon == 4: 35 now_time = time.strptime("2019-10-1", "%Y-%m-%d") 36 date1 = datetime.datetime(compare_time[0], compare_time[1], compare_time[2]) 37 date2 = datetime.datetime(now_time[0], now_time[1], now_time[2]) 38 res = (date1 - date2).days 39 if res < 0: 40 return 41 else: 42 return res 43fuc1_getbetweenday = partial(getbetweenday, mon=1) 44fuc3_getbetweenday = partial(getbetweenday, mon=3) 45fuc4_getbetweenday = partial(getbetweenday, mon=4) 46 47fuc1_getbetweenfirstday = partial(getbetweenfirstday, mon=1) 48fuc3_getbetweenfirstday = partial(getbetweenfirstday, mon=3) 49fuc4_getbetweenfirstday = partial(getbetweenfirstday, mon=4) 50 51for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18]: 52 tem = "E" + str(i) + "FromNow" 53 test_data[tem] = test_data["E" + str(i)].apply(fuc1_getbetweenday) 54 train_data3[tem] = train_data3["E" + str(i)].apply(fuc3_getbetweenday) ##2019-09-30 23:59:00 55 train_data4[tem] = train_data4["E" + str(i)].apply(fuc4_getbetweenday) ##2019-12-31 23:34:00 56 tem = "E" + str(i) + "FromFirstDay" 57 test_data[tem] = test_data["E" + str(i)].apply(fuc1_getbetweenfirstday) 58 train_data3[tem] = train_data3["E" + str(i)].apply(fuc3_getbetweenfirstday) ##2019-09-30 23:59:00 59 train_data4[tem] = train_data4["E" + str(i)].apply(fuc4_getbetweenfirstday) ##2019-12-31 23:34:00 60 61f1 = ["E" + str(i) + "FromNow" for i in range(1, 19) if i != 15 and i != 17 and i != 11] 62test_data["E19" + "FromNow"] = test_data[f1].min(axis=1) 63train_data3["E19" + "FromNow"] = train_data3[f1].min(axis=1) 64train_data4["E19" + "FromNow"] = train_data4[f1].min(axis=1) 65f1 = ["E" + str(i) + "FromFirstDay" for i in range(1, 19) if i != 15 and i != 17 and i != 11] 66test_data["E19" + "FromFirstDay"] = test_data[f1].min(axis=1) 67train_data3["E19" + "FromFirstDay"] = train_data3[f1].min(axis=1) 68train_data4["E19" + "FromFirstDay"] = train_data4[f1].min(axis=1) 69 70f1 = ["E" + str(i) + "FromNow" for i in [1, 2, 3, 4, 5]] 71test_data["E20" + "FromNow"] = test_data[f1].min(axis=1) 72train_data3["E20" + "FromNow"] = train_data3[f1].min(axis=1) 73train_data4["E20" + "FromNow"] = train_data4[f1].min(axis=1) 74f1 = ["E" + str(i) + "FromFirstDay" for i in [1, 2, 3, 4, 5]] 75test_data["E20" + "FromFirstDay"] = test_data[f1].min(axis=1) 76train_data3["E20" + "FromFirstDay"] = train_data3[f1].min(axis=1) 77train_data4["E20" + "FromFirstDay"] = train_data4[f1].min(axis=1) 78 79f1 = ["E" + str(i) + "FromNow" for i in [10, 12, 13, 14]] 80test_data["E23" + "FromNow"] = test_data[f1].min(axis=1) 81train_data3["E23" + "FromNow"] = train_data3[f1].min(axis=1) 82train_data4["E23" + "FromNow"] = train_data4[f1].min(axis=1) 83f1 = ["E" + str(i) + "FromFirstDay" for i in [10, 12, 13, 14]] 84test_data["E23" + "FromFirstDay"] = test_data[f1].min(axis=1) 85train_data3["E23" + "FromFirstDay"] = train_data3[f1].min(axis=1) 86train_data4["E23" + "FromFirstDay"] = train_data4[f1].min(axis=1) 87 88f1 = ["E" + str(i) + "FromNow" for i in [16, 18]] 89test_data["E24" + "FromNow"] = test_data[f1].min(axis=1) 90train_data3["E24" + "FromNow"] = train_data3[f1].min(axis=1) 91train_data4["E24" + "FromNow"] = train_data4[f1].min(axis=1) 92f1 = ["E" + str(i) + "FromFirstDay" for i in [16, 18]] 93test_data["E24" + "FromFirstDay"] = test_data[f1].min(axis=1) 94train_data3["E24" + "FromFirstDay"] = train_data3[f1].min(axis=1) 95train_data4["E24" + "FromFirstDay"] = train_data4[f1].min(axis=1) 96 97test_data["E26" + "FromNow"] = test_data["E16" + "FromNow"] - test_data["E18" + "FromNow"] 98train_data3["E26" + "FromNow"] = train_data3["E16" + "FromNow"] - train_data3["E18" + "FromNow"] 99train_data4["E26" + "FromNow"] = train_data4["E16" + "FromNow"] - train_data4["E18" + "FromNow"] 100test_data["E26" + "FromFirstDay"] = test_data["E16" + "FromFirstDay"] - test_data["E18" + "FromFirstDay"] 101train_data3["E26" + "FromFirstDay"] = train_data3["E16" + "FromFirstDay"] - train_data3["E18" + "FromFirstDay"] 102train_data4["E26" + "FromFirstDay"] = train_data4["E16" + "FromFirstDay"] - train_data4["E18" + "FromFirstDay"] 103 104test_data["E" + str(25) + "FromNow"] = test_data["E" + str(10) + "FromNow"] - test_data["E" + str(3) + "FromNow"] 105train_data3["E" + str(25) + "FromNow"] = train_data3["E" + str(10) + "FromNow"] - train_data3["E" + str(3) + "FromNow"] 106train_data4["E" + str(25) + "FromNow"] = train_data4["E" + str(10) + "FromNow"] - train_data4["E" + str(3) + "FromNow"] 107test_data["E" + str(25) + "FromFirstDay"] = test_data["E" + str(10) + "FromFirstDay"] - test_data[ 108 "E" + str(3) + "FromFirstDay"] 109train_data3["E" + str(25) + "FromFirstDay"] = train_data3["E" + str(10) + "FromFirstDay"] - train_data3[ 110 "E" + str(3) + "FromFirstDay"] 111train_data4["E" + str(25) + "FromFirstDay"] = train_data4["E" + str(10) + "FromFirstDay"] - train_data4[ 112 "E" + str(3) + "FromFirstDay"] 113 114for j in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 20, 23]: 115 for day in [7, 15, 30, 90, 365]: 116 tem = "E" + str(j) + "Less" + str(day) + "day" 117 test_data[tem] = (test_data["E" + str(i) + "FromNow"] < day).astype(int) 118 train_data3[tem] = (train_data3["E" + str(i) + "FromNow"] < day).astype(int) 119 train_data4[tem] = (train_data4["E" + str(i) + "FromNow"] < day).astype(int) 120for j in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 20, 23]: 121 for day in [30, 45, 60]: 122 tem = "E" + str(j) + "Less" + str(day) + "day" 123 test_data[tem] = (test_data["E" + str(i) + "FromFirstDay"] < day).astype(int) 124 train_data3[tem] = (train_data3["E" + str(i) + "FromFirstDay"] < day).astype(int) 125 train_data4[tem] = (train_data4["E" + str(i) + "FromFirstDay"] < day).astype(int) 126lastFea_E = ["E15", "E17", "E16FromNow", "E18FromNow", "E19FromNow", "E20FromNow", "E23FromNow"]
除了用户重大特征中的的日期信息外,还有一个B6特征,处理方法也十分类似:
1def StringToTime(x): 2 if pd.isna(x) == False: 3 timeArray = time.strptime(x, "%Y-%m-%d %H:%M:%S") 4 timeStamp = int(time.mktime(timeArray)) 5 return timeStamp 6 else: 7 return 0 8test_data["B6"] = test_data["B6"].apply(StringToTime) ##最大为2020-3-31号 9train_data3["B6"] = train_data3["B6"].apply(StringToTime) ##2019-09-30 23:59:00 10train_data4["B6"] = train_data4["B6"].apply(StringToTime) ##2019-12-31 23:34:00 11def data_from_last_Of_month(x, mon): 12 x = int(x) 13 if x == 0: 14 return 15 else: 16 nowtime = 0 17 if mon == 1: 18 nowtime = 1585670399 19 elif mon == 3: 20 nowtime = 1569859199 21 elif mon == 4: 22 nowtime = 1577807999 23 if nowtime - x < 0: 24 return 25 else: 26 return (nowtime - x) / 3600 / 24 27fuction3 = partial(data_from_last_Of_month, mon=3) 28fuction1 = partial(data_from_last_Of_month, mon=1) 29fuction4 = partial(data_from_last_Of_month, mon=4) 30test_data["B6_2"] = test_data["B6"].apply(fuction1) 31train_data3["B6_2"] = train_data3["B6"].apply(fuction3) 32train_data4["B6_2"] = train_data4["B6"].apply(fuction4) 33 34test_data["B6_3"] = test_data["B7"] / test_data["B6_2"] 35train_data3["B6_3"] = train_data3["B7"] / train_data3["B6_2"] 36train_data4["B6_3"] = train_data4["B7"] / train_data4["B6_2"] 37def below_theday(df, day): 38 df["B6_2Less" + str(day)] = (df["B6_2"] < day).astype(int) 39 return df 40for te in [1, 7, 15, 30, 60, 90, 180, 365]: 41 test_data = below_theday(test_data, te) 42 train_data3 = below_theday(train_data3, te)
二:label特征
label特征指的是上一个季度该用户的label值作为特征,上一个季度的标签有-1,0,1,其余部分用户没有上一个季度标签,本人又将其余部分用户分为上个季度为无效用户和上个季度无用户信息两类,最终该特征为5类,-1,0,1分别表示上季度标签,2表示上季度为无效用户,3表示上季度无该用户信息。
1import collections 2all_cust_3=set(train_data3["cust_no"]) 3all_cust_4=set(train_data4["cust_no"]) 4valid_cust3=set(val_3["cust_no"]) 5valid_cust4=set(val_4["cust_no"]) 6 7train_data3["last_label"]=2 8train_data4["last_label"]=pd.merge(train_data4[["cust_no"]],train_data3[["cust_no","label"]],on="cust_no",how="left")["label"] 9train_data4["last_label"]=train_data4["last_label"].fillna(2) 10test_data["last_label"]=pd.merge(test_data[["cust_no"]],train_data4[["cust_no","label"]],on="cust_no",how="left")["label"] 11test_data["last_label"]=test_data["last_label"].fillna(2) 12def get_last_isvalid_3(x): 13 if x not in valid_cust3: 14 if x in all_cust_3: 15 return 2 16 else: 17 return 3 18def get_last_isvalid_4(x): 19 if x not in valid_cust4: 20 if x in all_cust_4: 21 return 2 22 else: 23 return 3 24train_data4.loc[train_data4["last_label"]==2,"last_label"]=train_data4.loc[train_data4["last_label"]==2,"cust_no"].apply(get_last_isvalid_3) 25test_data.loc[test_data["last_label"]==2,"last_label"]=test_data.loc[test_data["last_label"]==2,"cust_no"].apply(get_last_isvalid_4) 26 27train_data3["jidu"]=3 28train_data4["jidu"]=4 29test_data["jidu"]=1 30train_data=pd.concat([train_data3,train_data4]).reset_index(drop=True) 31train_data.index=range(len(train_data)) 32test_data['label']=2 33all_data=pd.concat([train_data,test_data]).reset_index(drop=True)
三:behavior表特征
该表中的数值行为特征主要有每个月手机网银的登陆次数,转入次数,转出次数,转入金额,转出金额。还有一个季度特征季度内账户变动次数。与用户余额特征相似,首先进行不同特征之间的交叉,如下图所示。其中灰色表示原始特征,橙色表示交叉后的特征。其实交易CTR的计算方式是交易次数除以登录次数。通过特征交叉的方法挖掘更深维度的用户行为。
1for i in range(1, 6): 2 for j in ["", "_x", "_y"]: 3 tem = "B" + str(i) + j 4 all_data[tem].fillna(all_data[tem].mean(), inplace=True) 5for i in ["", "_x", "_y"]: 6 all_data["B8" + i] = all_data["B2" + i] + all_data["B4" + i] ##交易次数 7 all_data["B9" + i] = all_data["B3" + i] - all_data["B5" + i] ##转入减去转出 8 all_data["B10" + i] = all_data["B3" + i] / (all_data["B2" + i] + 1.001) ##平均每次的流水 9 all_data["B11" + i] = all_data["B5" + i] / (all_data["B4" + i] + 1.001) ##平均转出 10 all_data["B12" + i] = all_data["B3" + i] + all_data["B5" + i] ##流入流出的总量 11 all_data["B13" + i] = all_data["B8" + i] / (all_data["B1" + i] + 1.001) ##交易次数除以登录次数。相当于推荐中的CTR 12 13def behavior_m(all_data, tem): 14 all_data[tem + "sub" + tem + "_x"] = all_data[tem] - all_data[tem + "_x"] 15 all_data[tem + "sub" + tem + "_y"] = all_data[tem] - all_data[tem + "_y"] 16 all_data[tem + "_x" + "sub" + tem + "_y"] = all_data[tem + "_x"] - all_data[tem + "_y"] 17 all_data[tem + "div" + tem + "_x"] = all_data[tem] / (all_data[tem + "_x"] + 1.0001) 18 all_data[tem + "div" + tem + "_y"] = all_data[tem] / (all_data[tem + "_y"] + 1.0001) 19 all_data[tem + "_x" + "div" + tem + "_y"] = all_data[tem + "_x"] / (all_data[tem + "_y"] + 1.0001) 20 all_data["all2_" + tem] = all_data[tem] + all_data[tem + "_x"] / 2 + all_data[tem + "_y"] / 3 21 all_data["all3_" + tem] = all_data[tem] + all_data[tem + "_x"] + all_data[tem + "_y"] 22 all_data["mid_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].median(axis=1) 23 all_data["min_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].min(axis=1) 24 all_data["max_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].max(axis=1) 25 return all_data 26 27for i in range(1, 14): 28 if i == 6 or i == 7: 29 continue 30 tem = "B" + str(i) 31 all_data = behavior_m(all_data, tem) 32 33all_data["B2sumDivB4sum"] = all_data["all3_" + "B2"] - all_data["all3_" + "B4"] 34all_data["B2sumAddB4sum"] = all_data["all3_" + "B2"] + all_data["all3_" + "B4"] 35all_data["B2sumAddB4sum_divB7"] = (all_data["B2sumAddB4sum"] - all_data["B7"]).astype(int) 36 37lastFae_B = ["B" + str(i) for i in range(1, 14) if i != 6] 38lastFae_tempB = [j + "_B" + str(i) for i in range(1, 14) if i != 6 and i != 7 for j in 39 ["max", "min", "all3", "mid", "all2"]] ##,"std" 40lastFae_B = lastFae_B + lastFae_tempB 41lastFae_B.append("B6_2") 42lastFae_B.append("B2sumDivB4sum") 43lastFae_B.append("B2sumAddB4sum") 44lastFae_B.append("B2sumAddB4sum_divB7")
四:用户余额信息和存款信息
用户存款信息和余额信息是判断用户类型关键的特征。用户的存款信息主要有如下图所示的几个。通过存款信息之间聚合操作可以生成新的几个特征。左边是原始的特征,右边是交叉之后的特征和相应的计算方法。主要的思路是将不同的余额进行组合,比如全部加权可以视为用户的月末的总体余额信息,除贷款外加权可以视为用户的账号余额,还有负债率信息。
1for i in range(1, 9): 2 for j in ["", "_x", "_y"]: 3 tem = "X" + str(i) + j 4 all_data[tem].fillna(all_data[tem].mean(), inplace=True) 5all_data["X9"] = all_data["X1"] + all_data["X2"] + all_data["X3"] + all_data["X4"] + all_data["X5"] + all_data["X6"] - \ 6 all_data["X7"] + all_data["X8"] 7all_data["X9_x"] = all_data["X1_x"] + all_data["X2_x"] + all_data["X3_x"] + all_data["X4_x"] + all_data["X5_x"] + \ 8 all_data["X6_x"] - all_data["X7_x"] + all_data["X8_x"] 9all_data["X9_y"] = all_data["X1_y"] + all_data["X2_y"] + all_data["X3_y"] + all_data["X4_y"] + all_data["X5_y"] + \ 10 all_data["X6_y"] - all_data["X7_y"] + all_data["X8_y"] 11 12all_data["X10"] = all_data["X1"] + all_data["X2"] + all_data["X3"] + all_data["X4"] + all_data["X5"] + all_data["X6"] + \ 13 all_data["X8"] 14all_data["X10_x"] = all_data["X1_x"] + all_data["X2_x"] + all_data["X3_x"] + all_data["X4_x"] + all_data["X5_x"] + \ 15 all_data["X6_x"] + all_data["X8_x"] 16all_data["X10_y"] = all_data["X1_y"] + all_data["X2_y"] + all_data["X3_y"] + all_data["X4_y"] + all_data["X5_y"] + \ 17 all_data["X6_y"] + all_data["X8_y"] 18 19all_data["X11"] = all_data["X3"] + all_data["X4"] + all_data["X5"] 20all_data["X11_x"] = all_data["X3_x"] + all_data["X4_x"] + all_data["X5_x"] 21all_data["X11_y"] = all_data["X3_y"] + all_data["X4_y"] + all_data["X5_y"] 22 23all_data["X12"] = all_data["X7"] / (all_data["X10"] + 1) 24all_data["X12_x"] = all_data["X7_x"] / all_data["X10_x"] 25all_data["X12_y"] = all_data["X7_y"] / all_data["X10_y"] 26 27all_data["X13"] = all_data["X9"] - all_data["C1"] 28all_data["X13_x"] = all_data["X9_x"] - all_data["C1_x"] 29all_data["X13_y"] = all_data["X9_y"] - all_data["C1_y"] 30 31fea_x = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"] 32def amm_m(all_data, fe, fe2): 33 tem = fe2 + fe 34 all_data[tem + "sub" + tem + "_x"] = all_data[tem] - all_data[tem + "_x"] 35 all_data[tem + "sub" + tem + "_y"] = all_data[tem] - all_data[tem + "_y"] 36 all_data[tem + "_x" + "sub" + tem + "_y"] = all_data[tem + "_x"] - all_data[tem + "_y"] 37 38 all_data[tem + "div" + tem + "_x"] = all_data[tem] / (all_data[tem + "_x"] + 1.0001) 39 all_data[tem + "div" + tem + "_y"] = all_data[tem] / (all_data[tem + "_y"] + 1.0001) 40 all_data[tem + "_x" + "div" + tem + "_y"] = all_data[tem + "_x"] / (all_data[tem + "_y"] + 1.0001) 41 42 all_data["all2_" + tem] = all_data[tem] + all_data[tem + "_x"] / 2 + all_data[tem + "_y"] / 3 43 all_data["all3_" + tem] = all_data[tem] + all_data[tem + "_x"] + all_data[tem + "_y"] 44 45 all_data["max_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].max(axis=1) 46 all_data["min_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].min(axis=1) 47 all_data["mid_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].median(axis=1) 48 return all_data 49for fe in fea_x: 50 all_data = amm_m(all_data, fe, "X") 51 52for i in range(1, 3): 53 for j in ["", "_x", "_y"]: 54 tem = "C" + str(i) + j 55 all_data[tem].fillna(all_data[tem].mean(), inplace=True) 56all_data["C3"] = all_data["C1"] / (all_data["C2"] + 1) 57all_data["C3_x"] = all_data["C1_x"] / (all_data["C2_x"] + 1) 58all_data["C3_y"] = all_data["C1_y"] / (all_data["C2_y"] + 1) 59for fe in ["1", "2", "3"]: 60 all_data = amm_m(all_data, fe, "C") 61lastFea_x = ["X" + str(i) for i in range(1, 14)] 62lastFea_x1 = [j + "_X" + str(i) for i in range(1, 14) for j in ["max", "min", "all3", "mid", "all2"]] # ,"std" 63lastFea_c = ["C" + str(i) for i in range(1, 4)] 64lastFea_c1 = [j + "_C" + str(i) for i in range(1, 4) for j in ["max", "min", "all3", "mid", "all2"]] # ,"std" 65last_fea_XC = lastFea_x + lastFea_x1 + lastFea_c + lastFea_c1
五:季度间特征
季度间的差异信息这个特征是能够取得高分最关键的特征。利用用户在两个季度特征的交叉形成新的特征,如下图所示,我们获取到用户在第三第四季度的特征,比如余额特征F1,那么我们可以对该特征进行季度间交叉,交叉方式有两个季度特征相加,第四季度减去或者除以第三季度该特征,形成的新特征作为第四季度新特征,然后加上原始的第四季度特征,形成训练集特征。具体的操作方式如下图所示。
注意的是,对用户基础特征不会做这个操作,因为用户基础特征一般变动性很小。这种特征工程的思想是用户类型是一个动态的过程,银行会根据用户过去几个月甚至几个季度的行为来判断用户的类型。比如这个季度用户余额或者操作行为特别少,按理说会判定为流失用户,但是如果用户上个季度余额或者操作很多,这时候银行可能会权衡一下。
1x_train_3 = all_data[all_data["jidu"] == 3] 2x_train_4 = all_data[all_data["jidu"] == 4] 3x_text_1 = all_data[all_data["jidu"] == 1] 4 5for tem in last_fea_XC + lastFae_B + lastFea_E: 6 fea1 = tem 7 x_train_3[fea1 + "diff"] = 0 8 temp = pd.merge(x_train_4[["cust_no", fea1]], x_train_3[["cust_no", fea1]], on="cust_no", how="left") 9 x_train_4[fea1 + "diff"] = list(temp[fea1 + "_x"] - temp[fea1 + "_y"]) 10 11 if tem not in ["E16FromNow", "E18FromNow", "B6_2", "E19FromNow", "E20FromNow", "E23FromNow"]: 12 x_train_4[fea1 + "add"] = list(temp[fea1 + "_x"] + temp[fea1 + "_y"]) 13 x_train_4[fea1 + "div"] = list(temp[fea1 + "_x"] / (temp[fea1 + "_y"] + 1.0001)) 14 15 temp = pd.merge(x_text_1[["cust_no", fea1]], x_train_4[["cust_no", fea1]], on="cust_no", how="left") 16 x_text_1[fea1 + "diff"] = list(temp[fea1 + "_x"] - temp[fea1 + "_y"]) 17 18 if tem not in ["E16FromNow", "E18FromNow", "B6_2", "E19FromNow", "E20FromNow", "E23FromNow"]: 19 x_text_1[fea1 + "add"] = list(temp[fea1 + "_x"] + temp[fea1 + "_y"]) 20 x_text_1[fea1 + "div"] = list(temp[fea1 + "_x"] / (temp[fea1 + "_y"] + 1.0001)) 21all_data = pd.concat([x_train_3, x_train_4, x_text_1]).reset_index(drop=True)
六:用户基础信息
用户基础信息较为简单,主要有两类特征,第一类是数值特征,比如用户的年龄,收入,第二个是类别特征,比如客户等级等。如下图所示,数值特征采用分箱的操作,类别特征的数目并不很多,我们队伍采用了one-hot的操作,one-hot操作可以更好地学习到特征之间的组合特征。如果用LGB的话,可以将其转化为类别特征。
1def getcust_1(x): 2 return str(int(x, 16)) 3all_data["cust_no_10"]=all_data["cust_no"].apply(getcust_1) 4all_data["cust_no_1"]=all_data["cust_no_10"].apply(lambda x:x[:3]=="300").astype(int) 5all_data["cust_no_2"]=all_data["cust_no_10"].apply(lambda x:int(x[4:])) 6 7all_data['I5'] = all_data['I5'].fillna('不便分类的其他从业人员') 8all_data['I13'] = all_data['I13'].fillna('未说明的婚姻状况') 9all_data['I14'] = all_data['I14'].fillna('其他') 10def get_age(df,col): 11 df[col+"_18"]=(df.I2>18).astype(int) 12 df[col+"_25"]=(df.I2>25).astype(int) 13 df[col+"_30"]=(df.I2>30).astype(int) 14 df[col+"_40"]=(df.I2>40).astype(int) 15 df[col+"_50"]=(df.I2>50).astype(int) 16 df[col+"_60"]=(df.I2>60).astype(int) 17 df[col+"_70"]=(df.I2>70).astype(int) 18 df[col+"_80"]=(df.I2>80).astype(int) 19 return df 20all_data['I1']=(all_data.I1=="男").astype(int)##男女转化为0-1 21all_data=get_age(all_data,'I2')##年纪卡阈值 22# all_data['I12']=(all_data.I12=="个人").astype(int) ##转化为0-1 23def get_salary(df,col): 24 df[col+"_0"]=(df.I11>0).astype(int) 25 df[col+"_1"]=(df.I11>16559.999).astype(int) 26 df[col+"_2"]=(df.I11>273000.0).astype(int) 27 df[col+"_3"]=(df.I11>600000.0).astype(int) 28 df[col+"_4"]=(df.I11>960000.0).astype(int) 29 df[col+"_5"]=(df.I11>1581288).astype(int) 30 return df 31all_data=get_age(all_data,"I11")##通过家庭收入卡阈值get_salary 32 33for col in ["I3","I5","I10","I13","I14","last_label"]: 34 all_data=pd.merge(all_data.drop(col,axis=1),pd.get_dummies(all_data[col],col),left_index=True,right_index=True) 35all_data=all_data.drop(["I7","I9","E11","I2","I11","I8"],axis=1) ##删掉全为空的 ,"jidu" ,"B6_1""B6",
LGB模型
该队伍最后的成绩是xgb单种子和lgb单种子融合,线上lgb的结果更高,因此将lgb建模方法开源。代码如下所示:
1test_data=all_data[all_data["label"]==2] 2test_data.drop(["label","jidu"],axis=1) 3train_data=all_data[all_data["jidu"]==4] 4train_data=pd.merge(train_data,val_4,on='cust_no',how='inner') 5fea=train_data.columns.values.tolist() 6fea.remove('label') 7fea.remove("jidu") 8y_train=train_data.label 9x_train=train_data[fea] 10obj_fea=[] 11for index,value in enumerate(x_train.columns): 12 if(train_data[value]).dtype=="object": 13 obj_fea.append(value) 14y_train=y_train+1##多分类输入要是0-1-2 15col=[i for i in x_train.columns if i not in obj_fea] 16 17params = { 18 'bagging_freq': 1, 19 'bagging_fraction': 0.8, 20 'bagging_seed': 20201101, 21 'boost': 'gbdt', 22 'feature_fraction': 0.8, 23 'feature_fraction_seed': 20201101, 24 'learning_rate': 0.05, 25 'max_depth': 8, 26 'metric': 'multi_logloss', 27 'min_data_in_leaf': 20, 28 'num_leaves': 32, 29 'num_threads': 6, 30 'objective': 'multiclass', 31 'num_class': 3, 32 'lambda_l1': 0.5, 33 'lambda_l2': 1.2, 34 'verbosity': 1, 35 # 'max_bin': 64, 36 'device': 'cpu', 37} 38from sklearn.model_selection import KFold 39import catboost as cat 40import datetime 41 42print(datetime.datetime.now()) 43folds = KFold(n_splits=10, shuffle=True, random_state=1996) 44predictions2 = np.zeros([test_data.shape[0], 3]) 45preds2 = np.zeros([x_train.shape[0], 3]) 46val_label = [] 47for fold_, (train_index, test_index) in enumerate(folds.split(x_train, y_train)): 48 print("第{}折".format(fold_)) 49 train_x, test_x, train_y, test_y = x_train.iloc[train_index], x_train.iloc[test_index], y_train.iloc[train_index], \ 50 y_train.iloc[test_index] 51 52 trn_data = lgb.Dataset(train_x[col], train_y) 53 val_data = lgb.Dataset(test_x[col], test_y) 54 num_round = 50000 55 clf = lgb.train(params, 56 trn_data, 57 num_round, 58 valid_sets=[trn_data, val_data], 59 verbose_eval=100, early_stopping_rounds=500, 60 ) 61 val_train = clf.predict(test_x[col], num_iteration=clf.best_iteration) 62 preds2[test_index] = val_train 63 val_pred = clf.predict(test_data[col], num_iteration=clf.best_iteration) 64 predictions2 += val_pred / 10 65 66print(datetime.datetime.now())
最后需要根据线下的结果调整0,1,-1的比例以取得最优的kappa值,然后将其应用到线上的预测上,kaggle上有不少关于kappa值搜索的开源,这里就不献丑了。
转:https://mp.weixin.qq.com/s?__biz=MzAxOTU5NTU4MQ==&mid=2247484853&idx=1&sn=6cf580f6d548e9f63c86ab405c6ff502&chksm=9bc5ede7acb264f1fa7e02372f5e92dac17392c8335fe664297d63a8064aa21bd762bccc6875&mpshare=1&scene=23&srcid=0222ALobZ0BPWMdquqpbEVrz&sharer_sharetime=1613965675967&sharer_shareid=628acadc38c0d97ed34eb70efc08eaf5#rd
总结
根据该代码作者所述,有一些该比赛别的经验与大家分享。
1:采用第四季度作为训练集与三四季度一起作为训练集相比有更好的效果,这可能是由于将第三季度作为训练集无法获取季度间特征信息的原因
2:采用回归模型进行预测然后卡阈值的方法效果比用三分类预测效果更差
3:季度间特征信息,无效用户的利用是模型取得好的结果的最重要原因,kappa的搜索时线上取得好结果最直接的原因
模型训练的数据集如下,需要同学可以加QQ:231469242咨询