歡迎各位同學學習python信用評分卡建模視頻系列教程(附代碼, 博主錄製) :
(微信二維碼掃一掃)
2020屆廈門銀行比賽複賽已經結束了2個多月,比起19年的比賽,這次比賽我個人認爲更好,留給選手操作的空間也很大,從這個比賽可以學到更多的有用的特徵工程知識。目前還沒有看到前排大佬的開源,因此我們就先拋磚引玉了。這次開源的代碼來自"廈門夕陽紅旅遊團"隊伍,該隊伍在本次比賽中獲得A榜第五,B榜第八的成績。讓我們直接來欣賞下他們的代碼和思路吧。(代碼較長,可以先看文字,如果需要再將代碼複製下來)
數據讀取
賽題鏈接:https://js.dclab.run/v2/cmptDetail.html?id=439
讀取測試集,直接簡單粗暴的讀取方式
1import pandas as pdimport os 2import lightgbm as lgb 3import collections 4import numpy as np 5base_dir = os.getcwd() 6def gettest(): 7 x_test= os.path.join(base_dir,'x_test') 8 cust_avli_Q1=os.path.join(x_test,'cust_avli_Q1.csv') 9 cust_info_q1=os.path.join(x_test,'cust_info_q1.csv') 10 aum_test=os.path.join(x_test,'aum_test') 11 aum_m1=os.path.join(aum_test,'aum_m1.csv') 12 aum_m2=os.path.join(aum_test,'aum_m2.csv') 13 aum_m3=os.path.join(aum_test,'aum_m3.csv') 14 behavior_test=os.path.join(x_test,'behavior_test') 15 behavior_m1=os.path.join(behavior_test,'behavior_m1.csv') 16 behavior_m2=os.path.join(behavior_test,'behavior_m2.csv') 17 behavior_m3=os.path.join(behavior_test,'behavior_m3.csv') 18 big_event_test=os.path.join(x_test,'big_event_test') 19 big_event_Q1=os.path.join(big_event_test,'big_event_Q1.csv') 20 cunkuan_test=os.path.join(x_test,'cunkuan_test') 21 cunkuan_m1=os.path.join(cunkuan_test,'cunkuan_m1.csv') 22 cunkuan_m2=os.path.join(cunkuan_test,'cunkuan_m2.csv') 23 cunkuan_m3=os.path.join(cunkuan_test,'cunkuan_m3.csv') 24 data1=pd.read_csv(cust_info_q1) 25 data2=pd.read_csv(cust_avli_Q1) 26 data=pd.merge(data1,data2,on='cust_no',how='inner') 27 list_csv=[aum_m1,aum_m2,aum_m3,behavior_m1,behavior_m2,behavior_m3,big_event_Q1,cunkuan_m1,cunkuan_m2,cunkuan_m3] 28 for sir in list_csv: 29 tem=pd.read_csv(sir) 30 data=pd.merge(data,tem,on='cust_no',how='left') 31 return data 32test_data = gettest()
讀取訓練集,同樣是簡單粗暴的方式
1def gettrain(): 2 x_train = os.path.join(base_dir, 'x_train') 3 y_train = os.path.join(base_dir, 'y_train_3') 4 cust_avli_Q3 = os.path.join(x_train, 'cust_avli_Q3.csv') 5 cust_info_q3 = os.path.join(x_train, 'cust_info_q3.csv') 6 y_Q3_3 = os.path.join(y_train, 'y_Q3_3.csv') 7 8 cust_avli_Q4 = os.path.join(x_train, 'cust_avli_Q4.csv') 9 cust_info_q4 = os.path.join(x_train, 'cust_info_q4.csv') 10 y_Q4_3 = os.path.join(y_train, 'y_Q4_3.csv') 11 12 aum_train = os.path.join(x_train, 'aum_train') 13 aum_m7 = os.path.join(aum_train, 'aum_m7.csv') 14 aum_m8 = os.path.join(aum_train, 'aum_m8.csv') 15 aum_m9 = os.path.join(aum_train, 'aum_m9.csv') 16 aum_m10 = os.path.join(aum_train, 'aum_m10.csv') 17 aum_m11 = os.path.join(aum_train, 'aum_m11.csv') 18 aum_m12 = os.path.join(aum_train, 'aum_m12.csv') 19 20 behavior_train = os.path.join(x_train, 'behavior_train') 21 behavior_m7 = os.path.join(behavior_train, 'behavior_m7.csv') 22 behavior_m8 = os.path.join(behavior_train, 'behavior_m8.csv') 23 behavior_m9 = os.path.join(behavior_train, 'behavior_m9.csv') 24 behavior_m10 = os.path.join(behavior_train, 'behavior_m10.csv') 25 behavior_m11 = os.path.join(behavior_train, 'behavior_m11.csv') 26 behavior_m12 = os.path.join(behavior_train, 'behavior_m12.csv') 27 28 big_event_train = os.path.join(x_train, 'big_event_train') 29 big_event_Q3 = os.path.join(big_event_train, 'big_event_Q3.csv') 30 big_event_Q4 = os.path.join(big_event_train, 'big_event_Q4.csv') 31 32 cunkuan_train = os.path.join(x_train, 'cunkuan_train') 33 cunkuan_m7 = os.path.join(cunkuan_train, 'cunkuan_m7.csv') 34 cunkuan_m8 = os.path.join(cunkuan_train, 'cunkuan_m8.csv') 35 cunkuan_m9 = os.path.join(cunkuan_train, 'cunkuan_m9.csv') 36 cunkuan_m10 = os.path.join(cunkuan_train, 'cunkuan_m10.csv') 37 cunkuan_m11 = os.path.join(cunkuan_train, 'cunkuan_m11.csv') 38 cunkuan_m12 = os.path.join(cunkuan_train, 'cunkuan_m12.csv') 39 40 Q3List = [aum_m7, aum_m8, aum_m9, behavior_m7, behavior_m8, behavior_m9, big_event_Q3, cunkuan_m7, cunkuan_m8, cunkuan_m9] 41 Q4List = [aum_m10, aum_m11, aum_m12, behavior_m10, behavior_m11, behavior_m12, big_event_Q4, cunkuan_m10, cunkuan_m11, cunkuan_m12] 42 43 data3 = pd.read_csv(cust_info_q3) 44 data3_val = pd.read_csv(cust_avli_Q3) 45 for sir in Q3List: 46 tem = pd.read_csv(sir) 47 data3 = pd.merge(data3, tem, on='cust_no', how='left') 48 49 y_3 = pd.read_csv(y_Q3_3) 50 data3 = pd.merge(data3, y_3, on='cust_no', how='left') 51 52 data4 = pd.read_csv(cust_info_q4) 53 data4_val = pd.read_csv(cust_avli_Q4) 54 for sir in Q4List: 55 tem = pd.read_csv(sir) 56 data4 = pd.merge(data4, tem, on='cust_no', how='left') 57 58 y_4 = pd.read_csv(y_Q4_3) 59 data4 = pd.merge(data4, y_4, on='cust_no', how='left') 60 return data3, data3_val, data4, data4_val 61train_data3, val_3, train_data4, val_4 = gettrain()
下面就是本文第一個乾貨:
這是線上分數490到495以及以上的最關鍵操作。上文中的代碼讀取了第三第四季度中所有的用戶信息,值得注意的是,第三季度中所有用戶數量是493441個,第四季度的所有用戶的數量是543823。這些用戶沒有標籤,只有基礎特徵。一般的特徵工程中很難利用這些用戶的任何信息。看起來主辦方給我們這麼多用戶信息有點浪費了。那麼這些用戶在模型中可以被使用嗎?答案是可以,這是本次分享隊伍挖掘到的很重要的一點,是模型可以到0.495的關鍵,將它稱之爲無效用戶的利用。下圖描繪了各個季度不同分類用戶的數量。
從圖中可以看到,有第三季度有6641個無效用戶在第四季度中是有效用戶,第四季度中5569個無效用戶在第一季度中是有效用戶。使用季度交叉特徵的時候不需要上一個季度的是有效用戶,因爲不需要使用上一個季度的標籤。將第三季度的6641個無效用戶和第四季度的5569個無效用戶用來做第五節的季度交叉特徵。有效地提升了季度交叉特徵地覆蓋度,從而提升了整體模型地準確度。
無效用戶利用的代碼:
1all_cust_3=set(train_data3["cust_no"]) 2all_cust_4=set(train_data4["cust_no"]) 3valid_cust3=set(val_3["cust_no"]) 4valid_cust4=set(val_4["cust_no"]) 5valid_test_cust=set(test_data["cust_no"]) 6def is_need3(x): 7 if x in valid_cust3 or x in valid_cust4: 8 return 1 9 else: 10 0 11train_data3["is_need"]=train_data3["cust_no"].apply(is_need3) 12def is_need4(x): 13 if x in valid_cust4 or x in valid_test_cust: 14 return 1 15 else: 16 0 17train_data4["is_need"]=train_data4["cust_no"].apply(is_need4) 18train_data4=train_data4[train_data4["is_need"]==1] 19train_data3=train_data3[train_data3["is_need"]==1] 20train_data3=train_data3.drop(["is_need"],axis=1) 21train_data4=train_data4.drop(["is_need"],axis=1)
以上就是數據讀取的全部內容以及代碼。
特徵工程
接下來就是該隊伍的特徵工程,從以下幾個方面來介紹。
一:日期信息特徵
日期特徵有每個季度最近一次交易時間還有用戶重大時間發生時間。判斷用戶的用戶類型,可以從該用戶的使用時間來做判斷,一般來說,在交易時間距離本度末的時間越近,那麼用戶爲提升型用戶的可能性越大,代表用戶活躍。反之,該用戶爲流失型用戶的可能性越高。基於此考慮,我們隊伍主要是計算各種日期距離季度末的時間和距離季度初的時間,並做一定的分箱操作。具體操作如下圖所示:
一般來說,如果某個用戶的重大操作行爲距離該季度末的時間越近,那麼該用戶在這個季度越不可能是流失用戶。相反,如果用戶距離季度末的時間越遠,那麼該用戶這個季度爲流失用戶的可能性越高。同樣,可以採用季度初做差作爲補充特徵。
1import time 2import math 3import datetime 4from functools import partial 5def getbetweenday(x, mon): 6 if pd.isna(x): 7 return 8 x = str(x) 9 compare_time = time.strptime(x, "%Y-%m-%d") 10 11 if mon == 1: 12 now_time = time.strptime("2020-3-30", "%Y-%m-%d") 13 elif mon == 3: 14 now_time = time.strptime("2019-9-30", "%Y-%m-%d") 15 elif mon == 4: 16 now_time = time.strptime("2019-12-31", "%Y-%m-%d") 17 date1 = datetime.datetime(compare_time[0], compare_time[1], compare_time[2]) 18 date2 = datetime.datetime(now_time[0], now_time[1], now_time[2]) 19 res = (date2 - date1).days 20 if res < 0: 21 return 22 else: 23 return res 24def getbetweenfirstday(x, mon): ##距離當前的天數 25 if pd.isna(x): 26 return 27 x = str(x) 28 compare_time = time.strptime(x, "%Y-%m-%d") 29 30 if mon == 1: 31 now_time = time.strptime("2020-1-1", "%Y-%m-%d") 32 elif mon == 3: 33 now_time = time.strptime("2019-7-1", "%Y-%m-%d") 34 elif mon == 4: 35 now_time = time.strptime("2019-10-1", "%Y-%m-%d") 36 date1 = datetime.datetime(compare_time[0], compare_time[1], compare_time[2]) 37 date2 = datetime.datetime(now_time[0], now_time[1], now_time[2]) 38 res = (date1 - date2).days 39 if res < 0: 40 return 41 else: 42 return res 43fuc1_getbetweenday = partial(getbetweenday, mon=1) 44fuc3_getbetweenday = partial(getbetweenday, mon=3) 45fuc4_getbetweenday = partial(getbetweenday, mon=4) 46 47fuc1_getbetweenfirstday = partial(getbetweenfirstday, mon=1) 48fuc3_getbetweenfirstday = partial(getbetweenfirstday, mon=3) 49fuc4_getbetweenfirstday = partial(getbetweenfirstday, mon=4) 50 51for i in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18]: 52 tem = "E" + str(i) + "FromNow" 53 test_data[tem] = test_data["E" + str(i)].apply(fuc1_getbetweenday) 54 train_data3[tem] = train_data3["E" + str(i)].apply(fuc3_getbetweenday) ##2019-09-30 23:59:00 55 train_data4[tem] = train_data4["E" + str(i)].apply(fuc4_getbetweenday) ##2019-12-31 23:34:00 56 tem = "E" + str(i) + "FromFirstDay" 57 test_data[tem] = test_data["E" + str(i)].apply(fuc1_getbetweenfirstday) 58 train_data3[tem] = train_data3["E" + str(i)].apply(fuc3_getbetweenfirstday) ##2019-09-30 23:59:00 59 train_data4[tem] = train_data4["E" + str(i)].apply(fuc4_getbetweenfirstday) ##2019-12-31 23:34:00 60 61f1 = ["E" + str(i) + "FromNow" for i in range(1, 19) if i != 15 and i != 17 and i != 11] 62test_data["E19" + "FromNow"] = test_data[f1].min(axis=1) 63train_data3["E19" + "FromNow"] = train_data3[f1].min(axis=1) 64train_data4["E19" + "FromNow"] = train_data4[f1].min(axis=1) 65f1 = ["E" + str(i) + "FromFirstDay" for i in range(1, 19) if i != 15 and i != 17 and i != 11] 66test_data["E19" + "FromFirstDay"] = test_data[f1].min(axis=1) 67train_data3["E19" + "FromFirstDay"] = train_data3[f1].min(axis=1) 68train_data4["E19" + "FromFirstDay"] = train_data4[f1].min(axis=1) 69 70f1 = ["E" + str(i) + "FromNow" for i in [1, 2, 3, 4, 5]] 71test_data["E20" + "FromNow"] = test_data[f1].min(axis=1) 72train_data3["E20" + "FromNow"] = train_data3[f1].min(axis=1) 73train_data4["E20" + "FromNow"] = train_data4[f1].min(axis=1) 74f1 = ["E" + str(i) + "FromFirstDay" for i in [1, 2, 3, 4, 5]] 75test_data["E20" + "FromFirstDay"] = test_data[f1].min(axis=1) 76train_data3["E20" + "FromFirstDay"] = train_data3[f1].min(axis=1) 77train_data4["E20" + "FromFirstDay"] = train_data4[f1].min(axis=1) 78 79f1 = ["E" + str(i) + "FromNow" for i in [10, 12, 13, 14]] 80test_data["E23" + "FromNow"] = test_data[f1].min(axis=1) 81train_data3["E23" + "FromNow"] = train_data3[f1].min(axis=1) 82train_data4["E23" + "FromNow"] = train_data4[f1].min(axis=1) 83f1 = ["E" + str(i) + "FromFirstDay" for i in [10, 12, 13, 14]] 84test_data["E23" + "FromFirstDay"] = test_data[f1].min(axis=1) 85train_data3["E23" + "FromFirstDay"] = train_data3[f1].min(axis=1) 86train_data4["E23" + "FromFirstDay"] = train_data4[f1].min(axis=1) 87 88f1 = ["E" + str(i) + "FromNow" for i in [16, 18]] 89test_data["E24" + "FromNow"] = test_data[f1].min(axis=1) 90train_data3["E24" + "FromNow"] = train_data3[f1].min(axis=1) 91train_data4["E24" + "FromNow"] = train_data4[f1].min(axis=1) 92f1 = ["E" + str(i) + "FromFirstDay" for i in [16, 18]] 93test_data["E24" + "FromFirstDay"] = test_data[f1].min(axis=1) 94train_data3["E24" + "FromFirstDay"] = train_data3[f1].min(axis=1) 95train_data4["E24" + "FromFirstDay"] = train_data4[f1].min(axis=1) 96 97test_data["E26" + "FromNow"] = test_data["E16" + "FromNow"] - test_data["E18" + "FromNow"] 98train_data3["E26" + "FromNow"] = train_data3["E16" + "FromNow"] - train_data3["E18" + "FromNow"] 99train_data4["E26" + "FromNow"] = train_data4["E16" + "FromNow"] - train_data4["E18" + "FromNow"] 100test_data["E26" + "FromFirstDay"] = test_data["E16" + "FromFirstDay"] - test_data["E18" + "FromFirstDay"] 101train_data3["E26" + "FromFirstDay"] = train_data3["E16" + "FromFirstDay"] - train_data3["E18" + "FromFirstDay"] 102train_data4["E26" + "FromFirstDay"] = train_data4["E16" + "FromFirstDay"] - train_data4["E18" + "FromFirstDay"] 103 104test_data["E" + str(25) + "FromNow"] = test_data["E" + str(10) + "FromNow"] - test_data["E" + str(3) + "FromNow"] 105train_data3["E" + str(25) + "FromNow"] = train_data3["E" + str(10) + "FromNow"] - train_data3["E" + str(3) + "FromNow"] 106train_data4["E" + str(25) + "FromNow"] = train_data4["E" + str(10) + "FromNow"] - train_data4["E" + str(3) + "FromNow"] 107test_data["E" + str(25) + "FromFirstDay"] = test_data["E" + str(10) + "FromFirstDay"] - test_data[ 108 "E" + str(3) + "FromFirstDay"] 109train_data3["E" + str(25) + "FromFirstDay"] = train_data3["E" + str(10) + "FromFirstDay"] - train_data3[ 110 "E" + str(3) + "FromFirstDay"] 111train_data4["E" + str(25) + "FromFirstDay"] = train_data4["E" + str(10) + "FromFirstDay"] - train_data4[ 112 "E" + str(3) + "FromFirstDay"] 113 114for j in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 20, 23]: 115 for day in [7, 15, 30, 90, 365]: 116 tem = "E" + str(j) + "Less" + str(day) + "day" 117 test_data[tem] = (test_data["E" + str(i) + "FromNow"] < day).astype(int) 118 train_data3[tem] = (train_data3["E" + str(i) + "FromNow"] < day).astype(int) 119 train_data4[tem] = (train_data4["E" + str(i) + "FromNow"] < day).astype(int) 120for j in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 18, 19, 20, 23]: 121 for day in [30, 45, 60]: 122 tem = "E" + str(j) + "Less" + str(day) + "day" 123 test_data[tem] = (test_data["E" + str(i) + "FromFirstDay"] < day).astype(int) 124 train_data3[tem] = (train_data3["E" + str(i) + "FromFirstDay"] < day).astype(int) 125 train_data4[tem] = (train_data4["E" + str(i) + "FromFirstDay"] < day).astype(int) 126lastFea_E = ["E15", "E17", "E16FromNow", "E18FromNow", "E19FromNow", "E20FromNow", "E23FromNow"]
除了用戶重大特徵中的的日期信息外,還有一個B6特徵,處理方法也十分類似:
1def StringToTime(x): 2 if pd.isna(x) == False: 3 timeArray = time.strptime(x, "%Y-%m-%d %H:%M:%S") 4 timeStamp = int(time.mktime(timeArray)) 5 return timeStamp 6 else: 7 return 0 8test_data["B6"] = test_data["B6"].apply(StringToTime) ##最大爲2020-3-31號 9train_data3["B6"] = train_data3["B6"].apply(StringToTime) ##2019-09-30 23:59:00 10train_data4["B6"] = train_data4["B6"].apply(StringToTime) ##2019-12-31 23:34:00 11def data_from_last_Of_month(x, mon): 12 x = int(x) 13 if x == 0: 14 return 15 else: 16 nowtime = 0 17 if mon == 1: 18 nowtime = 1585670399 19 elif mon == 3: 20 nowtime = 1569859199 21 elif mon == 4: 22 nowtime = 1577807999 23 if nowtime - x < 0: 24 return 25 else: 26 return (nowtime - x) / 3600 / 24 27fuction3 = partial(data_from_last_Of_month, mon=3) 28fuction1 = partial(data_from_last_Of_month, mon=1) 29fuction4 = partial(data_from_last_Of_month, mon=4) 30test_data["B6_2"] = test_data["B6"].apply(fuction1) 31train_data3["B6_2"] = train_data3["B6"].apply(fuction3) 32train_data4["B6_2"] = train_data4["B6"].apply(fuction4) 33 34test_data["B6_3"] = test_data["B7"] / test_data["B6_2"] 35train_data3["B6_3"] = train_data3["B7"] / train_data3["B6_2"] 36train_data4["B6_3"] = train_data4["B7"] / train_data4["B6_2"] 37def below_theday(df, day): 38 df["B6_2Less" + str(day)] = (df["B6_2"] < day).astype(int) 39 return df 40for te in [1, 7, 15, 30, 60, 90, 180, 365]: 41 test_data = below_theday(test_data, te) 42 train_data3 = below_theday(train_data3, te)
二:label特徵
label特徵指的是上一個季度該用戶的label值作爲特徵,上一個季度的標籤有-1,0,1,其餘部分用戶沒有上一個季度標籤,本人又將其餘部分用戶分爲上個季度爲無效用戶和上個季度無用戶信息兩類,最終該特徵爲5類,-1,0,1分別表示上季度標籤,2表示上季度爲無效用戶,3表示上季度無該用戶信息。
1import collections 2all_cust_3=set(train_data3["cust_no"]) 3all_cust_4=set(train_data4["cust_no"]) 4valid_cust3=set(val_3["cust_no"]) 5valid_cust4=set(val_4["cust_no"]) 6 7train_data3["last_label"]=2 8train_data4["last_label"]=pd.merge(train_data4[["cust_no"]],train_data3[["cust_no","label"]],on="cust_no",how="left")["label"] 9train_data4["last_label"]=train_data4["last_label"].fillna(2) 10test_data["last_label"]=pd.merge(test_data[["cust_no"]],train_data4[["cust_no","label"]],on="cust_no",how="left")["label"] 11test_data["last_label"]=test_data["last_label"].fillna(2) 12def get_last_isvalid_3(x): 13 if x not in valid_cust3: 14 if x in all_cust_3: 15 return 2 16 else: 17 return 3 18def get_last_isvalid_4(x): 19 if x not in valid_cust4: 20 if x in all_cust_4: 21 return 2 22 else: 23 return 3 24train_data4.loc[train_data4["last_label"]==2,"last_label"]=train_data4.loc[train_data4["last_label"]==2,"cust_no"].apply(get_last_isvalid_3) 25test_data.loc[test_data["last_label"]==2,"last_label"]=test_data.loc[test_data["last_label"]==2,"cust_no"].apply(get_last_isvalid_4) 26 27train_data3["jidu"]=3 28train_data4["jidu"]=4 29test_data["jidu"]=1 30train_data=pd.concat([train_data3,train_data4]).reset_index(drop=True) 31train_data.index=range(len(train_data)) 32test_data['label']=2 33all_data=pd.concat([train_data,test_data]).reset_index(drop=True)
三:behavior表特徵
該表中的數值行爲特徵主要有每個月手機網銀的登陸次數,轉入次數,轉出次數,轉入金額,轉出金額。還有一個季度特徵季度內賬戶變動次數。與用戶餘額特徵相似,首先進行不同特徵之間的交叉,如下圖所示。其中灰色表示原始特徵,橙色表示交叉後的特徵。其實交易CTR的計算方式是交易次數除以登錄次數。通過特徵交叉的方法挖掘更深維度的用戶行爲。
1for i in range(1, 6): 2 for j in ["", "_x", "_y"]: 3 tem = "B" + str(i) + j 4 all_data[tem].fillna(all_data[tem].mean(), inplace=True) 5for i in ["", "_x", "_y"]: 6 all_data["B8" + i] = all_data["B2" + i] + all_data["B4" + i] ##交易次數 7 all_data["B9" + i] = all_data["B3" + i] - all_data["B5" + i] ##轉入減去轉出 8 all_data["B10" + i] = all_data["B3" + i] / (all_data["B2" + i] + 1.001) ##平均每次的流水 9 all_data["B11" + i] = all_data["B5" + i] / (all_data["B4" + i] + 1.001) ##平均轉出 10 all_data["B12" + i] = all_data["B3" + i] + all_data["B5" + i] ##流入流出的總量 11 all_data["B13" + i] = all_data["B8" + i] / (all_data["B1" + i] + 1.001) ##交易次數除以登錄次數。相當於推薦中的CTR 12 13def behavior_m(all_data, tem): 14 all_data[tem + "sub" + tem + "_x"] = all_data[tem] - all_data[tem + "_x"] 15 all_data[tem + "sub" + tem + "_y"] = all_data[tem] - all_data[tem + "_y"] 16 all_data[tem + "_x" + "sub" + tem + "_y"] = all_data[tem + "_x"] - all_data[tem + "_y"] 17 all_data[tem + "div" + tem + "_x"] = all_data[tem] / (all_data[tem + "_x"] + 1.0001) 18 all_data[tem + "div" + tem + "_y"] = all_data[tem] / (all_data[tem + "_y"] + 1.0001) 19 all_data[tem + "_x" + "div" + tem + "_y"] = all_data[tem + "_x"] / (all_data[tem + "_y"] + 1.0001) 20 all_data["all2_" + tem] = all_data[tem] + all_data[tem + "_x"] / 2 + all_data[tem + "_y"] / 3 21 all_data["all3_" + tem] = all_data[tem] + all_data[tem + "_x"] + all_data[tem + "_y"] 22 all_data["mid_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].median(axis=1) 23 all_data["min_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].min(axis=1) 24 all_data["max_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].max(axis=1) 25 return all_data 26 27for i in range(1, 14): 28 if i == 6 or i == 7: 29 continue 30 tem = "B" + str(i) 31 all_data = behavior_m(all_data, tem) 32 33all_data["B2sumDivB4sum"] = all_data["all3_" + "B2"] - all_data["all3_" + "B4"] 34all_data["B2sumAddB4sum"] = all_data["all3_" + "B2"] + all_data["all3_" + "B4"] 35all_data["B2sumAddB4sum_divB7"] = (all_data["B2sumAddB4sum"] - all_data["B7"]).astype(int) 36 37lastFae_B = ["B" + str(i) for i in range(1, 14) if i != 6] 38lastFae_tempB = [j + "_B" + str(i) for i in range(1, 14) if i != 6 and i != 7 for j in 39 ["max", "min", "all3", "mid", "all2"]] ##,"std" 40lastFae_B = lastFae_B + lastFae_tempB 41lastFae_B.append("B6_2") 42lastFae_B.append("B2sumDivB4sum") 43lastFae_B.append("B2sumAddB4sum") 44lastFae_B.append("B2sumAddB4sum_divB7")
四:用戶餘額信息和存款信息
用戶存款信息和餘額信息是判斷用戶類型關鍵的特徵。用戶的存款信息主要有如下圖所示的幾個。通過存款信息之間聚合操作可以生成新的幾個特徵。左邊是原始的特徵,右邊是交叉之後的特徵和相應的計算方法。主要的思路是將不同的餘額進行組合,比如全部加權可以視爲用戶的月末的總體餘額信息,除貸款外加權可以視爲用戶的賬號餘額,還有負債率信息。
1for i in range(1, 9): 2 for j in ["", "_x", "_y"]: 3 tem = "X" + str(i) + j 4 all_data[tem].fillna(all_data[tem].mean(), inplace=True) 5all_data["X9"] = all_data["X1"] + all_data["X2"] + all_data["X3"] + all_data["X4"] + all_data["X5"] + all_data["X6"] - \ 6 all_data["X7"] + all_data["X8"] 7all_data["X9_x"] = all_data["X1_x"] + all_data["X2_x"] + all_data["X3_x"] + all_data["X4_x"] + all_data["X5_x"] + \ 8 all_data["X6_x"] - all_data["X7_x"] + all_data["X8_x"] 9all_data["X9_y"] = all_data["X1_y"] + all_data["X2_y"] + all_data["X3_y"] + all_data["X4_y"] + all_data["X5_y"] + \ 10 all_data["X6_y"] - all_data["X7_y"] + all_data["X8_y"] 11 12all_data["X10"] = all_data["X1"] + all_data["X2"] + all_data["X3"] + all_data["X4"] + all_data["X5"] + all_data["X6"] + \ 13 all_data["X8"] 14all_data["X10_x"] = all_data["X1_x"] + all_data["X2_x"] + all_data["X3_x"] + all_data["X4_x"] + all_data["X5_x"] + \ 15 all_data["X6_x"] + all_data["X8_x"] 16all_data["X10_y"] = all_data["X1_y"] + all_data["X2_y"] + all_data["X3_y"] + all_data["X4_y"] + all_data["X5_y"] + \ 17 all_data["X6_y"] + all_data["X8_y"] 18 19all_data["X11"] = all_data["X3"] + all_data["X4"] + all_data["X5"] 20all_data["X11_x"] = all_data["X3_x"] + all_data["X4_x"] + all_data["X5_x"] 21all_data["X11_y"] = all_data["X3_y"] + all_data["X4_y"] + all_data["X5_y"] 22 23all_data["X12"] = all_data["X7"] / (all_data["X10"] + 1) 24all_data["X12_x"] = all_data["X7_x"] / all_data["X10_x"] 25all_data["X12_y"] = all_data["X7_y"] / all_data["X10_y"] 26 27all_data["X13"] = all_data["X9"] - all_data["C1"] 28all_data["X13_x"] = all_data["X9_x"] - all_data["C1_x"] 29all_data["X13_y"] = all_data["X9_y"] - all_data["C1_y"] 30 31fea_x = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"] 32def amm_m(all_data, fe, fe2): 33 tem = fe2 + fe 34 all_data[tem + "sub" + tem + "_x"] = all_data[tem] - all_data[tem + "_x"] 35 all_data[tem + "sub" + tem + "_y"] = all_data[tem] - all_data[tem + "_y"] 36 all_data[tem + "_x" + "sub" + tem + "_y"] = all_data[tem + "_x"] - all_data[tem + "_y"] 37 38 all_data[tem + "div" + tem + "_x"] = all_data[tem] / (all_data[tem + "_x"] + 1.0001) 39 all_data[tem + "div" + tem + "_y"] = all_data[tem] / (all_data[tem + "_y"] + 1.0001) 40 all_data[tem + "_x" + "div" + tem + "_y"] = all_data[tem + "_x"] / (all_data[tem + "_y"] + 1.0001) 41 42 all_data["all2_" + tem] = all_data[tem] + all_data[tem + "_x"] / 2 + all_data[tem + "_y"] / 3 43 all_data["all3_" + tem] = all_data[tem] + all_data[tem + "_x"] + all_data[tem + "_y"] 44 45 all_data["max_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].max(axis=1) 46 all_data["min_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].min(axis=1) 47 all_data["mid_" + tem] = all_data[[tem, tem + "_x", tem + "_y"]].median(axis=1) 48 return all_data 49for fe in fea_x: 50 all_data = amm_m(all_data, fe, "X") 51 52for i in range(1, 3): 53 for j in ["", "_x", "_y"]: 54 tem = "C" + str(i) + j 55 all_data[tem].fillna(all_data[tem].mean(), inplace=True) 56all_data["C3"] = all_data["C1"] / (all_data["C2"] + 1) 57all_data["C3_x"] = all_data["C1_x"] / (all_data["C2_x"] + 1) 58all_data["C3_y"] = all_data["C1_y"] / (all_data["C2_y"] + 1) 59for fe in ["1", "2", "3"]: 60 all_data = amm_m(all_data, fe, "C") 61lastFea_x = ["X" + str(i) for i in range(1, 14)] 62lastFea_x1 = [j + "_X" + str(i) for i in range(1, 14) for j in ["max", "min", "all3", "mid", "all2"]] # ,"std" 63lastFea_c = ["C" + str(i) for i in range(1, 4)] 64lastFea_c1 = [j + "_C" + str(i) for i in range(1, 4) for j in ["max", "min", "all3", "mid", "all2"]] # ,"std" 65last_fea_XC = lastFea_x + lastFea_x1 + lastFea_c + lastFea_c1
五:季度間特徵
季度間的差異信息這個特徵是能夠取得高分最關鍵的特徵。利用用戶在兩個季度特徵的交叉形成新的特徵,如下圖所示,我們獲取到用戶在第三第四季度的特徵,比如餘額特徵F1,那麼我們可以對該特徵進行季度間交叉,交叉方式有兩個季度特徵相加,第四季度減去或者除以第三季度該特徵,形成的新特徵作爲第四季度新特徵,然後加上原始的第四季度特徵,形成訓練集特徵。具體的操作方式如下圖所示。
注意的是,對用戶基礎特徵不會做這個操作,因爲用戶基礎特徵一般變動性很小。這種特徵工程的思想是用戶類型是一個動態的過程,銀行會根據用戶過去幾個月甚至幾個季度的行爲來判斷用戶的類型。比如這個季度用戶餘額或者操作行爲特別少,按理說會判定爲流失用戶,但是如果用戶上個季度餘額或者操作很多,這時候銀行可能會權衡一下。
1x_train_3 = all_data[all_data["jidu"] == 3] 2x_train_4 = all_data[all_data["jidu"] == 4] 3x_text_1 = all_data[all_data["jidu"] == 1] 4 5for tem in last_fea_XC + lastFae_B + lastFea_E: 6 fea1 = tem 7 x_train_3[fea1 + "diff"] = 0 8 temp = pd.merge(x_train_4[["cust_no", fea1]], x_train_3[["cust_no", fea1]], on="cust_no", how="left") 9 x_train_4[fea1 + "diff"] = list(temp[fea1 + "_x"] - temp[fea1 + "_y"]) 10 11 if tem not in ["E16FromNow", "E18FromNow", "B6_2", "E19FromNow", "E20FromNow", "E23FromNow"]: 12 x_train_4[fea1 + "add"] = list(temp[fea1 + "_x"] + temp[fea1 + "_y"]) 13 x_train_4[fea1 + "div"] = list(temp[fea1 + "_x"] / (temp[fea1 + "_y"] + 1.0001)) 14 15 temp = pd.merge(x_text_1[["cust_no", fea1]], x_train_4[["cust_no", fea1]], on="cust_no", how="left") 16 x_text_1[fea1 + "diff"] = list(temp[fea1 + "_x"] - temp[fea1 + "_y"]) 17 18 if tem not in ["E16FromNow", "E18FromNow", "B6_2", "E19FromNow", "E20FromNow", "E23FromNow"]: 19 x_text_1[fea1 + "add"] = list(temp[fea1 + "_x"] + temp[fea1 + "_y"]) 20 x_text_1[fea1 + "div"] = list(temp[fea1 + "_x"] / (temp[fea1 + "_y"] + 1.0001)) 21all_data = pd.concat([x_train_3, x_train_4, x_text_1]).reset_index(drop=True)
六:用戶基礎信息
用戶基礎信息較爲簡單,主要有兩類特徵,第一類是數值特徵,比如用戶的年齡,收入,第二個是類別特徵,比如客戶等級等。如下圖所示,數值特徵採用分箱的操作,類別特徵的數目並不很多,我們隊伍採用了one-hot的操作,one-hot操作可以更好地學習到特徵之間的組合特徵。如果用LGB的話,可以將其轉化爲類別特徵。
1def getcust_1(x): 2 return str(int(x, 16)) 3all_data["cust_no_10"]=all_data["cust_no"].apply(getcust_1) 4all_data["cust_no_1"]=all_data["cust_no_10"].apply(lambda x:x[:3]=="300").astype(int) 5all_data["cust_no_2"]=all_data["cust_no_10"].apply(lambda x:int(x[4:])) 6 7all_data['I5'] = all_data['I5'].fillna('不便分類的其他從業人員') 8all_data['I13'] = all_data['I13'].fillna('未說明的婚姻狀況') 9all_data['I14'] = all_data['I14'].fillna('其他') 10def get_age(df,col): 11 df[col+"_18"]=(df.I2>18).astype(int) 12 df[col+"_25"]=(df.I2>25).astype(int) 13 df[col+"_30"]=(df.I2>30).astype(int) 14 df[col+"_40"]=(df.I2>40).astype(int) 15 df[col+"_50"]=(df.I2>50).astype(int) 16 df[col+"_60"]=(df.I2>60).astype(int) 17 df[col+"_70"]=(df.I2>70).astype(int) 18 df[col+"_80"]=(df.I2>80).astype(int) 19 return df 20all_data['I1']=(all_data.I1=="男").astype(int)##男女轉化爲0-1 21all_data=get_age(all_data,'I2')##年紀卡閾值 22# all_data['I12']=(all_data.I12=="個人").astype(int) ##轉化爲0-1 23def get_salary(df,col): 24 df[col+"_0"]=(df.I11>0).astype(int) 25 df[col+"_1"]=(df.I11>16559.999).astype(int) 26 df[col+"_2"]=(df.I11>273000.0).astype(int) 27 df[col+"_3"]=(df.I11>600000.0).astype(int) 28 df[col+"_4"]=(df.I11>960000.0).astype(int) 29 df[col+"_5"]=(df.I11>1581288).astype(int) 30 return df 31all_data=get_age(all_data,"I11")##通過家庭收入卡閾值get_salary 32 33for col in ["I3","I5","I10","I13","I14","last_label"]: 34 all_data=pd.merge(all_data.drop(col,axis=1),pd.get_dummies(all_data[col],col),left_index=True,right_index=True) 35all_data=all_data.drop(["I7","I9","E11","I2","I11","I8"],axis=1) ##刪掉全爲空的 ,"jidu" ,"B6_1""B6",
LGB模型
該隊伍最後的成績是xgb單種子和lgb單種子融合,線上lgb的結果更高,因此將lgb建模方法開源。代碼如下所示:
1test_data=all_data[all_data["label"]==2] 2test_data.drop(["label","jidu"],axis=1) 3train_data=all_data[all_data["jidu"]==4] 4train_data=pd.merge(train_data,val_4,on='cust_no',how='inner') 5fea=train_data.columns.values.tolist() 6fea.remove('label') 7fea.remove("jidu") 8y_train=train_data.label 9x_train=train_data[fea] 10obj_fea=[] 11for index,value in enumerate(x_train.columns): 12 if(train_data[value]).dtype=="object": 13 obj_fea.append(value) 14y_train=y_train+1##多分類輸入要是0-1-2 15col=[i for i in x_train.columns if i not in obj_fea] 16 17params = { 18 'bagging_freq': 1, 19 'bagging_fraction': 0.8, 20 'bagging_seed': 20201101, 21 'boost': 'gbdt', 22 'feature_fraction': 0.8, 23 'feature_fraction_seed': 20201101, 24 'learning_rate': 0.05, 25 'max_depth': 8, 26 'metric': 'multi_logloss', 27 'min_data_in_leaf': 20, 28 'num_leaves': 32, 29 'num_threads': 6, 30 'objective': 'multiclass', 31 'num_class': 3, 32 'lambda_l1': 0.5, 33 'lambda_l2': 1.2, 34 'verbosity': 1, 35 # 'max_bin': 64, 36 'device': 'cpu', 37} 38from sklearn.model_selection import KFold 39import catboost as cat 40import datetime 41 42print(datetime.datetime.now()) 43folds = KFold(n_splits=10, shuffle=True, random_state=1996) 44predictions2 = np.zeros([test_data.shape[0], 3]) 45preds2 = np.zeros([x_train.shape[0], 3]) 46val_label = [] 47for fold_, (train_index, test_index) in enumerate(folds.split(x_train, y_train)): 48 print("第{}折".format(fold_)) 49 train_x, test_x, train_y, test_y = x_train.iloc[train_index], x_train.iloc[test_index], y_train.iloc[train_index], \ 50 y_train.iloc[test_index] 51 52 trn_data = lgb.Dataset(train_x[col], train_y) 53 val_data = lgb.Dataset(test_x[col], test_y) 54 num_round = 50000 55 clf = lgb.train(params, 56 trn_data, 57 num_round, 58 valid_sets=[trn_data, val_data], 59 verbose_eval=100, early_stopping_rounds=500, 60 ) 61 val_train = clf.predict(test_x[col], num_iteration=clf.best_iteration) 62 preds2[test_index] = val_train 63 val_pred = clf.predict(test_data[col], num_iteration=clf.best_iteration) 64 predictions2 += val_pred / 10 65 66print(datetime.datetime.now())
最後需要根據線下的結果調整0,1,-1的比例以取得最優的kappa值,然後將其應用到線上的預測上,kaggle上有不少關於kappa值搜索的開源,這裏就不獻醜了。
轉:https://mp.weixin.qq.com/s?__biz=MzAxOTU5NTU4MQ==&mid=2247484853&idx=1&sn=6cf580f6d548e9f63c86ab405c6ff502&chksm=9bc5ede7acb264f1fa7e02372f5e92dac17392c8335fe664297d63a8064aa21bd762bccc6875&mpshare=1&scene=23&srcid=0222ALobZ0BPWMdquqpbEVrz&sharer_sharetime=1613965675967&sharer_shareid=628acadc38c0d97ed34eb70efc08eaf5#rd
總結
根據該代碼作者所述,有一些該比賽別的經驗與大家分享。
1:採用第四季度作爲訓練集與三四季度一起作爲訓練集相比有更好的效果,這可能是由於將第三季度作爲訓練集無法獲取季度間特徵信息的原因
2:採用迴歸模型進行預測然後卡閾值的方法效果比用三分類預測效果更差
3:季度間特徵信息,無效用戶的利用是模型取得好的結果的最重要原因,kappa的搜索時線上取得好結果最直接的原因
模型訓練的數據集如下,需要同學可以加QQ:231469242諮詢