賽題地址:https://www.kesci.com/home/competition/5be92233954d6e001063649a
又打了個醬油,最終成績是39/205。說出來挺丟人的,因爲本次比賽採用AUC來評判模型的效果,不用建模一半預測爲去,另一半預測爲不去就能得0.5分。
賽題介紹
賽題描述
參賽選手需要根據2017年貴陽市常住居民的部分用戶的歷史數據(訓練集),以及2018年6月、7月的數據(測試集),對2018年8月貴陽市常住居民前往黔東南州進行省內旅遊的可能性進行預測。
本比賽任務爲:
訓練:使用所提供的訓練集,即用戶使用2017年6、7月的歷史數據與8月是否前往黔東南州進行省內旅遊的數據,建立預測模型
輸出結果:使用所提供的測試集,即用戶使用2018年6月、7月的歷史數據,通過所建立的模型,預測用戶在2018年8月是否會前往黔東南州進行省內旅遊的概率。在科賽網,提交測評,得到AUC分數
數據說明
訓練集(training_set)約2.3G,其中包含 201708n,201708q 和 weather_data_2017三個文件夾,分別記錄了對應的2017年6、7月用戶歷史數據和天氣歷史數據。
在201708n和201708q兩個文件夾中,各包含7個txt文件,201708n文件夾中的用戶在2017年8月都沒有去過黔東南目標區域,201708q文件夾中的用戶在2017年8月都去過黔東南目標景區
訓練集中,除以下列示字段外,最後還有一個字段“label”:“0”表示其爲負樣本,即該用戶在2017年8月沒有去過黔東南目標區域;“1”表示其爲正樣本,即該用戶在2017年8月去過黔東南目標區域
用戶身份屬性表(201708n1.txt, 201708q1.txt)
用戶手機終端信息表(201708n2.txt, 201708q2.txt)
用戶漫遊行爲表(201708n3.txt, 201708q3.txt)
用戶漫出省份表(201708n4.txt, 201708q4.txt)
用戶地理位置表(201708n6.txt, 201708q6.txt)
用戶APP使用情況表(201708n7.txt, 201708q7.txt)
在weather_data_2017文件夾中包含兩個txt文件,“weather_reported_2017”記錄了2017年6月、7月的實際天氣,“weather_forecast_2017”,記錄了2017年6月、7月的預報天氣,以及一個“天氣現象編碼表.xlsx”文件。
2017實況天氣表(weather_reported_2017.txt)
2017預測天氣表(weather_forecast_2017.txt)
測試集(testing_set)共約1G,其中包含201808和weather_data_2018兩個文件夾
在201808文件夾中包含7個txt文件,命名依次爲2018_1.txt,2018_2.txt, … ,2018_7.txt,字段信息與訓練集相對應
在weather_data_2018文件夾中包含兩個txt文件,命“weather_reported_2018”記錄了2018年6月、7月的實際天氣,“weather_forecast_2018”記錄了2018年6月、7月的預報天氣,字段信息與訓練集相對應。
備註:
每個文件夾中的7個表可以通過虛擬ID互相關聯;但不是每個虛擬ID都可以被關聯,選手自行判斷如何處理和使用
不同表中的虛擬ID存在格式不同的情況,需選手自行處理,並保證提交虛擬ID格式爲string
由於表的數量較多,信息維度不同,應用方法多種,數據可能存在異常和缺失,選手需自行處理可能遇到的異常狀況
歡迎選手用不同的方法進行嘗試,如遷移學習等前沿方法
本次競賽數據經過了脫敏處理,數據和實際信息有一定差距,但是不會影響問題的解決
評審說明
1、初賽評分規則
本次比賽採用AUC來評判模型的效果。AUC即以False Positive Rate爲橫軸,True Positive Rate爲縱軸的ROC (Receiver Operating Characteristic)曲線下方的面積大小。
2、評審說明
測評排行榜採用Private/Public機制,其中,Private榜對應所提交結果文件中一定比例數據的成績,Public榜對應剩餘數據的成績。
提供給每個隊伍每天5次提交與測評排名的機會,實時更新Public排行榜,從高到低排序,若隊伍一天內多次提交結果,新結果版本將覆蓋原版本。
由於受到使用模型的泛化性能的影響,在 Public 榜獲得最高分的提交在 Private 的分數不一定最高,因此需要選手從自己的有效提交裏,選擇兩個覺得兼顧了泛化性能與模型評分的結果文件進入 Private 榜測評
Private 排行榜在比賽結束後會揭曉,比賽的最終有效成績與有效排名將以 Private 榜爲準。
代碼
% load_ext klab- autotime
import pandas as pd
import numpy as np
time: 311 ms
def reduce_mem_usage ( df, verbose= True ) :
numerics = [ 'int16' , 'int32' , 'int64' , 'float16' , 'float32' , 'float64' ]
start_mem = df. memory_usage( ) . sum ( ) / 1024 ** 2
for col in df. columns:
col_type = df[ col] . dtypes
if col_type in numerics:
c_min = df[ col] . min ( )
c_max = df[ col] . max ( )
if str ( col_type) [ : 3 ] == 'int' :
if c_min > np. iinfo( np. int8) . min and c_max < np. iinfo( np. int8) . max :
df[ col] = df[ col] . astype( np. int8)
elif c_min > np. iinfo( np. int16) . min and c_max < np. iinfo( np. int16) . max :
df[ col] = df[ col] . astype( np. int16)
elif c_min > np. iinfo( np. int32) . min and c_max < np. iinfo( np. int32) . max :
df[ col] = df[ col] . astype( np. int32)
elif c_min > np. iinfo( np. int64) . min and c_max < np. iinfo( np. int64) . max :
df[ col] = df[ col] . astype( np. int64)
else :
if c_min > np. finfo( np. float16) . min and c_max < np. finfo( np. float16) . max :
df[ col] = df[ col] . astype( np. float16)
elif c_min > np. finfo( np. float32) . min and c_max < np. finfo( np. float32) . max :
df[ col] = df[ col] . astype( np. float32)
else :
df[ col] = df[ col] . astype( np. float64)
end_mem = df. memory_usage( ) . sum ( ) / 1024 ** 2
if verbose:
print ( 'Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)' . format ( end_mem, 100 * ( start_mem - end_mem) / start_mem) )
return df
time: 3.85 ms
1 特徵工程
正樣本
q1
將兩月金額相加
q1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q1.txt' , sep= '\t' , header= None ) )
q1. columns = [ 'year_month' , 'id' , 'consume' , 'label' ]
Mem. usage decreased to 0.16 Mb (53.1% reduction)
time: 39.2 ms
q1. describe( )
year_month
id
consume
label
count
11200.000000
1.120000e+04
1.086500e+04
11200.0
mean
201706.500000
5.416583e+15
inf
1.0
std
0.500022
2.642827e+15
inf
0.0
min
201706.000000
1.448104e+12
4.998779e-02
1.0
25%
201706.000000
3.117220e+15
4.068750e+01
1.0
50%
201706.500000
5.456254e+15
9.837500e+01
1.0
75%
201707.000000
7.702940e+15
1.785000e+02
1.0
max
201707.000000
9.997949e+15
1.324000e+03
1.0
time: 37.3 ms
q1. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month 11200 non-null int32
id 11200 non-null int64
consume 10865 non-null float16
label 11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.91 ms
q1. consume. min ( )
0.05
time: 2.64 ms
q1 = q1. fillna( 98.0 )
time: 2.75 ms
q1. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month 11200 non-null int32
id 11200 non-null int64
consume 11200 non-null float16
label 11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.71 ms
q1 = q1[ [ 'id' , 'consume' ] ]
q1_groupbyid = q1. groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
time: 709 ms
q2
特徵1 使用過的top9+其它手機品牌 共10個
特徵2 使用的不同品牌數量
q2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q2.txt' , sep= '\t' , header= None ) )
q2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' , 'label' ]
Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 2.46 s
q2. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289203 entries, 0 to 289202
Data columns (total 6 columns):
id 289203 non-null int64
brand 197376 non-null object
type 197380 non-null object
first_use_time 289203 non-null int64
recent_use_time 289203 non-null int64
label 289203 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 11.3+ MB
time: 62.6 ms
q2. type = q2. type . fillna( '其它' )
time: 18.4 ms
brand_series = pd. Series( { '蘋果' : 'iphone' , '華爲' : "huawei" , '歐珀' : 'oppo' , '維沃' : 'vivo' , '三星' : 'san' , '小米' : 'mi' , '金立' : 'jinli' , '魅族' : 'mei' , '樂視' : 'le' , '四季恆美' : 'siji' } )
q2. brand = q2. brand. map ( brand_series)
time: 42.4 ms
q2. brand = q2. brand. fillna( '其它' )
time: 17.4 ms
q2. head( )
id
brand
type
first_use_time
recent_use_time
label
0
1752398069509000
其它
其它
20161209134530
20161209190636
1
1
1752398069509000
huawei
PLK-AL10
20170609223138
20170609224345
1
2
1752398069509000
le
LETV X501
20160924102711
20160924112425
1
3
1752398069509000
jinli
金立 GN800
20150331210255
20150630131232
1
4
1752398069509000
jinli
GIONEE M5
20170508191216
20170605192347
1
time: 18.7 ms
q2[ 'brand_type' ] = q2[ 'brand' ] + q2[ 'type' ]
time: 109 ms
q2. head( )
id
brand
type
first_use_time
recent_use_time
label
brand_type
0
1752398069509000
其它
其它
20161209134530
20161209190636
1
其它其它
1
1752398069509000
huawei
PLK-AL10
20170609223138
20170609224345
1
huaweiPLK-AL10
2
1752398069509000
le
LETV X501
20160924102711
20160924112425
1
leLETV X501
3
1752398069509000
jinli
金立 GN800
20150331210255
20150630131232
1
jinli金立 GN800
4
1752398069509000
jinli
GIONEE M5
20170508191216
20170605192347
1
jinliGIONEE M5
time: 9.75 ms
groupbybrand_type = q2[ 'brand_type' ] . value_counts( )
time: 51.8 ms
groupbybrand_type. head( 10 )
其它其它 91823
iphoneA1586 14898
iphoneA1524 10330
iphoneA1700 9246
iphoneA1699 8277
iphoneIPHONE6S(A1633) 6271
oppoOPPO R9M 4725
iphoneA1530 4640
oppoOPPO R9TM 2978
vivoVIVO X7 2516
Name: brand_type, dtype: int64
time: 3.44 ms
q2_brand_type = q2[ [ 'id' , 'brand_type' ] ]
q2_brand_type = q2_brand_type. drop_duplicates( )
q2_groupbyid = q2_brand_type[ 'id' ] . value_counts( )
q2_groupbyid = q2_groupbyid. reset_index( )
q2_groupbyid. columns = [ 'id' , 'phone_nums' ]
q2_groupbyid. head( )
id
phone_nums
0
8707678197418467
422
1
9196501153454276
409
2
3900535090108175
389
3
4104535378288025
352
4
1106540188374027
350
time: 90 ms
q2_groupbyid. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5600 entries, 0 to 5599
Data columns (total 2 columns):
id 5600 non-null int64
phone_nums 5600 non-null int64
dtypes: int64(2)
memory usage: 87.6 KB
time: 5.91 ms
q2_brand = q2[ [ 'id' , 'brand' ] ]
q2_brand = q2_brand. drop_duplicates( )
q2_brand_one_hot = pd. get_dummies( q2_brand)
q2_brand_one_hot. head( )
id
brand_huawei
brand_iphone
brand_jinli
brand_le
brand_mei
brand_mi
brand_oppo
brand_san
brand_siji
brand_vivo
brand_其它
0
1752398069509000
0
0
0
0
0
0
0
0
0
0
1
1
1752398069509000
1
0
0
0
0
0
0
0
0
0
0
2
1752398069509000
0
0
0
1
0
0
0
0
0
0
0
3
1752398069509000
0
0
1
0
0
0
0
0
0
0
0
8
1752398069509000
0
0
0
0
0
0
0
1
0
0
0
time: 48.9 ms
q2_one_hot = q2_brand_one_hot. groupby( [ 'id' ] ) . agg( { 'brand_huawei' : pd. Series. max ,
'brand_iphone' : pd. Series. max ,
'brand_jinli' : pd. Series. max ,
'brand_le' : pd. Series. max ,
'brand_mei' : pd. Series. max ,
'brand_mi' : pd. Series. max ,
'brand_oppo' : pd. Series. max ,
'brand_san' : pd. Series. max ,
'brand_siji' : pd. Series. max ,
'brand_vivo' : pd. Series. max ,
'brand_其它' : pd. Series. max
} )
q2_one_hot. head( )
brand_huawei
brand_iphone
brand_jinli
brand_le
brand_mei
brand_mi
brand_oppo
brand_san
brand_siji
brand_vivo
brand_其它
id
1448103998000
1
1
0
1
1
0
1
1
0
0
1
17398718813730
1
1
1
1
1
1
1
1
0
1
1
61132623486000
1
0
0
0
0
0
0
0
0
0
1
68156596675520
0
1
1
1
0
0
0
0
0
0
1
76819334576430
1
1
1
0
1
1
1
1
0
1
1
time: 6.57 s
pos_set = q1_groupbyid. merge( q2_groupbyid, on= [ 'id' ] )
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 3 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
dtypes: float16(1), int64(2)
memory usage: 142.2 KB
time: 11.6 ms
pos_set = pos_set. merge( q2_one_hot, on= [ 'id' ] )
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 14 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
dtypes: float16(1), int64(2), uint8(11)
memory usage: 202.3 KB
time: 98.6 ms
q3
1.將兩月聯絡圈規模求和
2.將兩月出省求和 是:1 否:0
3.將兩月出國求和 是:1 否:0
q3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q3.txt' , sep= '\t' , header= None ) )
q3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'label' ]
Mem. usage decreased to 0.18 Mb (64.6% reduction)
time: 85.8 ms
q3. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month 11200 non-null int32
id 11200 non-null int64
call_nums 11200 non-null int16
is_trans_provincial 11200 non-null int8
is_transnational 11200 non-null int8
label 11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 186.0 KB
time: 7.49 ms
q3_groupbyid_call = q3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
q3_groupbyid_provincial = q3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
q3_groupbyid_trans = q3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
pos_set = pos_set. merge( q3_groupbyid_call, on= [ 'id' ] )
pos_set = pos_set. merge( q3_groupbyid_provincial, on= [ 'id' ] )
pos_set = pos_set. merge( q3_groupbyid_trans, on= [ 'id' ] )
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 17 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 224.2 KB
time: 1.95 s
q4
1.兩月內漫出省次數
2.所有省份one-hot或top10省份+其它省份
3.兩月內漫出不同省個數
q4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q4.txt' , sep= '\t' , header= None ) )
q4. columns = [ 'year_month' , 'id' , 'province' , 'label' ]
q4. info( )
Mem. usage decreased to 0.15 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month 7289 non-null int32
id 7289 non-null int64
province 7218 non-null object
label 7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 18.4 ms
q4. head( )
year_month
id
province
label
0
201707
6062475264825100
廣東
1
1
201707
5627768389537500
北京
1
2
201707
2000900444179600
山西
1
3
201707
5304502776817600
四川
1
4
201707
5304502776817600
四川
1
time: 7.16 ms
q4_groupbyid = q4. groupby( [ 'province' ] ) . size( )
time: 61.3 ms
q4_groupbyid. sort_values( )
province
寧夏 15
吉林 20
內蒙古 22
黑龍江 27
青海 35
天津 39
遼寧 44
西藏 69
山西 70
甘肅 73
新疆 74
安徽 86
海南 100
陝西 114
山東 121
福建 150
河北 168
江蘇 182
湖北 208
上海 215
河南 237
北京 247
江西 364
重慶 428
浙江 483
雲南 530
廣西 536
四川 793
廣東 835
湖南 933
dtype: int64
time: 4.04 ms
q4. province = q4. province. fillna( '湖南' )
q4. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month 7289 non-null int32
id 7289 non-null int64
province 7289 non-null object
label 7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 8.09 ms
q4_groupbyid = q4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
q4_groupbyid = q4_groupbyid. reset_index( )
q4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
pos_set = pos_set. merge( q4_groupbyid, how= 'left' , on= [ 'id' ] )
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 18 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
province_out_cnt 1942 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 268.0 KB
time: 19.6 ms
pos_set = pos_set. fillna( 0 )
pos_set[ 'label' ] = 1
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
province_out_cnt 5600 non-null float64
label 5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 12.7 ms
q6 暫時忽略
q7
1.使用總流量
2.使用不同APP數量
3.某些特定(旅遊相關)APP是否使用
1.1 正樣本
q1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q1.txt' , sep= '\t' , header= None ) )
q1. columns = [ 'year_month' , 'id' , 'consume' , 'label' ]
q1 = q1. fillna( 98.0 )
q1 = q1[ [ 'id' , 'consume' ] ]
q1_groupbyid = q1. groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
q2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q2.txt' , sep= '\t' , header= None ) )
q2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' , 'label' ]
q2. type = q2. type . fillna( '其它' )
brand_series = pd. Series( { '蘋果' : 'iphone' , '華爲' : "huawei" , '歐珀' : 'oppo' , '維沃' : 'vivo' , '三星' : 'san' , '小米' : 'mi' , '金立' : 'jinli' , '魅族' : 'mei' , '樂視' : 'le' , '四季恆美' : 'siji' } )
q2. brand = q2. brand. map ( brand_series)
q2. brand = q2. brand. fillna( '其它' )
q2[ 'brand_type' ] = q2[ 'brand' ] + q2[ 'type' ]
q2_brand_type = q2[ [ 'id' , 'brand_type' ] ]
q2_brand_type = q2_brand_type. drop_duplicates( )
q2_groupbyid = q2_brand_type[ 'id' ] . value_counts( )
q2_groupbyid = q2_groupbyid. reset_index( )
q2_groupbyid. columns = [ 'id' , 'phone_nums' ]
q2_brand = q2[ [ 'id' , 'brand' ] ]
q2_brand = q2_brand. drop_duplicates( )
q2_brand_one_hot = pd. get_dummies( q2_brand)
q2_one_hot = q2_brand_one_hot. groupby( [ 'id' ] ) . agg( { 'brand_huawei' : pd. Series. max ,
'brand_iphone' : pd. Series. max ,
'brand_jinli' : pd. Series. max ,
'brand_le' : pd. Series. max ,
'brand_mei' : pd. Series. max ,
'brand_mi' : pd. Series. max ,
'brand_oppo' : pd. Series. max ,
'brand_san' : pd. Series. max ,
'brand_siji' : pd. Series. max ,
'brand_vivo' : pd. Series. max ,
'brand_其它' : pd. Series. max
} )
q2_one_hot. head( )
pos_set = q1_groupbyid. merge( q2_groupbyid, on= [ 'id' ] )
pos_set = pos_set. merge( q2_one_hot, on= [ 'id' ] )
q3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q3.txt' , sep= '\t' , header= None ) )
q3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'label' ]
q3_groupbyid_call = q3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
q3_groupbyid_provincial = q3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
q3_groupbyid_trans = q3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
pos_set = pos_set. merge( q3_groupbyid_call, on= [ 'id' ] )
pos_set = pos_set. merge( q3_groupbyid_provincial, on= [ 'id' ] )
pos_set = pos_set. merge( q3_groupbyid_trans, on= [ 'id' ] )
q4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q4.txt' , sep= '\t' , header= None ) )
q4. columns = [ 'year_month' , 'id' , 'province' , 'label' ]
q4. province = q4. province. fillna( '湖南' )
q4_groupbyid = q4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
q4_groupbyid = q4_groupbyid. reset_index( )
q4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
pos_set = pos_set. merge( q4_groupbyid, how= 'left' , on= [ 'id' ] )
pos_set = pos_set. fillna( 0 )
pos_set[ 'label' ] = 1
pos_set. info( )
Mem. usage decreased to 0.16 Mb (53.1% reduction)
Mem. usage decreased to 11.31 Mb (14.6% reduction)
Mem. usage decreased to 0.18 Mb (64.6% reduction)
Mem. usage decreased to 0.15 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id 5600 non-null int64
consume 5600 non-null float16
phone_nums 5600 non-null int64
brand_huawei 5600 non-null uint8
brand_iphone 5600 non-null uint8
brand_jinli 5600 non-null uint8
brand_le 5600 non-null uint8
brand_mei 5600 non-null uint8
brand_mi 5600 non-null uint8
brand_oppo 5600 non-null uint8
brand_san 5600 non-null uint8
brand_siji 5600 non-null uint8
brand_vivo 5600 non-null uint8
brand_其它 5600 non-null uint8
call_nums 5600 non-null int16
is_trans_provincial 5600 non-null int8
is_transnational 5600 non-null int8
province_out_cnt 5600 non-null float64
label 5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 10.1 s
1.2 負樣本
n1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n1.txt' , sep= '\t' , header= None ) )
n1. columns = [ 'year_month' , 'id' , 'consume' , 'label' ]
n1 = n1. fillna( 98.0 )
n1_groupbyid = n1[ [ 'id' , 'consume' ] ] . groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
n2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n2.txt' , sep= '\t' , header= None ) )
n2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' , 'label' ]
n2. type = n2. type . fillna( '其它' )
brand_series = pd. Series( { '蘋果' : 'iphone' , '華爲' : "huawei" , '歐珀' : 'oppo' , '維沃' : 'vivo' , '三星' : 'san' , '小米' : 'mi' , '金立' : 'jinli' , '魅族' : 'mei' , '樂視' : 'le' , '四季恆美' : 'siji' } )
n2. brand = n2. brand. map ( brand_series)
n2. brand = n2. brand. fillna( '其它' )
n2[ 'brand_type' ] = n2[ 'brand' ] + n2[ 'type' ]
n2_brand_type = n2[ [ 'id' , 'brand_type' ] ]
n2_brand_type = n2_brand_type. drop_duplicates( )
n2_groupbyid = n2_brand_type[ 'id' ] . value_counts( )
n2_groupbyid = n2_groupbyid. reset_index( )
n2_groupbyid. columns = [ 'id' , 'phone_nums' ]
n2_brand = n2[ [ 'id' , 'brand' ] ]
n2_brand = n2_brand. drop_duplicates( )
n2_brand_one_hot = pd. get_dummies( n2_brand)
n2_one_hot = n2_brand_one_hot. groupby( [ 'id' ] ) . agg( { 'brand_huawei' : pd. Series. max ,
'brand_iphone' : pd. Series. max ,
'brand_jinli' : pd. Series. max ,
'brand_le' : pd. Series. max ,
'brand_mei' : pd. Series. max ,
'brand_mi' : pd. Series. max ,
'brand_oppo' : pd. Series. max ,
'brand_san' : pd. Series. max ,
'brand_siji' : pd. Series. max ,
'brand_vivo' : pd. Series. max ,
'brand_其它' : pd. Series. max
} )
neg_set = n1_groupbyid. merge( n2_groupbyid, on= [ 'id' ] )
neg_set = neg_set. merge( n2_one_hot, on= [ 'id' ] )
n3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n3.txt' , sep= '\t' , header= None ) )
n3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'label' ]
n3_groupbyid_call = n3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
n3_groupbyid_provincial = n3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
n3_groupbyid_trans = n3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
neg_set = neg_set. merge( n3_groupbyid_call, on= [ 'id' ] )
neg_set = neg_set. merge( n3_groupbyid_provincial, on= [ 'id' ] )
neg_set = neg_set. merge( n3_groupbyid_trans, on= [ 'id' ] )
n4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n4.txt' , sep= '\t' , header= None ) )
n4. columns = [ 'year_month' , 'id' , 'province' , 'label' ]
n4. province = n4. province. fillna( '湖南' )
n4_groupbyid = n4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
n4_groupbyid = n4_groupbyid. reset_index( )
n4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
neg_set = neg_set. merge( n4_groupbyid, how= 'left' , on= [ 'id' ] )
neg_set = neg_set. fillna( 0 )
neg_set[ 'label' ] = 0
neg_set. info( )
Mem. usage decreased to 2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
Mem. usage decreased to 3.03 Mb (64.6% reduction)
Mem. usage decreased to 0.73 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 93375 entries, 0 to 93374
Data columns (total 19 columns):
id 93375 non-null int64
consume 93375 non-null float16
phone_nums 93375 non-null int64
brand_huawei 93375 non-null uint8
brand_iphone 93375 non-null uint8
brand_jinli 93375 non-null uint8
brand_le 93375 non-null uint8
brand_mei 93375 non-null uint8
brand_mi 93375 non-null uint8
brand_oppo 93375 non-null uint8
brand_san 93375 non-null uint8
brand_siji 93375 non-null uint8
brand_vivo 93375 non-null uint8
brand_其它 93375 non-null uint8
call_nums 93375 non-null int16
is_trans_provincial 93375 non-null int8
is_transnational 93375 non-null int8
province_out_cnt 93375 non-null float64
label 93375 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.1 MB
time: 2min 48s
train_set = pos_set. append( neg_set)
train_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98975 entries, 0 to 93374
Data columns (total 19 columns):
id 98975 non-null int64
consume 98975 non-null float16
phone_nums 98975 non-null int64
brand_huawei 98975 non-null uint8
brand_iphone 98975 non-null uint8
brand_jinli 98975 non-null uint8
brand_le 98975 non-null uint8
brand_mei 98975 non-null uint8
brand_mi 98975 non-null uint8
brand_oppo 98975 non-null uint8
brand_san 98975 non-null uint8
brand_siji 98975 non-null uint8
brand_vivo 98975 non-null uint8
brand_其它 98975 non-null uint8
call_nums 98975 non-null int16
is_trans_provincial 98975 non-null int8
is_transnational 98975 non-null int8
province_out_cnt 98975 non-null float64
label 98975 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.4 MB
time: 62.5 ms
2 建模
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn import metrics
from sklearn. model_selection import train_test_split
X = train_set[ [ 'consume' , 'phone_nums' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'province_out_cnt' ] ] . values
y = train_set[ 'label' ] . values
x_train, x_test, y_train, y_test = train_test_split( X, y, test_size= 0.2 )
lgb_train = lgb. Dataset( x_train, y_train)
lgb_eval = lgb. Dataset( x_test, y_test, reference = lgb_train)
params = {
'boosting_type' : 'gbdt' ,
'objective' : 'binary' ,
'metric' : { 'auc' } ,
'num_leaves' : 100 ,
'reg_alpha' : 0 ,
'reg_lambda' : 0.01 ,
'max_depth' : 6 ,
'n_estimators' : 100 ,
'subsample' : 0.9 ,
'colsample_bytree' : 0.85 ,
'subsample_freq' : 1 ,
'min_child_samples' : 25 ,
'learning_rate' : 0.1 ,
'random_state' : 2019
}
gbm = lgb. train( params,
lgb_train,
num_boost_round = 2000 ,
valid_sets = lgb_eval,
verbose_eval= 250 ,
early_stopping_rounds= 50 )
y_pred = gbm. predict( X, num_iteration= gbm. best_iteration)
print ( 'AUC: %.4f' % metrics. roc_auc_score( y, y_pred) )
y_pred = gbm. predict( x_test, num_iteration= gbm. best_iteration)
print ( 'Test AUC: %.4f' % metrics. roc_auc_score( y_test, y_pred) )
Training until validation scores don't improve for 50 rounds.
Early stopping, best iteration is:
[18] valid_0's auc: 0.786865
AUC: 0.7981
Test AUC: 0.7869
time: 772 ms
from sklearn. model_selection import train_test_split
from xgboost import XGBClassifier
from collections import Counter
X = train_set[ [ 'consume' , 'phone_nums' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'province_out_cnt' ] ] . values
y = train_set[ 'label' ] . values
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size= 0.2 )
c = Counter( y_train)
'''
params={'booster':'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth':4,
'lambda':10,
'subsample':0.75,
'colsample_bytree':0.75,
'min_child_weight':2,
'eta': 0.025,
'seed':0,
'nthread':8,
'silent':1}
'''
clf = XGBClassifier( max_depth= 5 , eval_metric= 'auc' , min_child_weight= 6 , scale_pos_weight= c[ 0 ] / 16 / c[ 1 ] ,
nthread= 12 , num_boost_round= 1000 , seed= 2019
)
print ( 'fit start...' )
clf. fit( X_train, y_train)
print ( 'fit finish' )
'''
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print('train score:{}\ntest score:{}'.format(train_score, test_score))
'''
y_pred= clf. predict( X)
from sklearn import metrics
print ( 'AUC: %.4f' % metrics. roc_auc_score( y, y_pred) )
y_pred= clf. predict( X_test)
print ( 'Test AUC: %.4f' % metrics. roc_auc_score( y_test, y_pred) )
fit start...
fit finish
AUC: 0.5134
Test AUC: 0.5082
time: 3.11 s
import xgboost as xgb
import pandas as pd
from sklearn. model_selection import GridSearchCV
from collections import Counter
X_train = train_set[ [ 'consume' , 'phone_nums' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'province_out_cnt' ] ] . values
y_train = train_set[ 'label' ] . values
c = Counter( y_train)
parameters = {
'max_depth' : [ 5 , 10 , 15 ] ,
'learning_rate' : [ 0.01 , 0.02 , 0.05 ] ,
'n_estimators' : [ 500 , 1000 , 2000 ] ,
'min_child_weight' : [ 0 , 2 , 5 ] ,
'max_delta_step' : [ 0 , 0.2 , 0.6 ] ,
'subsample' : [ 0.6 , 0.7 , 0.8 ] ,
'colsample_bytree' : [ 0.5 , 0.6 , 0.7 ] ,
'reg_alpha' : [ 0 , 0.25 , 0.5 ] ,
'reg_lambda' : [ 0.2 , 0.4 , 0.6 ] ,
'scale_pos_weight' : [ 0.8 , 8 , 14 ]
}
xlf = xgb. XGBClassifier( max_depth= 10 ,
learning_rate= 0.01 ,
n_estimators= 2000 ,
silent= True ,
objective= 'binary:logistic' ,
nthread= 12 ,
gamma= 0 ,
min_child_weight= 1 ,
max_delta_step= 0 ,
subsample= 0.85 ,
colsample_bytree= 0.7 ,
colsample_bylevel= 1 ,
reg_alpha= 0 ,
reg_lambda= 1 ,
scale_pos_weight= 1 ,
seed= 2019 ,
missing= None )
gsearch = GridSearchCV( xlf, param_grid= parameters, scoring= 'accuracy' , cv= 3 )
gsearch. fit( X_train, y_train)
print ( "Best score: %0.3f" % gsearch. best_score_)
print ( "Best parameters set:" )
best_parameters = gsearch. best_estimator_. get_params( )
for param_name in sorted ( parameters. keys( ) ) :
print ( "\t%s: %r" % ( param_name, best_parameters[ param_name] ) )
3 預測
3.1 測試集
t1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_1.txt' , sep= '\t' , header= None ) )
t1. columns = [ 'year_month' , 'id' , 'consume' ]
t1 = t1. fillna( 81.0 )
t1_groupbyid = t1[ [ 'id' , 'consume' ] ] . groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
t2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_2.txt' , sep= '\t' , header= None ) )
t2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' ]
t2 = t2. fillna( '其它' )
brand_series = pd. Series( { '蘋果' : 'iphone' , '華爲' : "huawei" , '歐珀' : 'oppo' , '維沃' : 'vivo' , '三星' : 'san' , '小米' : 'mi' , '金立' : 'jinli' , '魅族' : 'mei' , '樂視' : 'le' , '四季恆美' : 'siji' } )
t2. brand = t2. brand. map ( brand_series)
t2. brand = t2. brand. fillna( '其它' )
t2[ 'brand_type' ] = t2[ 'brand' ] + t2[ 'type' ]
t2_brand_type = t2[ [ 'id' , 'brand_type' ] ]
t2_brand_type = t2_brand_type. drop_duplicates( )
t2_groupbyid = t2_brand_type[ 'id' ] . value_counts( )
t2_groupbyid = t2_groupbyid. reset_index( )
t2_groupbyid. columns = [ 'id' , 'phone_nums' ]
t2_brand = t2[ [ 'id' , 'brand' ] ]
t2_brand = t2_brand. drop_duplicates( )
t2_brand_one_hot = pd. get_dummies( t2_brand)
t2_one_hot = t2_brand_one_hot. groupby( [ 'id' ] ) . agg( { 'brand_huawei' : pd. Series. max ,
'brand_iphone' : pd. Series. max ,
'brand_jinli' : pd. Series. max ,
'brand_le' : pd. Series. max ,
'brand_mei' : pd. Series. max ,
'brand_mi' : pd. Series. max ,
'brand_oppo' : pd. Series. max ,
'brand_san' : pd. Series. max ,
'brand_siji' : pd. Series. max ,
'brand_vivo' : pd. Series. max ,
'brand_其它' : pd. Series. max
} )
test_set = t1_groupbyid. merge( t2_groupbyid, on= [ 'id' ] )
test_set = test_set. merge( t2_one_hot, on= [ 'id' ] )
t3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_3.txt' , sep= '\t' , header= None ) )
t3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' ]
t3_groupbyid_call = t3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
t3_groupbyid_provincial = t3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
t3_groupbyid_trans = t3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
test_set = test_set. merge( t3_groupbyid_call, on= [ 'id' ] )
test_set = test_set. merge( t3_groupbyid_provincial, on= [ 'id' ] )
test_set = test_set. merge( t3_groupbyid_trans, on= [ 'id' ] )
t4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_4.txt' , sep= '\t' , header= None ) )
t4. columns = [ 'year_month' , 'id' , 'province' ]
t4 = t4. fillna( '湖南' )
t4_groupbyid = t4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
t4_groupbyid = t4_groupbyid. reset_index( )
t4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
test_set = test_set. merge( t4_groupbyid, how= 'left' , on= [ 'id' ] )
test_set = test_set. fillna( 0 )
test_set. info( )
Mem. usage decreased to 1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
Mem. usage decreased to 1.53 Mb (60.0% reduction)
Mem. usage decreased to 0.85 Mb (16.7% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 18 columns):
id 48668 non-null int64
consume 48668 non-null float16
phone_nums 48668 non-null int64
brand_huawei 48668 non-null uint8
brand_iphone 48668 non-null uint8
brand_jinli 48668 non-null uint8
brand_le 48668 non-null uint8
brand_mei 48668 non-null uint8
brand_mi 48668 non-null uint8
brand_oppo 48668 non-null uint8
brand_san 48668 non-null uint8
brand_siji 48668 non-null uint8
brand_vivo 48668 non-null uint8
brand_其它 48668 non-null uint8
call_nums 48668 non-null int16
is_trans_provincial 48668 non-null int8
is_transnational 48668 non-null int8
province_out_cnt 48668 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 2.3 MB
time: 1min 39s
X_test = test_set[ [ 'consume' , 'phone_nums' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'province_out_cnt' ] ] . values
y_predict = gbm. predict( X_test, num_iteration= gbm. best_iteration)
submit = test_set[ [ 'id' ] ]
submit[ 'pred' ] = y_predict
time: 108 ms
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""
type ( y_predict)
numpy.ndarray
time: 2.3 ms
y_predict[ : 5 ]
array([0.10280227, 0.08214867, 0.06905468, 0.07655945, 0.11238844])
time: 2.9 ms
X_test = test_set[ [ 'consume' , 'phone_nums' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'province_out_cnt' ] ] . values
y_predict = clf. predict_proba( X_test) [ : , 1 ]
submit_xgb = test_set[ [ 'id' ] ]
submit_xgb[ 'pred' ] = y_predict
time: 208 ms
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""
4 提交結果
tt1 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_1.txt' , sep= '\t' , header= None )
tt1. columns = [ 'year_month' , 'id' , 'consume' ]
time: 41.6 ms
xgb_t1_id = tt1[ [ 'id' ] ] . drop_duplicates( )
time: 13 ms
xgb_t1_id. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id 50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.46 ms
t1_id = tt1[ [ 'id' ] ] . drop_duplicates( )
time: 12.5 ms
t1_id. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id 50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.67 ms
submit_xgb. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id 48668 non-null int64
pred 48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 950.5 KB
time: 7.8 ms
submit. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id 48668 non-null int64
pred 48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.33 ms
tt_xgb = t1_id. merge( submit_xgb, on= [ 'id' ] , how= 'left' )
time: 17.6 ms
tt_xgb. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 980.5 KB
time: 8.14 ms
tt = t1_id. merge( submit, on= [ 'id' ] , how= 'left' )
time: 19.3 ms
tt. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.06 ms
xgboost
submit_xgb = tt_xgb. fillna( 0.0 )
time: 1.92 ms
lightgbm
submit_gbm = tt. fillna( 0.0 )
time: 1.96 ms
1.模型融合 求和 得分0.4558
2.全爲1.0/0.0 得分0.5
3.大於0.5改爲1.0,小於0.5改爲0.0 應有2800人左右去 xgb0.26 得分0.50153 gbm0.17 得分0.50554
submit_xgb. describe( )
id
pred
count
5.020000e+04
50200.000000
mean
5.449990e+15
0.092590
std
2.628886e+15
0.088487
min
5.959412e+11
0.000000
25%
3.177008e+15
0.034837
50%
5.441108e+15
0.063993
75%
7.726328e+15
0.125547
max
9.999920e+15
0.754152
time: 22.4 ms
submit_xgb[ submit_xgb[ 'pred' ] >= 0.26 ] . describe( )
id
pred
count
2.818000e+03
2818.000000
mean
5.523494e+15
0.350387
std
2.632627e+15
0.083545
min
7.736480e+13
0.260060
25%
3.193231e+15
0.287803
50%
5.528103e+15
0.324941
75%
7.801996e+15
0.386373
max
9.999505e+15
0.754152
time: 16.7 ms
xgb_yes = submit_xgb[ submit_xgb[ 'pred' ] >= 0.26 ]
xgb_yes[ 'pred' ] = 1.0
xgb_yes. describe( )
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
id
pred
count
2.818000e+03
2818.0
mean
5.523494e+15
1.0
std
2.632627e+15
0.0
min
7.736480e+13
1.0
25%
3.193231e+15
1.0
50%
5.528103e+15
1.0
75%
7.801996e+15
1.0
max
9.999505e+15
1.0
time: 347 ms
xgb_no = submit_xgb[ submit_xgb[ 'pred' ] < 0.26 ]
xgb_no[ 'pred' ] = 0.0
xgb_no. describe( )
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
id
pred
count
4.738200e+04
47382.0
mean
5.445619e+15
0.0
std
2.628626e+15
0.0
min
5.959412e+11
0.0
25%
3.175890e+15
0.0
50%
5.435288e+15
0.0
75%
7.722863e+15
0.0
max
9.999920e+15
0.0
time: 380 ms
submit = xgb_yes. append( xgb_no)
time: 2.29 ms
submit. describe( )
id
pred
count
5.020000e+04
50200.000000
mean
5.449990e+15
0.056135
std
2.628886e+15
0.230185
min
5.959412e+11
0.000000
25%
3.177008e+15
0.000000
50%
5.441108e+15
0.000000
75%
7.726328e+15
0.000000
max
9.999920e+15
1.000000
time: 19.6 ms
submit_xgb[ submit_xgb[ 'pred' ] >= 0.2 ] . describe( )
id
pred
count
5.547000e+03
5547.000000
mean
5.508672e+15
0.289829
std
2.641133e+15
0.086438
min
5.399382e+12
0.200014
25%
3.195841e+15
0.225862
50%
5.489831e+15
0.261552
75%
7.813588e+15
0.326278
max
9.999505e+15
0.754152
time: 18.5 ms
5600 / 98975 * 50200
2840.3132104066685
time: 2.17 ms
submit_gbm[ submit_gbm[ 'pred' ] >= 0.23 ] . describe( )
id
pred
count
2.539000e+03
2539.000000
mean
5.482621e+15
0.298836
std
2.625965e+15
0.062903
min
7.736480e+13
0.230013
25%
3.200866e+15
0.253366
50%
5.471503e+15
0.279145
75%
7.742764e+15
0.326900
max
9.999505e+15
0.632138
time: 19 ms
submit_gbm[ submit_gbm[ 'pred' ] >= 0.22 ] . describe( )
id
pred
count
2.859000e+03
2859.000000
mean
5.493943e+15
0.290563
std
2.630246e+15
0.063701
min
7.736480e+13
0.220121
25%
3.195841e+15
0.244933
50%
5.501943e+15
0.270700
75%
7.743865e+15
0.321506
max
9.999505e+15
0.632138
time: 19.6 ms
gbm_yes = submit_gbm[ submit_gbm[ 'pred' ] >= 0.23 ]
gbm_yes[ 'pred' ] = 1.0
gbm_yes. describe( )
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
id
pred
count
2.539000e+03
2539.0
mean
5.482621e+15
1.0
std
2.625965e+15
0.0
min
7.736480e+13
1.0
25%
3.200866e+15
1.0
50%
5.471503e+15
1.0
75%
7.742764e+15
1.0
max
9.999505e+15
1.0
time: 82.2 ms
gbm_no = submit_gbm[ submit_gbm[ 'pred' ] < 0.23 ]
gbm_no[ 'pred' ] = 0.0
gbm_no. describe( )
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
id
pred
count
4.766100e+04
47661.0
mean
5.448252e+15
0.0
std
2.629058e+15
0.0
min
5.959412e+11
0.0
25%
3.175232e+15
0.0
50%
5.439911e+15
0.0
75%
7.725629e+15
0.0
max
9.999920e+15
0.0
time: 58.7 ms
submit = gbm_yes. append( gbm_no)
time: 4.19 ms
submit. describe( )
id
pred
count
5.020000e+04
50200.000000
mean
5.449990e+15
0.018745
std
2.628886e+15
0.135625
min
5.959412e+11
0.000000
25%
3.177008e+15
0.000000
50%
5.441108e+15
0.000000
75%
7.726328e+15
0.000000
max
9.999920e+15
1.000000
time: 20.4 ms
submit_gbm. describe( )
id
pred
count
5.020000e+04
50200.000000
mean
5.449990e+15
0.085097
std
2.628886e+15
0.071304
min
5.959412e+11
0.000000
25%
3.177008e+15
0.036845
50%
5.441108e+15
0.062206
75%
7.726328e+15
0.113462
max
9.999920e+15
0.632138
time: 20.8 ms
submit. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 91 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 9.36 ms
submit = submit_xgb. append( submit_gbm)
submit = submit. groupby( by= 'id' ) . sum ( ) . reset_index( )
submit. describe( )
id
pred
count
5.020000e+04
50200.000000
mean
5.449990e+15
0.169012
std
2.628886e+15
0.139313
min
5.959412e+11
0.000000
25%
3.177008e+15
0.076237
50%
5.441108e+15
0.125893
75%
7.726328e+15
0.222622
max
9.999920e+15
1.124561
time: 41.7 ms
submit. head( )
id
pred
4
9297165066591558
1.0
14
8168181097053542
1.0
18
6473515505643555
1.0
25
4641233171005560
1.0
29
6759757036024682
1.0
time: 6.16 ms
submit_xgb[ submit_xgb[ 'id' ] == 595941207920 ]
id
pred
8048
595941207920
0.185561
time: 7.07 ms
submit_gbm[ submit_gbm[ 'id' ] == 595941207920 ]
id
pred
8048
595941207920
0.114782
time: 6.33 ms
submit. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
id 50200 non-null int64
pred 50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8 ms
全爲1
t1_id[ 'pred' ] = 1.0
submit = t1_id. copy( )
submit. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 2 columns):
id 50200 non-null int64
pred 50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.79 ms
submit. head( )
id
pred
0
6401824160010748
1.0
1
6506134548135499
1.0
2
5996920884619954
1.0
3
1187209424543713
1.0
4
9297165066591558
1.0
time: 13.1 ms
submit. columns = [ 'ID' , 'Pred' ]
submit[ 'ID' ] = submit[ 'ID' ] . astype( str )
time: 36.7 ms
submit. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
ID 50200 non-null object
Pred 50200 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.1+ MB
time: 10.1 ms
submit. to_csv( '../submit.csv' )
time: 126 ms
!wget - O kesci_submit https: // www. heywhale. com/ kesci_submit& & chmod + x kesci_submit
wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
--2019-07-31 08:15:56-- https://www.heywhale.com/kesci_submit
Resolving www.heywhale.com (www.heywhale.com)... 106.15.25.147
Connecting to www.heywhale.com (www.heywhale.com)|106.15.25.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6528405 (6.2M) [application/octet-stream]
Saving to: ‘kesci_submit’
kesci_submit 100%[===================>] 6.23M 12.1MB/s in 0.5s
2019-07-31 08:15:57 (12.1 MB/s) - ‘kesci_submit’ saved [6528405/6528405]
time: 1.83 s
!https_proxy= "http://klab-external-proxy" . / kesci_submit - file . . / submit. csv - token 578549794d544bff
Kesci Submit Tool 3.0
> 已驗證Token
> 提交文件 ../submit.csv (1312.26 KiB)
> 文件已上傳
> 提交完成
time: 1.7 s
!. / kesci_submit - token 578549794d544bff - file . . / submit. csv
Kesci Submit Tool
Result File: ../submit.csv (1.28 MiB)
Uploading: 7%====================
Submit Failed.
Serevr Response:
400 - {"message":"當前提交工具版本過舊,請參考比賽提交頁面信息下載新的提交工具"}
time: 1 s
!ls . . /
input pred.csv work
time: 665 ms
!wget - nv - O kesci_submit https: // www. heywhale. com/ kesci_submit& & chmod + x kesci_submit
wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
2019-07-02 08:08:23 URL:https://www.heywhale.com/kesci_submit [7842088/7842088] -> "kesci_submit" [1]
time: 1.47 s
0 查看數據
0.1 訓練數據
0.1.1 正樣本
q1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q1.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 0.16 Mb (53.1% reduction)
time: 23 ms
q1. columns = [ 'year_month' , 'id' , 'consume' , 'label' ]
time: 1.21 ms
q1 = q1. dropna( axis= 0 )
time: 6.72 ms
q1. head( )
year_month
id
consume
label
2
201706
8160829951314300
82.75000
1
3
201707
8160829951314300
37.68750
1
4
201706
1508075698521400
68.00000
1
5
201707
1508075698521400
49.59375
1
6
201706
1686251204809800
200.75000
1
time: 6.82 ms
q1. describe( )
year_month
id
consume
label
count
10865.000000
1.086500e+04
1.086500e+04
10865.0
mean
201706.499678
5.417732e+15
inf
1.0
std
0.500023
2.635784e+15
inf
0.0
min
201706.000000
1.448104e+12
4.998779e-02
1.0
25%
201706.000000
3.118365e+15
4.068750e+01
1.0
50%
201706.000000
5.456594e+15
9.837500e+01
1.0
75%
201707.000000
7.687339e+15
1.785000e+02
1.0
max
201707.000000
9.997949e+15
1.324000e+03
1.0
time: 37.1 ms
q1. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 2 to 11199
Data columns (total 4 columns):
year_month 10865 non-null int32
id 10865 non-null int64
consume 10865 non-null float16
label 10865 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 244.0 KB
time: 6.9 ms
% matplotlib inline
q1. consume. plot( )
Matplotlib is building the font cache using fc-list. This may take a moment.
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1c0659b70>
time: 11.3 s
q1[ q1. consume == 1323.74 ]
year_month
id
consume
label
4867
201707
5510977603357000
1324.0
1
time: 11.1 ms
q2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q2.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 291 ms
q2 = q2. dropna( axis= 0 )
time: 77.7 ms
q2. head( )
0
1
2
3
4
5
1
1752398069509000
華爲
PLK-AL10
20170609223138
20170609224345
1
2
1752398069509000
樂視
LETV X501
20160924102711
20160924112425
1
3
1752398069509000
金立
金立 GN800
20150331210255
20150630131232
1
4
1752398069509000
金立
GIONEE M5
20170508191216
20170605192347
1
5
1752398069509000
華爲
PLK-AL10
20160618182839
20170731235959
1
time: 8.16 ms
q2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' , 'label' ]
time: 1.15 ms
q2. head( )
id
brand
type
first_use_time
recent_use_time
label
1
1752398069509000
華爲
PLK-AL10
20170609223138
20170609224345
1
2
1752398069509000
樂視
LETV X501
20160924102711
20160924112425
1
3
1752398069509000
金立
金立 GN800
20150331210255
20150630131232
1
4
1752398069509000
金立
GIONEE M5
20170508191216
20170605192347
1
5
1752398069509000
華爲
PLK-AL10
20160618182839
20170731235959
1
time: 8.58 ms
q2. describe( )
id
first_use_time
recent_use_time
label
count
1.973760e+05
1.973760e+05
1.973760e+05
197376.0
mean
5.436228e+15
2.015597e+13
2.015684e+13
1.0
std
2.642924e+15
2.685010e+11
2.685124e+11
0.0
min
1.448104e+12
-1.000000e+00
-1.000000e+00
1.0
25%
3.227267e+15
2.015122e+13
2.016013e+13
1.0
50%
5.353833e+15
2.016052e+13
2.016060e+13
1.0
75%
7.764521e+15
2.016102e+13
2.016112e+13
1.0
max
9.997949e+15
2.017073e+13
2.017073e+13
1.0
time: 64.7 ms
q2. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 197376 entries, 1 to 289201
Data columns (total 6 columns):
id 197376 non-null int64
brand 197376 non-null object
type 197376 non-null object
first_use_time 197376 non-null int64
recent_use_time 197376 non-null int64
label 197376 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 9.2+ MB
time: 41.7 ms
q3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q3.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 0.18 Mb (64.6% reduction)
time: 18.4 ms
q3 = q3. dropna( axis= 0 )
time: 6.41 ms
q3. head( )
0
1
2
3
4
5
0
201707
6062475264825100
88
1
0
1
1
201707
8160829951314300
27
0
0
1
2
201707
1508075698521400
19
0
0
1
3
201707
1686251204809800
207
0
0
1
4
201707
5627768389537500
133
1
0
1
time: 7.62 ms
q3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'label' ]
time: 1.16 ms
q3. head( )
year_month
id
call_nums
is_trans_provincial
is_transnational
label
0
201707
6062475264825100
88
1
0
1
1
201707
8160829951314300
27
0
0
1
2
201707
1508075698521400
19
0
0
1
3
201707
1686251204809800
207
0
0
1
4
201707
5627768389537500
133
1
0
1
time: 7.37 ms
q3. describe( )
year_month
id
call_nums
is_trans_provincial
is_transnational
label
count
11200.000000
1.120000e+04
11200.000000
11200.000000
11200.000000
11200.0
mean
201706.500000
5.416583e+15
70.562232
0.235446
0.014464
1.0
std
0.500022
2.642827e+15
61.820144
0.424296
0.119400
0.0
min
201706.000000
1.448104e+12
-1.000000
0.000000
0.000000
1.0
25%
201706.000000
3.117220e+15
25.000000
0.000000
0.000000
1.0
50%
201706.500000
5.456254e+15
54.000000
0.000000
0.000000
1.0
75%
201707.000000
7.702940e+15
99.250000
0.000000
0.000000
1.0
max
201707.000000
9.997949e+15
727.000000
1.000000
1.000000
1.0
time: 79.6 ms
q3. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month 11200 non-null int32
id 11200 non-null int64
call_nums 11200 non-null int16
is_trans_provincial 11200 non-null int8
is_transnational 11200 non-null int8
label 11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 273.4 KB
time: 7.47 ms
q4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q4.txt' , sep= '\t' , header= None ) )
q4 = q4. dropna( axis= 0 )
q4. columns = [ 'year_month' , 'id' , 'province' , 'label' ]
time: 935 µs
q4. head( )
year_month
id
province
label
0
201707
6062475264825100
廣東
1
1
201707
5627768389537500
北京
1
2
201707
2000900444179600
山西
1
3
201707
5304502776817600
四川
1
4
201707
5304502776817600
四川
1
time: 6.84 ms
q4. describe( )
year_month
id
label
count
7218.000000
7.218000e+03
7218.0
mean
201706.538515
5.341915e+15
1.0
std
0.498549
2.631231e+15
0.0
min
201706.000000
1.739872e+13
1.0
25%
201706.000000
3.037311e+15
1.0
50%
201707.000000
5.367106e+15
1.0
75%
201707.000000
7.545199e+15
1.0
max
201707.000000
9.987407e+15
1.0
time: 22.2 ms
q4. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7218 entries, 0 to 7288
Data columns (total 4 columns):
year_month 7218 non-null int32
id 7218 non-null int64
province 7218 non-null object
label 7218 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 204.4+ KB
time: 6.74 ms
!ls / home/ kesci/ input / gzlt/ train_set/ 201708q/
201708q1.txt 201708q3.txt 201708q6.txt
201708q2.txt 201708q4.txt 201708q7.txt
time: 667 ms
q6 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q6.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 62.58 Mb (52.1% reduction)
time: 3.9 s
q6. columns = [ 'date' , 'hour' , 'id' , 'user_longitude' , 'user_latitude' , 'label' ]
time: 868 µs
q6. head( )
date
hour
id
user_longitude
user_latitude
label
0
2017-07-18
8.0
9239265006758100
106.467545
26.58625
1
1
2017-07-10
0.0
3859201812337600
106.708213
26.57854
1
2
2017-07-16
18.0
3859201812337600
106.545690
26.56724
1
3
2017-07-17
8.0
3859201812337600
106.545690
26.56724
1
4
2017-07-27
16.0
3859201812337600
106.545690
26.56724
1
time: 16.7 ms
q6. describe( )
hour
id
user_longitude
user_latitude
label
count
2.852871e+06
2.852871e+06
2.851527e+06
2.851527e+06
2852871.0
mean
1.141897e+01
5.415213e+15
1.068143e+02
2.659968e+01
1.0
std
6.632995e+00
2.634349e+15
5.580043e-01
2.852525e-01
0.0
min
0.000000e+00
1.448104e+12
1.036700e+02
2.470664e+01
1.0
25%
6.000000e+00
3.135488e+15
1.066656e+02
2.654610e+01
1.0
50%
1.200000e+01
5.442594e+15
1.067027e+02
2.658143e+01
1.0
75%
1.800000e+01
7.687963e+15
1.067373e+02
2.662629e+01
1.0
max
2.200000e+01
9.997949e+15
1.095277e+02
2.909348e+01
1.0
time: 775 ms
q6. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2852871 entries, 0 to 2852870
Data columns (total 6 columns):
date object
hour float64
id int64
user_longitude float64
user_latitude float64
label int64
dtypes: float64(3), int64(2), object(1)
memory usage: 130.6+ MB
time: 3.24 ms
q7 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708q/201708q7.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 3.80 Mb (42.5% reduction)
time: 137 ms
q7 = q7. dropna( axis= 0 )
time: 35.4 ms
q7. columns = [ 'year_month' , 'id' , 'app' , 'flow' , 'label' ]
time: 1.54 ms
q7. head( )
year_month
id
app
flow
label
0
201707
6610350034824100
騰訊手機管家
0.010002
1
1
201707
6997210664840100
喜馬拉雅FM
27.390625
1
2
201707
3198621664927300
網易新聞
0.029999
1
3
201707
9987406611703100
喜馬拉雅FM
0.000000
1
4
201707
1785540174324200
天氣通
0.020004
1
time: 8.14 ms
q7. describe( )
year_month
id
flow
label
count
173117.000000
1.731170e+05
173117.000000
173117.0
mean
201706.539699
5.403100e+15
NaN
1.0
std
0.498423
2.667026e+15
NaN
0.0
min
201706.000000
1.448104e+12
0.000000
1.0
25%
201706.000000
3.056260e+15
0.010002
1.0
50%
201707.000000
5.429056e+15
0.080017
1.0
75%
201707.000000
7.730223e+15
1.599609
1.0
max
201707.000000
9.997949e+15
7828.000000
1.0
time: 70.4 ms
q7. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 173117 entries, 0 to 173116
Data columns (total 5 columns):
year_month 173117 non-null int32
id 173117 non-null int64
app 173117 non-null object
flow 173117 non-null float16
label 173117 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1), object(1)
memory usage: 5.1+ MB
time: 29.8 ms
q1
將兩月金額相加
q1. head( )
year_month
id
consume
label
2
201706
8160829951314300
82.75000
1
3
201707
8160829951314300
37.68750
1
4
201706
1508075698521400
68.00000
1
5
201707
1508075698521400
49.59375
1
6
201706
1686251204809800
200.75000
1
time: 7.05 ms
q1 = q1[ [ 'id' , 'consume' ] ]
time: 2.91 ms
q1_groupbyid = q1. groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
time: 747 ms
len ( q1)
10865
time: 8.1 ms
q1[ q1[ 'id' ] == 1448103998000 ]
id
consume
3532
1448103998000
18.09375
3533
1448103998000
44.28125
time: 8.84 ms
q1_groupbyid[ : 10 ]
consume
id
1448103998000
62.37500
17398718813730
460.75000
61132623486000
12.28125
68156596675520
903.50000
76819334576430
282.25000
78745100940550
531.00000
110229638660000
253.00000
122134826301000
138.75000
132923269304000
26.81250
138204830829320
387.50000
time: 5.8 ms
q2
特徵1 使用過的top9+其它手機品牌 共10個
特徵2 使用的不同品牌數量
q2 = q2[ [ 'id' , 'brand' ] ]
time: 4.86 ms
q2. head( 10 )
id
brand
1
1752398069509000
華爲
2
1752398069509000
樂視
3
1752398069509000
金立
4
1752398069509000
金立
5
1752398069509000
華爲
6
1752398069509000
華爲
7
1752398069509000
金立
8
1752398069509000
三星
9
4799656026499908
三星
10
4799656026499908
華爲
time: 6.36 ms
groupbybrand = q2[ 'brand' ] . value_counts( )
time: 18.7 ms
len ( groupbybrand)
750
time: 2.09 ms
% matplotlib inline
groupbybrand. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1c00ea7b8>
time: 454 ms
groupbybrand[ : 10 ]
蘋果 62347
華爲 22266
歐珀 20516
維沃 17158
三星 13435
小米 10632
金立 9922
魅族 9708
樂視 5609
四季恆美 2163
Name: brand, dtype: int64
time: 3.52 ms
q2 = q2. drop_duplicates( )
groupbyid = q2[ 'id' ] . value_counts( )
time: 19.6 ms
len ( groupbyid)
5597
time: 2.23 ms
% matplotlib inline
groupbyid. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1bb56e048>
time: 294 ms
groupbyid[ : 10 ]
4104535378288025 115
8707678197418467 108
3900535090108175 104
3986280749497468 93
9196501153454276 88
5510977603357000 84
8569492566715454 78
1106540188374027 71
4091371962011072 71
4874962666674313 71
Name: id, dtype: int64
time: 3.27 ms
q1[ q1[ 'id' ] == 4104535378288025 ]
year_month
id
consume
label
10576
201706
4104535378288025
208.000
1
10577
201707
4104535378288025
205.125
1
time: 7.63 ms
time: 364 µs
type ( groupbyid)
pandas.core.series.Series
time: 2.14 ms
type ( groupbyid. to_frame( ) )
pandas.core.frame.DataFrame
time: 3.13 ms
q2_groupbyid = groupbyid. reset_index( )
time: 2.34 ms
q2_groupbyid. columns = [ 'id' , 'phone_nums' ]
time: 1.19 ms
q2_groupbyid. head( )
id
phone_nums
0
4104535378288025
115
1
8707678197418467
108
2
3900535090108175
104
3
3986280749497468
93
4
9196501153454276
88
time: 6.12 ms
type ( q1_groupbyid)
pandas.core.frame.DataFrame
time: 2.15 ms
pos_set = q1_groupbyid. merge( q2_groupbyid, on= [ 'id' ] )
time: 6.42 ms
pos_set. head( )
id
consume
phone_nums
0
1448103998000
62.37500
6
1
17398718813730
460.75000
23
2
61132623486000
12.28125
1
3
68156596675520
903.50000
4
4
76819334576430
282.25000
21
time: 7.11 ms
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 3 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
dtypes: float16(1), int64(2)
memory usage: 139.0 KB
time: 6.27 ms
q3
1.將兩月聯絡圈規模求和
2.將兩月出省求和 是:1 否:0
3.將兩月出國求和 是:1 否:0
q3. head( )
year_month
id
call_nums
is_trans_provincial
is_transnational
label
0
201707
6062475264825100
88
1
0
1
1
201707
8160829951314300
27
0
0
1
2
201707
1508075698521400
19
0
0
1
3
201707
1686251204809800
207
0
0
1
4
201707
5627768389537500
133
1
0
1
time: 7.69 ms
q3_groupbyid_call = q3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
q3_groupbyid_provincial = q3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
q3_groupbyid_trans = q3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
time: 1.95 s
pos_set = pos_set. merge( q3_groupbyid_call, on= [ 'id' ] )
time: 5.14 ms
pos_set. head( )
id
consume
phone_nums
call_nums
0
1448103998000
62.37500
6
21
1
17398718813730
460.75000
23
217
2
61132623486000
12.28125
1
61
3
68156596675520
903.50000
4
353
4
76819334576430
282.25000
21
431
time: 7.94 ms
pos_set = pos_set. merge( q3_groupbyid_provincial, on= [ 'id' ] )
pos_set = pos_set. merge( q3_groupbyid_trans, on= [ 'id' ] )
time: 9.61 ms
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 6 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
call_nums 5473 non-null int16
is_trans_provincial 5473 non-null int8
is_transnational 5473 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2)
memory usage: 160.3 KB
time: 7.3 ms
q4
1.兩月內漫出省次數
2.所有省份one-hot或top10省份+其它省份
3.兩月內漫出不同省個數
q4. head( 10 )
year_month
id
province
label
0
201707
6062475264825100
廣東
1
1
201707
5627768389537500
北京
1
2
201707
2000900444179600
山西
1
3
201707
5304502776817600
四川
1
4
201707
5304502776817600
四川
1
5
201707
5304502776817600
四川
1
6
201707
5304502776817600
重慶
1
7
201707
8594396491246200
廣西
1
8
201707
8594396491246200
廣西
1
9
201707
8594396491246200
廣西
1
time: 8.78 ms
q4_groupbyid = q4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . agg( { 'province' : pd. Series. unique} )
q4_groupbyid. head( )
province
id
17398718813730
重慶
61132623486000
[福建, 河南, 江蘇, 安徽]
68156596675520
[遼寧, 廣東]
132923269304000
江西
138204830829320
浙江
time: 322 ms
q4_groupbyid = q4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
q4_groupbyid. head( )
id
17398718813730 1
61132623486000 8
68156596675520 3
132923269304000 1
138204830829320 2
dtype: int64
time: 6.52 ms
q4[ q4[ 'id' ] == 61132623486000 ]
year_month
id
province
label
461
201707
61132623486000
福建
1
462
201707
61132623486000
福建
1
463
201707
61132623486000
福建
1
4363
201706
61132623486000
河南
1
4364
201706
61132623486000
江蘇
1
4365
201706
61132623486000
安徽
1
4366
201706
61132623486000
安徽
1
4367
201706
61132623486000
江蘇
1
time: 8.26 ms
type ( q4_groupbyid. reset_index( ) )
pandas.core.frame.DataFrame
time: 4.03 ms
q4_groupbyid = q4_groupbyid. reset_index( )
q4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
time: 2.73 ms
q4_groupbyid. head( )
id
province_out_cnt
0
17398718813730
1
1
61132623486000
8
2
68156596675520
3
3
132923269304000
1
4
138204830829320
2
time: 5.73 ms
pos_set = pos_set. merge( q4_groupbyid, how= 'left' , on= [ 'id' ] )
pos_set. head( )
id
consume
phone_nums
call_nums
is_trans_provincial
is_transnational
province_out_cnt
0
1448103998000
62.37500
6
21
0
0
NaN
1
17398718813730
460.75000
23
217
1
0
1.0
2
61132623486000
12.28125
1
61
2
0
8.0
3
68156596675520
903.50000
4
353
2
0
3.0
4
76819334576430
282.25000
21
431
0
0
NaN
time: 14.6 ms
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
call_nums 5473 non-null int16
is_trans_provincial 5473 non-null int8
is_transnational 5473 non-null int8
province_out_cnt 1913 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 7.53 ms
pos_set = pos_set. fillna( 0 )
time: 2.46 ms
pos_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id 5473 non-null int64
consume 5473 non-null float16
phone_nums 5473 non-null int64
call_nums 5473 non-null int16
is_trans_provincial 5473 non-null int8
is_transnational 5473 non-null int8
province_out_cnt 5473 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 8.02 ms
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1913 entries, 0 to 1912
Data columns (total 7 columns):
id 1913 non-null int64
consume 1913 non-null float16
phone_nums 1913 non-null int64
call_nums 1913 non-null int16
is_trans_provincial 1913 non-null int8
is_transnational 1913 non-null int8
province_out_cnt 1913 non-null int64
dtypes: float16(1), int16(1), int64(3), int8(2)
memory usage: 71.0 KB
time: 6.67 ms
q6 暫時忽略
q7
1.使用總流量
2.使用不同APP數量
3.某些特定(旅遊相關)APP是否使用
q7. head( )
year_month
id
app
flow
label
0
201707
6610350034824100
騰訊手機管家
0.010002
1
1
201707
6997210664840100
喜馬拉雅FM
27.390625
1
2
201707
3198621664927300
網易新聞
0.029999
1
3
201707
9987406611703100
喜馬拉雅FM
0.000000
1
4
201707
1785540174324200
天氣通
0.020004
1
time: 7.94 ms
q7_groupbyapp = q7. groupby( [ 'app' ] ) . agg( { 'flow' : pd. Series. sum } )
time: 135 ms
len ( q7_groupbyapp)
762
time: 2.04 ms
q7_groupbyapp. sort_values( by= 'flow' , ascending= False )
flow
app
網易雲音樂
inf
愛奇藝視頻
inf
微信
inf
新浪微博
inf
QQ音樂
inf
今日頭條
inf
QQ
57856.0
手機百度
53408.0
陌陌
43488.0
iTunes
35392.0
騰訊新聞
25952.0
快手
24256.0
手機淘寶
18400.0
UC瀏覽器
16608.0
酷狗音樂
15360.0
高德地圖
14984.0
酷我音樂
13488.0
新浪新聞
13432.0
唯品會
11504.0
騰訊視頻
10760.0
優酷視頻
10736.0
汽車之家
9984.0
百度地圖
9816.0
美團
9400.0
網易新聞
8648.0
AppStore
7776.0
中國聯通手機營業廳
6736.0
百度貼吧
6104.0
鳳凰新聞
5504.0
蝦米音樂
5020.0
...
...
百才招聘網
0.0
碰碰
0.0
禾文阿思看圖購
0.0
科學作息時間表
0.0
章魚輸入法
0.0
米折
0.0
約會吧
0.0
網易微博
0.0
表情大全
0.0
歡樂互娛
0.0
博客大巴
0.0
查快遞
0.0
郵儲銀行
0.0
號簿助手
0.0
司機邦
0.0
壁紙多多
0.0
天天聊
0.0
天翼閱讀
0.0
安全管家
0.0
安卓遊戲盒子
0.0
安軟市場
0.0
車網互聯
0.0
宜搜搜索
0.0
工程師爸爸
0.0
彩票控
0.0
貝瓦兒歌
0.0
搜狗壁紙
0.0
智遠一戶通
0.0
誠品快拍
0.0
07073手遊中心
0.0
762 rows × 1 columns
time: 12.4 ms
pos_set. describe( )
id
consume
phone_nums
call_nums
is_trans_provincial
is_transnational
province_out_cnt
count
5.473000e+03
5473.000000
5473.000000
5473.000000
5473.000000
5473.000000
5473.000000
mean
5.417038e+15
inf
8.228942
141.201900
0.474511
0.029600
1.300018
std
2.637784e+15
inf
8.551830
121.262826
0.706162
0.187904
3.110401
min
1.448104e+12
0.099976
1.000000
-2.000000
0.000000
0.000000
0.000000
25%
3.113785e+15
82.000000
3.000000
52.000000
0.000000
0.000000
0.000000
50%
5.457364e+15
198.250000
6.000000
108.000000
0.000000
0.000000
0.000000
75%
7.688781e+15
355.250000
10.000000
198.000000
1.000000
0.000000
1.000000
max
9.997949e+15
2392.000000
115.000000
1035.000000
2.000000
2.000000
42.000000
time: 126 ms
pos_set[ 'label' ] = 1
id
consume
phone_nums
call_nums
is_trans_provincial
is_transnational
province_out_cnt
label
0
1448103998000
62.37500
6
21
0
0
NaN
1
1
17398718813730
460.75000
23
217
1
0
1.0
1
2
61132623486000
12.28125
1
61
2
0
8.0
1
3
68156596675520
903.50000
4
353
2
0
3.0
1
4
76819334576430
282.25000
21
431
0
0
NaN
1
time: 10.5 ms
pos_set. fillna( 0 )
pos_set. head( )
id
consume
phone_nums
call_nums
is_trans_provincial
is_transnational
province_out_cnt
label
0
1448103998000
62.37500
6
21
0
0
NaN
1
1
17398718813730
460.75000
23
217
1
0
1.0
1
2
61132623486000
12.28125
1
61
2
0
8.0
1
3
68156596675520
903.50000
4
353
2
0
3.0
1
4
76819334576430
282.25000
21
431
0
0
NaN
1
time: 23.5 ms
0.1.2 負樣本
n1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n1.txt' , sep= '\t' , header= None ) )
n1. columns = [ 'year_month' , 'id' , 'consume' , 'label' ]
n1 = n1. dropna( axis= 0 )
n1_groupbyid = n1[ [ 'id' , 'consume' ] ] . groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
n2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n2.txt' , sep= '\t' , header= None ) )
n2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' , 'label' ]
n2 = n2. dropna( axis= 0 )
n2 = n2[ [ 'id' , 'brand' ] ]
n2 = n2. drop_duplicates( )
n2_groupbyid = n2[ 'id' ] . value_counts( )
n2_groupbyid = n2_groupbyid. reset_index( )
n2_groupbyid. columns = [ 'id' , 'phone_nums' ]
neg_set = n1_groupbyid. merge( n2_groupbyid, on= [ 'id' ] )
neg_set. head( )
Mem. usage decreased to 2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
id
consume
phone_nums
0
1009387204000
225.000000
4
1
1167316303000
1.199219
4
2
1883071709000
213.500000
8
3
3393143830010
517.500000
6
4
4568973162000
18.078125
3
time: 10.8 s
neg_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 76515 entries, 0 to 76514
Data columns (total 3 columns):
id 76515 non-null int64
consume 76515 non-null float16
phone_nums 76515 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.9 MB
time: 11.1 ms
n3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n3.txt' , sep= '\t' , header= None ) )
n3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'label' ]
n3_groupbyid_call = n3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
n3_groupbyid_provincial = n3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
n3_groupbyid_trans = n3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
neg_set = neg_set. merge( n3_groupbyid_call, on= [ 'id' ] )
neg_set = neg_set. merge( n3_groupbyid_provincial, on= [ 'id' ] )
neg_set = neg_set. merge( n3_groupbyid_trans, on= [ 'id' ] )
n4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n4.txt' , sep= '\t' , header= None ) )
n4. columns = [ 'year_month' , 'id' , 'province' , 'label' ]
n4_groupbyid = n4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
n4_groupbyid = n4_groupbyid. reset_index( )
n4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
neg_set = neg_set. merge( n4_groupbyid, how= 'left' , on= [ 'id' ] )
neg_set = neg_set. fillna( 0 )
neg_set. head( )
Mem. usage decreased to 3.03 Mb (64.6% reduction)
Mem. usage decreased to 0.73 Mb (34.4% reduction)
id
consume
phone_nums
call_nums
is_trans_provincial
is_transnational
province_out_cnt
0
1009387204000
225.000000
4
19
0
0
0.0
1
1167316303000
1.199219
4
6
0
0
0.0
2
1883071709000
213.500000
8
40
0
0
0.0
3
3393143830010
517.500000
6
205
1
0
2.0
4
4568973162000
18.078125
3
17
0
0
0.0
time: 32.5 s
neg_set[ 'label' ] = 0
time: 1.83 ms
neg_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 76515 entries, 0 to 76514
Data columns (total 8 columns):
id 76515 non-null int64
consume 76515 non-null float16
phone_nums 76515 non-null int64
call_nums 76515 non-null int16
is_trans_provincial 76515 non-null int8
is_transnational 76515 non-null int8
province_out_cnt 76515 non-null float64
label 76515 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2)
memory usage: 3.4 MB
time: 18.9 ms
n1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n1.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 2.67 Mb (53.1% reduction)
time: 484 ms
n1. columns = [ 'year_month' , 'id' , 'consume' , 'label' ]
time: 1.28 ms
n1. head( )
year_month
id
consume
label
0
201707
8570518832906100
9.00
0
1
201707
2182640938718700
10.00
0
2
201707
783614344429000
8.38
0
3
201707
2007036960106400
100.00
0
4
201707
9482847959399300
226.05
0
time: 7.22 ms
n1. describe( )
year_month
id
consume
label
count
186800.000000
1.868000e+05
150750.000000
186800.0
mean
201706.500000
5.464219e+15
63.580028
0.0
std
0.500001
2.633848e+15
84.063600
0.0
min
201706.000000
1.009387e+12
-70.660000
0.0
25%
201706.000000
3.192389e+15
12.930000
0.0
50%
201706.500000
5.486486e+15
34.000000
0.0
75%
201707.000000
7.744140e+15
82.500000
0.0
max
201707.000000
9.999717e+15
3979.940000
0.0
time: 52.5 ms
n1. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 4 columns):
year_month 186800 non-null int64
id 186800 non-null int64
consume 150750 non-null float64
label 186800 non-null int64
dtypes: float64(1), int64(3)
memory usage: 5.7 MB
time: 21.7 ms
n2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n2.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 51.13 Mb (14.6% reduction)
time: 7.76 s
n2. head( )
0
1
2
3
4
5
0
5227696575283900
蘋果
A1699
20150331210636
20150701063017
0
1
6279759720262000
NaN
NaN
20160725112240
20170731235959
0
2
6279759720262000
NaN
NaN
20161205220417
20161205220417
0
3
6279759720262000
三星
SM-A9000
20161128231001
20161128231001
0
4
6279759720262000
NaN
NaN
20161220102623
20170306173713
0
time: 8.15 ms
n2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' , 'label' ]
time: 1.2 ms
n2. head( )
id
brand
type
first_use_time
recent_use_time
label
0
5227696575283900
蘋果
A1699
20150331210636
20150701063017
0
1
6279759720262000
NaN
NaN
20160725112240
20170731235959
0
2
6279759720262000
NaN
NaN
20161205220417
20161205220417
0
3
6279759720262000
三星
SM-A9000
20161128231001
20161128231001
0
4
6279759720262000
NaN
NaN
20161220102623
20170306173713
0
time: 8.3 ms
n2. describe( )
id
first_use_time
recent_use_time
label
count
1.307608e+06
1.307608e+06
1.307608e+06
1307608.0
mean
5.460966e+15
1.999810e+13
1.999992e+13
0.0
std
2.619222e+15
1.801007e+12
1.801171e+12
0.0
min
1.009387e+12
-1.000000e+00
-1.000000e+00
0.0
25%
3.196695e+15
2.015112e+13
2.016022e+13
0.0
50%
5.477102e+15
2.016071e+13
2.016101e+13
0.0
75%
7.728047e+15
2.016123e+13
2.017023e+13
0.0
max
9.999717e+15
2.017073e+13
2.017073e+13
0.0
time: 252 ms
n2. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1307608 entries, 0 to 1307607
Data columns (total 6 columns):
id 1307608 non-null int64
brand 894190 non-null object
type 894205 non-null object
first_use_time 1307608 non-null int64
recent_use_time 1307608 non-null int64
label 1307608 non-null int64
dtypes: int64(4), object(2)
memory usage: 59.9+ MB
time: 251 ms
n3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n3.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 3.03 Mb (64.6% reduction)
time: 584 ms
n3. head( )
0
1
2
3
4
5
0
201707
4295277677437000
36
1
0
0
1
201707
9121335969062000
37
0
0
0
2
201707
9438277095447300
-1
0
0
0
3
201707
6749854876532500
20
0
0
0
4
201707
1545361809381400
26
0
0
0
time: 7.82 ms
n3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' , 'label' ]
time: 1.13 ms
n3. head( )
year_month
id
call_nums
is_trans_provincial
is_transnational
label
0
201707
4295277677437000
36
1
0
0
1
201707
9121335969062000
37
0
0
0
2
201707
9438277095447300
-1
0
0
0
3
201707
6749854876532500
20
0
0
0
4
201707
1545361809381400
26
0
0
0
time: 7.49 ms
n3. describe( )
year_month
id
call_nums
is_trans_provincial
is_transnational
label
count
186800.000000
1.868000e+05
186800.000000
186800.000000
186800.000000
186800.0
mean
201706.500000
5.464219e+15
32.674797
0.093292
0.005054
0.0
std
0.500001
2.633848e+15
46.054929
0.290842
0.070909
0.0
min
201706.000000
1.009387e+12
-1.000000
0.000000
0.000000
0.0
25%
201706.000000
3.192389e+15
4.000000
0.000000
0.000000
0.0
50%
201706.500000
5.486486e+15
19.000000
0.000000
0.000000
0.0
75%
201707.000000
7.744140e+15
43.000000
0.000000
0.000000
0.0
max
201707.000000
9.999717e+15
1807.000000
1.000000
1.000000
0.0
time: 75.7 ms
n3. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 6 columns):
year_month 186800 non-null int64
id 186800 non-null int64
call_nums 186800 non-null int64
is_trans_provincial 186800 non-null int64
is_transnational 186800 non-null int64
label 186800 non-null int64
dtypes: int64(6)
memory usage: 8.6 MB
time: 26.6 ms
n4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n4.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 0.73 Mb (34.4% reduction)
time: 88.8 ms
n4. columns = [ 'year_month' , 'id' , 'province' , 'label' ]
time: 1.15 ms
n4. head( )
year_month
id
province
label
0
201707
4295277677437000
重慶
0
1
201707
5560109665240300
廣西
0
2
201707
5560109665240300
廣東
0
3
201707
5560109665240300
廣東
0
4
201707
5705601521649600
重慶
0
time: 7.14 ms
n4. describe( )
year_month
id
label
count
36499.000000
3.649900e+04
36499.0
mean
201706.539193
5.471019e+15
0.0
std
0.498468
2.639006e+15
0.0
min
201706.000000
3.393144e+12
0.0
25%
201706.000000
3.203830e+15
0.0
50%
201707.000000
5.468480e+15
0.0
75%
201707.000000
7.753756e+15
0.0
max
201707.000000
9.999305e+15
0.0
time: 24.4 ms
n4. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36499 entries, 0 to 36498
Data columns (total 4 columns):
year_month 36499 non-null int64
id 36499 non-null int64
province 36099 non-null object
label 36499 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.1+ MB
time: 9.97 ms
!ls / home/ kesci/ input / gzlt/ train_set/ 201708n/
201708n1.txt 201708n3.txt 201708n6.txt
201708n2.txt 201708n4.txt 201708n7.txt
time: 669 ms
n6 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n6.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 798.26 Mb (52.1% reduction)
time: 2min 59s
n6. columns = [ 'date' , 'hour' , 'id' , 'user_longitude' , 'user_latitude' , 'label' ]
time: 1.51 ms
n6. head( )
date
hour
id
user_longitude
user_latitude
label
0
2017-07-02
10.0
7748777616409800
106.680816
26.563650
0
1
2017-07-10
0.0
7748777616409800
106.719520
26.576370
0
2
2017-07-31
14.0
7748777616409800
106.683060
26.654663
0
3
2017-07-01
0.0
6633710902197900
106.697440
26.613930
0
4
2017-07-08
14.0
6633710902197900
106.715700
26.609710
0
time: 9.14 ms
q6. describe( )
hour
id
user_longitude
user_latitude
label
count
2.852871e+06
2.852871e+06
2.851527e+06
2.851527e+06
2852871.0
mean
1.141897e+01
5.415213e+15
1.068143e+02
2.659968e+01
1.0
std
6.632995e+00
2.634349e+15
5.580043e-01
2.852525e-01
0.0
min
0.000000e+00
1.448104e+12
1.036700e+02
2.470664e+01
1.0
25%
6.000000e+00
3.135488e+15
1.066656e+02
2.654610e+01
1.0
50%
1.200000e+01
5.442594e+15
1.067027e+02
2.658143e+01
1.0
75%
1.800000e+01
7.687963e+15
1.067373e+02
2.662629e+01
1.0
max
2.200000e+01
9.997949e+15
1.095277e+02
2.909348e+01
1.0
time: 979 ms
n6. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36393070 entries, 0 to 36393069
Data columns (total 6 columns):
date object
hour float64
id int64
user_longitude float64
user_latitude float64
label int64
dtypes: float64(3), int64(2), object(1)
memory usage: 1.6+ GB
time: 3.76 ms
n7 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/train_set/201708n/201708n7.txt' , sep= '\t' , header= None ) )
Mem. usage decreased to 17.98 Mb (31.2% reduction)
time: 3.14 s
n7. columns = [ 'year_month' , 'id' , 'app' , 'flow' ]
time: 1.44 ms
n7. head( )
year_month
id
app
flow
0
201707
4011022166491000
米聊
0.01
1
201707
8544172893207700
百度地圖
2.07
2
201707
9856572220983403
搜狗輸入法
0.00
3
201707
6441300393946200
愛奇藝視頻
0.00
4
201707
8751918977379700
開心消消樂
0.03
time: 7.51 ms
time: 2.94 ms
year_month
id
app
flow
label
0
201707
4011022166491000
米聊
0.01
0
1
201707
8544172893207700
百度地圖
2.07
0
2
201707
9856572220983403
搜狗輸入法
0.00
0
3
201707
6441300393946200
愛奇藝視頻
0.00
0
4
201707
8751918977379700
開心消消樂
0.03
0
time: 8.46 ms
n7. describe( )
year_month
id
flow
label
count
856961.000000
8.569610e+05
856961.000000
856961.0
mean
201706.535881
5.432556e+15
9.942533
0.0
std
0.498711
2.643712e+15
68.096944
0.0
min
201706.000000
1.009387e+12
0.000000
0.0
25%
201706.000000
3.134290e+15
0.000000
0.0
50%
201707.000000
5.440495e+15
0.060000
0.0
75%
201707.000000
7.727765e+15
1.130000
0.0
max
201707.000000
9.999717e+15
10986.150000
0.0
time: 170 ms
n7. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856961 entries, 0 to 856960
Data columns (total 5 columns):
year_month 856961 non-null int64
id 856961 non-null int64
app 856961 non-null object
flow 856961 non-null float64
label 856961 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 32.7+ MB
time: 116 ms
0.1.3 天氣數據
!ls / home/ kesci/ input / gzlt/ train_set/ weather_data_2017/
weather_forecast_2017.txt weather_reported_2017.txt 天氣現象編碼.xlsx
time: 669 ms
weather_reported = pd. read_csv( '/home/kesci/input/gzlt/train_set/weather_data_2017/weather_reported_2017.txt' , sep= '\t' )
time: 6.15 ms
weather_reported. head( )
Station_Name
VACODE
Year
Month
Day
TEM_Avg
TEM_Max
TEM_Min
PRE_Time_2020
WEP_Record
0
麻江
522635
2017
6
1
23.00
24.5
20.9
0.6
( 01 60 ) 60 .
1
三穗
522624
2017
6
1
21.13
25.6
19.4
9.0
( 01 10 80 ) 80 60 .
2
鎮遠
522625
2017
6
1
22.68
26.5
21.3
8.9
( 60 ) 60 .
3
雷山
522634
2017
6
1
23.80
26.1
20.4
5.1
( 10 ) 60 .
4
劍河
522629
2017
6
1
23.53
27.1
22.0
6.8
( 01 10 80 ) 80 10 .
time: 12.2 ms
time: 1.25 ms
weather_reported. describe( )
Station_Name
VACODE
Year
Month
Day
TEM_Avg
TEM_Max
TEM_Min
PRE_Time_2020
WEP_Record
count
1404
1404
1404
1404
1404
1404
1404
1404
1404
1404
unique
24
25
2
3
32
448
214
109
330
305
top
貴陽
520000
2017
7
4
22.83
30.5
20.5
0.0
( 01 ) 01 .
freq
61
360
1403
713
46
10
18
35
625
197
time: 49.9 ms
weather_reported. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1404 entries, 0 to 1403
Data columns (total 10 columns):
Station_Name 1404 non-null object
VACODE 1404 non-null object
Year 1404 non-null object
Month 1404 non-null object
Day 1404 non-null object
TEM_Avg 1404 non-null object
TEM_Max 1404 non-null object
TEM_Min 1404 non-null object
PRE_Time_2020 1404 non-null object
WEP_Record 1404 non-null object
dtypes: object(10)
memory usage: 109.8+ KB
time: 6.32 ms
weather_forecast = pd. read_csv( '/home/kesci/input/gzlt/train_set/weather_data_2017/weather_forecast_2017.txt' , sep= '\t' )
time: 10.8 ms
weather_forecast. head( )
Station_Name
VACODE
Year
Mon
Day
TEM_Max_24h
TEM_Min_24h
WEP_24h
TEM_Max_48h
TEM_Min_48h
...
TEM_Max_120h
TEM_Min_120h
WEP_120h
TEM_Max_144h
TEM_Min_144h
WEP_144h
TEM_Max_168h
TEM_Min_168h,WEP_168h
Unnamed: 24
Unnamed: 25
0
白雲
520113
2017
6
1
25.0
17.0
(2)1
24.0
19.0
...
(4)2
25.0
15.0
(2)1
27.0
15.0
(1)0
26.0
16.0
(1)0
1
岑鞏
522626
2017
6
1
31.3
19.4
(1)1
31.0
22.0
...
(4)1
32.0
19.4
(1)1
32.0
22.8
(1)1
32.0
21.0
(1)1
2
從江
522633
2017
6
1
33.4
22.0
(1)1
30.0
23.0
...
(4)3
34.0
22.0
(1)1
34.0
23.8
(1)1
34.0
22.0
(1)1
3
丹寨
522636
2017
6
1
27.5
18.0
(1)1
24.5
20.0
...
(4)1
28.5
18.0
(1)1
28.5
21.0
(1)1
28.5
20.0
(1)1
4
貴陽
520103
2017
6
1
26.0
18.0
(2)1
25.0
20.0
...
(4)2
26.0
16.0
(2)1
28.0
16.0
(1)0
27.0
17.0
(1)0
5 rows × 26 columns
time: 86.4 ms
weather_forecast. describe( )
VACODE
Year
Mon
Day
TEM_Max_24h
TEM_Min_24h
TEM_Max_48h
TEM_Min_48h
TEM_Max_72h
TEM_Min_72h,WEP_72h
TEM_Min_96h
WEP_96h
TEM_Min_120h
WEP_120h
TEM_Min_144h
WEP_144h
TEM_Min_168h,WEP_168h
Unnamed: 24
count
1464.000000
1464.0
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
1464.000000
mean
521792.583333
2017.0
6.508197
15.754098
28.374658
20.721585
28.375820
20.872814
28.283811
21.112432
28.539481
21.408128
28.702254
21.454713
29.142623
21.485656
29.131626
21.589003
std
1180.891163
0.0
0.500104
8.809966
4.300391
2.290850
4.379771
2.232788
4.329132
2.204980
4.154188
5.203525
4.167441
5.238257
4.124026
2.180222
4.033227
2.391945
min
520103.000000
2017.0
6.000000
1.000000
17.300000
13.800000
17.300000
13.600000
17.000000
10.000000
19.000000
14.300000
19.000000
15.000000
18.000000
15.000000
18.000000
2.000000
25%
520122.750000
2017.0
6.000000
8.000000
25.000000
19.000000
25.000000
19.400000
25.000000
19.600000
25.500000
19.700000
26.000000
19.700000
26.000000
20.000000
26.500000
20.000000
50%
522624.500000
2017.0
7.000000
16.000000
28.500000
21.000000
28.500000
21.000000
28.000000
21.000000
28.500000
21.500000
28.500000
21.500000
29.000000
22.000000
29.000000
22.000000
75%
522630.250000
2017.0
7.000000
23.000000
31.800000
22.500000
31.600000
22.500000
31.500000
23.000000
31.500000
23.000000
32.000000
23.000000
32.000000
23.000000
32.000000
23.500000
max
522636.000000
2017.0
7.000000
31.000000
39.000000
25.700000
39.500000
25.500000
38.000000
25.800000
39.000000
200.000000
39.000000
202.000000
38.800000
25.800000
37.500000
26.000000
time: 121 ms
weather_forecast. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1464 entries, 0 to 1463
Data columns (total 26 columns):
Station_Name 1464 non-null object
VACODE 1464 non-null int64
Year 1464 non-null int64
Mon 1464 non-null int64
Day 1464 non-null int64
TEM_Max_24h 1464 non-null float64
TEM_Min_24h 1464 non-null float64
WEP_24h 1464 non-null object
TEM_Max_48h 1464 non-null float64
TEM_Min_48h 1464 non-null float64
WEP_48h 1464 non-null object
TEM_Max_72h 1464 non-null float64
TEM_Min_72h,WEP_72h 1464 non-null float64
TEM_Max_96h 1464 non-null object
TEM_Min_96h 1464 non-null float64
WEP_96h 1464 non-null float64
TEM_Max_120h 1464 non-null object
TEM_Min_120h 1464 non-null float64
WEP_120h 1464 non-null float64
TEM_Max_144h 1464 non-null object
TEM_Min_144h 1464 non-null float64
WEP_144h 1464 non-null float64
TEM_Max_168h 1464 non-null object
TEM_Min_168h,WEP_168h 1464 non-null float64
Unnamed: 24 1464 non-null float64
Unnamed: 25 1464 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.5+ KB
time: 9.2 ms
0.2 測試數據
0.2.1 測試集
t1 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_1.txt' , sep= '\t' , header= None ) )
t1. columns = [ 'year_month' , 'id' , 'consume' ]
t1 = t1. dropna( axis= 0 )
t1_groupbyid = t1[ [ 'id' , 'consume' ] ] . groupby( [ 'id' ] ) . agg( { 'consume' : pd. Series. sum } )
t2 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_2.txt' , sep= '\t' , header= None ) )
t2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' ]
t2 = t2. dropna( axis= 0 )
t2 = t2[ [ 'id' , 'brand' ] ]
t2 = t2. drop_duplicates( )
t2_groupbyid = t2[ 'id' ] . value_counts( )
t2_groupbyid = t2_groupbyid. reset_index( )
t2_groupbyid. columns = [ 'id' , 'phone_nums' ]
test_set = t1_groupbyid. merge( t2_groupbyid, on= [ 'id' ] )
test_set. head( )
Mem. usage decreased to 1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
id
consume
phone_nums
0
595941207920
220.000
10
1
901845022650
662.000
6
2
1868765858840
143.375
4
3
5058794512580
200.000
7
4
5399381591230
192.000
29
time: 7.86 s
test_set. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43977 entries, 0 to 43976
Data columns (total 3 columns):
id 43977 non-null int64
consume 43977 non-null float16
phone_nums 43977 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.1 MB
time: 9.02 ms
t3 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_3.txt' , sep= '\t' , header= None ) )
t3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' ]
t3_groupbyid_call = t3[ [ 'id' , 'call_nums' ] ] . groupby( [ 'id' ] ) . agg( { 'call_nums' : pd. Series. sum } )
t3_groupbyid_provincial = t3[ [ 'id' , 'is_trans_provincial' ] ] . groupby( [ 'id' ] ) . agg( { 'is_trans_provincial' : pd. Series. sum } )
t3_groupbyid_trans = t3[ [ 'id' , 'is_transnational' ] ] . groupby( [ 'id' ] ) . agg( { 'is_transnational' : pd. Series. sum } )
test_set = test_set. merge( t3_groupbyid_call, on= [ 'id' ] )
test_set = test_set. merge( t3_groupbyid_provincial, on= [ 'id' ] )
test_set = test_set. merge( t3_groupbyid_trans, on= [ 'id' ] )
t4 = reduce_mem_usage( pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_4.txt' , sep= '\t' , header= None ) )
t4. columns = [ 'year_month' , 'id' , 'province' ]
t4_groupbyid = t4[ [ 'id' , 'province' ] ] . groupby( [ 'id' ] ) . size( )
t4_groupbyid = t4_groupbyid. reset_index( )
t4_groupbyid. columns = [ 'id' , 'province_out_cnt' ]
test_set = test_set. merge( t4_groupbyid, how= 'left' , on= [ 'id' ] )
test_set = test_set. fillna( 0 )
test_set. head( )
Mem. usage decreased to 1.53 Mb (60.0% reduction)
Mem. usage decreased to 0.85 Mb (16.7% reduction)
id
consume
phone_nums
call_nums
is_trans_provincial
is_transnational
province_out_cnt
0
595941207920
220.000
10
68
1
0
1.0
1
901845022650
662.000
6
278
0
0
0.0
2
1868765858840
143.375
4
107
2
0
3.0
3
5058794512580
200.000
7
128
0
0
0.0
4
5399381591230
192.000
29
61
0
0
0.0
time: 17.4 s
!ls / home/ kesci/ input / gzlt/ test_set/
201808 weather_data_2018
time: 704 ms
!ls / home/ kesci/ input / gzlt/ test_set/ 201808
2018_1.txt 2018_2.txt 2018_3.txt 2018_4.txt 2018_6.txt 2018_7.txt
time: 702 ms
t1 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_1.txt' , sep= '\t' , header= None )
time: 527 ms
t1. columns = [ 'year_month' , 'id' , 'consume' ]
time: 1.27 ms
t1. head( )
year_month
id
consume
0
201807
6401824160010748
618.40
1
201807
6506134548135499
NaN
2
201807
5996920884619954
22.05
3
201806
1187209424543713
7.20
4
201807
9297165066591558
124.00
time: 99.9 ms
t1. describe( )
year_month
id
consume
count
100402.000000
1.004020e+05
86787.000000
mean
201806.500000
5.449905e+15
103.357399
std
0.500002
2.628916e+15
311.428596
min
201806.000000
5.959412e+11
0.010000
25%
201806.000000
3.176902e+15
36.500000
50%
201806.500000
5.440931e+15
81.000000
75%
201807.000000
7.726318e+15
132.125000
max
201807.000000
9.999920e+15
61465.900000
time: 50.6 ms
t1. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100402 entries, 0 to 100401
Data columns (total 3 columns):
year_month 100402 non-null int64
id 100402 non-null int64
consume 86787 non-null float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
time: 12.6 ms
% matplotlib inline
t1. consume. plot( )
Matplotlib is building the font cache using fc-list. This may take a moment.
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd4cd3c978>
time: 17 s
t1[ t1. consume == 61465.9 ]
year_month
id
consume
11962
201807
4827806860301307
61465.9
time: 7.15 ms
t2 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_2.txt' , sep= '\t' , header= None )
time: 11.8 s
t2. columns = [ 'id' , 'brand' , 'type' , 'first_use_time' , 'recent_use_time' ]
time: 1.18 ms
t2. head( )
id
brand
type
first_use_time
recent_use_time
0
3179771753483280
魅族
M575
20180601151052
20180601151054
1
4185007692177509
NaN
NaN
20171021182915
20171021183000
2
4972845789896505
NaN
NaN
20180624003647
20180624003656
3
4207293827582218
NaN
NaN
20171224165902
20180306175444
4
2628020151876580
NaN
NaN
20170820111053
20171207020159
time: 7.95 ms
t2. describe( )
id
first_use_time
recent_use_time
count
1.586024e+06
1.586024e+06
1.586024e+06
mean
5.410516e+15
2.017033e+13
2.017156e+13
std
2.618994e+15
6.902153e+09
6.865591e+09
min
5.959412e+11
2.016032e+13
2.016033e+13
25%
3.140763e+15
2.016122e+13
2.017021e+13
50%
5.389338e+15
2.017063e+13
2.017080e+13
75%
7.660413e+15
2.017122e+13
2.018013e+13
max
9.999920e+15
2.018073e+13
2.018073e+13
time: 353 ms
t2. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586024 entries, 0 to 1586023
Data columns (total 5 columns):
id 1586024 non-null int64
brand 1098244 non-null object
type 1098250 non-null object
first_use_time 1586024 non-null int64
recent_use_time 1586024 non-null int64
dtypes: int64(3), object(2)
memory usage: 60.5+ MB
time: 291 ms
t3 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_3.txt' , sep= '\t' , header= None )
time: 451 ms
t3. columns = [ 'year_month' , 'id' , 'call_nums' , 'is_trans_provincial' , 'is_transnational' ]
time: 1.14 ms
t3. head( )
year_month
id
call_nums
is_trans_provincial
is_transnational
0
201806
3690814703003361
49
0
0
1
201807
4315823592069831
-1
0
0
2
201806
5199170013029443
-1
0
0
3
201806
1387658205895203
35
0
0
4
201807
3280240784164442
-1
0
0
time: 7.12 ms
t3. describe( )
year_month
id
call_nums
is_trans_provincial
is_transnational
count
100400.000000
1.004000e+05
100400.000000
100400.000000
100400.000000
mean
201806.500000
5.449990e+15
51.642102
0.206116
0.012809
std
0.500002
2.628873e+15
90.705957
0.404516
0.112449
min
201806.000000
5.959412e+11
-1.000000
0.000000
0.000000
25%
201806.000000
3.177008e+15
6.000000
0.000000
0.000000
50%
201806.500000
5.441108e+15
31.000000
0.000000
0.000000
75%
201807.000000
7.726328e+15
71.000000
0.000000
0.000000
max
201807.000000
9.999920e+15
6537.000000
1.000000
1.000000
time: 46.4 ms
t3. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100400 entries, 0 to 100399
Data columns (total 5 columns):
year_month 100400 non-null int64
id 100400 non-null int64
call_nums 100400 non-null int64
is_trans_provincial 100400 non-null int64
is_transnational 100400 non-null int64
dtypes: int64(5)
memory usage: 3.8 MB
time: 15.1 ms
t4 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_4.txt' , sep= '\t' , header= None )
time: 240 ms
t4. columns = [ 'year_month' , 'id' , 'province' ]
time: 1.2 ms
t4. head( )
year_month
id
province
0
201807
8445647072009305
廣東
1
201806
9414872397547413
浙江
2
201806
2272887111818372
廣東
3
201807
224368910874770
湖北
4
201807
6081677258986878
NaN
time: 6.81 ms
t4. describe( )
year_month
id
count
44543.000000
4.454300e+04
mean
201806.530319
5.448788e+15
std
0.499086
2.640390e+15
min
201806.000000
5.959412e+11
25%
201806.000000
3.118911e+15
50%
201807.000000
5.430117e+15
75%
201807.000000
7.751481e+15
max
201807.000000
9.999505e+15
time: 20.3 ms
t4. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44543 entries, 0 to 44542
Data columns (total 3 columns):
year_month 44543 non-null int64
id 44543 non-null int64
province 44119 non-null object
dtypes: int64(2), object(1)
memory usage: 1.0+ MB
time: 9.73 ms
t6 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_6.txt' , sep= '\t' , header= None )
time: 2min 7s
t6. columns = [ 'date' , 'hour' , 'id' , 'user_longitude' , 'user_latitude' ]
time: 1.22 ms
t6. head( )
date
hour
id
user_longitude
user_latitude
0
2018-06-10
20
1929821481825935
106.289902
26.837687
1
2018-07-14
18
5450093661688579
106.641975
26.627846
2
2018-07-16
2
4617571498633816
106.230420
27.466980
3
2018-06-15
22
2826359445811398
106.693610
26.591110
4
2018-06-22
10
3526202744290054
107.032570
27.715830
time: 8.4 ms
t6. describe( )
hour
id
user_longitude
user_latitude
count
1.655899e+07
1.655899e+07
1.655081e+07
1.655081e+07
mean
1.144987e+01
5.461505e+15
1.066642e+02
2.662386e+01
std
6.742805e+00
2.629564e+15
4.626476e-01
3.195807e-01
min
0.000000e+00
5.959412e+11
1.036700e+02
2.469706e+01
25%
6.000000e+00
3.191837e+15
1.066328e+02
2.655164e+01
50%
1.200000e+01
5.475087e+15
1.066902e+02
2.658444e+01
75%
1.800000e+01
7.732384e+15
1.067199e+02
2.663778e+01
max
2.200000e+01
9.999920e+15
1.095534e+02
2.916468e+01
time: 6.3 s
t6. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16558993 entries, 0 to 16558992
Data columns (total 5 columns):
date object
hour int64
id int64
user_longitude float64
user_latitude float64
dtypes: float64(2), int64(2), object(1)
memory usage: 631.7+ MB
time: 3.04 ms
t7 = pd. read_csv( '/home/kesci/input/gzlt/test_set/201808/2018_7.txt' , sep= '\t' , header= None )
time: 8.75 s
t7. columns = [ 'year_month' , 'id' , 'app' , 'flow' ]
time: 1.18 ms
t7. head( )
year_month
id
app
flow
0
201806
9813651010156104
OPPO軟件商店
14545.00
1
201806
2338567014163500
騰訊新聞
0.19
2
201807
1133512913801798
訊飛輸入法
0.01
3
201807
7739596338372898
手機百度
1615.00
4
201807
5724269192271018
百度貼吧
1301953.00
time: 15.6 ms
t7. describe( )
year_month
id
flow
count
1.493733e+06
1.493733e+06
1.492434e+06
mean
2.018065e+05
5.468351e+15
8.991198e+07
std
4.999895e-01
2.628382e+15
8.503798e+08
min
2.018060e+05
5.959412e+11
0.000000e+00
25%
2.018060e+05
3.196619e+15
6.519000e+03
50%
2.018070e+05
5.477012e+15
2.883350e+05
75%
2.018070e+05
7.737568e+15
7.842132e+06
max
2.018070e+05
9.999920e+15
3.341152e+11
time: 226 ms
t7. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1493733 entries, 0 to 1493732
Data columns (total 4 columns):
year_month 1493733 non-null int64
id 1493733 non-null int64
app 1457137 non-null object
flow 1492434 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 45.6+ MB
time: 178 ms
0.2.2 天氣數據
!ls / home/ kesci/ input / gzlt/ test_set/ weather_data_2018/
weather_forecast_2018.txt weather_reported_2018.txt
time: 830 ms
weather_reported_2018 = pd. read_csv( '/home/kesci/input/gzlt/test_set/weather_data_2018/weather_reported_2018.txt' , sep= '\t' )
time: 8.57 ms
weather_reported_2018. head( )
Station_Name
VACODE
Year
Month
Day
TEM_Avg
TEM_Max
TEM_Min
PRE_Time_2020
WEP_Record
0
鎮遠
522625
2018
6
1
19.0
21.0
17.8
0.1
( 60 01 ) 01 60 10 .
1
丹寨
522636
2018
6
1
17.0
19.9
15.3
4.3
( 60 80 ) 80 .
2
三穗
522624
2018
6
1
17.8
19.2
17.0
0.6
( 80 10 ) 60 10 .
3
臺江
522630
2018
6
1
18.8
21.1
17.5
1.4
( 60 01 ) 01 60 10 .
4
劍河
522629
2018
6
1
19.2
21.6
17.9
2.1
( 60 ) 60 10 .
time: 12.6 ms
weather_reported_2018. describe( )
VACODE
Year
Month
Day
TEM_Avg
TEM_Max
TEM_Min
PRE_Time_2020
count
1403.000000
1403.0
1403.000000
1403.000000
1403.000000
1403.000000
1403.000000
1403.000000
mean
521862.934426
2018.0
6.508197
15.754098
737.393799
742.297577
734.011119
4.922594
std
1155.972144
0.0
0.500111
8.810097
26696.850268
26696.719415
26696.940604
15.090986
min
520103.000000
2018.0
6.000000
1.000000
15.100000
16.200000
11.800000
0.000000
25%
520122.000000
2018.0
6.000000
8.000000
22.900000
27.300000
20.000000
0.000000
50%
522625.000000
2018.0
7.000000
16.000000
25.100000
30.100000
21.600000
0.000000
75%
522631.000000
2018.0
7.000000
23.000000
26.900000
32.550000
23.050000
2.100000
max
522636.000000
2018.0
7.000000
31.000000
999999.000000
999999.000000
999999.000000
281.700000
time: 118 ms
weather_reported_2018. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name 1403 non-null object
VACODE 1403 non-null int64
Year 1403 non-null int64
Month 1403 non-null int64
Day 1403 non-null int64
TEM_Avg 1403 non-null float64
TEM_Max 1403 non-null float64
TEM_Min 1403 non-null float64
PRE_Time_2020 1403 non-null float64
WEP_Record 1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms
weather_forecast_2018 = pd. read_csv( '/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt' , sep= '\t' )
time: 12 ms
weather_forecast_2018. head( )
Station_Name
VACODE
Year
Mon
Day
TEM_Max_24h
TEM_Min_24h
WEP_24h
TEM_Max_48h
TEM_Min_48h
...
TEM_Max_120h
TEM_Min_120h
WEP_120h
TEM_Max_144h
TEM_Min_144h
WEP_144h
TEM_Max_168h
TEM_Min_168h,WEP_168h
Unnamed: 24
Unnamed: 25
0
白雲
520113
2018
6
1
20.2
14.8
(3)2
23.2
15.8
...
(2)1
27.5
13.5
(1)1
26.0
14.0
(2)1
24.0
16.0
(1)1
1
岑鞏
522626
2018
6
1
25.5
17.5
(2)2
28.5
20.2
...
(2)0
31.0
17.0
(0)0
31.0
18.5
(0)1
31.0
21.5
(1)1
2
從江
522633
2018
6
1
27.3
19.0
(7)2
29.5
22.0
...
(21)0
33.5
19.6
(0)0
33.5
20.2
(0)1
31.5
23.0
(1)1
3
丹寨
522636
2018
6
1
23.0
15.5
(2)2
26.0
19.2
...
(2)0
28.0
16.2
(0)0
28.0
17.2
(0)1
27.0
19.5
(1)1
4
貴陽
520103
2018
6
1
20.9
14.9
(3)2
24.0
16.4
...
(2)1
28.0
14.0
(1)1
26.0
14.0
(2)1
24.0
16.0
(1)1
5 rows × 26 columns
time: 54.2 ms
weather_forecast_2018. describe( )
VACODE
Year
Mon
Day
TEM_Max_24h
TEM_Min_24h
TEM_Max_48h
TEM_Min_48h
TEM_Max_72h
TEM_Min_72h,WEP_72h
TEM_Min_96h
WEP_96h
TEM_Min_120h
WEP_120h
TEM_Min_144h
WEP_144h
TEM_Min_168h,WEP_168h
Unnamed: 24
count
1463.000000
1463.0
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
mean
521793.738209
2018.0
6.508544
15.759398
29.724607
21.244703
29.724470
21.385236
29.694463
21.655434
29.924949
21.886945
29.891183
22.010936
30.027341
22.055229
30.192960
21.985373
std
1180.467638
0.0
0.500098
8.810643
3.470128
2.536103
3.232737
2.385237
3.167789
2.270505
3.130886
2.131020
3.191721
2.066640
3.199460
2.092155
3.167676
2.227871
min
520103.000000
2018.0
6.000000
1.000000
17.800000
10.800000
18.000000
12.000000
16.500000
12.500000
16.500000
14.000000
14.500000
13.000000
17.000000
13.200000
16.000000
15.000000
25%
520123.000000
2018.0
6.000000
8.000000
27.500000
20.000000
27.500000
20.000000
27.500000
20.200000
28.000000
20.500000
27.500000
21.000000
28.000000
21.000000
28.000000
20.850000
50%
522625.000000
2018.0
7.000000
16.000000
30.000000
22.000000
29.900000
22.000000
29.500000
22.000000
30.000000
22.000000
30.000000
22.200000
30.000000
22.100000
30.000000
22.200000
75%
522630.500000
2018.0
7.000000
23.000000
32.350000
23.000000
32.000000
23.000000
32.300000
23.300000
32.500000
23.500000
32.500000
23.500000
32.500000
23.700000
32.600000
24.000000
max
522636.000000
2018.0
7.000000
31.000000
37.500000
27.000000
37.000000
25.900000
36.500000
26.000000
36.500000
26.000000
36.500000
26.200000
37.000000
26.000000
37.000000
30.000000
time: 74 ms
weather_forecast_2018. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name 1463 non-null object
VACODE 1463 non-null int64
Year 1463 non-null int64
Mon 1463 non-null int64
Day 1463 non-null int64
TEM_Max_24h 1463 non-null float64
TEM_Min_24h 1463 non-null float64
WEP_24h 1463 non-null object
TEM_Max_48h 1463 non-null float64
TEM_Min_48h 1463 non-null float64
WEP_48h 1463 non-null object
TEM_Max_72h 1463 non-null float64
TEM_Min_72h,WEP_72h 1463 non-null float64
TEM_Max_96h 1463 non-null object
TEM_Min_96h 1463 non-null float64
WEP_96h 1463 non-null float64
TEM_Max_120h 1463 non-null object
TEM_Min_120h 1463 non-null float64
WEP_120h 1463 non-null float64
TEM_Max_144h 1463 non-null object
TEM_Min_144h 1463 non-null float64
WEP_144h 1463 non-null float64
TEM_Max_168h 1463 non-null object
TEM_Min_168h,WEP_168h 1463 non-null float64
Unnamed: 24 1463 non-null float64
Unnamed: 25 1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms
!jupyter nbconvert - - to markdown "“聯創黔線”杯大數據應用創新大賽.ipynb"
0.000000
25%
520122.000000
2018.0
6.000000
8.000000
22.900000
27.300000
20.000000
0.000000
50%
522625.000000
2018.0
7.000000
16.000000
25.100000
30.100000
21.600000
0.000000
75%
522631.000000
2018.0
7.000000
23.000000
26.900000
32.550000
23.050000
2.100000
max
522636.000000
2018.0
7.000000
31.000000
999999.000000
999999.000000
999999.000000
281.700000
time: 118 ms
weather_reported_2018. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name 1403 non-null object
VACODE 1403 non-null int64
Year 1403 non-null int64
Month 1403 non-null int64
Day 1403 non-null int64
TEM_Avg 1403 non-null float64
TEM_Max 1403 non-null float64
TEM_Min 1403 non-null float64
PRE_Time_2020 1403 non-null float64
WEP_Record 1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms
weather_forecast_2018 = pd. read_csv( '/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt' , sep= '\t' )
time: 12 ms
weather_forecast_2018. head( )
Station_Name
VACODE
Year
Mon
Day
TEM_Max_24h
TEM_Min_24h
WEP_24h
TEM_Max_48h
TEM_Min_48h
...
TEM_Max_120h
TEM_Min_120h
WEP_120h
TEM_Max_144h
TEM_Min_144h
WEP_144h
TEM_Max_168h
TEM_Min_168h,WEP_168h
Unnamed: 24
Unnamed: 25
0
白雲
520113
2018
6
1
20.2
14.8
(3)2
23.2
15.8
...
(2)1
27.5
13.5
(1)1
26.0
14.0
(2)1
24.0
16.0
(1)1
1
岑鞏
522626
2018
6
1
25.5
17.5
(2)2
28.5
20.2
...
(2)0
31.0
17.0
(0)0
31.0
18.5
(0)1
31.0
21.5
(1)1
2
從江
522633
2018
6
1
27.3
19.0
(7)2
29.5
22.0
...
(21)0
33.5
19.6
(0)0
33.5
20.2
(0)1
31.5
23.0
(1)1
3
丹寨
522636
2018
6
1
23.0
15.5
(2)2
26.0
19.2
...
(2)0
28.0
16.2
(0)0
28.0
17.2
(0)1
27.0
19.5
(1)1
4
貴陽
520103
2018
6
1
20.9
14.9
(3)2
24.0
16.4
...
(2)1
28.0
14.0
(1)1
26.0
14.0
(2)1
24.0
16.0
(1)1
5 rows × 26 columns
time: 54.2 ms
weather_forecast_2018. describe( )
VACODE
Year
Mon
Day
TEM_Max_24h
TEM_Min_24h
TEM_Max_48h
TEM_Min_48h
TEM_Max_72h
TEM_Min_72h,WEP_72h
TEM_Min_96h
WEP_96h
TEM_Min_120h
WEP_120h
TEM_Min_144h
WEP_144h
TEM_Min_168h,WEP_168h
Unnamed: 24
count
1463.000000
1463.0
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
1463.000000
mean
521793.738209
2018.0
6.508544
15.759398
29.724607
21.244703
29.724470
21.385236
29.694463
21.655434
29.924949
21.886945
29.891183
22.010936
30.027341
22.055229
30.192960
21.985373
std
1180.467638
0.0
0.500098
8.810643
3.470128
2.536103
3.232737
2.385237
3.167789
2.270505
3.130886
2.131020
3.191721
2.066640
3.199460
2.092155
3.167676
2.227871
min
520103.000000
2018.0
6.000000
1.000000
17.800000
10.800000
18.000000
12.000000
16.500000
12.500000
16.500000
14.000000
14.500000
13.000000
17.000000
13.200000
16.000000
15.000000
25%
520123.000000
2018.0
6.000000
8.000000
27.500000
20.000000
27.500000
20.000000
27.500000
20.200000
28.000000
20.500000
27.500000
21.000000
28.000000
21.000000
28.000000
20.850000
50%
522625.000000
2018.0
7.000000
16.000000
30.000000
22.000000
29.900000
22.000000
29.500000
22.000000
30.000000
22.000000
30.000000
22.200000
30.000000
22.100000
30.000000
22.200000
75%
522630.500000
2018.0
7.000000
23.000000
32.350000
23.000000
32.000000
23.000000
32.300000
23.300000
32.500000
23.500000
32.500000
23.500000
32.500000
23.700000
32.600000
24.000000
max
522636.000000
2018.0
7.000000
31.000000
37.500000
27.000000
37.000000
25.900000
36.500000
26.000000
36.500000
26.000000
36.500000
26.200000
37.000000
26.000000
37.000000
30.000000
time: 74 ms
weather_forecast_2018. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name 1463 non-null object
VACODE 1463 non-null int64
Year 1463 non-null int64
Mon 1463 non-null int64
Day 1463 non-null int64
TEM_Max_24h 1463 non-null float64
TEM_Min_24h 1463 non-null float64
WEP_24h 1463 non-null object
TEM_Max_48h 1463 non-null float64
TEM_Min_48h 1463 non-null float64
WEP_48h 1463 non-null object
TEM_Max_72h 1463 non-null float64
TEM_Min_72h,WEP_72h 1463 non-null float64
TEM_Max_96h 1463 non-null object
TEM_Min_96h 1463 non-null float64
WEP_96h 1463 non-null float64
TEM_Max_120h 1463 non-null object
TEM_Min_120h 1463 non-null float64
WEP_120h 1463 non-null float64
TEM_Max_144h 1463 non-null object
TEM_Min_144h 1463 non-null float64
WEP_144h 1463 non-null float64
TEM_Max_168h 1463 non-null object
TEM_Min_168h,WEP_168h 1463 non-null float64
Unnamed: 24 1463 non-null float64
Unnamed: 25 1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms
!jupyter nbconvert - - to markdown "“聯創黔線”杯大數據應用創新大賽.ipynb"