“聯創黔線”杯大數據應用創新大賽

文章目錄

又打了個醬油，最終成績是39/205。說出來挺丟人的，因爲本次比賽採用AUC來評判模型的效果，不用建模一半預測爲去，另一半預測爲不去就能得0.5分。

賽題介紹

賽題描述
參賽選手需要根據2017年貴陽市常住居民的部分用戶的歷史數據（訓練集），以及2018年6月、7月的數據（測試集），對2018年8月貴陽市常住居民前往黔東南州進行省內旅遊的可能性進行預測。

本比賽任務爲：

訓練：使用所提供的訓練集，即用戶使用2017年6、7月的歷史數據與8月是否前往黔東南州進行省內旅遊的數據，建立預測模型
輸出結果：使用所提供的測試集，即用戶使用2018年6月、7月的歷史數據，通過所建立的模型，預測用戶在2018年8月是否會前往黔東南州進行省內旅遊的概率。在科賽網，提交測評，得到AUC分數
數據說明
訓練集(training_set)約2.3G，其中包含 201708n，201708q 和 weather_data_2017三個文件夾，分別記錄了對應的2017年6、7月用戶歷史數據和天氣歷史數據。

在201708n和201708q兩個文件夾中，各包含7個txt文件，201708n文件夾中的用戶在2017年8月都沒有去過黔東南目標區域，201708q文件夾中的用戶在2017年8月都去過黔東南目標景區
訓練集中，除以下列示字段外，最後還有一個字段“label”：“0”表示其爲負樣本，即該用戶在2017年8月沒有去過黔東南目標區域；“1”表示其爲正樣本，即該用戶在2017年8月去過黔東南目標區域
用戶身份屬性表（201708n1.txt, 201708q1.txt)
用戶手機終端信息表（201708n2.txt, 201708q2.txt)
用戶漫遊行爲表（201708n3.txt, 201708q3.txt)
用戶漫出省份表（201708n4.txt, 201708q4.txt)
用戶地理位置表（201708n6.txt, 201708q6.txt)
用戶APP使用情況表（201708n7.txt, 201708q7.txt)
在weather_data_2017文件夾中包含兩個txt文件，“weather_reported_2017”記錄了2017年6月、7月的實際天氣，“weather_forecast_2017”，記錄了2017年6月、7月的預報天氣，以及一個“天氣現象編碼表.xlsx”文件。
2017實況天氣表（weather_reported_2017.txt）
2017預測天氣表（weather_forecast_2017.txt）
測試集(testing_set)共約1G，其中包含201808和weather_data_2018兩個文件夾

在201808文件夾中包含7個txt文件，命名依次爲2018_1.txt,2018_2.txt, … ,2018_7.txt，字段信息與訓練集相對應
在weather_data_2018文件夾中包含兩個txt文件，命“weather_reported_2018”記錄了2018年6月、7月的實際天氣，“weather_forecast_2018”記錄了2018年6月、7月的預報天氣，字段信息與訓練集相對應。
備註：

每個文件夾中的7個表可以通過虛擬ID互相關聯；但不是每個虛擬ID都可以被關聯，選手自行判斷如何處理和使用
不同表中的虛擬ID存在格式不同的情況，需選手自行處理，並保證提交虛擬ID格式爲string
由於表的數量較多，信息維度不同，應用方法多種，數據可能存在異常和缺失，選手需自行處理可能遇到的異常狀況
歡迎選手用不同的方法進行嘗試，如遷移學習等前沿方法
本次競賽數據經過了脫敏處理，數據和實際信息有一定差距，但是不會影響問題的解決
評審說明
1、初賽評分規則

本次比賽採用AUC來評判模型的效果。AUC即以False Positive Rate爲橫軸，True Positive Rate爲縱軸的ROC （Receiver Operating Characteristic）曲線下方的面積大小。

2、評審說明

測評排行榜採用Private/Public機制，其中，Private榜對應所提交結果文件中一定比例數據的成績，Public榜對應剩餘數據的成績。

提供給每個隊伍每天5次提交與測評排名的機會，實時更新Public排行榜，從高到低排序，若隊伍一天內多次提交結果，新結果版本將覆蓋原版本。
由於受到使用模型的泛化性能的影響，在 Public 榜獲得最高分的提交在 Private 的分數不一定最高，因此需要選手從自己的有效提交裏，選擇兩個覺得兼顧了泛化性能與模型評分的結果文件進入 Private 榜測評
Private 排行榜在比賽結束後會揭曉，比賽的最終有效成績與有效排名將以 Private 榜爲準。

代碼

# 顯示cell運行時長
%load_ext klab-autotime

import pandas as pd
import numpy as np

time: 311 ms

# 減少內存使用

def reduce_mem_usage(df, verbose=True):

    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

    start_mem = df.memory_usage().sum() / 1024 ** 2

    for col in df.columns:

        col_type = df[col].dtypes

        if col_type in numerics:

            c_min = df[col].min()

            c_max = df[col].max()

            if str(col_type)[:3] == 'int':

                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:

                    df[col] = df[col].astype(np.int8)

                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:

                    df[col] = df[col].astype(np.int16)

                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:

                    df[col] = df[col].astype(np.int32)

                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:

                    df[col] = df[col].astype(np.int64)

            else:

                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:

                    df[col] = df[col].astype(np.float16)

                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:

                    df[col] = df[col].astype(np.float32)

                else:

                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024 ** 2

    if verbose:

        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))

    return df

time: 3.85 ms

1 特徵工程

正樣本

q1
將兩月金額相加

q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))
q1.columns = ['year_month', 'id', 'consume', 'label']

Mem. usage decreased to  0.16 Mb (53.1% reduction)
time: 39.2 ms

q1.describe()

	year_month	id	consume	label
count	11200.000000	1.120000e+04	1.086500e+04	11200.0
mean	201706.500000	5.416583e+15	inf	1.0
std	0.500022	2.642827e+15	inf	0.0
min	201706.000000	1.448104e+12	4.998779e-02	1.0
25%	201706.000000	3.117220e+15	4.068750e+01	1.0
50%	201706.500000	5.456254e+15	9.837500e+01	1.0
75%	201707.000000	7.702940e+15	1.785000e+02	1.0
max	201707.000000	9.997949e+15	1.324000e+03	1.0

time: 37.3 ms

q1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month    11200 non-null int32
id            11200 non-null int64
consume       10865 non-null float16
label         11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.91 ms

q1.consume.min()

0.05



time: 2.64 ms

q1 = q1.fillna(98.0)

time: 2.75 ms

q1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 4 columns):
year_month    11200 non-null int32
id            11200 non-null int64
consume       11200 non-null float16
label         11200 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 164.1 KB
time: 6.71 ms

q1 = q1[['id', 'consume']]
q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})

time: 709 ms

q2
特徵1 使用過的top9+其它手機品牌共10個
特徵2 使用的不同品牌數量

q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))
q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 2.46 s

q2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289203 entries, 0 to 289202
Data columns (total 6 columns):
id                 289203 non-null int64
brand              197376 non-null object
type               197380 non-null object
first_use_time     289203 non-null int64
recent_use_time    289203 non-null int64
label              289203 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 11.3+ MB
time: 62.6 ms

q2.type = q2.type.fillna('其它')

time: 18.4 ms

brand_series = pd.Series({'蘋果' : 'iphone', '華爲' : "huawei", '歐珀' : 'oppo', '維沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '樂視' : 'le', '四季恆美' : 'siji'})

q2.brand = q2.brand.map(brand_series)

time: 42.4 ms

q2.brand = q2.brand.fillna('其它')

time: 17.4 ms

q2.head()

	id	brand	type	first_use_time	recent_use_time	label
0	1752398069509000	其它	其它	20161209134530	20161209190636	1
1	1752398069509000	huawei	PLK-AL10	20170609223138	20170609224345	1
2	1752398069509000	le	LETV X501	20160924102711	20160924112425	1
3	1752398069509000	jinli	金立 GN800	20150331210255	20150630131232	1
4	1752398069509000	jinli	GIONEE M5	20170508191216	20170605192347	1

time: 18.7 ms

q2['brand_type'] = q2['brand'] + q2['type']

time: 109 ms

q2.head()

	id	brand	type	first_use_time	recent_use_time	label	brand_type
0	1752398069509000	其它	其它	20161209134530	20161209190636	1	其它其它
1	1752398069509000	huawei	PLK-AL10	20170609223138	20170609224345	1	huaweiPLK-AL10
2	1752398069509000	le	LETV X501	20160924102711	20160924112425	1	leLETV X501
3	1752398069509000	jinli	金立 GN800	20150331210255	20150630131232	1	jinli金立 GN800
4	1752398069509000	jinli	GIONEE M5	20170508191216	20170605192347	1	jinliGIONEE M5

time: 9.75 ms

groupbybrand_type = q2['brand_type'].value_counts()

time: 51.8 ms

groupbybrand_type.head(10)

其它其它                     91823
iphoneA1586              14898
iphoneA1524              10330
iphoneA1700               9246
iphoneA1699               8277
iphoneIPHONE6S(A1633)     6271
oppoOPPO R9M              4725
iphoneA1530               4640
oppoOPPO R9TM             2978
vivoVIVO X7               2516
Name: brand_type, dtype: int64



time: 3.44 ms

q2_brand_type = q2[['id', 'brand_type']]
q2_brand_type = q2_brand_type.drop_duplicates()
q2_groupbyid = q2_brand_type['id'].value_counts()
q2_groupbyid = q2_groupbyid.reset_index()
q2_groupbyid.columns = ['id', 'phone_nums']
q2_groupbyid.head()

	id	phone_nums
0	8707678197418467	422
1	9196501153454276	409
2	3900535090108175	389
3	4104535378288025	352
4	1106540188374027	350

time: 90 ms

q2_groupbyid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5600 entries, 0 to 5599
Data columns (total 2 columns):
id            5600 non-null int64
phone_nums    5600 non-null int64
dtypes: int64(2)
memory usage: 87.6 KB
time: 5.91 ms

q2_brand = q2[['id', 'brand']]
q2_brand = q2_brand.drop_duplicates()
q2_brand_one_hot = pd.get_dummies(q2_brand)
q2_brand_one_hot.head()

	id	brand_huawei	brand_jinli	brand_le	brand_san	brand_其它
0	1752398069509000	0	0	0	0	1
1	1752398069509000	1	0	0	0	0
2	1752398069509000	0	0	1	0	0
3	1752398069509000	0	1	0	0	0
8	1752398069509000	0	0	0	1	0

time: 48.9 ms

q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 
                                                   'brand_iphone': pd.Series.max,
                                                   'brand_jinli': pd.Series.max, 
                                                   'brand_le': pd.Series.max,
                                                   'brand_mei': pd.Series.max, 
                                                   'brand_mi': pd.Series.max,
                                                   'brand_oppo': pd.Series.max, 
                                                   'brand_san': pd.Series.max,
                                                   'brand_siji': pd.Series.max, 
                                                   'brand_vivo': pd.Series.max,
                                                   'brand_其它': pd.Series.max
})
q2_one_hot.head()

	brand_huawei	brand_iphone	brand_jinli	brand_le	brand_mei	brand_mi	brand_oppo	brand_san	brand_siji	brand_vivo	brand_其它
id
1448103998000	1	1	0	1	1	0	1	1	0	0	1
17398718813730	1	1	1	1	1	1	1	1	0	1	1
61132623486000	1	0	0	0	0	0	0	0	0	0	1
68156596675520	0	1	1	1	0	0	0	0	0	0	1
76819334576430	1	1	1	0	1	1	1	1	0	1	1

time: 6.57 s

pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])
pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 3 columns):
id            5600 non-null int64
consume       5600 non-null float16
phone_nums    5600 non-null int64
dtypes: float16(1), int64(2)
memory usage: 142.2 KB
time: 11.6 ms

pos_set = pos_set.merge(q2_one_hot, on=['id'])
pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 14 columns):
id              5600 non-null int64
consume         5600 non-null float16
phone_nums      5600 non-null int64
brand_huawei    5600 non-null uint8
brand_iphone    5600 non-null uint8
brand_jinli     5600 non-null uint8
brand_le        5600 non-null uint8
brand_mei       5600 non-null uint8
brand_mi        5600 non-null uint8
brand_oppo      5600 non-null uint8
brand_san       5600 non-null uint8
brand_siji      5600 non-null uint8
brand_vivo      5600 non-null uint8
brand_其它        5600 non-null uint8
dtypes: float16(1), int64(2), uint8(11)
memory usage: 202.3 KB
time: 98.6 ms

q3
1.將兩月聯絡圈規模求和
2.將兩月出省求和是:1 否:0
3.將兩月出國求和是:1 否:0

q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))
q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

Mem. usage decreased to  0.18 Mb (64.6% reduction)
time: 85.8 ms

q3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month             11200 non-null int32
id                     11200 non-null int64
call_nums              11200 non-null int16
is_trans_provincial    11200 non-null int8
is_transnational       11200 non-null int8
label                  11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 186.0 KB
time: 7.49 ms

q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])
pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 17 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 224.2 KB
time: 1.95 s

q4
1.兩月內漫出省次數
2.所有省份one-hot或top10省份+其它省份
3.兩月內漫出不同省個數

q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4.columns = ['year_month', 'id', 'province', 'label']
q4.info()

Mem. usage decreased to  0.15 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7289 non-null int32
id            7289 non-null int64
province      7218 non-null object
label         7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 18.4 ms

q4.head()

	year_month	id	province	label
0	201707	6062475264825100	廣東	1
1	201707	5627768389537500	北京	1
2	201707	2000900444179600	山西	1
3	201707	5304502776817600	四川	1
4	201707	5304502776817600	四川	1

time: 7.16 ms

q4_groupbyid = q4.groupby(['province']).size()

time: 61.3 ms

q4_groupbyid.sort_values()

province
寧夏      15
吉林      20
內蒙古     22
黑龍江     27
青海      35
天津      39
遼寧      44
西藏      69
山西      70
甘肅      73
新疆      74
安徽      86
海南     100
陝西     114
山東     121
福建     150
河北     168
江蘇     182
湖北     208
上海     215
河南     237
北京     247
江西     364
重慶     428
浙江     483
雲南     530
廣西     536
四川     793
廣東     835
湖南     933
dtype: int64



time: 4.04 ms

q4.province = q4.province.fillna('湖南')
q4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7289 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7289 non-null int32
id            7289 non-null int64
province      7289 non-null object
label         7289 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 149.6+ KB
time: 8.09 ms

q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']

pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 18 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       1942 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 268.0 KB
time: 19.6 ms

pos_set = pos_set.fillna(0)
pos_set['label'] = 1
pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       5600 non-null float64
label                  5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 12.7 ms

q6 暫時忽略
q7
1.使用總流量
2.使用不同APP數量
3.某些特定（旅遊相關）APP是否使用

1.1 正樣本

q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))

q1.columns = ['year_month', 'id', 'consume', 'label']

q1 = q1.fillna(98.0)

q1 = q1[['id', 'consume']]

q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})



q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))

q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

q2.type = q2.type.fillna('其它')

brand_series = pd.Series({'蘋果' : 'iphone', '華爲' : "huawei", '歐珀' : 'oppo', '維沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '樂視' : 'le', '四季恆美' : 'siji'})

q2.brand = q2.brand.map(brand_series)

q2.brand = q2.brand.fillna('其它')

q2['brand_type'] = q2['brand'] + q2['type']

q2_brand_type = q2[['id', 'brand_type']]

q2_brand_type = q2_brand_type.drop_duplicates()

q2_groupbyid = q2_brand_type['id'].value_counts()

q2_groupbyid = q2_groupbyid.reset_index()

q2_groupbyid.columns = ['id', 'phone_nums']

q2_brand = q2[['id', 'brand']]

q2_brand = q2_brand.drop_duplicates()

q2_brand_one_hot = pd.get_dummies(q2_brand)

q2_one_hot = q2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 

                                                   'brand_iphone': pd.Series.max,

                                                   'brand_jinli': pd.Series.max, 

                                                   'brand_le': pd.Series.max,

                                                   'brand_mei': pd.Series.max, 

                                                   'brand_mi': pd.Series.max,

                                                   'brand_oppo': pd.Series.max, 

                                                   'brand_san': pd.Series.max,

                                                   'brand_siji': pd.Series.max, 

                                                   'brand_vivo': pd.Series.max,

                                                   'brand_其它': pd.Series.max

})

q2_one_hot.head()

pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])

pos_set = pos_set.merge(q2_one_hot, on=['id'])



q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))

q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})

q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})

q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})



pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])

pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])

pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])



q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))

q4.columns = ['year_month', 'id', 'province', 'label']

q4.province = q4.province.fillna('湖南')

q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()

q4_groupbyid = q4_groupbyid.reset_index()

q4_groupbyid.columns = ['id', 'province_out_cnt']



pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])

pos_set = pos_set.fillna(0)

pos_set['label'] = 1

pos_set.info()

Mem. usage decreased to  0.16 Mb (53.1% reduction)
Mem. usage decreased to 11.31 Mb (14.6% reduction)
Mem. usage decreased to  0.18 Mb (64.6% reduction)
Mem. usage decreased to  0.15 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5600 entries, 0 to 5599
Data columns (total 19 columns):
id                     5600 non-null int64
consume                5600 non-null float16
phone_nums             5600 non-null int64
brand_huawei           5600 non-null uint8
brand_iphone           5600 non-null uint8
brand_jinli            5600 non-null uint8
brand_le               5600 non-null uint8
brand_mei              5600 non-null uint8
brand_mi               5600 non-null uint8
brand_oppo             5600 non-null uint8
brand_san              5600 non-null uint8
brand_siji             5600 non-null uint8
brand_vivo             5600 non-null uint8
brand_其它               5600 non-null uint8
call_nums              5600 non-null int16
is_trans_provincial    5600 non-null int8
is_transnational       5600 non-null int8
province_out_cnt       5600 non-null float64
label                  5600 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 311.7 KB
time: 10.1 s

1.2 負樣本

n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))

n1.columns = ['year_month', 'id', 'consume', 'label']

n1 = n1.fillna(98.0)

n1_groupbyid = n1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})



n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))

n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

n2.type = n2.type.fillna('其它')

brand_series = pd.Series({'蘋果' : 'iphone', '華爲' : "huawei", '歐珀' : 'oppo', '維沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '樂視' : 'le', '四季恆美' : 'siji'})

n2.brand = n2.brand.map(brand_series)

n2.brand = n2.brand.fillna('其它')

n2['brand_type'] = n2['brand'] + n2['type']

n2_brand_type = n2[['id', 'brand_type']]

n2_brand_type = n2_brand_type.drop_duplicates()

n2_groupbyid = n2_brand_type['id'].value_counts()

n2_groupbyid = n2_groupbyid.reset_index()

n2_groupbyid.columns = ['id', 'phone_nums']

n2_brand = n2[['id', 'brand']]

n2_brand = n2_brand.drop_duplicates()

n2_brand_one_hot = pd.get_dummies(n2_brand)

n2_one_hot = n2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 

                                                   'brand_iphone': pd.Series.max,

                                                   'brand_jinli': pd.Series.max, 

                                                   'brand_le': pd.Series.max,

                                                   'brand_mei': pd.Series.max, 

                                                   'brand_mi': pd.Series.max,

                                                   'brand_oppo': pd.Series.max, 

                                                   'brand_san': pd.Series.max,

                                                   'brand_siji': pd.Series.max, 

                                                   'brand_vivo': pd.Series.max,

                                                   'brand_其它': pd.Series.max

})



neg_set = n1_groupbyid.merge(n2_groupbyid, on=['id'])

neg_set = neg_set.merge(n2_one_hot, on=['id'])

n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))

n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']



n3_groupbyid_call = n3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})

n3_groupbyid_provincial = n3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})

n3_groupbyid_trans = n3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

neg_set = neg_set.merge(n3_groupbyid_call, on=['id'])

neg_set = neg_set.merge(n3_groupbyid_provincial, on=['id'])

neg_set = neg_set.merge(n3_groupbyid_trans, on=['id'])



n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))

n4.columns = ['year_month', 'id', 'province', 'label']

n4.province = n4.province.fillna('湖南')

n4_groupbyid = n4[['id', 'province']].groupby(['id']).size()

n4_groupbyid = n4_groupbyid.reset_index()

n4_groupbyid.columns = ['id', 'province_out_cnt']

neg_set = neg_set.merge(n4_groupbyid, how='left', on=['id'])

neg_set = neg_set.fillna(0)



neg_set['label'] = 0

neg_set.info()

Mem. usage decreased to  2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)
Mem. usage decreased to  3.03 Mb (64.6% reduction)
Mem. usage decreased to  0.73 Mb (34.4% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 93375 entries, 0 to 93374
Data columns (total 19 columns):
id                     93375 non-null int64
consume                93375 non-null float16
phone_nums             93375 non-null int64
brand_huawei           93375 non-null uint8
brand_iphone           93375 non-null uint8
brand_jinli            93375 non-null uint8
brand_le               93375 non-null uint8
brand_mei              93375 non-null uint8
brand_mi               93375 non-null uint8
brand_oppo             93375 non-null uint8
brand_san              93375 non-null uint8
brand_siji             93375 non-null uint8
brand_vivo             93375 non-null uint8
brand_其它               93375 non-null uint8
call_nums              93375 non-null int16
is_trans_provincial    93375 non-null int8
is_transnational       93375 non-null int8
province_out_cnt       93375 non-null float64
label                  93375 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.1 MB
time: 2min 48s

train_set = pos_set.append(neg_set)
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98975 entries, 0 to 93374
Data columns (total 19 columns):
id                     98975 non-null int64
consume                98975 non-null float16
phone_nums             98975 non-null int64
brand_huawei           98975 non-null uint8
brand_iphone           98975 non-null uint8
brand_jinli            98975 non-null uint8
brand_le               98975 non-null uint8
brand_mei              98975 non-null uint8
brand_mi               98975 non-null uint8
brand_oppo             98975 non-null uint8
brand_san              98975 non-null uint8
brand_siji             98975 non-null uint8
brand_vivo             98975 non-null uint8
brand_其它               98975 non-null uint8
call_nums              98975 non-null int16
is_trans_provincial    98975 non-null int8
is_transnational       98975 non-null int8
province_out_cnt       98975 non-null float64
label                  98975 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2), uint8(11)
memory usage: 5.4 MB
time: 62.5 ms

2 建模

import numpy as np

import pandas as pd

import lightgbm as lgb

from sklearn import metrics

from sklearn.model_selection import train_test_split



X = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values

y = train_set['label'].values



x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



lgb_train = lgb.Dataset(x_train, y_train)

lgb_eval = lgb.Dataset(x_test, y_test, reference = lgb_train)

params = {

        'boosting_type':'gbdt',  #提升器的類型

        'objective':'binary',   

        'metric':{'auc'},

        'num_leaves':100,

        'reg_alpha':0,

        'reg_lambda':0.01,

        'max_depth':6,

        'n_estimators':100,

        'subsample':0.9,

        'colsample_bytree':0.85,

        'subsample_freq':1,

        'min_child_samples':25,

        'learning_rate':0.1,

        'random_state':2019

        #'feature_fraction':0.9,  #每棵樹訓練之前選擇90%的特徵

        #'bagging_fraction':0.8,  #類似於feature_fraction，加速訓練，處理過擬合

        #'bagging_freq':5,

        #'verbose':0

}

gbm = lgb.train(params,

                lgb_train,

                num_boost_round = 2000, # 4000 number of boosting iterations,

                valid_sets = lgb_eval,

                verbose_eval=250,

                early_stopping_rounds=50)

                

y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)

print('AUC: %.4f' % metrics.roc_auc_score(y, y_pred))



y_pred = gbm.predict(x_test, num_iteration=gbm.best_iteration)

print('Test AUC: %.4f' % metrics.roc_auc_score(y_test, y_pred))

Training until validation scores don't improve for 50 rounds.
Early stopping, best iteration is:
[18]	valid_0's auc: 0.786865
AUC: 0.7981
Test AUC: 0.7869
time: 772 ms

from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier

from collections import Counter



X = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values

y = train_set['label'].values



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



c = Counter(y_train)

'''

params={'booster':'gbtree',

    'objective': 'binary:logistic',

    'eval_metric': 'auc',

    'max_depth':4,

    'lambda':10,

    'subsample':0.75,

    'colsample_bytree':0.75,

    'min_child_weight':2,

    'eta': 0.025,

    'seed':0,

    'nthread':8,

     'silent':1}

'''

clf = XGBClassifier(max_depth=5, eval_metric='auc', min_child_weight=6, scale_pos_weight=c[0] / 16 / c[1], 

                    nthread=12, num_boost_round=1000, seed=2019

                    )

                    

print('fit start...')

clf.fit(X_train, y_train)

print('fit finish')



'''

train_score = clf.score(X_train, y_train)

test_score = clf.score(X_test, y_test)

print('train score:{}\ntest score:{}'.format(train_score, test_score))

'''



y_pred=clf.predict(X)

from sklearn import metrics

print('AUC: %.4f' % metrics.roc_auc_score(y, y_pred))



y_pred=clf.predict(X_test)

print('Test AUC: %.4f' % metrics.roc_auc_score(y_test, y_pred))

fit start...
fit finish
AUC: 0.5134
Test AUC: 0.5082
time: 3.11 s

import xgboost as xgb

import pandas as pd

from sklearn.model_selection import GridSearchCV

from collections import Counter





X_train = train_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values

y_train = train_set['label'].values

c = Counter(y_train)



# n = c[0] / c[1]  # 13.98

# nn = c[0] / 16 / c[1] # 0.8738

# print(n, nn)



parameters = {

    'max_depth': [5, 10, 15],

    'learning_rate': [0.01, 0.02, 0.05],

    'n_estimators': [500, 1000, 2000],

    'min_child_weight': [0, 2, 5],

    'max_delta_step': [0, 0.2, 0.6],

    'subsample': [0.6, 0.7, 0.8],

    'colsample_bytree': [0.5, 0.6, 0.7],

    'reg_alpha': [0, 0.25, 0.5],

    'reg_lambda': [0.2, 0.4, 0.6],

    'scale_pos_weight': [0.8, 8, 14]



}



xlf = xgb.XGBClassifier(max_depth=10,

                        learning_rate=0.01,

                        n_estimators=2000,

                        silent=True,

                        objective='binary:logistic',

                        nthread=12,

                        gamma=0,

                        min_child_weight=1,

                        max_delta_step=0,

                        subsample=0.85,

                        colsample_bytree=0.7,

                        colsample_bylevel=1,

                        reg_alpha=0,

                        reg_lambda=1,

                        scale_pos_weight=1,

                        seed=2019,

                        missing=None)



# 有了gridsearch我們便不需要fit函數

gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)

gsearch.fit(X_train, y_train)



print("Best score: %0.3f" % gsearch.best_score_)

print("Best parameters set:")

best_parameters = gsearch.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):

    print("\t%s: %r" % (param_name, best_parameters[param_name]))

3 預測

3.1 測試集

t1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None))

t1.columns = ['year_month', 'id', 'consume']

t1 = t1.fillna(81.0)

# t1 = t1.dropna(axis=0)

t1_groupbyid = t1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})



t2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None))

t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']

t2 = t2.fillna('其它')

# t2 = t2.dropna(axis=0)

brand_series = pd.Series({'蘋果' : 'iphone', '華爲' : "huawei", '歐珀' : 'oppo', '維沃' : 'vivo', '三星' : 'san', '小米' : 'mi', '金立' : 'jinli', '魅族' : 'mei', '樂視' : 'le', '四季恆美' : 'siji'})

t2.brand = t2.brand.map(brand_series)

t2.brand = t2.brand.fillna('其它')

t2['brand_type'] = t2['brand'] + t2['type']

t2_brand_type = t2[['id', 'brand_type']]

t2_brand_type = t2_brand_type.drop_duplicates()

t2_groupbyid = t2_brand_type['id'].value_counts()

t2_groupbyid = t2_groupbyid.reset_index()

t2_groupbyid.columns = ['id', 'phone_nums']

t2_brand = t2[['id', 'brand']]

t2_brand = t2_brand.drop_duplicates()

t2_brand_one_hot = pd.get_dummies(t2_brand)

t2_one_hot = t2_brand_one_hot.groupby(['id']).agg({'brand_huawei': pd.Series.max, 

                                                   'brand_iphone': pd.Series.max,

                                                   'brand_jinli': pd.Series.max, 

                                                   'brand_le': pd.Series.max,

                                                   'brand_mei': pd.Series.max, 

                                                   'brand_mi': pd.Series.max,

                                                   'brand_oppo': pd.Series.max, 

                                                   'brand_san': pd.Series.max,

                                                   'brand_siji': pd.Series.max, 

                                                   'brand_vivo': pd.Series.max,

                                                   'brand_其它': pd.Series.max

})



test_set = t1_groupbyid.merge(t2_groupbyid, on=['id'])

test_set = test_set.merge(t2_one_hot, on=['id'])



t3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None))

t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']



t3_groupbyid_call = t3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})

t3_groupbyid_provincial = t3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})

t3_groupbyid_trans = t3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

test_set = test_set.merge(t3_groupbyid_call, on=['id'])

test_set = test_set.merge(t3_groupbyid_provincial, on=['id'])

test_set = test_set.merge(t3_groupbyid_trans, on=['id'])



t4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None))

t4.columns = ['year_month', 'id', 'province']

t4 = t4.fillna('湖南')

# t4 = t4.dropna(axis=0)

t4_groupbyid = t4[['id', 'province']].groupby(['id']).size()

t4_groupbyid = t4_groupbyid.reset_index()

t4_groupbyid.columns = ['id', 'province_out_cnt']

test_set = test_set.merge(t4_groupbyid, how='left', on=['id'])



test_set = test_set.fillna(0)

test_set.info()

Mem. usage decreased to  1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)
Mem. usage decreased to  1.53 Mb (60.0% reduction)
Mem. usage decreased to  0.85 Mb (16.7% reduction)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 18 columns):
id                     48668 non-null int64
consume                48668 non-null float16
phone_nums             48668 non-null int64
brand_huawei           48668 non-null uint8
brand_iphone           48668 non-null uint8
brand_jinli            48668 non-null uint8
brand_le               48668 non-null uint8
brand_mei              48668 non-null uint8
brand_mi               48668 non-null uint8
brand_oppo             48668 non-null uint8
brand_san              48668 non-null uint8
brand_siji             48668 non-null uint8
brand_vivo             48668 non-null uint8
brand_其它               48668 non-null uint8
call_nums              48668 non-null int16
is_trans_provincial    48668 non-null int8
is_transnational       48668 non-null int8
province_out_cnt       48668 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2), uint8(11)
memory usage: 2.3 MB
time: 1min 39s

# lightgbm
X_test = test_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_predict = gbm.predict(X_test, num_iteration=gbm.best_iteration)
submit = test_set[['id']]
submit['pred'] = y_predict

time: 108 ms


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """

type(y_predict)

numpy.ndarray



time: 2.3 ms

y_predict[:5]

array([0.10280227, 0.08214867, 0.06905468, 0.07655945, 0.11238844])



time: 2.9 ms

# xgboost
X_test = test_set[['consume', 'phone_nums', 'call_nums', 'is_trans_provincial', 'is_transnational', 'province_out_cnt']].values
y_predict = clf.predict_proba(X_test)[:, 1]
submit_xgb = test_set[['id']]
submit_xgb['pred'] = y_predict

time: 208 ms


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """

4 提交結果

tt1 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None)
tt1.columns = ['year_month', 'id', 'consume']

time: 41.6 ms

xgb_t1_id = tt1[['id']].drop_duplicates()

time: 13 ms

xgb_t1_id.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id    50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.46 ms

t1_id = tt1[['id']].drop_duplicates()

time: 12.5 ms

t1_id.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 1 columns):
id    50200 non-null int64
dtypes: int64(1)
memory usage: 784.4 KB
time: 5.67 ms

submit_xgb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id      48668 non-null int64
pred    48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 950.5 KB
time: 7.8 ms

submit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48668 entries, 0 to 48667
Data columns (total 2 columns):
id      48668 non-null int64
pred    48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.33 ms

tt_xgb = t1_id.merge(submit_xgb, on=['id'], how='left')

time: 17.6 ms

tt_xgb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    48668 non-null float32
dtypes: float32(1), int64(1)
memory usage: 980.5 KB
time: 8.14 ms

tt = t1_id.merge(submit, on=['id'], how='left')

time: 19.3 ms

tt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    48668 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.06 ms

xgboost

# fill 0 0.469  dropna-0.50005  addfeat-0.46  addfeat dropna-0.4549
# fill 1 0.436  addfeat dropna-0.419
# fill mean0.088458 0.43048757
submit_xgb = tt_xgb.fillna(0.0)

time: 1.92 ms

lightgbm

# fill 0 addfeat-0.4491 0.4539  addfeat dropna-0.4512
submit_gbm = tt.fillna(0.0)

time: 1.96 ms

1.模型融合求和得分0.4558
2.全爲1.0/0.0 得分0.5
3.大於0.5改爲1.0，小於0.5改爲0.0 應有2800人左右去 xgb0.26 得分0.50153 gbm0.17 得分0.50554

submit_xgb.describe()

	id	pred
count	5.020000e+04	50200.000000
mean	5.449990e+15	0.092590
std	2.628886e+15	0.088487
min	5.959412e+11	0.000000
25%	3.177008e+15	0.034837
50%	5.441108e+15	0.063993
75%	7.726328e+15	0.125547
max	9.999920e+15	0.754152

time: 22.4 ms

submit_xgb[submit_xgb['pred']>=0.26].describe()

	id	pred
count	2.818000e+03	2818.000000
mean	5.523494e+15	0.350387
std	2.632627e+15	0.083545
min	7.736480e+13	0.260060
25%	3.193231e+15	0.287803
50%	5.528103e+15	0.324941
75%	7.801996e+15	0.386373
max	9.999505e+15	0.754152

time: 16.7 ms

xgb_yes = submit_xgb[submit_xgb['pred']>=0.26] 
xgb_yes['pred'] = 1.0
xgb_yes.describe()

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	id	pred
count	2.818000e+03	2818.0
mean	5.523494e+15	1.0
std	2.632627e+15	0.0
min	7.736480e+13	1.0
25%	3.193231e+15	1.0
50%	5.528103e+15	1.0
75%	7.801996e+15	1.0
max	9.999505e+15	1.0

time: 347 ms

xgb_no = submit_xgb[submit_xgb['pred']<0.26] 
xgb_no['pred'] = 0.0
xgb_no.describe()

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	id	pred
count	4.738200e+04	47382.0
mean	5.445619e+15	0.0
std	2.628626e+15	0.0
min	5.959412e+11	0.0
25%	3.175890e+15	0.0
50%	5.435288e+15	0.0
75%	7.722863e+15	0.0
max	9.999920e+15	0.0

time: 380 ms

submit = xgb_yes.append(xgb_no)

time: 2.29 ms

submit.describe()

	id	pred
count	5.020000e+04	50200.000000
mean	5.449990e+15	0.056135
std	2.628886e+15	0.230185
min	5.959412e+11	0.000000
25%	3.177008e+15	0.000000
50%	5.441108e+15	0.000000
75%	7.726328e+15	0.000000
max	9.999920e+15	1.000000

time: 19.6 ms

submit_xgb[submit_xgb['pred']>=0.2].describe()

	id	pred
count	5.547000e+03	5547.000000
mean	5.508672e+15	0.289829
std	2.641133e+15	0.086438
min	5.399382e+12	0.200014
25%	3.195841e+15	0.225862
50%	5.489831e+15	0.261552
75%	7.813588e+15	0.326278
max	9.999505e+15	0.754152

time: 18.5 ms

5600/98975*50200

2840.3132104066685



time: 2.17 ms

submit_gbm[submit_gbm['pred']>=0.23].describe()

	id	pred
count	2.539000e+03	2539.000000
mean	5.482621e+15	0.298836
std	2.625965e+15	0.062903
min	7.736480e+13	0.230013
25%	3.200866e+15	0.253366
50%	5.471503e+15	0.279145
75%	7.742764e+15	0.326900
max	9.999505e+15	0.632138

time: 19 ms

submit_gbm[submit_gbm['pred']>=0.22].describe()

	id	pred
count	2.859000e+03	2859.000000
mean	5.493943e+15	0.290563
std	2.630246e+15	0.063701
min	7.736480e+13	0.220121
25%	3.195841e+15	0.244933
50%	5.501943e+15	0.270700
75%	7.743865e+15	0.321506
max	9.999505e+15	0.632138

time: 19.6 ms

gbm_yes = submit_gbm[submit_gbm['pred']>=0.23] 
gbm_yes['pred'] = 1.0
gbm_yes.describe()

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	id	pred
count	2.539000e+03	2539.0
mean	5.482621e+15	1.0
std	2.625965e+15	0.0
min	7.736480e+13	1.0
25%	3.200866e+15	1.0
50%	5.471503e+15	1.0
75%	7.742764e+15	1.0
max	9.999505e+15	1.0

time: 82.2 ms

gbm_no = submit_gbm[submit_gbm['pred']<0.23] 
gbm_no['pred'] = 0.0
gbm_no.describe()

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	id	pred
count	4.766100e+04	47661.0
mean	5.448252e+15	0.0
std	2.629058e+15	0.0
min	5.959412e+11	0.0
25%	3.175232e+15	0.0
50%	5.439911e+15	0.0
75%	7.725629e+15	0.0
max	9.999920e+15	0.0

time: 58.7 ms

submit = gbm_yes.append(gbm_no)

time: 4.19 ms

submit.describe()

	id	pred
count	5.020000e+04	50200.000000
mean	5.449990e+15	0.018745
std	2.628886e+15	0.135625
min	5.959412e+11	0.000000
25%	3.177008e+15	0.000000
50%	5.441108e+15	0.000000
75%	7.726328e+15	0.000000
max	9.999920e+15	1.000000

time: 20.4 ms

submit_gbm.describe()

	id	pred
count	5.020000e+04	50200.000000
mean	5.449990e+15	0.085097
std	2.628886e+15	0.071304
min	5.959412e+11	0.000000
25%	3.177008e+15	0.036845
50%	5.441108e+15	0.062206
75%	7.726328e+15	0.113462
max	9.999920e+15	0.632138

time: 20.8 ms

submit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 91 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 9.36 ms

submit = submit_xgb.append(submit_gbm)
submit = submit.groupby(by='id').sum().reset_index()
submit.describe()

	id	pred
count	5.020000e+04	50200.000000
mean	5.449990e+15	0.169012
std	2.628886e+15	0.139313
min	5.959412e+11	0.000000
25%	3.177008e+15	0.076237
50%	5.441108e+15	0.125893
75%	7.726328e+15	0.222622
max	9.999920e+15	1.124561

time: 41.7 ms

submit.head()

	id	pred
4	9297165066591558	1.0
14	8168181097053542	1.0
18	6473515505643555	1.0
25	4641233171005560	1.0
29	6759757036024682	1.0

time: 6.16 ms

submit_xgb[submit_xgb['id']==595941207920]

	id	pred
8048	595941207920	0.185561

time: 7.07 ms

submit_gbm[submit_gbm['id']==595941207920]

	id	pred
8048	595941207920	0.114782

time: 6.33 ms

submit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8 ms

全爲1

t1_id['pred'] = 1.0

submit = t1_id.copy()
submit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 0 to 99852
Data columns (total 2 columns):
id      50200 non-null int64
pred    50200 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 MB
time: 8.79 ms

submit.head()

	id	pred
0	6401824160010748	1.0
1	6506134548135499	1.0
2	5996920884619954	1.0
3	1187209424543713	1.0
4	9297165066591558	1.0

time: 13.1 ms

submit.columns = ['ID', 'Pred']
submit['ID'] = submit['ID'].astype(str)

time: 36.7 ms

submit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50200 entries, 14 to 50199
Data columns (total 2 columns):
ID      50200 non-null object
Pred    50200 non-null float64
dtypes: float64(1), object(1)
memory usage: 1.1+ MB
time: 10.1 ms

submit.to_csv('../submit.csv')

time: 126 ms

!wget -O kesci_submit https://www.heywhale.com/kesci_submit&&chmod +x kesci_submit

wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
--2019-07-31 08:15:56--  https://www.heywhale.com/kesci_submit
Resolving www.heywhale.com (www.heywhale.com)... 106.15.25.147
Connecting to www.heywhale.com (www.heywhale.com)|106.15.25.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6528405 (6.2M) [application/octet-stream]
Saving to: ‘kesci_submit’

kesci_submit        100%[===================>]   6.23M  12.1MB/s    in 0.5s    

2019-07-31 08:15:57 (12.1 MB/s) - ‘kesci_submit’ saved [6528405/6528405]

time: 1.83 s

!https_proxy="http://klab-external-proxy" ./kesci_submit -file ../submit.csv -token 578549794d544bff

Kesci Submit Tool 3.0

> 已驗證Token
> 提交文件 ../submit.csv (1312.26 KiB)
> 文件已上傳        
> 提交完成
time: 1.7 s

!./kesci_submit -token 578549794d544bff -file ../submit.csv

Kesci Submit Tool
Result File: ../submit.csv (1.28 MiB)
Uploading: 7%====================
Submit Failed.
Serevr Response:
 400 - {"message":"當前提交工具版本過舊，請參考比賽提交頁面信息下載新的提交工具"}

time: 1 s

!ls ../

input  pred.csv  work
time: 665 ms

!wget -nv -O kesci_submit https://www.heywhale.com/kesci_submit&&chmod +x kesci_submit

wget: /opt/conda/lib/libcrypto.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
wget: /opt/conda/lib/libssl.so.1.0.0: no version information available (required by wget)
2019-07-02 08:08:23 URL:https://www.heywhale.com/kesci_submit [7842088/7842088] -> "kesci_submit" [1]
time: 1.47 s

0 查看數據

0.1 訓練數據

0.1.1 正樣本

q1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q1.txt', sep='\t', header=None))

Mem. usage decreased to  0.16 Mb (53.1% reduction)
time: 23 ms

q1.columns = ['year_month', 'id', 'consume', 'label']

time: 1.21 ms

q1 = q1.dropna(axis=0)

time: 6.72 ms

q1.head()

	year_month	id	consume	label
2	201706	8160829951314300	82.75000	1
3	201707	8160829951314300	37.68750	1
4	201706	1508075698521400	68.00000	1
5	201707	1508075698521400	49.59375	1
6	201706	1686251204809800	200.75000	1

time: 6.82 ms

q1.describe()

	year_month	id	consume	label
count	10865.000000	1.086500e+04	1.086500e+04	10865.0
mean	201706.499678	5.417732e+15	inf	1.0
std	0.500023	2.635784e+15	inf	0.0
min	201706.000000	1.448104e+12	4.998779e-02	1.0
25%	201706.000000	3.118365e+15	4.068750e+01	1.0
50%	201706.000000	5.456594e+15	9.837500e+01	1.0
75%	201707.000000	7.687339e+15	1.785000e+02	1.0
max	201707.000000	9.997949e+15	1.324000e+03	1.0

time: 37.1 ms

q1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 2 to 11199
Data columns (total 4 columns):
year_month    10865 non-null int32
id            10865 non-null int64
consume       10865 non-null float16
label         10865 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1)
memory usage: 244.0 KB
time: 6.9 ms

%matplotlib inline

# 按index日期排序

q1.consume.plot()

Matplotlib is building the font cache using fc-list. This may take a moment.





<matplotlib.axes._subplots.AxesSubplot at 0x7fd1c0659b70>

time: 11.3 s

q1[q1.consume == 1323.74]

	year_month	id	consume	label
4867	201707	5510977603357000	1324.0	1

time: 11.1 ms

q2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q2.txt', sep='\t', header=None))

Mem. usage decreased to 11.31 Mb (14.6% reduction)
time: 291 ms

q2 = q2.dropna(axis=0)

time: 77.7 ms

q2.head()

	0	1	2	3	4	5
1	1752398069509000	華爲	PLK-AL10	20170609223138	20170609224345	1
2	1752398069509000	樂視	LETV X501	20160924102711	20160924112425	1
3	1752398069509000	金立	金立 GN800	20150331210255	20150630131232	1
4	1752398069509000	金立	GIONEE M5	20170508191216	20170605192347	1
5	1752398069509000	華爲	PLK-AL10	20160618182839	20170731235959	1

time: 8.16 ms

q2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

time: 1.15 ms

q2.head()

	id	brand	type	first_use_time	recent_use_time	label
1	1752398069509000	華爲	PLK-AL10	20170609223138	20170609224345	1
2	1752398069509000	樂視	LETV X501	20160924102711	20160924112425	1
3	1752398069509000	金立	金立 GN800	20150331210255	20150630131232	1
4	1752398069509000	金立	GIONEE M5	20170508191216	20170605192347	1
5	1752398069509000	華爲	PLK-AL10	20160618182839	20170731235959	1

time: 8.58 ms

q2.describe()

	id	first_use_time	recent_use_time	label
count	1.973760e+05	1.973760e+05	1.973760e+05	197376.0
mean	5.436228e+15	2.015597e+13	2.015684e+13	1.0
std	2.642924e+15	2.685010e+11	2.685124e+11	0.0
min	1.448104e+12	-1.000000e+00	-1.000000e+00	1.0
25%	3.227267e+15	2.015122e+13	2.016013e+13	1.0
50%	5.353833e+15	2.016052e+13	2.016060e+13	1.0
75%	7.764521e+15	2.016102e+13	2.016112e+13	1.0
max	9.997949e+15	2.017073e+13	2.017073e+13	1.0

time: 64.7 ms

q2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 197376 entries, 1 to 289201
Data columns (total 6 columns):
id                 197376 non-null int64
brand              197376 non-null object
type               197376 non-null object
first_use_time     197376 non-null int64
recent_use_time    197376 non-null int64
label              197376 non-null int8
dtypes: int64(3), int8(1), object(2)
memory usage: 9.2+ MB
time: 41.7 ms

q3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q3.txt', sep='\t', header=None))

Mem. usage decreased to  0.18 Mb (64.6% reduction)
time: 18.4 ms

q3 = q3.dropna(axis=0)

time: 6.41 ms

q3.head()

	0	1	2	3	5
0	201707	6062475264825100	88	1	1
1	201707	8160829951314300	27	0	1
2	201707	1508075698521400	19	0	1
3	201707	1686251204809800	207	0	1
4	201707	5627768389537500	133	1	1

time: 7.62 ms

q3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

time: 1.16 ms

q3.head()

	year_month	id	call_nums	is_trans_provincial	label
0	201707	6062475264825100	88	1	1
1	201707	8160829951314300	27	0	1
2	201707	1508075698521400	19	0	1
3	201707	1686251204809800	207	0	1
4	201707	5627768389537500	133	1	1

time: 7.37 ms

q3.describe()

	year_month	id	call_nums	is_trans_provincial	is_transnational	label
count	11200.000000	1.120000e+04	11200.000000	11200.000000	11200.000000	11200.0
mean	201706.500000	5.416583e+15	70.562232	0.235446	0.014464	1.0
std	0.500022	2.642827e+15	61.820144	0.424296	0.119400	0.0
min	201706.000000	1.448104e+12	-1.000000	0.000000	0.000000	1.0
25%	201706.000000	3.117220e+15	25.000000	0.000000	0.000000	1.0
50%	201706.500000	5.456254e+15	54.000000	0.000000	0.000000	1.0
75%	201707.000000	7.702940e+15	99.250000	0.000000	0.000000	1.0
max	201707.000000	9.997949e+15	727.000000	1.000000	1.000000	1.0

time: 79.6 ms

q3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11200 entries, 0 to 11199
Data columns (total 6 columns):
year_month             11200 non-null int32
id                     11200 non-null int64
call_nums              11200 non-null int16
is_trans_provincial    11200 non-null int8
is_transnational       11200 non-null int8
label                  11200 non-null int8
dtypes: int16(1), int32(1), int64(1), int8(3)
memory usage: 273.4 KB
time: 7.47 ms

q4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q4.txt', sep='\t', header=None))
q4 = q4.dropna(axis=0)
q4.columns = ['year_month', 'id', 'province', 'label']

time: 935 µs

q4.head()

	year_month	id	province	label
0	201707	6062475264825100	廣東	1
1	201707	5627768389537500	北京	1
2	201707	2000900444179600	山西	1
3	201707	5304502776817600	四川	1
4	201707	5304502776817600	四川	1

time: 6.84 ms

q4.describe()

	year_month	id	label
count	7218.000000	7.218000e+03	7218.0
mean	201706.538515	5.341915e+15	1.0
std	0.498549	2.631231e+15	0.0
min	201706.000000	1.739872e+13	1.0
25%	201706.000000	3.037311e+15	1.0
50%	201707.000000	5.367106e+15	1.0
75%	201707.000000	7.545199e+15	1.0
max	201707.000000	9.987407e+15	1.0

time: 22.2 ms

q4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7218 entries, 0 to 7288
Data columns (total 4 columns):
year_month    7218 non-null int32
id            7218 non-null int64
province      7218 non-null object
label         7218 non-null int8
dtypes: int32(1), int64(1), int8(1), object(1)
memory usage: 204.4+ KB
time: 6.74 ms

!ls /home/kesci/input/gzlt/train_set/201708q/

201708q1.txt  201708q3.txt  201708q6.txt
201708q2.txt  201708q4.txt  201708q7.txt
time: 667 ms

q6 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q6.txt', sep='\t', header=None))

Mem. usage decreased to 62.58 Mb (52.1% reduction)
time: 3.9 s

q6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude', 'label']

time: 868 µs

q6.head()

	date	hour	id	user_longitude	user_latitude	label
0	2017-07-18	8.0	9239265006758100	106.467545	26.58625	1
1	2017-07-10	0.0	3859201812337600	106.708213	26.57854	1
2	2017-07-16	18.0	3859201812337600	106.545690	26.56724	1
3	2017-07-17	8.0	3859201812337600	106.545690	26.56724	1
4	2017-07-27	16.0	3859201812337600	106.545690	26.56724	1

time: 16.7 ms

q6.describe()

	hour	id	user_longitude	user_latitude	label
count	2.852871e+06	2.852871e+06	2.851527e+06	2.851527e+06	2852871.0
mean	1.141897e+01	5.415213e+15	1.068143e+02	2.659968e+01	1.0
std	6.632995e+00	2.634349e+15	5.580043e-01	2.852525e-01	0.0
min	0.000000e+00	1.448104e+12	1.036700e+02	2.470664e+01	1.0
25%	6.000000e+00	3.135488e+15	1.066656e+02	2.654610e+01	1.0
50%	1.200000e+01	5.442594e+15	1.067027e+02	2.658143e+01	1.0
75%	1.800000e+01	7.687963e+15	1.067373e+02	2.662629e+01	1.0
max	2.200000e+01	9.997949e+15	1.095277e+02	2.909348e+01	1.0

time: 775 ms

q6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2852871 entries, 0 to 2852870
Data columns (total 6 columns):
date              object
hour              float64
id                int64
user_longitude    float64
user_latitude     float64
label             int64
dtypes: float64(3), int64(2), object(1)
memory usage: 130.6+ MB
time: 3.24 ms

q7 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708q/201708q7.txt', sep='\t', header=None))

Mem. usage decreased to  3.80 Mb (42.5% reduction)
time: 137 ms

q7 = q7.dropna(axis=0)

time: 35.4 ms

q7.columns = ['year_month', 'id', 'app', 'flow', 'label']

time: 1.54 ms

q7.head()

	year_month	id	app	flow	label
0	201707	6610350034824100	騰訊手機管家	0.010002	1
1	201707	6997210664840100	喜馬拉雅FM	27.390625	1
2	201707	3198621664927300	網易新聞	0.029999	1
3	201707	9987406611703100	喜馬拉雅FM	0.000000	1
4	201707	1785540174324200	天氣通	0.020004	1

time: 8.14 ms

q7.describe()

	year_month	id	flow	label
count	173117.000000	1.731170e+05	173117.000000	173117.0
mean	201706.539699	5.403100e+15	NaN	1.0
std	0.498423	2.667026e+15	NaN	0.0
min	201706.000000	1.448104e+12	0.000000	1.0
25%	201706.000000	3.056260e+15	0.010002	1.0
50%	201707.000000	5.429056e+15	0.080017	1.0
75%	201707.000000	7.730223e+15	1.599609	1.0
max	201707.000000	9.997949e+15	7828.000000	1.0

time: 70.4 ms

q7.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 173117 entries, 0 to 173116
Data columns (total 5 columns):
year_month    173117 non-null int32
id            173117 non-null int64
app           173117 non-null object
flow          173117 non-null float16
label         173117 non-null int8
dtypes: float16(1), int32(1), int64(1), int8(1), object(1)
memory usage: 5.1+ MB
time: 29.8 ms

q1
將兩月金額相加

q1.head()

	year_month	id	consume	label
2	201706	8160829951314300	82.75000	1
3	201707	8160829951314300	37.68750	1
4	201706	1508075698521400	68.00000	1
5	201707	1508075698521400	49.59375	1
6	201706	1686251204809800	200.75000	1

time: 7.05 ms

q1 = q1[['id', 'consume']]

time: 2.91 ms

q1_groupbyid = q1.groupby(['id']).agg({'consume': pd.Series.sum})

time: 747 ms

len(q1)

10865



time: 8.1 ms

q1[q1['id']==1448103998000]

	id	consume
3532	1448103998000	18.09375
3533	1448103998000	44.28125

time: 8.84 ms

q1_groupbyid[:10]

	consume
id
1448103998000	62.37500
17398718813730	460.75000
61132623486000	12.28125
68156596675520	903.50000
76819334576430	282.25000
78745100940550	531.00000
110229638660000	253.00000
122134826301000	138.75000
132923269304000	26.81250
138204830829320	387.50000

time: 5.8 ms

q2
特徵1 使用過的top9+其它手機品牌共10個
特徵2 使用的不同品牌數量

q2 = q2[['id', 'brand']]

time: 4.86 ms

q2.head(10)

	id	brand
1	1752398069509000	華爲
2	1752398069509000	樂視
3	1752398069509000	金立
4	1752398069509000	金立
5	1752398069509000	華爲
6	1752398069509000	華爲
7	1752398069509000	金立
8	1752398069509000	三星
9	4799656026499908	三星
10	4799656026499908	華爲

time: 6.36 ms

groupbybrand = q2['brand'].value_counts()

time: 18.7 ms

len(groupbybrand)

750



time: 2.09 ms

%matplotlib inline

groupbybrand.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1c00ea7b8>

time: 454 ms

groupbybrand[:10]

蘋果      62347
華爲      22266
歐珀      20516
維沃      17158
三星      13435
小米      10632
金立       9922
魅族       9708
樂視       5609
四季恆美     2163
Name: brand, dtype: int64



time: 3.52 ms

q2 = q2.drop_duplicates()
groupbyid = q2['id'].value_counts()

time: 19.6 ms

len(groupbyid)

5597



time: 2.23 ms

%matplotlib inline

groupbyid.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd1bb56e048>

time: 294 ms

groupbyid[:10]

4104535378288025    115
8707678197418467    108
3900535090108175    104
3986280749497468     93
9196501153454276     88
5510977603357000     84
8569492566715454     78
1106540188374027     71
4091371962011072     71
4874962666674313     71
Name: id, dtype: int64



time: 3.27 ms

q1[q1['id']==4104535378288025]

	year_month	id	consume	label
10576	201706	4104535378288025	208.000	1
10577	201707	4104535378288025	205.125	1

time: 7.63 ms

# q2[q2['id']==4104535378288025]

time: 364 µs

type(groupbyid)

pandas.core.series.Series



time: 2.14 ms

type(groupbyid.to_frame())

pandas.core.frame.DataFrame



time: 3.13 ms

q2_groupbyid = groupbyid.reset_index()

time: 2.34 ms

q2_groupbyid.columns = ['id', 'phone_nums']

time: 1.19 ms

q2_groupbyid.head()

	id	phone_nums
0	4104535378288025	115
1	8707678197418467	108
2	3900535090108175	104
3	3986280749497468	93
4	9196501153454276	88

time: 6.12 ms

type(q1_groupbyid)

pandas.core.frame.DataFrame



time: 2.15 ms

pos_set = q1_groupbyid.merge(q2_groupbyid, on=['id'])

time: 6.42 ms

pos_set.head()

	id	consume	phone_nums
0	1448103998000	62.37500	6
1	17398718813730	460.75000	23
2	61132623486000	12.28125	1
3	68156596675520	903.50000	4
4	76819334576430	282.25000	21

time: 7.11 ms

pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 3 columns):
id            5473 non-null int64
consume       5473 non-null float16
phone_nums    5473 non-null int64
dtypes: float16(1), int64(2)
memory usage: 139.0 KB
time: 6.27 ms

q3
1.將兩月聯絡圈規模求和
2.將兩月出省求和是:1 否:0
3.將兩月出國求和是:1 否:0

q3.head()

	year_month	id	call_nums	is_trans_provincial	label
0	201707	6062475264825100	88	1	1
1	201707	8160829951314300	27	0	1
2	201707	1508075698521400	19	0	1
3	201707	1686251204809800	207	0	1
4	201707	5627768389537500	133	1	1

time: 7.69 ms

q3_groupbyid_call = q3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
q3_groupbyid_provincial = q3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
q3_groupbyid_trans = q3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})

time: 1.95 s

pos_set = pos_set.merge(q3_groupbyid_call, on=['id'])

time: 5.14 ms

pos_set.head()

	id	consume	phone_nums	call_nums
0	1448103998000	62.37500	6	21
1	17398718813730	460.75000	23	217
2	61132623486000	12.28125	1	61
3	68156596675520	903.50000	4	353
4	76819334576430	282.25000	21	431

time: 7.94 ms

pos_set = pos_set.merge(q3_groupbyid_provincial, on=['id'])
pos_set = pos_set.merge(q3_groupbyid_trans, on=['id'])

time: 9.61 ms

pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 6 columns):
id                     5473 non-null int64
consume                5473 non-null float16
phone_nums             5473 non-null int64
call_nums              5473 non-null int16
is_trans_provincial    5473 non-null int8
is_transnational       5473 non-null int8
dtypes: float16(1), int16(1), int64(2), int8(2)
memory usage: 160.3 KB
time: 7.3 ms

q4
1.兩月內漫出省次數
2.所有省份one-hot或top10省份+其它省份
3.兩月內漫出不同省個數

q4.head(10)

	year_month	id	province	label
0	201707	6062475264825100	廣東	1
1	201707	5627768389537500	北京	1
2	201707	2000900444179600	山西	1
3	201707	5304502776817600	四川	1
4	201707	5304502776817600	四川	1
5	201707	5304502776817600	四川	1
6	201707	5304502776817600	重慶	1
7	201707	8594396491246200	廣西	1
8	201707	8594396491246200	廣西	1
9	201707	8594396491246200	廣西	1

time: 8.78 ms

q4_groupbyid = q4[['id', 'province']].groupby(['id']).agg({'province': pd.Series.unique})
q4_groupbyid.head()

	province
id
17398718813730	重慶
61132623486000	[福建, 河南, 江蘇, 安徽]
68156596675520	[遼寧, 廣東]
132923269304000	江西
138204830829320	浙江

time: 322 ms

q4_groupbyid = q4[['id', 'province']].groupby(['id']).size()
q4_groupbyid.head()

id
17398718813730     1
61132623486000     8
68156596675520     3
132923269304000    1
138204830829320    2
dtype: int64



time: 6.52 ms

q4[q4['id']==61132623486000]

	year_month	id	province	label
461	201707	61132623486000	福建	1
462	201707	61132623486000	福建	1
463	201707	61132623486000	福建	1
4363	201706	61132623486000	河南	1
4364	201706	61132623486000	江蘇	1
4365	201706	61132623486000	安徽	1
4366	201706	61132623486000	安徽	1
4367	201706	61132623486000	江蘇	1

time: 8.26 ms

type(q4_groupbyid.reset_index())

pandas.core.frame.DataFrame



time: 4.03 ms

q4_groupbyid = q4_groupbyid.reset_index()
q4_groupbyid.columns = ['id', 'province_out_cnt']

time: 2.73 ms

q4_groupbyid.head()

	id	province_out_cnt
0	17398718813730	1
1	61132623486000	8
2	68156596675520	3
3	132923269304000	1
4	138204830829320	2

time: 5.73 ms

pos_set = pos_set.merge(q4_groupbyid, how='left', on=['id'])
pos_set.head()

	id	consume	phone_nums	call_nums	is_trans_provincial	province_out_cnt
0	1448103998000	62.37500	6	21	0	NaN
1	17398718813730	460.75000	23	217	1	1.0
2	61132623486000	12.28125	1	61	2	8.0
3	68156596675520	903.50000	4	353	2	3.0
4	76819334576430	282.25000	21	431	0	NaN

time: 14.6 ms

pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id                     5473 non-null int64
consume                5473 non-null float16
phone_nums             5473 non-null int64
call_nums              5473 non-null int16
is_trans_provincial    5473 non-null int8
is_transnational       5473 non-null int8
province_out_cnt       1913 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 7.53 ms

pos_set = pos_set.fillna(0)

time: 2.46 ms

pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5473 entries, 0 to 5472
Data columns (total 7 columns):
id                     5473 non-null int64
consume                5473 non-null float16
phone_nums             5473 non-null int64
call_nums              5473 non-null int16
is_trans_provincial    5473 non-null int8
is_transnational       5473 non-null int8
province_out_cnt       5473 non-null float64
dtypes: float16(1), float64(1), int16(1), int64(2), int8(2)
memory usage: 203.1 KB
time: 8.02 ms

# inner
# pos_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1913 entries, 0 to 1912
Data columns (total 7 columns):
id                     1913 non-null int64
consume                1913 non-null float16
phone_nums             1913 non-null int64
call_nums              1913 non-null int16
is_trans_provincial    1913 non-null int8
is_transnational       1913 non-null int8
province_out_cnt       1913 non-null int64
dtypes: float16(1), int16(1), int64(3), int8(2)
memory usage: 71.0 KB
time: 6.67 ms

q6 暫時忽略
q7
1.使用總流量
2.使用不同APP數量
3.某些特定（旅遊相關）APP是否使用

q7.head()

	year_month	id	app	flow	label
0	201707	6610350034824100	騰訊手機管家	0.010002	1
1	201707	6997210664840100	喜馬拉雅FM	27.390625	1
2	201707	3198621664927300	網易新聞	0.029999	1
3	201707	9987406611703100	喜馬拉雅FM	0.000000	1
4	201707	1785540174324200	天氣通	0.020004	1

time: 7.94 ms

q7_groupbyapp = q7.groupby(['app']).agg({'flow': pd.Series.sum})

time: 135 ms

len(q7_groupbyapp)

762



time: 2.04 ms

q7_groupbyapp.sort_values(by='flow', ascending=False)

	flow
app
網易雲音樂	inf
愛奇藝視頻	inf
微信	inf
新浪微博	inf
QQ音樂	inf
今日頭條	inf
QQ	57856.0
手機百度	53408.0
陌陌	43488.0
iTunes	35392.0
騰訊新聞	25952.0
快手	24256.0
手機淘寶	18400.0
UC瀏覽器	16608.0
酷狗音樂	15360.0
高德地圖	14984.0
酷我音樂	13488.0
新浪新聞	13432.0
唯品會	11504.0
騰訊視頻	10760.0
優酷視頻	10736.0
汽車之家	9984.0
百度地圖	9816.0
美團	9400.0
網易新聞	8648.0
AppStore	7776.0
中國聯通手機營業廳	6736.0
百度貼吧	6104.0
鳳凰新聞	5504.0
蝦米音樂	5020.0
...	...
百才招聘網	0.0
碰碰	0.0
禾文阿思看圖購	0.0
科學作息時間表	0.0
章魚輸入法	0.0
米折	0.0
約會吧	0.0
網易微博	0.0
表情大全	0.0
歡樂互娛	0.0
博客大巴	0.0
查快遞	0.0
郵儲銀行	0.0
號簿助手	0.0
司機邦	0.0
壁紙多多	0.0
天天聊	0.0
天翼閱讀	0.0
安全管家	0.0
安卓遊戲盒子	0.0
安軟市場	0.0
車網互聯	0.0
宜搜搜索	0.0
工程師爸爸	0.0
彩票控	0.0
貝瓦兒歌	0.0
搜狗壁紙	0.0
智遠一戶通	0.0
誠品快拍	0.0
07073手遊中心	0.0

762 rows × 1 columns

time: 12.4 ms

pos_set.describe()

	id	consume	phone_nums	call_nums	is_trans_provincial	is_transnational	province_out_cnt
count	5.473000e+03	5473.000000	5473.000000	5473.000000	5473.000000	5473.000000	5473.000000
mean	5.417038e+15	inf	8.228942	141.201900	0.474511	0.029600	1.300018
std	2.637784e+15	inf	8.551830	121.262826	0.706162	0.187904	3.110401
min	1.448104e+12	0.099976	1.000000	-2.000000	0.000000	0.000000	0.000000
25%	3.113785e+15	82.000000	3.000000	52.000000	0.000000	0.000000	0.000000
50%	5.457364e+15	198.250000	6.000000	108.000000	0.000000	0.000000	0.000000
75%	7.688781e+15	355.250000	10.000000	198.000000	1.000000	0.000000	1.000000
max	9.997949e+15	2392.000000	115.000000	1035.000000	2.000000	2.000000	42.000000

time: 126 ms

pos_set['label'] = 1

	id	consume	phone_nums	call_nums	is_trans_provincial	province_out_cnt	label
0	1448103998000	62.37500	6	21	0	NaN	1
1	17398718813730	460.75000	23	217	1	1.0	1
2	61132623486000	12.28125	1	61	2	8.0	1
3	68156596675520	903.50000	4	353	2	3.0	1
4	76819334576430	282.25000	21	431	0	NaN	1

time: 10.5 ms

pos_set.fillna(0)
pos_set.head()

	id	consume	phone_nums	call_nums	is_trans_provincial	province_out_cnt	label
0	1448103998000	62.37500	6	21	0	NaN	1
1	17398718813730	460.75000	23	217	1	1.0	1
2	61132623486000	12.28125	1	61	2	8.0	1
3	68156596675520	903.50000	4	353	2	3.0	1
4	76819334576430	282.25000	21	431	0	NaN	1

time: 23.5 ms

0.1.2 負樣本

n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))
n1.columns = ['year_month', 'id', 'consume', 'label']
n1 = n1.dropna(axis=0)
n1_groupbyid = n1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})

n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))
n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']
n2 = n2.dropna(axis=0)
n2 = n2[['id', 'brand']]
n2 = n2.drop_duplicates()
n2_groupbyid = n2['id'].value_counts()
n2_groupbyid = n2_groupbyid.reset_index()
n2_groupbyid.columns = ['id', 'phone_nums']

neg_set = n1_groupbyid.merge(n2_groupbyid, on=['id'])
neg_set.head()

Mem. usage decreased to  2.67 Mb (53.1% reduction)
Mem. usage decreased to 51.13 Mb (14.6% reduction)

	id	consume	phone_nums
0	1009387204000	225.000000	4
1	1167316303000	1.199219	4
2	1883071709000	213.500000	8
3	3393143830010	517.500000	6
4	4568973162000	18.078125	3

time: 10.8 s

neg_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76515 entries, 0 to 76514
Data columns (total 3 columns):
id            76515 non-null int64
consume       76515 non-null float16
phone_nums    76515 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.9 MB
time: 11.1 ms

n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))
n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

n3_groupbyid_call = n3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
n3_groupbyid_provincial = n3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
n3_groupbyid_trans = n3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
neg_set = neg_set.merge(n3_groupbyid_call, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_provincial, on=['id'])
neg_set = neg_set.merge(n3_groupbyid_trans, on=['id'])

n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))
n4.columns = ['year_month', 'id', 'province', 'label']

n4_groupbyid = n4[['id', 'province']].groupby(['id']).size()
n4_groupbyid = n4_groupbyid.reset_index()
n4_groupbyid.columns = ['id', 'province_out_cnt']
neg_set = neg_set.merge(n4_groupbyid, how='left', on=['id'])
neg_set = neg_set.fillna(0)
neg_set.head()

Mem. usage decreased to  3.03 Mb (64.6% reduction)
Mem. usage decreased to  0.73 Mb (34.4% reduction)

	id	consume	phone_nums	call_nums	is_trans_provincial	province_out_cnt
0	1009387204000	225.000000	4	19	0	0.0
1	1167316303000	1.199219	4	6	0	0.0
2	1883071709000	213.500000	8	40	0	0.0
3	3393143830010	517.500000	6	205	1	2.0
4	4568973162000	18.078125	3	17	0	0.0

time: 32.5 s

neg_set['label'] = 0

time: 1.83 ms

neg_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 76515 entries, 0 to 76514
Data columns (total 8 columns):
id                     76515 non-null int64
consume                76515 non-null float16
phone_nums             76515 non-null int64
call_nums              76515 non-null int16
is_trans_provincial    76515 non-null int8
is_transnational       76515 non-null int8
province_out_cnt       76515 non-null float64
label                  76515 non-null int64
dtypes: float16(1), float64(1), int16(1), int64(3), int8(2)
memory usage: 3.4 MB
time: 18.9 ms

n1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n1.txt', sep='\t', header=None))

Mem. usage decreased to  2.67 Mb (53.1% reduction)
time: 484 ms

n1.columns = ['year_month', 'id', 'consume', 'label']

time: 1.28 ms

n1.head()

	year_month	id	consume
0	201707	8570518832906100	9.00
1	201707	2182640938718700	10.00
2	201707	783614344429000	8.38
3	201707	2007036960106400	100.00
4	201707	9482847959399300	226.05

time: 7.22 ms

n1.describe()

	year_month	id	consume	label
count	186800.000000	1.868000e+05	150750.000000	186800.0
mean	201706.500000	5.464219e+15	63.580028	0.0
std	0.500001	2.633848e+15	84.063600	0.0
min	201706.000000	1.009387e+12	-70.660000	0.0
25%	201706.000000	3.192389e+15	12.930000	0.0
50%	201706.500000	5.486486e+15	34.000000	0.0
75%	201707.000000	7.744140e+15	82.500000	0.0
max	201707.000000	9.999717e+15	3979.940000	0.0

time: 52.5 ms

n1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 4 columns):
year_month    186800 non-null int64
id            186800 non-null int64
consume       150750 non-null float64
label         186800 non-null int64
dtypes: float64(1), int64(3)
memory usage: 5.7 MB
time: 21.7 ms

n2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n2.txt', sep='\t', header=None))

Mem. usage decreased to 51.13 Mb (14.6% reduction)
time: 7.76 s

n2.head()

	0	1	2	3	4
0	5227696575283900	蘋果	A1699	20150331210636	20150701063017
1	6279759720262000	NaN	NaN	20160725112240	20170731235959
2	6279759720262000	NaN	NaN	20161205220417	20161205220417
3	6279759720262000	三星	SM-A9000	20161128231001	20161128231001
4	6279759720262000	NaN	NaN	20161220102623	20170306173713

time: 8.15 ms

n2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time', 'label']

time: 1.2 ms

n2.head()

	id	brand	type	first_use_time	recent_use_time
0	5227696575283900	蘋果	A1699	20150331210636	20150701063017
1	6279759720262000	NaN	NaN	20160725112240	20170731235959
2	6279759720262000	NaN	NaN	20161205220417	20161205220417
3	6279759720262000	三星	SM-A9000	20161128231001	20161128231001
4	6279759720262000	NaN	NaN	20161220102623	20170306173713

time: 8.3 ms

n2.describe()

	id	first_use_time	recent_use_time	label
count	1.307608e+06	1.307608e+06	1.307608e+06	1307608.0
mean	5.460966e+15	1.999810e+13	1.999992e+13	0.0
std	2.619222e+15	1.801007e+12	1.801171e+12	0.0
min	1.009387e+12	-1.000000e+00	-1.000000e+00	0.0
25%	3.196695e+15	2.015112e+13	2.016022e+13	0.0
50%	5.477102e+15	2.016071e+13	2.016101e+13	0.0
75%	7.728047e+15	2.016123e+13	2.017023e+13	0.0
max	9.999717e+15	2.017073e+13	2.017073e+13	0.0

time: 252 ms

n2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1307608 entries, 0 to 1307607
Data columns (total 6 columns):
id                 1307608 non-null int64
brand              894190 non-null object
type               894205 non-null object
first_use_time     1307608 non-null int64
recent_use_time    1307608 non-null int64
label              1307608 non-null int64
dtypes: int64(4), object(2)
memory usage: 59.9+ MB
time: 251 ms

n3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n3.txt', sep='\t', header=None))

Mem. usage decreased to  3.03 Mb (64.6% reduction)
time: 584 ms

n3.head()

	0	1	2	3
0	201707	4295277677437000	36	1
1	201707	9121335969062000	37	0
2	201707	9438277095447300	-1	0
3	201707	6749854876532500	20	0
4	201707	1545361809381400	26	0

time: 7.82 ms

n3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational', 'label']

time: 1.13 ms

n3.head()

	year_month	id	call_nums	is_trans_provincial
0	201707	4295277677437000	36	1
1	201707	9121335969062000	37	0
2	201707	9438277095447300	-1	0
3	201707	6749854876532500	20	0
4	201707	1545361809381400	26	0

time: 7.49 ms

n3.describe()

	year_month	id	call_nums	is_trans_provincial	is_transnational	label
count	186800.000000	1.868000e+05	186800.000000	186800.000000	186800.000000	186800.0
mean	201706.500000	5.464219e+15	32.674797	0.093292	0.005054	0.0
std	0.500001	2.633848e+15	46.054929	0.290842	0.070909	0.0
min	201706.000000	1.009387e+12	-1.000000	0.000000	0.000000	0.0
25%	201706.000000	3.192389e+15	4.000000	0.000000	0.000000	0.0
50%	201706.500000	5.486486e+15	19.000000	0.000000	0.000000	0.0
75%	201707.000000	7.744140e+15	43.000000	0.000000	0.000000	0.0
max	201707.000000	9.999717e+15	1807.000000	1.000000	1.000000	0.0

time: 75.7 ms

n3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186800 entries, 0 to 186799
Data columns (total 6 columns):
year_month             186800 non-null int64
id                     186800 non-null int64
call_nums              186800 non-null int64
is_trans_provincial    186800 non-null int64
is_transnational       186800 non-null int64
label                  186800 non-null int64
dtypes: int64(6)
memory usage: 8.6 MB
time: 26.6 ms

n4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n4.txt', sep='\t', header=None))

Mem. usage decreased to  0.73 Mb (34.4% reduction)
time: 88.8 ms

n4.columns = ['year_month', 'id', 'province', 'label']

time: 1.15 ms

n4.head()

	year_month	id	province
0	201707	4295277677437000	重慶
1	201707	5560109665240300	廣西
2	201707	5560109665240300	廣東
3	201707	5560109665240300	廣東
4	201707	5705601521649600	重慶

time: 7.14 ms

n4.describe()

	year_month	id	label
count	36499.000000	3.649900e+04	36499.0
mean	201706.539193	5.471019e+15	0.0
std	0.498468	2.639006e+15	0.0
min	201706.000000	3.393144e+12	0.0
25%	201706.000000	3.203830e+15	0.0
50%	201707.000000	5.468480e+15	0.0
75%	201707.000000	7.753756e+15	0.0
max	201707.000000	9.999305e+15	0.0

time: 24.4 ms

n4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36499 entries, 0 to 36498
Data columns (total 4 columns):
year_month    36499 non-null int64
id            36499 non-null int64
province      36099 non-null object
label         36499 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.1+ MB
time: 9.97 ms

!ls /home/kesci/input/gzlt/train_set/201708n/

201708n1.txt  201708n3.txt  201708n6.txt
201708n2.txt  201708n4.txt  201708n7.txt
time: 669 ms

n6 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n6.txt', sep='\t', header=None))

Mem. usage decreased to 798.26 Mb (52.1% reduction)
time: 2min 59s

n6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude', 'label']

time: 1.51 ms

n6.head()

	date	hour	id	user_longitude	user_latitude
0	2017-07-02	10.0	7748777616409800	106.680816	26.563650
1	2017-07-10	0.0	7748777616409800	106.719520	26.576370
2	2017-07-31	14.0	7748777616409800	106.683060	26.654663
3	2017-07-01	0.0	6633710902197900	106.697440	26.613930
4	2017-07-08	14.0	6633710902197900	106.715700	26.609710

time: 9.14 ms

q6.describe()

	hour	id	user_longitude	user_latitude	label
count	2.852871e+06	2.852871e+06	2.851527e+06	2.851527e+06	2852871.0
mean	1.141897e+01	5.415213e+15	1.068143e+02	2.659968e+01	1.0
std	6.632995e+00	2.634349e+15	5.580043e-01	2.852525e-01	0.0
min	0.000000e+00	1.448104e+12	1.036700e+02	2.470664e+01	1.0
25%	6.000000e+00	3.135488e+15	1.066656e+02	2.654610e+01	1.0
50%	1.200000e+01	5.442594e+15	1.067027e+02	2.658143e+01	1.0
75%	1.800000e+01	7.687963e+15	1.067373e+02	2.662629e+01	1.0
max	2.200000e+01	9.997949e+15	1.095277e+02	2.909348e+01	1.0

time: 979 ms

n6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36393070 entries, 0 to 36393069
Data columns (total 6 columns):
date              object
hour              float64
id                int64
user_longitude    float64
user_latitude     float64
label             int64
dtypes: float64(3), int64(2), object(1)
memory usage: 1.6+ GB
time: 3.76 ms

n7 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/train_set/201708n/201708n7.txt', sep='\t', header=None))

Mem. usage decreased to 17.98 Mb (31.2% reduction)
time: 3.14 s

n7.columns = ['year_month', 'id', 'app', 'flow']

time: 1.44 ms

n7.head()

	year_month	id	app	flow
0	201707	4011022166491000	米聊	0.01
1	201707	8544172893207700	百度地圖	2.07
2	201707	9856572220983403	搜狗輸入法	0.00
3	201707	6441300393946200	愛奇藝視頻	0.00
4	201707	8751918977379700	開心消消樂	0.03

time: 7.51 ms

# n7['label'] = 0

time: 2.94 ms

# n7.head()

	year_month	id	app	flow
0	201707	4011022166491000	米聊	0.01
1	201707	8544172893207700	百度地圖	2.07
2	201707	9856572220983403	搜狗輸入法	0.00
3	201707	6441300393946200	愛奇藝視頻	0.00
4	201707	8751918977379700	開心消消樂	0.03

time: 8.46 ms

n7.describe()

	year_month	id	flow	label
count	856961.000000	8.569610e+05	856961.000000	856961.0
mean	201706.535881	5.432556e+15	9.942533	0.0
std	0.498711	2.643712e+15	68.096944	0.0
min	201706.000000	1.009387e+12	0.000000	0.0
25%	201706.000000	3.134290e+15	0.000000	0.0
50%	201707.000000	5.440495e+15	0.060000	0.0
75%	201707.000000	7.727765e+15	1.130000	0.0
max	201707.000000	9.999717e+15	10986.150000	0.0

time: 170 ms

n7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856961 entries, 0 to 856960
Data columns (total 5 columns):
year_month    856961 non-null int64
id            856961 non-null int64
app           856961 non-null object
flow          856961 non-null float64
label         856961 non-null int64
dtypes: float64(1), int64(3), object(1)
memory usage: 32.7+ MB
time: 116 ms

0.1.3 天氣數據

!ls /home/kesci/input/gzlt/train_set/weather_data_2017/

weather_forecast_2017.txt  weather_reported_2017.txt  天氣現象編碼.xlsx
time: 669 ms

weather_reported = pd.read_csv('/home/kesci/input/gzlt/train_set/weather_data_2017/weather_reported_2017.txt', sep='\t')

time: 6.15 ms

weather_reported.head()

	Station_Name	VACODE	Year	Month	Day	TEM_Avg	TEM_Max	TEM_Min	PRE_Time_2020	WEP_Record
0	麻江	522635	2017	6	1	23.00	24.5	20.9	0.6	( 01 60 ) 60 .
1	三穗	522624	2017	6	1	21.13	25.6	19.4	9.0	( 01 10 80 ) 80 60 .
2	鎮遠	522625	2017	6	1	22.68	26.5	21.3	8.9	( 60 ) 60 .
3	雷山	522634	2017	6	1	23.80	26.1	20.4	5.1	( 10 ) 60 .
4	劍河	522629	2017	6	1	23.53	27.1	22.0	6.8	( 01 10 80 ) 80 10 .

time: 12.2 ms

# weather_reported.columns = ['Station_Name', 'VACODE', 'Year', 'Month', 'Day', 'TEM_Avg', 'TEM_Max', 'TEM_Min', 'PRE_Time_2020', 'WEP_Record']

time: 1.25 ms

weather_reported.describe()

	Station_Name	VACODE	Year	Month	Day	TEM_Avg	TEM_Max	TEM_Min	PRE_Time_2020	WEP_Record
count	1404	1404	1404	1404	1404	1404	1404	1404	1404	1404
unique	24	25	2	3	32	448	214	109	330	305
top	貴陽	520000	2017	7	4	22.83	30.5	20.5	0.0	( 01 ) 01 .
freq	61	360	1403	713	46	10	18	35	625	197

time: 49.9 ms

weather_reported.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1404 entries, 0 to 1403
Data columns (total 10 columns):
Station_Name     1404 non-null object
VACODE           1404 non-null object
Year             1404 non-null object
Month            1404 non-null object
Day              1404 non-null object
TEM_Avg          1404 non-null object
TEM_Max          1404 non-null object
TEM_Min          1404 non-null object
PRE_Time_2020    1404 non-null object
WEP_Record       1404 non-null object
dtypes: object(10)
memory usage: 109.8+ KB
time: 6.32 ms

weather_forecast = pd.read_csv('/home/kesci/input/gzlt/train_set/weather_data_2017/weather_forecast_2017.txt', sep='\t')

time: 10.8 ms

weather_forecast.head()

	Station_Name	VACODE	Year	Mon	Day	TEM_Max_24h	TEM_Min_24h	WEP_24h	TEM_Max_48h	TEM_Min_48h	...	TEM_Max_120h	TEM_Min_120h	WEP_120h	TEM_Max_144h	TEM_Min_144h	WEP_144h	TEM_Max_168h	TEM_Min_168h，WEP_168h	Unnamed: 24	Unnamed: 25
0	白雲	520113	2017	6	1	25.0	17.0	(2)1	24.0	19.0	...	(4)2	25.0	15.0	(2)1	27.0	15.0	(1)0	26.0	16.0	(1)0
1	岑鞏	522626	2017	6	1	31.3	19.4	(1)1	31.0	22.0	...	(4)1	32.0	19.4	(1)1	32.0	22.8	(1)1	32.0	21.0	(1)1
2	從江	522633	2017	6	1	33.4	22.0	(1)1	30.0	23.0	...	(4)3	34.0	22.0	(1)1	34.0	23.8	(1)1	34.0	22.0	(1)1
3	丹寨	522636	2017	6	1	27.5	18.0	(1)1	24.5	20.0	...	(4)1	28.5	18.0	(1)1	28.5	21.0	(1)1	28.5	20.0	(1)1
4	貴陽	520103	2017	6	1	26.0	18.0	(2)1	25.0	20.0	...	(4)2	26.0	16.0	(2)1	28.0	16.0	(1)0	27.0	17.0	(1)0

5 rows × 26 columns

time: 86.4 ms

weather_forecast.describe()

	VACODE	Year	Mon	Day	TEM_Max_24h	TEM_Min_24h	TEM_Max_48h	TEM_Min_48h	TEM_Max_72h	TEM_Min_72h，WEP_72h	TEM_Min_96h	WEP_96h	TEM_Min_120h	WEP_120h	TEM_Min_144h	WEP_144h	TEM_Min_168h，WEP_168h	Unnamed: 24
count	1464.000000	1464.0	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000	1464.000000
mean	521792.583333	2017.0	6.508197	15.754098	28.374658	20.721585	28.375820	20.872814	28.283811	21.112432	28.539481	21.408128	28.702254	21.454713	29.142623	21.485656	29.131626	21.589003
std	1180.891163	0.0	0.500104	8.809966	4.300391	2.290850	4.379771	2.232788	4.329132	2.204980	4.154188	5.203525	4.167441	5.238257	4.124026	2.180222	4.033227	2.391945
min	520103.000000	2017.0	6.000000	1.000000	17.300000	13.800000	17.300000	13.600000	17.000000	10.000000	19.000000	14.300000	19.000000	15.000000	18.000000	15.000000	18.000000	2.000000
25%	520122.750000	2017.0	6.000000	8.000000	25.000000	19.000000	25.000000	19.400000	25.000000	19.600000	25.500000	19.700000	26.000000	19.700000	26.000000	20.000000	26.500000	20.000000
50%	522624.500000	2017.0	7.000000	16.000000	28.500000	21.000000	28.500000	21.000000	28.000000	21.000000	28.500000	21.500000	28.500000	21.500000	29.000000	22.000000	29.000000	22.000000
75%	522630.250000	2017.0	7.000000	23.000000	31.800000	22.500000	31.600000	22.500000	31.500000	23.000000	31.500000	23.000000	32.000000	23.000000	32.000000	23.000000	32.000000	23.500000
max	522636.000000	2017.0	7.000000	31.000000	39.000000	25.700000	39.500000	25.500000	38.000000	25.800000	39.000000	200.000000	39.000000	202.000000	38.800000	25.800000	37.500000	26.000000

time: 121 ms

weather_forecast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1464 entries, 0 to 1463
Data columns (total 26 columns):
Station_Name             1464 non-null object
VACODE                   1464 non-null int64
Year                     1464 non-null int64
Mon                      1464 non-null int64
Day                      1464 non-null int64
TEM_Max_24h              1464 non-null float64
TEM_Min_24h              1464 non-null float64
WEP_24h                  1464 non-null object
TEM_Max_48h              1464 non-null float64
TEM_Min_48h              1464 non-null float64
WEP_48h                  1464 non-null object
TEM_Max_72h              1464 non-null float64
TEM_Min_72h，WEP_72h      1464 non-null float64
TEM_Max_96h              1464 non-null object
TEM_Min_96h              1464 non-null float64
WEP_96h                  1464 non-null float64
TEM_Max_120h             1464 non-null object
TEM_Min_120h             1464 non-null float64
WEP_120h                 1464 non-null float64
TEM_Max_144h             1464 non-null object
TEM_Min_144h             1464 non-null float64
WEP_144h                 1464 non-null float64
TEM_Max_168h             1464 non-null object
TEM_Min_168h，WEP_168h    1464 non-null float64
Unnamed: 24              1464 non-null float64
Unnamed: 25              1464 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.5+ KB
time: 9.2 ms

0.2 測試數據

0.2.1 測試集

t1 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None))
t1.columns = ['year_month', 'id', 'consume']
t1 = t1.dropna(axis=0)
t1_groupbyid = t1[['id', 'consume']].groupby(['id']).agg({'consume': pd.Series.sum})

t2 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None))
t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']
t2 = t2.dropna(axis=0)
t2 = t2[['id', 'brand']]
t2 = t2.drop_duplicates()
t2_groupbyid = t2['id'].value_counts()
t2_groupbyid = t2_groupbyid.reset_index()
t2_groupbyid.columns = ['id', 'phone_nums']

test_set = t1_groupbyid.merge(t2_groupbyid, on=['id'])
test_set.head()

Mem. usage decreased to  1.34 Mb (41.7% reduction)
Mem. usage decreased to 60.50 Mb (0.0% reduction)

	id	consume	phone_nums
0	595941207920	220.000	10
1	901845022650	662.000	6
2	1868765858840	143.375	4
3	5058794512580	200.000	7
4	5399381591230	192.000	29

time: 7.86 s

test_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43977 entries, 0 to 43976
Data columns (total 3 columns):
id            43977 non-null int64
consume       43977 non-null float16
phone_nums    43977 non-null int64
dtypes: float16(1), int64(2)
memory usage: 1.1 MB
time: 9.02 ms

t3 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None))
t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']

t3_groupbyid_call = t3[['id', 'call_nums']].groupby(['id']).agg({'call_nums': pd.Series.sum})
t3_groupbyid_provincial = t3[['id', 'is_trans_provincial']].groupby(['id']).agg({'is_trans_provincial': pd.Series.sum})
t3_groupbyid_trans = t3[['id', 'is_transnational']].groupby(['id']).agg({'is_transnational': pd.Series.sum})
test_set = test_set.merge(t3_groupbyid_call, on=['id'])
test_set = test_set.merge(t3_groupbyid_provincial, on=['id'])
test_set = test_set.merge(t3_groupbyid_trans, on=['id'])

t4 = reduce_mem_usage(pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None))
t4.columns = ['year_month', 'id', 'province']

t4_groupbyid = t4[['id', 'province']].groupby(['id']).size()
t4_groupbyid = t4_groupbyid.reset_index()
t4_groupbyid.columns = ['id', 'province_out_cnt']
test_set = test_set.merge(t4_groupbyid, how='left', on=['id'])
test_set = test_set.fillna(0)
test_set.head()

Mem. usage decreased to  1.53 Mb (60.0% reduction)
Mem. usage decreased to  0.85 Mb (16.7% reduction)

	id	consume	phone_nums	call_nums	is_trans_provincial	province_out_cnt
0	595941207920	220.000	10	68	1	1.0
1	901845022650	662.000	6	278	0	0.0
2	1868765858840	143.375	4	107	2	3.0
3	5058794512580	200.000	7	128	0	0.0
4	5399381591230	192.000	29	61	0	0.0

time: 17.4 s

!ls /home/kesci/input/gzlt/test_set/

201808	weather_data_2018
time: 704 ms

!ls /home/kesci/input/gzlt/test_set/201808

2018_1.txt  2018_2.txt	2018_3.txt  2018_4.txt	2018_6.txt  2018_7.txt
time: 702 ms

t1 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_1.txt', sep='\t', header=None)

time: 527 ms

t1.columns = ['year_month', 'id', 'consume']

time: 1.27 ms

t1.head()

	year_month	id	consume
0	201807	6401824160010748	618.40
1	201807	6506134548135499	NaN
2	201807	5996920884619954	22.05
3	201806	1187209424543713	7.20
4	201807	9297165066591558	124.00

time: 99.9 ms

t1.describe()

	year_month	id	consume
count	100402.000000	1.004020e+05	86787.000000
mean	201806.500000	5.449905e+15	103.357399
std	0.500002	2.628916e+15	311.428596
min	201806.000000	5.959412e+11	0.010000
25%	201806.000000	3.176902e+15	36.500000
50%	201806.500000	5.440931e+15	81.000000
75%	201807.000000	7.726318e+15	132.125000
max	201807.000000	9.999920e+15	61465.900000

time: 50.6 ms

t1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100402 entries, 0 to 100401
Data columns (total 3 columns):
year_month    100402 non-null int64
id            100402 non-null int64
consume       86787 non-null float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
time: 12.6 ms

%matplotlib inline

# 按index日期排序

t1.consume.plot()

Matplotlib is building the font cache using fc-list. This may take a moment.





<matplotlib.axes._subplots.AxesSubplot at 0x7fbd4cd3c978>

time: 17 s

t1[t1.consume == 61465.9]

	year_month	id	consume
11962	201807	4827806860301307	61465.9

time: 7.15 ms

t2 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_2.txt', sep='\t', header=None)

time: 11.8 s

t2.columns = ['id', 'brand', 'type', 'first_use_time', 'recent_use_time']

time: 1.18 ms

t2.head()

	id	brand	type	first_use_time	recent_use_time
0	3179771753483280	魅族	M575	20180601151052	20180601151054
1	4185007692177509	NaN	NaN	20171021182915	20171021183000
2	4972845789896505	NaN	NaN	20180624003647	20180624003656
3	4207293827582218	NaN	NaN	20171224165902	20180306175444
4	2628020151876580	NaN	NaN	20170820111053	20171207020159

time: 7.95 ms

t2.describe()

	id	first_use_time	recent_use_time
count	1.586024e+06	1.586024e+06	1.586024e+06
mean	5.410516e+15	2.017033e+13	2.017156e+13
std	2.618994e+15	6.902153e+09	6.865591e+09
min	5.959412e+11	2.016032e+13	2.016033e+13
25%	3.140763e+15	2.016122e+13	2.017021e+13
50%	5.389338e+15	2.017063e+13	2.017080e+13
75%	7.660413e+15	2.017122e+13	2.018013e+13
max	9.999920e+15	2.018073e+13	2.018073e+13

time: 353 ms

t2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586024 entries, 0 to 1586023
Data columns (total 5 columns):
id                 1586024 non-null int64
brand              1098244 non-null object
type               1098250 non-null object
first_use_time     1586024 non-null int64
recent_use_time    1586024 non-null int64
dtypes: int64(3), object(2)
memory usage: 60.5+ MB
time: 291 ms

t3 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_3.txt', sep='\t', header=None)

time: 451 ms

t3.columns = ['year_month', 'id', 'call_nums', 'is_trans_provincial', 'is_transnational']

time: 1.14 ms

t3.head()

	year_month	id	call_nums
0	201806	3690814703003361	49
1	201807	4315823592069831	-1
2	201806	5199170013029443	-1
3	201806	1387658205895203	35
4	201807	3280240784164442	-1

time: 7.12 ms

t3.describe()

	year_month	id	call_nums	is_trans_provincial	is_transnational
count	100400.000000	1.004000e+05	100400.000000	100400.000000	100400.000000
mean	201806.500000	5.449990e+15	51.642102	0.206116	0.012809
std	0.500002	2.628873e+15	90.705957	0.404516	0.112449
min	201806.000000	5.959412e+11	-1.000000	0.000000	0.000000
25%	201806.000000	3.177008e+15	6.000000	0.000000	0.000000
50%	201806.500000	5.441108e+15	31.000000	0.000000	0.000000
75%	201807.000000	7.726328e+15	71.000000	0.000000	0.000000
max	201807.000000	9.999920e+15	6537.000000	1.000000	1.000000

time: 46.4 ms

t3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100400 entries, 0 to 100399
Data columns (total 5 columns):
year_month             100400 non-null int64
id                     100400 non-null int64
call_nums              100400 non-null int64
is_trans_provincial    100400 non-null int64
is_transnational       100400 non-null int64
dtypes: int64(5)
memory usage: 3.8 MB
time: 15.1 ms

t4 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_4.txt', sep='\t', header=None)

time: 240 ms

t4.columns = ['year_month', 'id', 'province']

time: 1.2 ms

t4.head()

	year_month	id	province
0	201807	8445647072009305	廣東
1	201806	9414872397547413	浙江
2	201806	2272887111818372	廣東
3	201807	224368910874770	湖北
4	201807	6081677258986878	NaN

time: 6.81 ms

t4.describe()

	year_month	id
count	44543.000000	4.454300e+04
mean	201806.530319	5.448788e+15
std	0.499086	2.640390e+15
min	201806.000000	5.959412e+11
25%	201806.000000	3.118911e+15
50%	201807.000000	5.430117e+15
75%	201807.000000	7.751481e+15
max	201807.000000	9.999505e+15

time: 20.3 ms

t4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44543 entries, 0 to 44542
Data columns (total 3 columns):
year_month    44543 non-null int64
id            44543 non-null int64
province      44119 non-null object
dtypes: int64(2), object(1)
memory usage: 1.0+ MB
time: 9.73 ms

t6 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_6.txt', sep='\t', header=None)

time: 2min 7s

t6.columns = ['date', 'hour', 'id', 'user_longitude', 'user_latitude']

time: 1.22 ms

t6.head()

	date	hour	id	user_longitude	user_latitude
0	2018-06-10	20	1929821481825935	106.289902	26.837687
1	2018-07-14	18	5450093661688579	106.641975	26.627846
2	2018-07-16	2	4617571498633816	106.230420	27.466980
3	2018-06-15	22	2826359445811398	106.693610	26.591110
4	2018-06-22	10	3526202744290054	107.032570	27.715830

time: 8.4 ms

t6.describe()

	hour	id	user_longitude	user_latitude
count	1.655899e+07	1.655899e+07	1.655081e+07	1.655081e+07
mean	1.144987e+01	5.461505e+15	1.066642e+02	2.662386e+01
std	6.742805e+00	2.629564e+15	4.626476e-01	3.195807e-01
min	0.000000e+00	5.959412e+11	1.036700e+02	2.469706e+01
25%	6.000000e+00	3.191837e+15	1.066328e+02	2.655164e+01
50%	1.200000e+01	5.475087e+15	1.066902e+02	2.658444e+01
75%	1.800000e+01	7.732384e+15	1.067199e+02	2.663778e+01
max	2.200000e+01	9.999920e+15	1.095534e+02	2.916468e+01

time: 6.3 s

t6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16558993 entries, 0 to 16558992
Data columns (total 5 columns):
date              object
hour              int64
id                int64
user_longitude    float64
user_latitude     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 631.7+ MB
time: 3.04 ms

t7 = pd.read_csv('/home/kesci/input/gzlt/test_set/201808/2018_7.txt', sep='\t', header=None)

time: 8.75 s

t7.columns = ['year_month', 'id', 'app', 'flow']

time: 1.18 ms

t7.head()

	year_month	id	app	flow
0	201806	9813651010156104	OPPO軟件商店	14545.00
1	201806	2338567014163500	騰訊新聞	0.19
2	201807	1133512913801798	訊飛輸入法	0.01
3	201807	7739596338372898	手機百度	1615.00
4	201807	5724269192271018	百度貼吧	1301953.00

time: 15.6 ms

t7.describe()

	year_month	id	flow
count	1.493733e+06	1.493733e+06	1.492434e+06
mean	2.018065e+05	5.468351e+15	8.991198e+07
std	4.999895e-01	2.628382e+15	8.503798e+08
min	2.018060e+05	5.959412e+11	0.000000e+00
25%	2.018060e+05	3.196619e+15	6.519000e+03
50%	2.018070e+05	5.477012e+15	2.883350e+05
75%	2.018070e+05	7.737568e+15	7.842132e+06
max	2.018070e+05	9.999920e+15	3.341152e+11

time: 226 ms

t7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1493733 entries, 0 to 1493732
Data columns (total 4 columns):
year_month    1493733 non-null int64
id            1493733 non-null int64
app           1457137 non-null object
flow          1492434 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 45.6+ MB
time: 178 ms

0.2.2 天氣數據

!ls /home/kesci/input/gzlt/test_set/weather_data_2018/

weather_forecast_2018.txt  weather_reported_2018.txt
time: 830 ms

weather_reported_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_reported_2018.txt', sep='\t')

time: 8.57 ms

weather_reported_2018.head()

	Station_Name	VACODE	Year	Month	Day	TEM_Avg	TEM_Max	TEM_Min	PRE_Time_2020	WEP_Record
0	鎮遠	522625	2018	6	1	19.0	21.0	17.8	0.1	( 60 01 ) 01 60 10 .
1	丹寨	522636	2018	6	1	17.0	19.9	15.3	4.3	( 60 80 ) 80 .
2	三穗	522624	2018	6	1	17.8	19.2	17.0	0.6	( 80 10 ) 60 10 .
3	臺江	522630	2018	6	1	18.8	21.1	17.5	1.4	( 60 01 ) 01 60 10 .
4	劍河	522629	2018	6	1	19.2	21.6	17.9	2.1	( 60 ) 60 10 .

time: 12.6 ms

weather_reported_2018.describe()

	VACODE	Year	Month	Day	TEM_Avg	TEM_Max	TEM_Min	PRE_Time_2020
count	1403.000000	1403.0	1403.000000	1403.000000	1403.000000	1403.000000	1403.000000	1403.000000
mean	521862.934426	2018.0	6.508197	15.754098	737.393799	742.297577	734.011119	4.922594
std	1155.972144	0.0	0.500111	8.810097	26696.850268	26696.719415	26696.940604	15.090986
min	520103.000000	2018.0	6.000000	1.000000	15.100000	16.200000	11.800000	0.000000
25%	520122.000000	2018.0	6.000000	8.000000	22.900000	27.300000	20.000000	0.000000
50%	522625.000000	2018.0	7.000000	16.000000	25.100000	30.100000	21.600000	0.000000
75%	522631.000000	2018.0	7.000000	23.000000	26.900000	32.550000	23.050000	2.100000
max	522636.000000	2018.0	7.000000	31.000000	999999.000000	999999.000000	999999.000000	281.700000

time: 118 ms

weather_reported_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name     1403 non-null object
 VACODE          1403 non-null int64
 Year            1403 non-null int64
 Month           1403 non-null int64
 Day             1403 non-null int64
 TEM_Avg         1403 non-null float64
 TEM_Max         1403 non-null float64
 TEM_Min         1403 non-null float64
PRE_Time_2020    1403 non-null float64
WEP_Record       1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms

weather_forecast_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt', sep='\t')

time: 12 ms

weather_forecast_2018.head()

	Station_Name	VACODE	Year	Mon	Day	TEM_Max_24h	TEM_Min_24h	WEP_24h	TEM_Max_48h	TEM_Min_48h	...	TEM_Max_120h	TEM_Min_120h	WEP_120h	TEM_Max_144h	TEM_Min_144h	WEP_144h	TEM_Max_168h	TEM_Min_168h，WEP_168h	Unnamed: 24	Unnamed: 25
0	白雲	520113	2018	6	1	20.2	14.8	(3)2	23.2	15.8	...	(2)1	27.5	13.5	(1)1	26.0	14.0	(2)1	24.0	16.0	(1)1
1	岑鞏	522626	2018	6	1	25.5	17.5	(2)2	28.5	20.2	...	(2)0	31.0	17.0	(0)0	31.0	18.5	(0)1	31.0	21.5	(1)1
2	從江	522633	2018	6	1	27.3	19.0	(7)2	29.5	22.0	...	(21)0	33.5	19.6	(0)0	33.5	20.2	(0)1	31.5	23.0	(1)1
3	丹寨	522636	2018	6	1	23.0	15.5	(2)2	26.0	19.2	...	(2)0	28.0	16.2	(0)0	28.0	17.2	(0)1	27.0	19.5	(1)1
4	貴陽	520103	2018	6	1	20.9	14.9	(3)2	24.0	16.4	...	(2)1	28.0	14.0	(1)1	26.0	14.0	(2)1	24.0	16.0	(1)1

5 rows × 26 columns

time: 54.2 ms

weather_forecast_2018.describe()

	VACODE	Year	Mon	Day	TEM_Max_24h	TEM_Min_24h	TEM_Max_48h	TEM_Min_48h	TEM_Max_72h	TEM_Min_72h，WEP_72h	TEM_Min_96h	WEP_96h	TEM_Min_120h	WEP_120h	TEM_Min_144h	WEP_144h	TEM_Min_168h，WEP_168h	Unnamed: 24
count	1463.000000	1463.0	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000
mean	521793.738209	2018.0	6.508544	15.759398	29.724607	21.244703	29.724470	21.385236	29.694463	21.655434	29.924949	21.886945	29.891183	22.010936	30.027341	22.055229	30.192960	21.985373
std	1180.467638	0.0	0.500098	8.810643	3.470128	2.536103	3.232737	2.385237	3.167789	2.270505	3.130886	2.131020	3.191721	2.066640	3.199460	2.092155	3.167676	2.227871
min	520103.000000	2018.0	6.000000	1.000000	17.800000	10.800000	18.000000	12.000000	16.500000	12.500000	16.500000	14.000000	14.500000	13.000000	17.000000	13.200000	16.000000	15.000000
25%	520123.000000	2018.0	6.000000	8.000000	27.500000	20.000000	27.500000	20.000000	27.500000	20.200000	28.000000	20.500000	27.500000	21.000000	28.000000	21.000000	28.000000	20.850000
50%	522625.000000	2018.0	7.000000	16.000000	30.000000	22.000000	29.900000	22.000000	29.500000	22.000000	30.000000	22.000000	30.000000	22.200000	30.000000	22.100000	30.000000	22.200000
75%	522630.500000	2018.0	7.000000	23.000000	32.350000	23.000000	32.000000	23.000000	32.300000	23.300000	32.500000	23.500000	32.500000	23.500000	32.500000	23.700000	32.600000	24.000000
max	522636.000000	2018.0	7.000000	31.000000	37.500000	27.000000	37.000000	25.900000	36.500000	26.000000	36.500000	26.000000	36.500000	26.200000	37.000000	26.000000	37.000000	30.000000

time: 74 ms

weather_forecast_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name             1463 non-null object
VACODE                   1463 non-null int64
Year                     1463 non-null int64
Mon                      1463 non-null int64
Day                      1463 non-null int64
TEM_Max_24h              1463 non-null float64
TEM_Min_24h              1463 non-null float64
WEP_24h                  1463 non-null object
TEM_Max_48h              1463 non-null float64
TEM_Min_48h              1463 non-null float64
WEP_48h                  1463 non-null object
TEM_Max_72h              1463 non-null float64
TEM_Min_72h，WEP_72h      1463 non-null float64
TEM_Max_96h              1463 non-null object
TEM_Min_96h              1463 non-null float64
WEP_96h                  1463 non-null float64
TEM_Max_120h             1463 non-null object
TEM_Min_120h             1463 non-null float64
WEP_120h                 1463 non-null float64
TEM_Max_144h             1463 non-null object
TEM_Min_144h             1463 non-null float64
WEP_144h                 1463 non-null float64
TEM_Max_168h             1463 non-null object
TEM_Min_168h，WEP_168h    1463 non-null float64
Unnamed: 24              1463 non-null float64
Unnamed: 25              1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms

!jupyter nbconvert --to markdown "“聯創黔線”杯大數據應用創新大賽.ipynb"

0.000000

25%
520122.000000
2018.0
6.000000
8.000000
22.900000
27.300000
20.000000
0.000000

50%
522625.000000
2018.0
7.000000
16.000000
25.100000
30.100000
21.600000
0.000000

75%
522631.000000
2018.0
7.000000
23.000000
26.900000
32.550000
23.050000
2.100000

max
522636.000000
2018.0
7.000000
31.000000
999999.000000
999999.000000
999999.000000
281.700000

time: 118 ms

weather_reported_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 10 columns):
Station_Name     1403 non-null object
 VACODE          1403 non-null int64
 Year            1403 non-null int64
 Month           1403 non-null int64
 Day             1403 non-null int64
 TEM_Avg         1403 non-null float64
 TEM_Max         1403 non-null float64
 TEM_Min         1403 non-null float64
PRE_Time_2020    1403 non-null float64
WEP_Record       1403 non-null object
dtypes: float64(4), int64(4), object(2)
memory usage: 109.7+ KB
time: 6.7 ms

weather_forecast_2018 = pd.read_csv('/home/kesci/input/gzlt/test_set/weather_data_2018/weather_forecast_2018.txt', sep='\t')

time: 12 ms

weather_forecast_2018.head()

	Station_Name	VACODE	Year	Mon	Day	TEM_Max_24h	TEM_Min_24h	WEP_24h	TEM_Max_48h	TEM_Min_48h	...	TEM_Max_120h	TEM_Min_120h	WEP_120h	TEM_Max_144h	TEM_Min_144h	WEP_144h	TEM_Max_168h	TEM_Min_168h，WEP_168h	Unnamed: 24	Unnamed: 25
0	白雲	520113	2018	6	1	20.2	14.8	(3)2	23.2	15.8	...	(2)1	27.5	13.5	(1)1	26.0	14.0	(2)1	24.0	16.0	(1)1
1	岑鞏	522626	2018	6	1	25.5	17.5	(2)2	28.5	20.2	...	(2)0	31.0	17.0	(0)0	31.0	18.5	(0)1	31.0	21.5	(1)1
2	從江	522633	2018	6	1	27.3	19.0	(7)2	29.5	22.0	...	(21)0	33.5	19.6	(0)0	33.5	20.2	(0)1	31.5	23.0	(1)1
3	丹寨	522636	2018	6	1	23.0	15.5	(2)2	26.0	19.2	...	(2)0	28.0	16.2	(0)0	28.0	17.2	(0)1	27.0	19.5	(1)1
4	貴陽	520103	2018	6	1	20.9	14.9	(3)2	24.0	16.4	...	(2)1	28.0	14.0	(1)1	26.0	14.0	(2)1	24.0	16.0	(1)1

5 rows × 26 columns

time: 54.2 ms

weather_forecast_2018.describe()

	VACODE	Year	Mon	Day	TEM_Max_24h	TEM_Min_24h	TEM_Max_48h	TEM_Min_48h	TEM_Max_72h	TEM_Min_72h，WEP_72h	TEM_Min_96h	WEP_96h	TEM_Min_120h	WEP_120h	TEM_Min_144h	WEP_144h	TEM_Min_168h，WEP_168h	Unnamed: 24
count	1463.000000	1463.0	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000	1463.000000
mean	521793.738209	2018.0	6.508544	15.759398	29.724607	21.244703	29.724470	21.385236	29.694463	21.655434	29.924949	21.886945	29.891183	22.010936	30.027341	22.055229	30.192960	21.985373
std	1180.467638	0.0	0.500098	8.810643	3.470128	2.536103	3.232737	2.385237	3.167789	2.270505	3.130886	2.131020	3.191721	2.066640	3.199460	2.092155	3.167676	2.227871
min	520103.000000	2018.0	6.000000	1.000000	17.800000	10.800000	18.000000	12.000000	16.500000	12.500000	16.500000	14.000000	14.500000	13.000000	17.000000	13.200000	16.000000	15.000000
25%	520123.000000	2018.0	6.000000	8.000000	27.500000	20.000000	27.500000	20.000000	27.500000	20.200000	28.000000	20.500000	27.500000	21.000000	28.000000	21.000000	28.000000	20.850000
50%	522625.000000	2018.0	7.000000	16.000000	30.000000	22.000000	29.900000	22.000000	29.500000	22.000000	30.000000	22.000000	30.000000	22.200000	30.000000	22.100000	30.000000	22.200000
75%	522630.500000	2018.0	7.000000	23.000000	32.350000	23.000000	32.000000	23.000000	32.300000	23.300000	32.500000	23.500000	32.500000	23.500000	32.500000	23.700000	32.600000	24.000000
max	522636.000000	2018.0	7.000000	31.000000	37.500000	27.000000	37.000000	25.900000	36.500000	26.000000	36.500000	26.000000	36.500000	26.200000	37.000000	26.000000	37.000000	30.000000

time: 74 ms

weather_forecast_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1463 entries, 0 to 1462
Data columns (total 26 columns):
Station_Name             1463 non-null object
VACODE                   1463 non-null int64
Year                     1463 non-null int64
Mon                      1463 non-null int64
Day                      1463 non-null int64
TEM_Max_24h              1463 non-null float64
TEM_Min_24h              1463 non-null float64
WEP_24h                  1463 non-null object
TEM_Max_48h              1463 non-null float64
TEM_Min_48h              1463 non-null float64
WEP_48h                  1463 non-null object
TEM_Max_72h              1463 non-null float64
TEM_Min_72h，WEP_72h      1463 non-null float64
TEM_Max_96h              1463 non-null object
TEM_Min_96h              1463 non-null float64
WEP_96h                  1463 non-null float64
TEM_Max_120h             1463 non-null object
TEM_Min_120h             1463 non-null float64
WEP_120h                 1463 non-null float64
TEM_Max_144h             1463 non-null object
TEM_Min_144h             1463 non-null float64
WEP_144h                 1463 non-null float64
TEM_Max_168h             1463 non-null object
TEM_Min_168h，WEP_168h    1463 non-null float64
Unnamed: 24              1463 non-null float64
Unnamed: 25              1463 non-null object
dtypes: float64(14), int64(4), object(8)
memory usage: 297.2+ KB
time: 11 ms

!jupyter nbconvert --to markdown "“聯創黔線”杯大數據應用創新大賽.ipynb"

“聯創黔線”杯大數據應用創新大賽

文章目錄

賽題介紹

代碼

1 特徵工程

1.1 正樣本

1.2 負樣本

2 建模

3 預測

3.1 測試集

4 提交結果

0 查看數據

0.1 訓練數據

0.1.1 正樣本

0.1.2 負樣本

0.1.3 天氣數據

0.2 測試數據

0.2.1 測試集

0.2.2 天氣數據

Hive、Beeline、Spark-SQL、Spark-Shell CLI使用

Hadoop與MongoDB整合（Hive篇）

使用Flume向HDFS持久化數據（日誌）

Spark SQL與Hive On MapReduce速度比較

基於HDP使用Flume實時採集MySQL中數據傳到Kafka+HDFS或Hive

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結