前言
偶然間在羣裏看到有人發了這個比賽,查了一下才知道這是京東舉行的第三屆JDATA算法大賽,可我從來沒有聽說過,有種被時代拋棄的感覺😢。我自從2016年參加了阿里天池的幾個比賽之後就沒有關注過這方面,一是工作比較忙,二是自己變懶了,唉。聽說騰訊每年也舉辦什麼廣告算法大賽,感興趣的同學可以參加一下,此外還有kaggle等等。這次比賽也只是打了個醬油,畢竟工作了,沒有上學時那麼多時間,而且現在的同學都太厲害了😆。雖然成績不怎麼樣(89/1401),但我覺得這個流程還是值得記錄和分享一下的。需注意的是,特徵工程決定上限,而其餘步驟只是逼近這個上限。最後,期待大佬們的分享~
賽題介紹
賽題:https://jdata.jd.com/html/detail.html?id=8
數據:https://download.csdn.net/download/dr_guo/11207507
賽題背景
京東零售集團堅持“以信賴爲基礎、以客戶爲中心的價值創造”這一經營理念,在不同的消費場景和連接終端上,在正確的時間、正確的地點爲3億多活躍用戶提供最適合的產品和服務。目前,京東零售集團第三方平臺簽約商家超過21萬個,實現了全品類覆蓋,爲維持商家生態繁榮、多樣和有序,全面滿足消費者一站式購物需求,需要對用戶購買行爲進行更精準地分析和預測。基於此,本賽題提供來自用戶、商家、商品等多方面數據信息,包括商家和商品自身的內容信息、評論信息以及用戶與之豐富的互動行爲。參賽隊伍需要通過數據挖掘技術和機器學習算法,構建用戶購買商家中相關品類的預測模型,輸出用戶和店鋪、品類的匹配結果,爲精準營銷提供高質量的目標羣體。同時,希望參賽隊伍通過本次比賽,挖掘數據背後潛在的意義,爲電商生態平臺的商家、用戶提供多方共贏的智能解決方案。
01 /評分
參賽者提交的結果文件中包含對所有用戶購買意向的預測結果。對每一個用戶的預測結果包括兩方面:
(1)該用戶2018-04-16到2018-04-22是否對品類有購買,提交的結果文件中僅包含預測爲下單的用戶和品類(預測爲未下單的用戶和品類無須在結果中出現)。評測時將對提交結果中重複的“用戶-品類”做排重處理,若預測正確,則評測算法中置label=1,不正確label=0。
(2)如果用戶對品類有購買,還需要預測對該品類下哪個店鋪有購買,若店鋪預測正確,則評測算法中置pred=1,不正確pred=0。
對於參賽者提交的結果文件,按如下公式計算得分:score=0.4F11+0.6F12
此處的F1值定義爲:
其中:Precise爲準確率,Recall爲召回率; F11 是label=1或0的F1值,F12 是pred=1或0的F1值。
02 /賽題數據
2.1 訓練數據
提供2018-02-01到2018-04-15用戶集合U中的用戶,對商品集合S中部分商品的行爲、評價、用戶數據。
2.2 預測數據
提供 2018-04-16 到 2018-04-22 預測用戶U對哪些品類和店鋪有購買,用戶對品類下的店鋪只會購買一次。
2.3 數據表說明
1)行爲數據(jdata_action)
2)評論數據(jdata_comment)
3)商品數據(jdata_product)
4)商家店鋪數據(jdata_shop)
5)用戶數據(jdata_user)
03 /任務描述及作品要求
3.1 任務描述
對於訓練集中出現的每一個用戶,參賽者的模型需要預測該用戶在未來7天內對某個目標品類下某個店鋪的購買意向。
3.2 作品要求
提交的CSV文件要求如下:
- UTF-8無BOM格式編碼;
- 第一行爲字段名,即:user_id,cate,shop_id(數據使用英文逗號分隔)
其中:user_id:用戶表(jdata_user)中用戶ID;cate:商品表(jdata_product)中商品sku_id對應的品類cate ;shop_id:商家表(jdata_shop)中店鋪ID; - 結果不存在重複的記錄行數,否則無效;
對於預測出沒有購買意向的用戶,在提交的CSV文件中不要包含該用戶的信息。
提交結果示例如下圖:
步驟彙總
這是我自己總結的流程,如有不對,歡迎交流指正。
1.查看分析數據
略
2.數據清洗
略
3.構造數據集(特徵工程)
開始自己選的特徵連0.03都上不去,後來看到Cookly 洪鵬飛開源的baseline,我基於他寫的刪了很多特徵又加了一點特徵,分數可能都沒有他的baseline高,但沒辦法,筆記本跑不動。我測試了一下我的筆記本最高只能處理300維,說到這我要吐槽一下京東了,能不能跟阿里一樣提供開發環境,我好多想法受限於硬件無法實現。
下面是代碼,需要注意的是,我構造了三個數據集,分別是訓練集、測試集和預測集。預測集這個大家可能沒有聽說過,因爲往往都叫測試集,但我覺得這樣叫有點亂,所以給它起了個名叫預測集,專門用來預測最終結果。測試集也有點不一樣,測試集應與訓練集完全不相關,用來評估模型的表現,而且用的次數不能太多。理論上講,模型在測試集的表現與在預測集上的表現應該是差不多的。此外,訓練集也可以劃分爲訓練集和驗證集,但是很多時候都沒有真正的測試集,而是把訓練集劃分爲訓練集和測試集或是劃分爲訓練集、驗證集和測試集。是不是感覺有點亂,我說的也不一定對,但只需記住一點,我們只用測試集去評估模型的表現,並不會去調整優化模型,慢慢體會吧。
import pandas as pd
import os
import pickle
import datetime
import re
import numpy as np
jdata_action_file_dir = "../../jdata/jdata_action.csv"
jdata_comment_file_dir = "../../jdata/jdata_comment.csv"
jdata_product_file_dir = "../../jdata/jdata_product.csv"
jdata_shop_file_dir = "../../jdata/jdata_shop.csv"
jdata_user_file_dir = "../../jdata/jdata_user.csv"
# 減少內存使用
def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024 ** 2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024 ** 2
if verbose:
print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
# 行爲數據
jdata_action = reduce_mem_usage(pd.read_csv(jdata_action_file_dir))
jdata_action.drop_duplicates(inplace=True)
del jdata_action['module_id']
# 視B榜數據而定,是否刪除購物車記錄(A榜數據只有8號後的購物車記錄)
jdata_action = jdata_action[-jdata_action['type'].isin([5])]
# 評論數據
# jdata_comment = reduce_mem_usage(pd.read_csv(jdata_comment_file_dir))
# 商品數據
jdata_product = reduce_mem_usage(pd.read_csv(jdata_product_file_dir))
# 商家店鋪數據
jdata_shop = reduce_mem_usage(pd.read_csv(jdata_shop_file_dir))
# 用戶數據
jdata_user = reduce_mem_usage(pd.read_csv(jdata_user_file_dir))
Mem. usage decreased to 745.30 Mb (47.5% reduction)
Mem. usage decreased to 5.72 Mb (57.5% reduction)
Mem. usage decreased to 0.24 Mb (57.1% reduction)
Mem. usage decreased to 38.35 Mb (65.3% reduction)
def go_split(s, symbol='-: '):
# 拼接正則表達式
symbol = "[" + symbol + "]+"
# 一次性分割字符串
result = re.split(symbol, s)
# 去除空字符
return [x for x in result if x]
def get_hour(start, end):
d = datetime.datetime(*[int(float(i)) for i in go_split(start)]) - datetime.datetime(*[int(float(i)) for i in go_split(end)])
n = int(d.days*24 + d.seconds/60/60)
return n
def get_first_hour_gap(x):
return get_hour(end_day, min(x))
def get_last_hour_gap(x):
return get_hour(end_day, max(x))
def get_act_days(x):
return len(set([i[:10] for i in x]))
構造訓練集-特徵1天-用戶品類店鋪候選集7天-label7天(預測7天)
- ‘2018-03-29’-‘2018-04-04’
def get_train_set(end_day):
# 合併數據
jdata_data = jdata_action.merge(jdata_product, on=['sku_id'])
# 候選集 7天
# '2018-03-29'-'2018-04-04'
train_set = jdata_data[(jdata_data['action_time'] >= '2018-03-29 00:00:00')
& (jdata_data['action_time'] <= '2018-04-04 23:59:59')][
['user_id', 'cate', 'shop_id']].drop_duplicates()
# label 7天
# '2018-04-05'-'2018-04-11'
train_buy = jdata_data[(jdata_data['action_time'] >= '2018-04-05 00:00:00')
& (jdata_data['action_time'] <= '2018-04-11 23:59:59')
& (jdata_data['type'] == 2)][['user_id', 'cate', 'shop_id']].drop_duplicates()
train_buy['label'] = 1
train_set = train_set.merge(train_buy, on=['user_id', 'cate', 'shop_id'], how='left').fillna(0)
print('標籤準備完畢!')
# 提取特徵 2018-04-04 1天
start_day = '2018-04-04 00:00:00'
for gb_c in [['user_id'], # 用戶
['cate'], # 品類
['shop_id'], # 店鋪
['user_id', 'cate'], # 用戶-品類
['user_id', 'shop_id'], # 用戶-店鋪
['cate', 'shop_id'], # 品類-店鋪
['user_id', 'cate', 'shop_id']]: # 用戶-品類-店鋪
print(gb_c)
action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
& (jdata_data['action_time'] <= end_day)]
# 特徵函數
features_dict = {
'sku_id': [np.size, lambda x: len(set(x))],
'type': lambda x: len(set(x)),
'brand': lambda x: len(set(x)),
'shop_id': lambda x: len(set(x)),
'cate': lambda x: len(set(x)),
'action_time': [
get_first_hour_gap, # first_hour_gap
get_last_hour_gap, # last_hour_gap
get_act_days # act_days
]
}
features_columns = [c +'_' + '_'.join(gb_c)
for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
# print(f_temp.columns)
f_temp.columns = gb_c + features_columns
# print(f_temp.columns)
train_set = train_set.merge(f_temp, on=gb_c, how='left')
for type_ in [1, 2, 3, 4, 5]: # 1:瀏覽 2:下單 3:關注 4:評論 5:加購物車
action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
& (jdata_data['action_time'] <= end_day)
& (jdata_data['type'] == type_)]
features_dict = {
'sku_id': [np.size, lambda x: len(set(x))],
'type': lambda x: len(set(x)),
'brand': lambda x: len(set(x)),
'shop_id': lambda x: len(set(x)),
'cate': lambda x: len(set(x)),
'action_time': [
get_first_hour_gap, # first_hour_gap
get_last_hour_gap, # last_hour_gap
get_act_days # act_days
]
}
features_columns = [c +'_' + '_'.join(gb_c) + '_type_' + str(type_)
for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
if len(f_temp) == 0:
continue
f_temp.columns = gb_c + features_columns
train_set = train_set.merge(f_temp, on=gb_c, how='left')
# 瀏覽、關注、評論購買比,加購物車特徵很重要,視B榜數據而定
train_set['buybro_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(1)]
train_set['buyfocus_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(3)]
train_set['buycom_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(4)]
# train_set['buycart_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(5)]
# 用戶特徵
uid_info_col = ['user_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county']
train_set = train_set.merge(jdata_user[uid_info_col], on=['user_id'], how='left')
print('用戶特徵準備完畢!')
# 店鋪特徵
shop_info_col = ['shop_id', 'fans_num', 'vip_num', 'shop_score']
train_set = train_set.merge(jdata_shop[shop_info_col], on=['shop_id'], how='left')
print('店鋪特徵準備完畢!')
return train_set
end_day = '2018-04-04 23:59:59'
train_set = get_train_set(end_day)
train_set.to_hdf('datasets/train_set.h5', key='train_set', mode='w')
print(train_set.shape) # (1560852, 350)
# print(list(train_set.columns))
del train_set
標籤準備完畢!
['user_id']
['cate']
['shop_id']
['user_id', 'cate']
['user_id', 'shop_id']
['cate', 'shop_id']
['user_id', 'cate', 'shop_id']
用戶特徵準備完畢!
店鋪特徵準備完畢!
(1560852, 350)
from collections import Counter
train_set = pd.read_hdf('datasets/train_set.h5', key='train_set')
y_train = train_set['label'].values
c = Counter(y_train)
del train_set, y_train
print(c)
Counter({0.0: 1546311, 1.0: 14541})
構造測試集-特徵1天-用戶品類店鋪候選集7天-label7天(預測7天)
- ‘2018-04-02’-‘2018-04-08’
- 因爲0327-28兩天瀏覽數據嚴重缺失,訓練集從0329開始到0404,與測試集重了三天,理論上訓練集與測試集應完全不相關
def get_test_set(end_day):
# 合併數據
jdata_data = jdata_action.merge(jdata_product, on=['sku_id'])
# 候選集 7天
# '2018-04-02'-'2018-04-08'
test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02 00:00:00')
& (jdata_data['action_time'] <= '2018-04-08 23:59:59')][
['user_id', 'cate', 'shop_id']].drop_duplicates()
# label 7天
# '2018-04-09'-'2018-04-15'
test_buy = jdata_data[(jdata_data['action_time'] >= '2018-04-09 00:00:00')
& (jdata_data['action_time'] <= '2018-04-15 23:59:59')
& (jdata_data['type'] == 2)][['user_id', 'cate', 'shop_id']].drop_duplicates()
test_buy['label'] = 1
test_set = test_set.merge(test_buy, on=['user_id', 'cate', 'shop_id'], how='left').fillna(0)
print('標籤準備完畢!')
# 提取特徵 2018-04-08 1天
start_day = '2018-04-08 00:00:00'
for gb_c in [['user_id'], # 用戶
['cate'], # 品類
['shop_id'], # 店鋪
['user_id', 'cate'], # 用戶-品類
['user_id', 'shop_id'], # 用戶-店鋪
['cate', 'shop_id'], # 品類-店鋪
['user_id', 'cate', 'shop_id']]: # 用戶-品類-店鋪
print(gb_c)
action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
& (jdata_data['action_time'] <= end_day)]
# 特徵函數
features_dict = {
'sku_id': [np.size, lambda x: len(set(x))],
'type': lambda x: len(set(x)),
'brand': lambda x: len(set(x)),
'shop_id': lambda x: len(set(x)),
'cate': lambda x: len(set(x)),
'action_time': [
get_first_hour_gap, # first_hour_gap
get_last_hour_gap, # last_hour_gap
get_act_days # act_days
]
}
features_columns = [c +'_' + '_'.join(gb_c)
for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
# print(f_temp.columns)
f_temp.columns = gb_c + features_columns
# print(f_temp.columns)
test_set = test_set.merge(f_temp, on=gb_c, how='left')
for type_ in [1, 2, 3, 4, 5]: # 1:瀏覽 2:下單 3:關注 4:評論 5:加購物車
action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
& (jdata_data['action_time'] <= end_day)
& (jdata_data['type'] == type_)]
features_dict = {
'sku_id': [np.size, lambda x: len(set(x))],
'type': lambda x: len(set(x)),
'brand': lambda x: len(set(x)),
'shop_id': lambda x: len(set(x)),
'cate': lambda x: len(set(x)),
'action_time': [
get_first_hour_gap, # first_hour_gap
get_last_hour_gap, # last_hour_gap
get_act_days # act_days
]
}
features_columns = [c +'_' + '_'.join(gb_c) + '_type_' + str(type_)
for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
if len(f_temp) == 0:
continue
f_temp.columns = gb_c + features_columns
test_set = test_set.merge(f_temp, on=gb_c, how='left')
# 瀏覽、關注、評論購買比,加購物車特徵很重要,視B榜數據而定
test_set['buybro_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(1)]
test_set['buyfocus_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(3)]
test_set['buycom_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(4)]
# test_set['buycart_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(5)]
# 用戶特徵
uid_info_col = ['user_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county']
test_set = test_set.merge(jdata_user[uid_info_col], on=['user_id'], how='left')
print('用戶特徵準備完畢!')
# 店鋪特徵
shop_info_col = ['shop_id', 'fans_num', 'vip_num', 'shop_score']
test_set = test_set.merge(jdata_shop[shop_info_col], on=['shop_id'], how='left')
print('店鋪特徵準備完畢!')
return test_set
end_day = '2018-04-08 23:59:59'
test_set = get_test_set(end_day)
test_set.to_hdf('datasets/test_set.h5', key='test_set', mode='w')
print(test_set.shape) # (1560848, 350)
del test_set
標籤準備完畢!
['user_id']
['cate']
['shop_id']
['user_id', 'cate']
['user_id', 'shop_id']
['cate', 'shop_id']
['user_id', 'cate', 'shop_id']
用戶特徵準備完畢!
店鋪特徵準備完畢!
(1560848, 350)
from collections import Counter
test_set = pd.read_hdf('datasets/test_set.h5', key='test_set')
y_train = test_set['label'].values
c = Counter(y_train)
del test_set, y_train
print(c)
Counter({0.0: 1545471, 1.0: 15377})
構造預測集-特徵1天-用戶品類店鋪候選集7天
- ‘2018-04-09’-‘2018-04-15’
def get_pre_set(end_day):
# 合併數據
jdata_data = jdata_action.merge(jdata_product, on=['sku_id'])
# 預測集 7天
# '2018-04-09'-'2018-04-15'
pre_set = jdata_data[(jdata_data['action_time'] >= '2018-04-09 00:00:00')
& (jdata_data['action_time'] <= '2018-04-15 23:59:59')][
['user_id', 'cate', 'shop_id']].drop_duplicates()
# 提取特徵 2018-04-15 1天
start_day = '2018-04-15 00:00:00'
for gb_c in [['user_id'], # 用戶
['cate'], # 品類
['shop_id'], # 店鋪
['user_id', 'cate'], # 用戶-品類
['user_id', 'shop_id'], # 用戶-店鋪
['cate', 'shop_id'], # 品類-店鋪
['user_id', 'cate', 'shop_id']]: # 用戶-品類-店鋪
print(gb_c)
action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
& (jdata_data['action_time'] <= end_day)]
# 特徵函數
features_dict = {
'sku_id': [np.size, lambda x: len(set(x))],
'type': lambda x: len(set(x)),
'brand': lambda x: len(set(x)),
'shop_id': lambda x: len(set(x)),
'cate': lambda x: len(set(x)),
'action_time': [
get_first_hour_gap, # first_hour_gap
get_last_hour_gap, # last_hour_gap
get_act_days # act_days
]
}
features_columns = [c +'_' + '_'.join(gb_c)
for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
# print(f_temp.columns)
f_temp.columns = gb_c + features_columns
# print(f_temp.columns)
pre_set = pre_set.merge(f_temp, on=gb_c, how='left')
for type_ in [1, 2, 3, 4, 5]: # 1:瀏覽 2:下單 3:關注 4:評論 5:加購物車
action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
& (jdata_data['action_time'] <= end_day)
& (jdata_data['type'] == type_)]
features_dict = {
'sku_id': [np.size, lambda x: len(set(x))],
'type': lambda x: len(set(x)),
'brand': lambda x: len(set(x)),
'shop_id': lambda x: len(set(x)),
'cate': lambda x: len(set(x)),
'action_time': [
get_first_hour_gap, # first_hour_gap
get_last_hour_gap, # last_hour_gap
get_act_days # act_days
]
}
features_columns = [c +'_' + '_'.join(gb_c) + '_type_' + str(type_)
for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
if len(f_temp) == 0:
continue
f_temp.columns = gb_c + features_columns
pre_set = pre_set.merge(f_temp, on=gb_c, how='left')
# 瀏覽、關注、評論購買比,加購物車特徵很重要,視B榜數據而定
pre_set['buybro_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(1)]
pre_set['buyfocus_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(3)]
pre_set['buycom_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(4)]
# pre_set['buycart_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(5)]
# 用戶特徵
uid_info_col = ['user_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county']
pre_set = pre_set.merge(jdata_user[uid_info_col], on=['user_id'], how='left')
print('用戶特徵準備完畢!')
# 店鋪特徵
shop_info_col = ['shop_id', 'fans_num', 'vip_num', 'shop_score']
pre_set = pre_set.merge(jdata_shop[shop_info_col], on=['shop_id'], how='left')
print('店鋪特徵準備完畢!')
return pre_set
end_day = '2018-04-15 23:59:59'
pre_set = get_pre_set(end_day)
pre_set.to_hdf('datasets/pre_set.h5', key='pre_set', mode='w')
print(pre_set.shape)
print(list(pre_set.columns))
del pre_set
['user_id']
['cate']
['shop_id']
['user_id', 'cate']
['user_id', 'shop_id']
['cate', 'shop_id']
['user_id', 'cate', 'shop_id']
用戶特徵準備完畢!
店鋪特徵準備完畢!
(1569270, 349)
['user_id', 'cate', 'shop_id', 'sku_cnt_user_id', 'sku_nq_user_id', 'type_nq_user_id', 'brand_nq_user_id', 'shop_nq_user_id', 'cate_nq_user_id', 'first_hour_gap_user_id', 'last_hour_gap_user_id', 'act_days_user_id', 'sku_cnt_user_id_type_1', 'sku_nq_user_id_type_1', 'type_nq_user_id_type_1', 'brand_nq_user_id_type_1', 'shop_nq_user_id_type_1', 'cate_nq_user_id_type_1', 'first_hour_gap_user_id_type_1', 'last_hour_gap_user_id_type_1', 'act_days_user_id_type_1', 'sku_cnt_user_id_type_2', 'sku_nq_user_id_type_2', 'type_nq_user_id_type_2', 'brand_nq_user_id_type_2', 'shop_nq_user_id_type_2', 'cate_nq_user_id_type_2', 'first_hour_gap_user_id_type_2', 'last_hour_gap_user_id_type_2', 'act_days_user_id_type_2', 'sku_cnt_user_id_type_3', 'sku_nq_user_id_type_3', 'type_nq_user_id_type_3', 'brand_nq_user_id_type_3', 'shop_nq_user_id_type_3', 'cate_nq_user_id_type_3', 'first_hour_gap_user_id_type_3', 'last_hour_gap_user_id_type_3', 'act_days_user_id_type_3', 'sku_cnt_user_id_type_4', 'sku_nq_user_id_type_4', 'type_nq_user_id_type_4', 'brand_nq_user_id_type_4', 'shop_nq_user_id_type_4', 'cate_nq_user_id_type_4', 'first_hour_gap_user_id_type_4', 'last_hour_gap_user_id_type_4', 'act_days_user_id_type_4', 'buybro_ratio_user_id', 'buyfocus_ratio_user_id', 'buycom_ratio_user_id', 'sku_cnt_cate', 'sku_nq_cate', 'type_nq_cate', 'brand_nq_cate', 'shop_nq_cate', 'cate_nq_cate', 'first_hour_gap_cate', 'last_hour_gap_cate', 'act_days_cate', 'sku_cnt_cate_type_1', 'sku_nq_cate_type_1', 'type_nq_cate_type_1', 'brand_nq_cate_type_1', 'shop_nq_cate_type_1', 'cate_nq_cate_type_1', 'first_hour_gap_cate_type_1', 'last_hour_gap_cate_type_1', 'act_days_cate_type_1', 'sku_cnt_cate_type_2', 'sku_nq_cate_type_2', 'type_nq_cate_type_2', 'brand_nq_cate_type_2', 'shop_nq_cate_type_2', 'cate_nq_cate_type_2', 'first_hour_gap_cate_type_2', 'last_hour_gap_cate_type_2', 'act_days_cate_type_2', 'sku_cnt_cate_type_3', 'sku_nq_cate_type_3', 'type_nq_cate_type_3', 'brand_nq_cate_type_3', 'shop_nq_cate_type_3', 'cate_nq_cate_type_3', 'first_hour_gap_cate_type_3', 'last_hour_gap_cate_type_3', 'act_days_cate_type_3', 'sku_cnt_cate_type_4', 'sku_nq_cate_type_4', 'type_nq_cate_type_4', 'brand_nq_cate_type_4', 'shop_nq_cate_type_4', 'cate_nq_cate_type_4', 'first_hour_gap_cate_type_4', 'last_hour_gap_cate_type_4', 'act_days_cate_type_4', 'buybro_ratio_cate', 'buyfocus_ratio_cate', 'buycom_ratio_cate', 'sku_cnt_shop_id', 'sku_nq_shop_id', 'type_nq_shop_id', 'brand_nq_shop_id', 'shop_nq_shop_id', 'cate_nq_shop_id', 'first_hour_gap_shop_id', 'last_hour_gap_shop_id', 'act_days_shop_id', 'sku_cnt_shop_id_type_1', 'sku_nq_shop_id_type_1', 'type_nq_shop_id_type_1', 'brand_nq_shop_id_type_1', 'shop_nq_shop_id_type_1', 'cate_nq_shop_id_type_1', 'first_hour_gap_shop_id_type_1', 'last_hour_gap_shop_id_type_1', 'act_days_shop_id_type_1', 'sku_cnt_shop_id_type_2', 'sku_nq_shop_id_type_2', 'type_nq_shop_id_type_2', 'brand_nq_shop_id_type_2', 'shop_nq_shop_id_type_2', 'cate_nq_shop_id_type_2', 'first_hour_gap_shop_id_type_2', 'last_hour_gap_shop_id_type_2', 'act_days_shop_id_type_2', 'sku_cnt_shop_id_type_3', 'sku_nq_shop_id_type_3', 'type_nq_shop_id_type_3', 'brand_nq_shop_id_type_3', 'shop_nq_shop_id_type_3', 'cate_nq_shop_id_type_3', 'first_hour_gap_shop_id_type_3', 'last_hour_gap_shop_id_type_3', 'act_days_shop_id_type_3', 'sku_cnt_shop_id_type_4', 'sku_nq_shop_id_type_4', 'type_nq_shop_id_type_4', 'brand_nq_shop_id_type_4', 'shop_nq_shop_id_type_4', 'cate_nq_shop_id_type_4', 'first_hour_gap_shop_id_type_4', 'last_hour_gap_shop_id_type_4', 'act_days_shop_id_type_4', 'buybro_ratio_shop_id', 'buyfocus_ratio_shop_id', 'buycom_ratio_shop_id', 'sku_cnt_user_id_cate', 'sku_nq_user_id_cate', 'type_nq_user_id_cate', 'brand_nq_user_id_cate', 'shop_nq_user_id_cate', 'cate_nq_user_id_cate', 'first_hour_gap_user_id_cate', 'last_hour_gap_user_id_cate', 'act_days_user_id_cate', 'sku_cnt_user_id_cate_type_1', 'sku_nq_user_id_cate_type_1', 'type_nq_user_id_cate_type_1', 'brand_nq_user_id_cate_type_1', 'shop_nq_user_id_cate_type_1', 'cate_nq_user_id_cate_type_1', 'first_hour_gap_user_id_cate_type_1', 'last_hour_gap_user_id_cate_type_1', 'act_days_user_id_cate_type_1', 'sku_cnt_user_id_cate_type_2', 'sku_nq_user_id_cate_type_2', 'type_nq_user_id_cate_type_2', 'brand_nq_user_id_cate_type_2', 'shop_nq_user_id_cate_type_2', 'cate_nq_user_id_cate_type_2', 'first_hour_gap_user_id_cate_type_2', 'last_hour_gap_user_id_cate_type_2', 'act_days_user_id_cate_type_2', 'sku_cnt_user_id_cate_type_3', 'sku_nq_user_id_cate_type_3', 'type_nq_user_id_cate_type_3', 'brand_nq_user_id_cate_type_3', 'shop_nq_user_id_cate_type_3', 'cate_nq_user_id_cate_type_3', 'first_hour_gap_user_id_cate_type_3', 'last_hour_gap_user_id_cate_type_3', 'act_days_user_id_cate_type_3', 'sku_cnt_user_id_cate_type_4', 'sku_nq_user_id_cate_type_4', 'type_nq_user_id_cate_type_4', 'brand_nq_user_id_cate_type_4', 'shop_nq_user_id_cate_type_4', 'cate_nq_user_id_cate_type_4', 'first_hour_gap_user_id_cate_type_4', 'last_hour_gap_user_id_cate_type_4', 'act_days_user_id_cate_type_4', 'buybro_ratio_user_id_cate', 'buyfocus_ratio_user_id_cate', 'buycom_ratio_user_id_cate', 'sku_cnt_user_id_shop_id', 'sku_nq_user_id_shop_id', 'type_nq_user_id_shop_id', 'brand_nq_user_id_shop_id', 'shop_nq_user_id_shop_id', 'cate_nq_user_id_shop_id', 'first_hour_gap_user_id_shop_id', 'last_hour_gap_user_id_shop_id', 'act_days_user_id_shop_id', 'sku_cnt_user_id_shop_id_type_1', 'sku_nq_user_id_shop_id_type_1', 'type_nq_user_id_shop_id_type_1', 'brand_nq_user_id_shop_id_type_1', 'shop_nq_user_id_shop_id_type_1', 'cate_nq_user_id_shop_id_type_1', 'first_hour_gap_user_id_shop_id_type_1', 'last_hour_gap_user_id_shop_id_type_1', 'act_days_user_id_shop_id_type_1', 'sku_cnt_user_id_shop_id_type_2', 'sku_nq_user_id_shop_id_type_2', 'type_nq_user_id_shop_id_type_2', 'brand_nq_user_id_shop_id_type_2', 'shop_nq_user_id_shop_id_type_2', 'cate_nq_user_id_shop_id_type_2', 'first_hour_gap_user_id_shop_id_type_2', 'last_hour_gap_user_id_shop_id_type_2', 'act_days_user_id_shop_id_type_2', 'sku_cnt_user_id_shop_id_type_3', 'sku_nq_user_id_shop_id_type_3', 'type_nq_user_id_shop_id_type_3', 'brand_nq_user_id_shop_id_type_3', 'shop_nq_user_id_shop_id_type_3', 'cate_nq_user_id_shop_id_type_3', 'first_hour_gap_user_id_shop_id_type_3', 'last_hour_gap_user_id_shop_id_type_3', 'act_days_user_id_shop_id_type_3', 'sku_cnt_user_id_shop_id_type_4', 'sku_nq_user_id_shop_id_type_4', 'type_nq_user_id_shop_id_type_4', 'brand_nq_user_id_shop_id_type_4', 'shop_nq_user_id_shop_id_type_4', 'cate_nq_user_id_shop_id_type_4', 'first_hour_gap_user_id_shop_id_type_4', 'last_hour_gap_user_id_shop_id_type_4', 'act_days_user_id_shop_id_type_4', 'buybro_ratio_user_id_shop_id', 'buyfocus_ratio_user_id_shop_id', 'buycom_ratio_user_id_shop_id', 'sku_cnt_cate_shop_id', 'sku_nq_cate_shop_id', 'type_nq_cate_shop_id', 'brand_nq_cate_shop_id', 'shop_nq_cate_shop_id', 'cate_nq_cate_shop_id', 'first_hour_gap_cate_shop_id', 'last_hour_gap_cate_shop_id', 'act_days_cate_shop_id', 'sku_cnt_cate_shop_id_type_1', 'sku_nq_cate_shop_id_type_1', 'type_nq_cate_shop_id_type_1', 'brand_nq_cate_shop_id_type_1', 'shop_nq_cate_shop_id_type_1', 'cate_nq_cate_shop_id_type_1', 'first_hour_gap_cate_shop_id_type_1', 'last_hour_gap_cate_shop_id_type_1', 'act_days_cate_shop_id_type_1', 'sku_cnt_cate_shop_id_type_2', 'sku_nq_cate_shop_id_type_2', 'type_nq_cate_shop_id_type_2', 'brand_nq_cate_shop_id_type_2', 'shop_nq_cate_shop_id_type_2', 'cate_nq_cate_shop_id_type_2', 'first_hour_gap_cate_shop_id_type_2', 'last_hour_gap_cate_shop_id_type_2', 'act_days_cate_shop_id_type_2', 'sku_cnt_cate_shop_id_type_3', 'sku_nq_cate_shop_id_type_3', 'type_nq_cate_shop_id_type_3', 'brand_nq_cate_shop_id_type_3', 'shop_nq_cate_shop_id_type_3', 'cate_nq_cate_shop_id_type_3', 'first_hour_gap_cate_shop_id_type_3', 'last_hour_gap_cate_shop_id_type_3', 'act_days_cate_shop_id_type_3', 'sku_cnt_cate_shop_id_type_4', 'sku_nq_cate_shop_id_type_4', 'type_nq_cate_shop_id_type_4', 'brand_nq_cate_shop_id_type_4', 'shop_nq_cate_shop_id_type_4', 'cate_nq_cate_shop_id_type_4', 'first_hour_gap_cate_shop_id_type_4', 'last_hour_gap_cate_shop_id_type_4', 'act_days_cate_shop_id_type_4', 'buybro_ratio_cate_shop_id', 'buyfocus_ratio_cate_shop_id', 'buycom_ratio_cate_shop_id', 'sku_cnt_user_id_cate_shop_id', 'sku_nq_user_id_cate_shop_id', 'type_nq_user_id_cate_shop_id', 'brand_nq_user_id_cate_shop_id', 'shop_nq_user_id_cate_shop_id', 'cate_nq_user_id_cate_shop_id', 'first_hour_gap_user_id_cate_shop_id', 'last_hour_gap_user_id_cate_shop_id', 'act_days_user_id_cate_shop_id', 'sku_cnt_user_id_cate_shop_id_type_1', 'sku_nq_user_id_cate_shop_id_type_1', 'type_nq_user_id_cate_shop_id_type_1', 'brand_nq_user_id_cate_shop_id_type_1', 'shop_nq_user_id_cate_shop_id_type_1', 'cate_nq_user_id_cate_shop_id_type_1', 'first_hour_gap_user_id_cate_shop_id_type_1', 'last_hour_gap_user_id_cate_shop_id_type_1', 'act_days_user_id_cate_shop_id_type_1', 'sku_cnt_user_id_cate_shop_id_type_2', 'sku_nq_user_id_cate_shop_id_type_2', 'type_nq_user_id_cate_shop_id_type_2', 'brand_nq_user_id_cate_shop_id_type_2', 'shop_nq_user_id_cate_shop_id_type_2', 'cate_nq_user_id_cate_shop_id_type_2', 'first_hour_gap_user_id_cate_shop_id_type_2', 'last_hour_gap_user_id_cate_shop_id_type_2', 'act_days_user_id_cate_shop_id_type_2', 'sku_cnt_user_id_cate_shop_id_type_3', 'sku_nq_user_id_cate_shop_id_type_3', 'type_nq_user_id_cate_shop_id_type_3', 'brand_nq_user_id_cate_shop_id_type_3', 'shop_nq_user_id_cate_shop_id_type_3', 'cate_nq_user_id_cate_shop_id_type_3', 'first_hour_gap_user_id_cate_shop_id_type_3', 'last_hour_gap_user_id_cate_shop_id_type_3', 'act_days_user_id_cate_shop_id_type_3', 'sku_cnt_user_id_cate_shop_id_type_4', 'sku_nq_user_id_cate_shop_id_type_4', 'type_nq_user_id_cate_shop_id_type_4', 'brand_nq_user_id_cate_shop_id_type_4', 'shop_nq_user_id_cate_shop_id_type_4', 'cate_nq_user_id_cate_shop_id_type_4', 'first_hour_gap_user_id_cate_shop_id_type_4', 'last_hour_gap_user_id_cate_shop_id_type_4', 'act_days_user_id_cate_shop_id_type_4', 'buybro_ratio_user_id_cate_shop_id', 'buyfocus_ratio_user_id_cate_shop_id', 'buycom_ratio_user_id_cate_shop_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county', 'fans_num', 'vip_num', 'shop_score']
4.特徵選擇
特徵選擇使用隨機森林,下面的代碼輸出特徵重要度排名。
from model.feat_columns import *
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.preprocessing import Imputer
"""
使用隨機森林進行特徵選擇
"""
train_set = pd.read_hdf('../datasets/test_set.h5', key='test_set')
# train_set = pd.read_hdf('../../datasets/train_set.h5', key='train_set')
X = train_set[feat_columns].values
print(X.shape) # 7day-1 (1560848, 349) (1560848, 328) 7day-7 (1560852, 349)
y = train_set['label'].values
# 如果用全量數據跑不動
seed = 3
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
clf = RandomForestClassifier(random_state=seed)
# 缺失值填充
X_train = Imputer().fit_transform(X_train)
print('fit...')
clf.fit(X_train, y_train)
print('done')
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
features = train_set[feat_columns].columns
l = []
for i in range(X_train.shape[1]):
print(("%2d) %-*s %f" % (i + 1, 30, features[indices[i]], importance[indices[i]])))
l.append(features[indices[i]])
print(l)
5.模型選擇
首先判斷一下你要處理的問題屬於哪一類,是分類、聚類還是迴歸等等。確定好類別後,再通過驗證集來查看此類中各個算法的表現,從而選擇合適的模型。因爲目前比賽常用且效果最好的只有lightgbm和xgboost,所以我此次並沒有進行模型選擇,就用了這兩個算法。大家可以查一下sklearn.model_selection包的相關文檔。
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
6.參數選擇
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from collections import Counter
from model.feat_columns import *
train_set = pd.read_hdf('../../datasets/train_set.h5', key='train_set')
X_train = train_set[feat_columns].values
y_train = train_set['label'].values
c = Counter(y_train)
# n = c[0] / 16 / c[1] # 8
n = c[0] / c[1] # 129.56
print(n)
parameters = {
'max_depth': [5, 10, 15, 20, 25],
'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
'n_estimators': [500, 1000, 2000, 3000, 5000],
'min_child_weight': [0, 2, 5, 10, 20],
'max_delta_step': [0, 0.2, 0.6, 1, 2],
'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],
'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1, 8, n]
}
xlf = xgb.XGBClassifier(max_depth=10,
learning_rate=0.01,
n_estimators=2000,
silent=True,
objective='binary:logistic',
nthread=12,
gamma=0,
min_child_weight=1,
max_delta_step=0,
subsample=0.85,
colsample_bytree=0.7,
colsample_bylevel=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=1440,
missing=None)
gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(X_train, y_train)
print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
7.模型訓練與評估
選好特徵、模型和參數後開始對模型進行訓練和評估,最後再使用前面的測試集(不是這裏訓練集劃分的測試集)查看模型效果。
import pickle
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from model.feat_columns import *
threshold = 0.3
train_set = pd.read_hdf('../../datasets/train_set.h5', key='train_set')
'''
pos_sample = train_set[train_set['label'] == 1]
n = len(pos_sample)
print(n)
neg_sample = train_set[train_set['label'] == 0].sample(n=n, random_state=1)
del train_set
train_set = pos_sample.append(neg_sample)
'''
X = train_set[feat_columns].values
print(X.shape) # 7day (1560852, 349)
y = train_set['label'].values
c = Counter(y) # Counter({0.0: 1545471, 1.0: 15377})
print(c)
train_metrics = train_set[['user_id', 'cate', 'shop_id', 'label']]
del train_set
# split data into train and test sets
seed = 3
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
c = Counter(y_train)
print(c) # 7day Counter({0.0: 1035454, 1.0: 10314})
# c[0] / 16 / c[1] 8 | c[0] / c[1] 129.56
clf = XGBClassifier(max_depth=5, min_child_weight=6, scale_pos_weight=c[0] / 16 / c[1], n_estimators=100, nthread=12,
seed=0, subsample=0.5)
eval_set = [(X_test, y_test)]
clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = clf.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
y_train_predict = clf.predict_proba(X)[:, 1]
# y_train_predict = clf.predict(X)
# train_metrics['pre_label'] = y_train_predict
train_metrics['pred_prob'] = y_train_predict
# pred = train_metrics[train_metrics['pre_label'] == 1]
pred = train_metrics[train_metrics['pred_prob'] > threshold]
truth = train_metrics[train_metrics['label'] == 1]
print('X train pred num is:', len(pred))
print("訓練集分數:")
get_final_score(pred, truth)
del train_metrics
pickle.dump(clf, open('../user_model/baseline.pkl', 'wb'))
# clf = pickle.load(open('../user_model/baseline.pkl', 'rb'))
8.模型融合
模型融合就是訓練多個模型,然後按照一定的方法整合多個模型輸出的結果,常用的有加權平均、投票、學習法等等,請自行百度。模型融合的結果往往優於單個模型,常用於最後的成績提升。我此次只是對lightgbm和xgboost兩個算法預測出的概率進行簡單的加權求和,然後取了topN的結果進行提交。
參考鏈接: