2019JDATA用戶對品類下店鋪的購買預測(機器學習一般步驟總結)

前言

        偶然間在羣裏看到有人發了這個比賽,查了一下才知道這是京東舉行的第三屆JDATA算法大賽,可我從來沒有聽說過,有種被時代拋棄的感覺😢。我自從2016年參加了阿里天池的幾個比賽之後就沒有關注過這方面,一是工作比較忙,二是自己變懶了,唉。聽說騰訊每年也舉辦什麼廣告算法大賽,感興趣的同學可以參加一下,此外還有kaggle等等。這次比賽也只是打了個醬油,畢竟工作了,沒有上學時那麼多時間,而且現在的同學都太厲害了😆。雖然成績不怎麼樣(89/1401),但我覺得這個流程還是值得記錄和分享一下的。需注意的是,特徵工程決定上限,而其餘步驟只是逼近這個上限。最後,期待大佬們的分享~

賽題介紹

賽題:https://jdata.jd.com/html/detail.html?id=8
數據:https://download.csdn.net/download/dr_guo/11207507
在這裏插入圖片描述
賽題背景

京東零售集團堅持“以信賴爲基礎、以客戶爲中心的價值創造”這一經營理念,在不同的消費場景和連接終端上,在正確的時間、正確的地點爲3億多活躍用戶提供最適合的產品和服務。目前,京東零售集團第三方平臺簽約商家超過21萬個,實現了全品類覆蓋,爲維持商家生態繁榮、多樣和有序,全面滿足消費者一站式購物需求,需要對用戶購買行爲進行更精準地分析和預測。基於此,本賽題提供來自用戶、商家、商品等多方面數據信息,包括商家和商品自身的內容信息、評論信息以及用戶與之豐富的互動行爲。參賽隊伍需要通過數據挖掘技術和機器學習算法,構建用戶購買商家中相關品類的預測模型,輸出用戶和店鋪、品類的匹配結果,爲精準營銷提供高質量的目標羣體。同時,希望參賽隊伍通過本次比賽,挖掘數據背後潛在的意義,爲電商生態平臺的商家、用戶提供多方共贏的智能解決方案。

01 /評分

參賽者提交的結果文件中包含對所有用戶購買意向的預測結果。對每一個用戶的預測結果包括兩方面:
(1)該用戶2018-04-16到2018-04-22是否對品類有購買,提交的結果文件中僅包含預測爲下單的用戶和品類(預測爲未下單的用戶和品類無須在結果中出現)。評測時將對提交結果中重複的“用戶-品類”做排重處理,若預測正確,則評測算法中置label=1,不正確label=0。
(2)如果用戶對品類有購買,還需要預測對該品類下哪個店鋪有購買,若店鋪預測正確,則評測算法中置pred=1,不正確pred=0。
對於參賽者提交的結果文件,按如下公式計算得分:score=0.4F11+0.6F12
此處的F1值定義爲:
在這裏插入圖片描述
其中:Precise爲準確率,Recall爲召回率; F11 是label=1或0的F1值,F12 是pred=1或0的F1值。

02 /賽題數據

2.1 訓練數據
提供2018-02-01到2018-04-15用戶集合U中的用戶,對商品集合S中部分商品的行爲、評價、用戶數據。
2.2 預測數據
提供 2018-04-16 到 2018-04-22 預測用戶U對哪些品類和店鋪有購買,用戶對品類下的店鋪只會購買一次。
2.3 數據表說明
在這裏插入圖片描述

1)行爲數據(jdata_action)
在這裏插入圖片描述
2)評論數據(jdata_comment)
在這裏插入圖片描述
3)商品數據(jdata_product)
在這裏插入圖片描述
4)商家店鋪數據(jdata_shop)
在這裏插入圖片描述
5)用戶數據(jdata_user)
在這裏插入圖片描述

03 /任務描述及作品要求

3.1 任務描述
對於訓練集中出現的每一個用戶,參賽者的模型需要預測該用戶在未來7天內對某個目標品類下某個店鋪的購買意向。

3.2 作品要求
提交的CSV文件要求如下:

  1. UTF-8無BOM格式編碼;
  2. 第一行爲字段名,即:user_id,cate,shop_id(數據使用英文逗號分隔)
    其中:user_id:用戶表(jdata_user)中用戶ID;cate:商品表(jdata_product)中商品sku_id對應的品類cate ;shop_id:商家表(jdata_shop)中店鋪ID;
  3. 結果不存在重複的記錄行數,否則無效;
    對於預測出沒有購買意向的用戶,在提交的CSV文件中不要包含該用戶的信息。
    提交結果示例如下圖:
    在這裏插入圖片描述

步驟彙總

這是我自己總結的流程,如有不對,歡迎交流指正。

1.查看分析數據

2.數據清洗

3.構造數據集(特徵工程)

開始自己選的特徵連0.03都上不去,後來看到Cookly 洪鵬飛開源的baseline,我基於他寫的刪了很多特徵又加了一點特徵,分數可能都沒有他的baseline高,但沒辦法,筆記本跑不動。我測試了一下我的筆記本最高只能處理300維,說到這我要吐槽一下京東了,能不能跟阿里一樣提供開發環境,我好多想法受限於硬件無法實現。
在這裏插入圖片描述
下面是代碼,需要注意的是,我構造了三個數據集,分別是訓練集、測試集和預測集。預測集這個大家可能沒有聽說過,因爲往往都叫測試集,但我覺得這樣叫有點亂,所以給它起了個名叫預測集,專門用來預測最終結果。測試集也有點不一樣,測試集應與訓練集完全不相關,用來評估模型的表現,而且用的次數不能太多。理論上講,模型在測試集的表現與在預測集上的表現應該是差不多的。此外,訓練集也可以劃分爲訓練集和驗證集,但是很多時候都沒有真正的測試集,而是把訓練集劃分爲訓練集和測試集或是劃分爲訓練集、驗證集和測試集。是不是感覺有點亂,我說的也不一定對,但只需記住一點,我們只用測試集去評估模型的表現,並不會去調整優化模型,慢慢體會吧。

import pandas as pd
import os
import pickle
import datetime
import re
import numpy as np
jdata_action_file_dir = "../../jdata/jdata_action.csv"
jdata_comment_file_dir = "../../jdata/jdata_comment.csv"
jdata_product_file_dir = "../../jdata/jdata_product.csv"
jdata_shop_file_dir = "../../jdata/jdata_shop.csv"
jdata_user_file_dir = "../../jdata/jdata_user.csv"
# 減少內存使用
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df
# 行爲數據
jdata_action = reduce_mem_usage(pd.read_csv(jdata_action_file_dir))
jdata_action.drop_duplicates(inplace=True)
del jdata_action['module_id']

# 視B榜數據而定,是否刪除購物車記錄(A榜數據只有8號後的購物車記錄)
jdata_action = jdata_action[-jdata_action['type'].isin([5])]

# 評論數據
# jdata_comment = reduce_mem_usage(pd.read_csv(jdata_comment_file_dir))

# 商品數據
jdata_product = reduce_mem_usage(pd.read_csv(jdata_product_file_dir))

# 商家店鋪數據
jdata_shop = reduce_mem_usage(pd.read_csv(jdata_shop_file_dir))

# 用戶數據
jdata_user = reduce_mem_usage(pd.read_csv(jdata_user_file_dir))
Mem. usage decreased to 745.30 Mb (47.5% reduction)
Mem. usage decreased to  5.72 Mb (57.5% reduction)
Mem. usage decreased to  0.24 Mb (57.1% reduction)
Mem. usage decreased to 38.35 Mb (65.3% reduction)
def go_split(s, symbol='-: '):
    # 拼接正則表達式
    symbol = "[" + symbol + "]+"
    # 一次性分割字符串
    result = re.split(symbol, s)
    # 去除空字符
    return [x for x in result if x]
def get_hour(start, end):
    d = datetime.datetime(*[int(float(i)) for i in go_split(start)]) - datetime.datetime(*[int(float(i)) for i in go_split(end)])
    n = int(d.days*24 + d.seconds/60/60)
    return n
def get_first_hour_gap(x):
    return get_hour(end_day, min(x))
def get_last_hour_gap(x):
    return get_hour(end_day, max(x))
def get_act_days(x):
    return len(set([i[:10] for i in x]))

構造訓練集-特徵1天-用戶品類店鋪候選集7天-label7天(預測7天)

  • ‘2018-03-29’-‘2018-04-04’
def get_train_set(end_day):
    
    # 合併數據
    jdata_data = jdata_action.merge(jdata_product, on=['sku_id'])
    
    # 候選集 7天
    # '2018-03-29'-'2018-04-04'
    train_set = jdata_data[(jdata_data['action_time'] >= '2018-03-29 00:00:00')
                           & (jdata_data['action_time'] <= '2018-04-04 23:59:59')][
        ['user_id', 'cate', 'shop_id']].drop_duplicates()
    
    # label 7天
    # '2018-04-05'-'2018-04-11'
    train_buy = jdata_data[(jdata_data['action_time'] >= '2018-04-05 00:00:00')
                           & (jdata_data['action_time'] <= '2018-04-11 23:59:59')
                           & (jdata_data['type'] == 2)][['user_id', 'cate', 'shop_id']].drop_duplicates()
    train_buy['label'] = 1
    train_set = train_set.merge(train_buy, on=['user_id', 'cate', 'shop_id'], how='left').fillna(0)
    print('標籤準備完畢!')
    

    # 提取特徵 2018-04-04 1天
    start_day = '2018-04-04 00:00:00'
    for gb_c in [['user_id'],  # 用戶
                 ['cate'],  # 品類
                 ['shop_id'],  # 店鋪
                 ['user_id', 'cate'],  # 用戶-品類
                 ['user_id', 'shop_id'],  # 用戶-店鋪
                 ['cate', 'shop_id'],  # 品類-店鋪
                 ['user_id', 'cate', 'shop_id']]:  # 用戶-品類-店鋪
        print(gb_c)
        
        action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
                                 & (jdata_data['action_time'] <= end_day)]

        # 特徵函數
        features_dict = {
            'sku_id': [np.size, lambda x: len(set(x))],
            'type': lambda x: len(set(x)),
            'brand': lambda x: len(set(x)),
            'shop_id': lambda x: len(set(x)),
            'cate': lambda x: len(set(x)),
            'action_time': [
                get_first_hour_gap,  # first_hour_gap
                get_last_hour_gap,  # last_hour_gap
                get_act_days  # act_days
            ]

        } 
        features_columns = [c +'_' + '_'.join(gb_c)
                            for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
        f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
        # print(f_temp.columns)
        f_temp.columns = gb_c + features_columns
        # print(f_temp.columns)
        train_set = train_set.merge(f_temp, on=gb_c, how='left')

        for type_ in [1, 2, 3, 4, 5]:  # 1:瀏覽 2:下單 3:關注 4:評論 5:加購物車
            action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
                                     & (jdata_data['action_time'] <= end_day)
                                     & (jdata_data['type'] == type_)]
            features_dict = {
                'sku_id': [np.size, lambda x: len(set(x))],
                'type': lambda x: len(set(x)),
                'brand': lambda x: len(set(x)),
                'shop_id': lambda x: len(set(x)),
                'cate': lambda x: len(set(x)),
                'action_time': [
                    get_first_hour_gap,  # first_hour_gap
                    get_last_hour_gap,  # last_hour_gap
                    get_act_days  # act_days
                ]
            }
            features_columns = [c +'_' + '_'.join(gb_c) + '_type_' + str(type_)
                                for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
            f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
            if len(f_temp) == 0:
                continue
            f_temp.columns = gb_c + features_columns
            train_set = train_set.merge(f_temp, on=gb_c, how='left')
        # 瀏覽、關注、評論購買比,加購物車特徵很重要,視B榜數據而定
        train_set['buybro_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(1)]
        train_set['buyfocus_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(3)]
        train_set['buycom_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(4)]
        # train_set['buycart_ratio_' + '_'.join(gb_c)] = train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/train_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(5)]

    # 用戶特徵
    uid_info_col = ['user_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county']
    train_set = train_set.merge(jdata_user[uid_info_col], on=['user_id'], how='left')
    print('用戶特徵準備完畢!')

    # 店鋪特徵
    shop_info_col = ['shop_id', 'fans_num', 'vip_num', 'shop_score']
    train_set = train_set.merge(jdata_shop[shop_info_col], on=['shop_id'], how='left')
    print('店鋪特徵準備完畢!')

    return train_set
end_day = '2018-04-04 23:59:59'
train_set = get_train_set(end_day)
train_set.to_hdf('datasets/train_set.h5', key='train_set', mode='w')
print(train_set.shape)  # (1560852, 350)
# print(list(train_set.columns))
del train_set
標籤準備完畢!
['user_id']
['cate']
['shop_id']
['user_id', 'cate']
['user_id', 'shop_id']
['cate', 'shop_id']
['user_id', 'cate', 'shop_id']
用戶特徵準備完畢!
店鋪特徵準備完畢!
(1560852, 350)
from collections import Counter
train_set = pd.read_hdf('datasets/train_set.h5', key='train_set')
y_train = train_set['label'].values
c = Counter(y_train)
del train_set, y_train
print(c) 
Counter({0.0: 1546311, 1.0: 14541})

構造測試集-特徵1天-用戶品類店鋪候選集7天-label7天(預測7天)

  • ‘2018-04-02’-‘2018-04-08’
  • 因爲0327-28兩天瀏覽數據嚴重缺失,訓練集從0329開始到0404,與測試集重了三天,理論上訓練集與測試集應完全不相關
def get_test_set(end_day):
    
    # 合併數據
    jdata_data = jdata_action.merge(jdata_product, on=['sku_id'])
    
    # 候選集 7天
    # '2018-04-02'-'2018-04-08'
    test_set = jdata_data[(jdata_data['action_time'] >= '2018-04-02 00:00:00')
                           & (jdata_data['action_time'] <= '2018-04-08 23:59:59')][
        ['user_id', 'cate', 'shop_id']].drop_duplicates()
    
    # label 7天
    # '2018-04-09'-'2018-04-15'
    test_buy = jdata_data[(jdata_data['action_time'] >= '2018-04-09 00:00:00')
                           & (jdata_data['action_time'] <= '2018-04-15 23:59:59')
                           & (jdata_data['type'] == 2)][['user_id', 'cate', 'shop_id']].drop_duplicates()
    test_buy['label'] = 1
    test_set = test_set.merge(test_buy, on=['user_id', 'cate', 'shop_id'], how='left').fillna(0)
    print('標籤準備完畢!')
    
    # 提取特徵 2018-04-08 1天
    start_day = '2018-04-08 00:00:00'
    for gb_c in [['user_id'],  # 用戶
                 ['cate'],  # 品類
                 ['shop_id'],  # 店鋪
                 ['user_id', 'cate'],  # 用戶-品類
                 ['user_id', 'shop_id'],  # 用戶-店鋪
                 ['cate', 'shop_id'],  # 品類-店鋪
                 ['user_id', 'cate', 'shop_id']]:  # 用戶-品類-店鋪
        print(gb_c)
        
        action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
                                 & (jdata_data['action_time'] <= end_day)]

        # 特徵函數
        features_dict = {
            'sku_id': [np.size, lambda x: len(set(x))],
            'type': lambda x: len(set(x)),
            'brand': lambda x: len(set(x)),
            'shop_id': lambda x: len(set(x)),
            'cate': lambda x: len(set(x)),
            'action_time': [
                get_first_hour_gap,  # first_hour_gap
                get_last_hour_gap,  # last_hour_gap
                get_act_days  # act_days
            ]

        } 
        features_columns = [c +'_' + '_'.join(gb_c)
                            for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
        f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
        # print(f_temp.columns)
        f_temp.columns = gb_c + features_columns
        # print(f_temp.columns)
        test_set = test_set.merge(f_temp, on=gb_c, how='left')

        for type_ in [1, 2, 3, 4, 5]:  # 1:瀏覽 2:下單 3:關注 4:評論 5:加購物車
            action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
                                     & (jdata_data['action_time'] <= end_day)
                                     & (jdata_data['type'] == type_)]
            features_dict = {
                'sku_id': [np.size, lambda x: len(set(x))],
                'type': lambda x: len(set(x)),
                'brand': lambda x: len(set(x)),
                'shop_id': lambda x: len(set(x)),
                'cate': lambda x: len(set(x)),
                'action_time': [
                    get_first_hour_gap,  # first_hour_gap
                    get_last_hour_gap,  # last_hour_gap
                    get_act_days  # act_days
                ]
            }
            features_columns = [c +'_' + '_'.join(gb_c) + '_type_' + str(type_)
                                for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
            f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
            if len(f_temp) == 0:
                continue
            f_temp.columns = gb_c + features_columns
            test_set = test_set.merge(f_temp, on=gb_c, how='left')
        # 瀏覽、關注、評論購買比,加購物車特徵很重要,視B榜數據而定
        test_set['buybro_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(1)]
        test_set['buyfocus_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(3)]
        test_set['buycom_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(4)]
        # test_set['buycart_ratio_' + '_'.join(gb_c)] = test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/test_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(5)]

    # 用戶特徵
    uid_info_col = ['user_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county']
    test_set = test_set.merge(jdata_user[uid_info_col], on=['user_id'], how='left')
    print('用戶特徵準備完畢!')

    # 店鋪特徵
    shop_info_col = ['shop_id', 'fans_num', 'vip_num', 'shop_score']
    test_set = test_set.merge(jdata_shop[shop_info_col], on=['shop_id'], how='left')
    print('店鋪特徵準備完畢!')

    return test_set
end_day = '2018-04-08 23:59:59'
test_set = get_test_set(end_day)
test_set.to_hdf('datasets/test_set.h5', key='test_set', mode='w')
print(test_set.shape)  # (1560848, 350)
del test_set
標籤準備完畢!
['user_id']
['cate']
['shop_id']
['user_id', 'cate']
['user_id', 'shop_id']
['cate', 'shop_id']
['user_id', 'cate', 'shop_id']
用戶特徵準備完畢!
店鋪特徵準備完畢!
(1560848, 350)
from collections import Counter
test_set = pd.read_hdf('datasets/test_set.h5', key='test_set')
y_train = test_set['label'].values
c = Counter(y_train)
del test_set, y_train
print(c) 
Counter({0.0: 1545471, 1.0: 15377})

構造預測集-特徵1天-用戶品類店鋪候選集7天

  • ‘2018-04-09’-‘2018-04-15’
def get_pre_set(end_day):
    
    # 合併數據
    jdata_data = jdata_action.merge(jdata_product, on=['sku_id'])
    
    # 預測集 7天
    # '2018-04-09'-'2018-04-15' 
    pre_set = jdata_data[(jdata_data['action_time'] >= '2018-04-09 00:00:00')
                           & (jdata_data['action_time'] <= '2018-04-15 23:59:59')][
        ['user_id', 'cate', 'shop_id']].drop_duplicates()
    
    # 提取特徵 2018-04-15 1天
    start_day = '2018-04-15 00:00:00'
    for gb_c in [['user_id'],  # 用戶
                 ['cate'],  # 品類
                 ['shop_id'],  # 店鋪
                 ['user_id', 'cate'],  # 用戶-品類
                 ['user_id', 'shop_id'],  # 用戶-店鋪
                 ['cate', 'shop_id'],  # 品類-店鋪
                 ['user_id', 'cate', 'shop_id']]:  # 用戶-品類-店鋪
        print(gb_c)
        
        action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
                                 & (jdata_data['action_time'] <= end_day)]

        # 特徵函數
        features_dict = {
            'sku_id': [np.size, lambda x: len(set(x))],
            'type': lambda x: len(set(x)),
            'brand': lambda x: len(set(x)),
            'shop_id': lambda x: len(set(x)),
            'cate': lambda x: len(set(x)),
            'action_time': [
                get_first_hour_gap,  # first_hour_gap
                get_last_hour_gap,  # last_hour_gap
                get_act_days  # act_days
            ]

        } 
        features_columns = [c +'_' + '_'.join(gb_c)
                            for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
        f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
        # print(f_temp.columns)
        f_temp.columns = gb_c + features_columns
        # print(f_temp.columns)
        pre_set = pre_set.merge(f_temp, on=gb_c, how='left')

        for type_ in [1, 2, 3, 4, 5]:  # 1:瀏覽 2:下單 3:關注 4:評論 5:加購物車
            action_temp = jdata_data[(jdata_data['action_time'] >= start_day)
                                     & (jdata_data['action_time'] <= end_day)
                                     & (jdata_data['type'] == type_)]
            features_dict = {
                'sku_id': [np.size, lambda x: len(set(x))],
                'type': lambda x: len(set(x)),
                'brand': lambda x: len(set(x)),
                'shop_id': lambda x: len(set(x)),
                'cate': lambda x: len(set(x)),
                'action_time': [
                    get_first_hour_gap,  # first_hour_gap
                    get_last_hour_gap,  # last_hour_gap
                    get_act_days  # act_days
                ]
            }
            features_columns = [c +'_' + '_'.join(gb_c) + '_type_' + str(type_)
                                for c in ['sku_cnt', 'sku_nq', 'type_nq', 'brand_nq', 'shop_nq', 'cate_nq', 'first_hour_gap', 'last_hour_gap', 'act_days']]
            f_temp = action_temp.groupby(gb_c).agg(features_dict).reset_index()
            if len(f_temp) == 0:
                continue
            f_temp.columns = gb_c + features_columns
            pre_set = pre_set.merge(f_temp, on=gb_c, how='left')
        # 瀏覽、關注、評論購買比,加購物車特徵很重要,視B榜數據而定
        pre_set['buybro_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(1)]
        pre_set['buyfocus_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(3)]
        pre_set['buycom_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(4)]
        # pre_set['buycart_ratio_' + '_'.join(gb_c)] = pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(2)]/pre_set['sku_cnt_' + '_'.join(gb_c) + '_type_' + str(5)]

    # 用戶特徵
    uid_info_col = ['user_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county']
    pre_set = pre_set.merge(jdata_user[uid_info_col], on=['user_id'], how='left')
    print('用戶特徵準備完畢!')

    # 店鋪特徵
    shop_info_col = ['shop_id', 'fans_num', 'vip_num', 'shop_score']
    pre_set = pre_set.merge(jdata_shop[shop_info_col], on=['shop_id'], how='left')
    print('店鋪特徵準備完畢!')

    return pre_set
end_day = '2018-04-15 23:59:59'
pre_set = get_pre_set(end_day)
pre_set.to_hdf('datasets/pre_set.h5', key='pre_set', mode='w')
print(pre_set.shape)  
print(list(pre_set.columns))
del pre_set
['user_id']
['cate']
['shop_id']
['user_id', 'cate']
['user_id', 'shop_id']
['cate', 'shop_id']
['user_id', 'cate', 'shop_id']
用戶特徵準備完畢!
店鋪特徵準備完畢!
(1569270, 349)
['user_id', 'cate', 'shop_id', 'sku_cnt_user_id', 'sku_nq_user_id', 'type_nq_user_id', 'brand_nq_user_id', 'shop_nq_user_id', 'cate_nq_user_id', 'first_hour_gap_user_id', 'last_hour_gap_user_id', 'act_days_user_id', 'sku_cnt_user_id_type_1', 'sku_nq_user_id_type_1', 'type_nq_user_id_type_1', 'brand_nq_user_id_type_1', 'shop_nq_user_id_type_1', 'cate_nq_user_id_type_1', 'first_hour_gap_user_id_type_1', 'last_hour_gap_user_id_type_1', 'act_days_user_id_type_1', 'sku_cnt_user_id_type_2', 'sku_nq_user_id_type_2', 'type_nq_user_id_type_2', 'brand_nq_user_id_type_2', 'shop_nq_user_id_type_2', 'cate_nq_user_id_type_2', 'first_hour_gap_user_id_type_2', 'last_hour_gap_user_id_type_2', 'act_days_user_id_type_2', 'sku_cnt_user_id_type_3', 'sku_nq_user_id_type_3', 'type_nq_user_id_type_3', 'brand_nq_user_id_type_3', 'shop_nq_user_id_type_3', 'cate_nq_user_id_type_3', 'first_hour_gap_user_id_type_3', 'last_hour_gap_user_id_type_3', 'act_days_user_id_type_3', 'sku_cnt_user_id_type_4', 'sku_nq_user_id_type_4', 'type_nq_user_id_type_4', 'brand_nq_user_id_type_4', 'shop_nq_user_id_type_4', 'cate_nq_user_id_type_4', 'first_hour_gap_user_id_type_4', 'last_hour_gap_user_id_type_4', 'act_days_user_id_type_4', 'buybro_ratio_user_id', 'buyfocus_ratio_user_id', 'buycom_ratio_user_id', 'sku_cnt_cate', 'sku_nq_cate', 'type_nq_cate', 'brand_nq_cate', 'shop_nq_cate', 'cate_nq_cate', 'first_hour_gap_cate', 'last_hour_gap_cate', 'act_days_cate', 'sku_cnt_cate_type_1', 'sku_nq_cate_type_1', 'type_nq_cate_type_1', 'brand_nq_cate_type_1', 'shop_nq_cate_type_1', 'cate_nq_cate_type_1', 'first_hour_gap_cate_type_1', 'last_hour_gap_cate_type_1', 'act_days_cate_type_1', 'sku_cnt_cate_type_2', 'sku_nq_cate_type_2', 'type_nq_cate_type_2', 'brand_nq_cate_type_2', 'shop_nq_cate_type_2', 'cate_nq_cate_type_2', 'first_hour_gap_cate_type_2', 'last_hour_gap_cate_type_2', 'act_days_cate_type_2', 'sku_cnt_cate_type_3', 'sku_nq_cate_type_3', 'type_nq_cate_type_3', 'brand_nq_cate_type_3', 'shop_nq_cate_type_3', 'cate_nq_cate_type_3', 'first_hour_gap_cate_type_3', 'last_hour_gap_cate_type_3', 'act_days_cate_type_3', 'sku_cnt_cate_type_4', 'sku_nq_cate_type_4', 'type_nq_cate_type_4', 'brand_nq_cate_type_4', 'shop_nq_cate_type_4', 'cate_nq_cate_type_4', 'first_hour_gap_cate_type_4', 'last_hour_gap_cate_type_4', 'act_days_cate_type_4', 'buybro_ratio_cate', 'buyfocus_ratio_cate', 'buycom_ratio_cate', 'sku_cnt_shop_id', 'sku_nq_shop_id', 'type_nq_shop_id', 'brand_nq_shop_id', 'shop_nq_shop_id', 'cate_nq_shop_id', 'first_hour_gap_shop_id', 'last_hour_gap_shop_id', 'act_days_shop_id', 'sku_cnt_shop_id_type_1', 'sku_nq_shop_id_type_1', 'type_nq_shop_id_type_1', 'brand_nq_shop_id_type_1', 'shop_nq_shop_id_type_1', 'cate_nq_shop_id_type_1', 'first_hour_gap_shop_id_type_1', 'last_hour_gap_shop_id_type_1', 'act_days_shop_id_type_1', 'sku_cnt_shop_id_type_2', 'sku_nq_shop_id_type_2', 'type_nq_shop_id_type_2', 'brand_nq_shop_id_type_2', 'shop_nq_shop_id_type_2', 'cate_nq_shop_id_type_2', 'first_hour_gap_shop_id_type_2', 'last_hour_gap_shop_id_type_2', 'act_days_shop_id_type_2', 'sku_cnt_shop_id_type_3', 'sku_nq_shop_id_type_3', 'type_nq_shop_id_type_3', 'brand_nq_shop_id_type_3', 'shop_nq_shop_id_type_3', 'cate_nq_shop_id_type_3', 'first_hour_gap_shop_id_type_3', 'last_hour_gap_shop_id_type_3', 'act_days_shop_id_type_3', 'sku_cnt_shop_id_type_4', 'sku_nq_shop_id_type_4', 'type_nq_shop_id_type_4', 'brand_nq_shop_id_type_4', 'shop_nq_shop_id_type_4', 'cate_nq_shop_id_type_4', 'first_hour_gap_shop_id_type_4', 'last_hour_gap_shop_id_type_4', 'act_days_shop_id_type_4', 'buybro_ratio_shop_id', 'buyfocus_ratio_shop_id', 'buycom_ratio_shop_id', 'sku_cnt_user_id_cate', 'sku_nq_user_id_cate', 'type_nq_user_id_cate', 'brand_nq_user_id_cate', 'shop_nq_user_id_cate', 'cate_nq_user_id_cate', 'first_hour_gap_user_id_cate', 'last_hour_gap_user_id_cate', 'act_days_user_id_cate', 'sku_cnt_user_id_cate_type_1', 'sku_nq_user_id_cate_type_1', 'type_nq_user_id_cate_type_1', 'brand_nq_user_id_cate_type_1', 'shop_nq_user_id_cate_type_1', 'cate_nq_user_id_cate_type_1', 'first_hour_gap_user_id_cate_type_1', 'last_hour_gap_user_id_cate_type_1', 'act_days_user_id_cate_type_1', 'sku_cnt_user_id_cate_type_2', 'sku_nq_user_id_cate_type_2', 'type_nq_user_id_cate_type_2', 'brand_nq_user_id_cate_type_2', 'shop_nq_user_id_cate_type_2', 'cate_nq_user_id_cate_type_2', 'first_hour_gap_user_id_cate_type_2', 'last_hour_gap_user_id_cate_type_2', 'act_days_user_id_cate_type_2', 'sku_cnt_user_id_cate_type_3', 'sku_nq_user_id_cate_type_3', 'type_nq_user_id_cate_type_3', 'brand_nq_user_id_cate_type_3', 'shop_nq_user_id_cate_type_3', 'cate_nq_user_id_cate_type_3', 'first_hour_gap_user_id_cate_type_3', 'last_hour_gap_user_id_cate_type_3', 'act_days_user_id_cate_type_3', 'sku_cnt_user_id_cate_type_4', 'sku_nq_user_id_cate_type_4', 'type_nq_user_id_cate_type_4', 'brand_nq_user_id_cate_type_4', 'shop_nq_user_id_cate_type_4', 'cate_nq_user_id_cate_type_4', 'first_hour_gap_user_id_cate_type_4', 'last_hour_gap_user_id_cate_type_4', 'act_days_user_id_cate_type_4', 'buybro_ratio_user_id_cate', 'buyfocus_ratio_user_id_cate', 'buycom_ratio_user_id_cate', 'sku_cnt_user_id_shop_id', 'sku_nq_user_id_shop_id', 'type_nq_user_id_shop_id', 'brand_nq_user_id_shop_id', 'shop_nq_user_id_shop_id', 'cate_nq_user_id_shop_id', 'first_hour_gap_user_id_shop_id', 'last_hour_gap_user_id_shop_id', 'act_days_user_id_shop_id', 'sku_cnt_user_id_shop_id_type_1', 'sku_nq_user_id_shop_id_type_1', 'type_nq_user_id_shop_id_type_1', 'brand_nq_user_id_shop_id_type_1', 'shop_nq_user_id_shop_id_type_1', 'cate_nq_user_id_shop_id_type_1', 'first_hour_gap_user_id_shop_id_type_1', 'last_hour_gap_user_id_shop_id_type_1', 'act_days_user_id_shop_id_type_1', 'sku_cnt_user_id_shop_id_type_2', 'sku_nq_user_id_shop_id_type_2', 'type_nq_user_id_shop_id_type_2', 'brand_nq_user_id_shop_id_type_2', 'shop_nq_user_id_shop_id_type_2', 'cate_nq_user_id_shop_id_type_2', 'first_hour_gap_user_id_shop_id_type_2', 'last_hour_gap_user_id_shop_id_type_2', 'act_days_user_id_shop_id_type_2', 'sku_cnt_user_id_shop_id_type_3', 'sku_nq_user_id_shop_id_type_3', 'type_nq_user_id_shop_id_type_3', 'brand_nq_user_id_shop_id_type_3', 'shop_nq_user_id_shop_id_type_3', 'cate_nq_user_id_shop_id_type_3', 'first_hour_gap_user_id_shop_id_type_3', 'last_hour_gap_user_id_shop_id_type_3', 'act_days_user_id_shop_id_type_3', 'sku_cnt_user_id_shop_id_type_4', 'sku_nq_user_id_shop_id_type_4', 'type_nq_user_id_shop_id_type_4', 'brand_nq_user_id_shop_id_type_4', 'shop_nq_user_id_shop_id_type_4', 'cate_nq_user_id_shop_id_type_4', 'first_hour_gap_user_id_shop_id_type_4', 'last_hour_gap_user_id_shop_id_type_4', 'act_days_user_id_shop_id_type_4', 'buybro_ratio_user_id_shop_id', 'buyfocus_ratio_user_id_shop_id', 'buycom_ratio_user_id_shop_id', 'sku_cnt_cate_shop_id', 'sku_nq_cate_shop_id', 'type_nq_cate_shop_id', 'brand_nq_cate_shop_id', 'shop_nq_cate_shop_id', 'cate_nq_cate_shop_id', 'first_hour_gap_cate_shop_id', 'last_hour_gap_cate_shop_id', 'act_days_cate_shop_id', 'sku_cnt_cate_shop_id_type_1', 'sku_nq_cate_shop_id_type_1', 'type_nq_cate_shop_id_type_1', 'brand_nq_cate_shop_id_type_1', 'shop_nq_cate_shop_id_type_1', 'cate_nq_cate_shop_id_type_1', 'first_hour_gap_cate_shop_id_type_1', 'last_hour_gap_cate_shop_id_type_1', 'act_days_cate_shop_id_type_1', 'sku_cnt_cate_shop_id_type_2', 'sku_nq_cate_shop_id_type_2', 'type_nq_cate_shop_id_type_2', 'brand_nq_cate_shop_id_type_2', 'shop_nq_cate_shop_id_type_2', 'cate_nq_cate_shop_id_type_2', 'first_hour_gap_cate_shop_id_type_2', 'last_hour_gap_cate_shop_id_type_2', 'act_days_cate_shop_id_type_2', 'sku_cnt_cate_shop_id_type_3', 'sku_nq_cate_shop_id_type_3', 'type_nq_cate_shop_id_type_3', 'brand_nq_cate_shop_id_type_3', 'shop_nq_cate_shop_id_type_3', 'cate_nq_cate_shop_id_type_3', 'first_hour_gap_cate_shop_id_type_3', 'last_hour_gap_cate_shop_id_type_3', 'act_days_cate_shop_id_type_3', 'sku_cnt_cate_shop_id_type_4', 'sku_nq_cate_shop_id_type_4', 'type_nq_cate_shop_id_type_4', 'brand_nq_cate_shop_id_type_4', 'shop_nq_cate_shop_id_type_4', 'cate_nq_cate_shop_id_type_4', 'first_hour_gap_cate_shop_id_type_4', 'last_hour_gap_cate_shop_id_type_4', 'act_days_cate_shop_id_type_4', 'buybro_ratio_cate_shop_id', 'buyfocus_ratio_cate_shop_id', 'buycom_ratio_cate_shop_id', 'sku_cnt_user_id_cate_shop_id', 'sku_nq_user_id_cate_shop_id', 'type_nq_user_id_cate_shop_id', 'brand_nq_user_id_cate_shop_id', 'shop_nq_user_id_cate_shop_id', 'cate_nq_user_id_cate_shop_id', 'first_hour_gap_user_id_cate_shop_id', 'last_hour_gap_user_id_cate_shop_id', 'act_days_user_id_cate_shop_id', 'sku_cnt_user_id_cate_shop_id_type_1', 'sku_nq_user_id_cate_shop_id_type_1', 'type_nq_user_id_cate_shop_id_type_1', 'brand_nq_user_id_cate_shop_id_type_1', 'shop_nq_user_id_cate_shop_id_type_1', 'cate_nq_user_id_cate_shop_id_type_1', 'first_hour_gap_user_id_cate_shop_id_type_1', 'last_hour_gap_user_id_cate_shop_id_type_1', 'act_days_user_id_cate_shop_id_type_1', 'sku_cnt_user_id_cate_shop_id_type_2', 'sku_nq_user_id_cate_shop_id_type_2', 'type_nq_user_id_cate_shop_id_type_2', 'brand_nq_user_id_cate_shop_id_type_2', 'shop_nq_user_id_cate_shop_id_type_2', 'cate_nq_user_id_cate_shop_id_type_2', 'first_hour_gap_user_id_cate_shop_id_type_2', 'last_hour_gap_user_id_cate_shop_id_type_2', 'act_days_user_id_cate_shop_id_type_2', 'sku_cnt_user_id_cate_shop_id_type_3', 'sku_nq_user_id_cate_shop_id_type_3', 'type_nq_user_id_cate_shop_id_type_3', 'brand_nq_user_id_cate_shop_id_type_3', 'shop_nq_user_id_cate_shop_id_type_3', 'cate_nq_user_id_cate_shop_id_type_3', 'first_hour_gap_user_id_cate_shop_id_type_3', 'last_hour_gap_user_id_cate_shop_id_type_3', 'act_days_user_id_cate_shop_id_type_3', 'sku_cnt_user_id_cate_shop_id_type_4', 'sku_nq_user_id_cate_shop_id_type_4', 'type_nq_user_id_cate_shop_id_type_4', 'brand_nq_user_id_cate_shop_id_type_4', 'shop_nq_user_id_cate_shop_id_type_4', 'cate_nq_user_id_cate_shop_id_type_4', 'first_hour_gap_user_id_cate_shop_id_type_4', 'last_hour_gap_user_id_cate_shop_id_type_4', 'act_days_user_id_cate_shop_id_type_4', 'buybro_ratio_user_id_cate_shop_id', 'buyfocus_ratio_user_id_cate_shop_id', 'buycom_ratio_user_id_cate_shop_id', 'age', 'sex', 'user_lv_cd', 'city_level', 'province', 'city', 'county', 'fans_num', 'vip_num', 'shop_score']

4.特徵選擇

特徵選擇使用隨機森林,下面的代碼輸出特徵重要度排名。

from model.feat_columns import *
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.preprocessing import Imputer

"""
使用隨機森林進行特徵選擇
"""
train_set = pd.read_hdf('../datasets/test_set.h5', key='test_set')
# train_set = pd.read_hdf('../../datasets/train_set.h5', key='train_set')
X = train_set[feat_columns].values
print(X.shape)  # 7day-1 (1560848, 349) (1560848, 328)  7day-7 (1560852, 349)

y = train_set['label'].values

# 如果用全量數據跑不動
seed = 3
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

clf = RandomForestClassifier(random_state=seed)
# 缺失值填充
X_train = Imputer().fit_transform(X_train)
print('fit...')
clf.fit(X_train, y_train)
print('done')
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
features = train_set[feat_columns].columns
l = []
for i in range(X_train.shape[1]):
    print(("%2d) %-*s %f" % (i + 1, 30, features[indices[i]], importance[indices[i]])))
    l.append(features[indices[i]])
print(l)

5.模型選擇

首先判斷一下你要處理的問題屬於哪一類,是分類、聚類還是迴歸等等。確定好類別後,再通過驗證集來查看此類中各個算法的表現,從而選擇合適的模型。因爲目前比賽常用且效果最好的只有lightgbm和xgboost,所以我此次並沒有進行模型選擇,就用了這兩個算法。大家可以查一下sklearn.model_selection包的相關文檔。

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

6.參數選擇

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from collections import Counter
from model.feat_columns import *


train_set = pd.read_hdf('../../datasets/train_set.h5', key='train_set')

X_train = train_set[feat_columns].values
y_train = train_set['label'].values
c = Counter(y_train)
# n = c[0] / 16 / c[1]  # 8
n = c[0] / c[1]  # 129.56

print(n)

parameters = {
    'max_depth': [5, 10, 15, 20, 25],
    'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
    'n_estimators': [500, 1000, 2000, 3000, 5000],
    'min_child_weight': [0, 2, 5, 10, 20],
    'max_delta_step': [0, 0.2, 0.6, 1, 2],
    'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
    'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
    'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
    'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],
    'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1, 8, n]

}

xlf = xgb.XGBClassifier(max_depth=10,
                        learning_rate=0.01,
                        n_estimators=2000,
                        silent=True,
                        objective='binary:logistic',
                        nthread=12,
                        gamma=0,
                        min_child_weight=1,
                        max_delta_step=0,
                        subsample=0.85,
                        colsample_bytree=0.7,
                        colsample_bylevel=1,
                        reg_alpha=0,
                        reg_lambda=1,
                        scale_pos_weight=1,
                        seed=1440,
                        missing=None)

gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(X_train, y_train)

print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

7.模型訓練與評估

選好特徵、模型和參數後開始對模型進行訓練和評估,最後再使用前面的測試集(不是這裏訓練集劃分的測試集)查看模型效果。

import pickle
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from model.feat_columns import *

threshold = 0.3

train_set = pd.read_hdf('../../datasets/train_set.h5', key='train_set')

'''
pos_sample = train_set[train_set['label'] == 1]
n = len(pos_sample)
print(n)
neg_sample = train_set[train_set['label'] == 0].sample(n=n, random_state=1)
del train_set
train_set = pos_sample.append(neg_sample)
'''

X = train_set[feat_columns].values
print(X.shape)  # 7day  (1560852, 349)

y = train_set['label'].values
c = Counter(y)  # Counter({0.0: 1545471, 1.0: 15377})
print(c)

train_metrics = train_set[['user_id', 'cate', 'shop_id', 'label']]
del train_set

# split data into train and test sets
seed = 3
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

c = Counter(y_train)
print(c)  # 7day Counter({0.0: 1035454, 1.0: 10314})

# c[0] / 16 / c[1]  8 | c[0] / c[1]  129.56
clf = XGBClassifier(max_depth=5, min_child_weight=6, scale_pos_weight=c[0] / 16 / c[1], n_estimators=100, nthread=12,
                    seed=0, subsample=0.5)
eval_set = [(X_test, y_test)]

clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = clf.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

y_train_predict = clf.predict_proba(X)[:, 1]
# y_train_predict = clf.predict(X)
# train_metrics['pre_label'] = y_train_predict
train_metrics['pred_prob'] = y_train_predict

# pred = train_metrics[train_metrics['pre_label'] == 1]
pred = train_metrics[train_metrics['pred_prob'] > threshold]
truth = train_metrics[train_metrics['label'] == 1]
print('X train pred num is:', len(pred))

print("訓練集分數:")
get_final_score(pred, truth)
del train_metrics
pickle.dump(clf, open('../user_model/baseline.pkl', 'wb'))

# clf = pickle.load(open('../user_model/baseline.pkl', 'rb'))

8.模型融合

模型融合就是訓練多個模型,然後按照一定的方法整合多個模型輸出的結果,常用的有加權平均、投票、學習法等等,請自行百度。模型融合的結果往往優於單個模型,常用於最後的成績提升。我此次只是對lightgbm和xgboost兩個算法預測出的概率進行簡單的加權求和,然後取了topN的結果進行提交。


參考鏈接:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章