商品推薦之CIKM 2019 EComm AI:超大規模推薦之用戶興趣高效檢索

題目介紹

初次參加天池比賽,題目很簡單:https://tianchi.aliyun.com/competition/entrance/231721/tab/158

訓練集一共包含三個文件,分別爲用戶行爲文件、用戶信息表、商品信息表,詳情如下。

user_behavior.csv爲用戶行爲文件,文件共有4列並以逗號分隔。每列的含義與內容如下:

列名 描述
用戶ID 正整數,對應一個特定的用戶
商品ID 正整數,對應一個特定的商品
行爲類型 枚舉類型字符串,取值爲('pv', 'buy', 'cart', 'fav')之一
時間戳 取值範圍爲[0, 1382400)的整數,表示該行爲發生的時間到某一個星期五的0:00:00的時間偏移(單位爲秒)

user.csv爲用戶信息文件,文件共有4列並以逗號分隔。每列的含義與內容如下:

列名 描述
用戶ID 正整數,對應一個特定的用戶
性別 正整數,表示用戶性別。0表示男性,1表示女性,2表示未知
年齡 正整數,用戶年齡
購買力 取值自[1, 9]的正整數,表示用戶的購買力層級

item.csv爲商品信息表,文件共有4列並以逗號分隔。每列的含義與內容如下:

列名 描述
商品ID 正整數,對應一個特定的商品
類目ID 正整數,表示該商品所屬的類目
店鋪ID 正整數,表示該商品所屬的店鋪
品牌ID 整數,表示該商品的品牌,-1表示未知

和訓練集類似,測試集同樣包含如上三個文件。參賽選手需要爲測試集中的每一個用戶預測其未來可能感興趣的top 50的商品。具體的,基於如上的定義,用戶的真實“未來興趣”是指用戶在1382400時間點以後一天內發生的('pv', 'buy', 'cart', 'fav')四種行爲中的任意一種。另外,用戶興趣預估的候選商品庫集合,爲訓練、測試集中商品庫(item.csv文件)的並集。

數據集下載:鏈接:https://pan.baidu.com/s/16rZgHtoG8aoJK3T9OMT3Zg  密碼:j77w

業務過濾方式

面對題目,我的思路如下:用戶行爲數據中對某商品有購買操作,如果該用戶羣對該商品的復購率高,那麼就可以將其設置爲感興趣商品,如果復購率爲0,那麼就不推薦該商品;如果對某商品沒有過購買操作,則根據其其他操作的加權之和作爲推薦指標;接下來再在該用戶所在的羣體(年齡、性別、購買力)中按照操作的加權之和降序推薦商品。

思路很簡單,但是數據文件很大,我使用python做數據分析,直接用pandas.DataFrame操作,速度非常慢(基本上動不了);之前參加天池的離線賽時也碰到這種情況,那個時候搭建的spark這時候就可以派上用場了。方法如下:首先將csv文件放到hadoop中,接着在pyspark中讀取csv文件,保存成parquet格式,這個時候就可以作爲pyspark.sql.dataframe.DataFrame用了,上面的思路就可以轉換成對DataFrame的操作了。代碼如下,通過pyspark運行:


import pyspark.sql.functions as F
import numpy as np
import pandas as pd

## csv -> parquet
user_behaviors=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/user_behavior.csv')
user_behaviors=user_behaviors.withColumnRenamed('_c0','user_id').withColumnRenamed('_c1','item_id').withColumnRenamed('_c2','behavior_type').withColumnRenamed('_c3','time')
user_behaviors.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/user_behaviors.parquet')

users=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/user.csv')
users=users.withColumnRenamed('_c0','user_id').withColumnRenamed('_c1','gender').withColumnRenamed('_c2','age').withColumnRenamed('_c3','buy_cap')
users.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/users.parquet')

items=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/item.csv')
items=items.withColumnRenamed('_c0','item_id').withColumnRenamed('_c1','category_id').withColumnRenamed('_c2','shop_id').withColumnRenamed('_c3','brand_id')
items.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/items.parquet')

## parquet 原始數據
users_test=spark.read.parquet('/item_recommend1_testB/users.parquet')
items_test=spark.read.parquet('/item_recommend1_testB/items.parquet')
user_behaviors_test=spark.read.parquet('/item_recommend1_testB/user_behaviors.parquet')
users=spark.read.parquet('/item_recommend1/users.parquet')
items=spark.read.parquet('/item_recommend1/items.parquet')
user_behaviors=spark.read.parquet('/item_recommend1/user_behaviors.parquet')

## total 數據
items_total=items.union(items_test).distinct()
users_total=users.union(users_test).distinct()
user_behaviors_total=user_behaviors.union(user_behaviors_test).distinct()
items_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/items.parquet')
users_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/users.parquet')
user_behaviors_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors.parquet')

## user_behaviors_allaction
user_behaviors=spark.read.parquet('/item_recommend1_totalB/user_behaviors.parquet')
user_behaviors_allaction=user_behaviors.withColumn('behavior_value',F.when(user_behaviors['behavior_type']=='pv',1).when(user_behaviors['behavior_type']=='fav',2).when(user_behaviors['behavior_type']=='cart',3).when(user_behaviors['behavior_type']=='buy',4))
user_behaviors_allaction.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors_allaction.parquet')
user_behaviors_allaction=spark.read.parquet('/item_recommend1_totalB/user_behaviors_allaction.parquet')
## 總user item 數據
users=spark.read.parquet('/item_recommend1_totalB/users.parquet')
items=spark.read.parquet('/item_recommend1_totalB/items.parquet')


## 所有天 behavior_value_new 的含義是考慮時間的權重,越在後面的權重越高
full_user_behaviors=user_behaviors_allaction.join(users,on='user_id').join(items,on='item_id')
full_user_behaviors=full_user_behaviors.select(['*',(full_user_behaviors.behavior_value/F.ceil(16-full_user_behaviors.time/86400)).alias('behavior_value_new')])
full_user_behaviors.write.format("parquet").mode("overwrite").save('/item_recommend1_totalB/full_user_behaviors.parquet')
full_user_behaviors=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors.parquet')

##根據'user_id','item_id'分組
full_user_behaviors_user_item=full_user_behaviors.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'}) #,'behavior_type':'count_distinct'
full_user_behaviors_user_item=full_user_behaviors_user_item.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
full_user_behaviors_user_item=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')

full_user_behaviors_user_item_user=users.join(full_user_behaviors_user_item,on='user_id')
full_user_behaviors_user_item_user_age_item=full_user_behaviors_user_item_user.groupBy(['age','gender','buy_cap','item_id']).agg({'behavior_value_sum':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_user_item_user_age_item=full_user_behaviors_user_item_user_age_item.withColumnRenamed('sum(behavior_value_sum)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_user_age_item.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item.parquet')
full_user_behaviors_user_item_user_age_item=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item.parquet')



## 3天內
START_TIME=86400*8
full_user_behaviors_3=full_user_behaviors.filter('time>'+str(START_TIME))
full_user_behaviors_3.write.format("parquet").mode("overwrite").save('/item_recommend1_totalB/full_user_behaviors_3.parquet')
full_user_behaviors_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_3.parquet')

##根據'user_id','item_id'分組
#count 表示買了該商品的人數
full_user_behaviors_user_item_3=full_user_behaviors_3.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'}) #,'behavior_type':'count_distinct'
full_user_behaviors_user_item_3=full_user_behaviors_user_item_3.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_3.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_3.parquet')
full_user_behaviors_user_item_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_3.parquet')

full_user_behaviors_user_item_user_3=users.join(full_user_behaviors_user_item_3,on='user_id')
full_user_behaviors_user_item_user_age_item_3=full_user_behaviors_user_item_user_3.groupBy(['age','gender','buy_cap','item_id']).agg({'behavior_value_sum':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_user_item_user_age_item_3=full_user_behaviors_user_item_user_age_item_3.withColumnRenamed('sum(behavior_value_sum)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_user_age_item_3.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item_3.parquet')
full_user_behaviors_user_item_user_age_item_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item_3.parquet')
## _3

#.filter('count>1')
dup_buyed_items=full_user_behaviors.filter('behavior_value==4').groupBy(['user_id','item_id']).count().groupBy('item_id').agg({'count':'avg'}).withColumnRenamed('avg(count)','count')
dup_buyed_items.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
dup_buyed_items=dup_buyed_items.filter('count>1.25')
buyed_items=full_user_behaviors.join(users_test,how='left_semi',on='user_id')
full_user_behaviors_buy=buyed_items.filter('behavior_value==4')
full_user_behaviors_buy_dup=full_user_behaviors_buy.select(['user_id','item_id']).distinct().join(dup_buyed_items,how='inner',on='item_id')
#平均用戶購買次數大於1次的商品
# full_user_behaviors_buy_dup_count=full_user_behaviors_buy_dup.groupBy('user_id').count()
# full_user_behaviors_buy_dup_count.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)

#三天內的動作
full_user_behaviors_3_test=full_user_behaviors_3.join(users_test,how='left_semi',on='user_id')
#去掉已經購買了的商品
full_user_behaviors_3_notbuy=full_user_behaviors_3_test.join(full_user_behaviors_buy,how='left_anti',on=['user_id','item_id'])
full_user_behaviors_3_notbuy_group=full_user_behaviors_3_notbuy.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_3_notbuy_group=full_user_behaviors_3_notbuy_group.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_3_notbuy_group.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
full_user_behaviors_3_notbuy_group.approxQuantile('behavior_value_sum',np.linspace(0,1,50).tolist(),0.01)

#未購買的商品中,其他動作>4的部分
recommended_notbuy=full_user_behaviors_3_notbuy_group.filter('behavior_value_sum>16')
full_user_behaviors_buy_dup.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_buy.parquet')
full_user_behaviors_buy_dup=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_buy.parquet')
recommended_notbuy.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/recommended_notbuy.parquet')
recommended_notbuy=spark.read.parquet('/item_recommend1_totalB/recommended_notbuy.parquet')
recommended=recommended_notbuy.select(['user_id','item_id','count','behavior_value_sum']).union(full_user_behaviors_buy_dup.selectExpr(['user_id','item_id','count','10000 as behavior_value_sum']))
# recommended.groupBy('user_id').count().approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
recommended1=recommended.selectExpr(['user_id','item_id','count','behavior_value_sum'])


# full_user_behaviors_user_item_user_age_item_3v=full_user_behaviors_user_item_user_age_item_3.selectExpr(['age','gender','buy_cap','item_id','behavior_value_sum/count'])
# full_user_behaviors_user_item_user_age_item_3v.approxQuantile('(behavior_value_sum / count)',np.linspace(0,1,50).tolist(),0.01)
# full_user_behaviors_user_item_user_age_itemP=full_user_behaviors_user_item_user_age_item_3.toPandas()
# full_user_behaviors_user_item_user_age_itemP['ac']=full_user_behaviors_user_item_user_age_itemP['behavior_value_sum']/full_user_behaviors_user_item_user_age_itemP['count']

#
# full_user_behaviors_user_item.filter('behavior_value_sum>1').stat.approxQuantile('behavior_value_sum',np.linspace(0,1,50).tolist(),0.01)
#
# full_user_behaviors_user_item.filter('behavior_value_sum>1').stat.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)

@F.pandas_udf("age int,gender int,buy_cap int,item_id int,count int,behavior_value_sum double", F.PandasUDFType.GROUPED_MAP)
def trim(df):
    return df.nlargest(50,'behavior_value_sum')


recommend_items_age=full_user_behaviors_user_item_user_age_item_3.select(['age','gender','buy_cap','item_id','count', 'behavior_value_sum']).groupby(['age','gender','buy_cap']).apply(trim)
recommend_items_user=users_test.join(recommend_items_age,on=['age','gender','buy_cap']).select(['user_id','item_id','count','behavior_value_sum'])
recommend_items_user.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/recommend_items_user.parquet')
recommend_items_user=spark.read.parquet('/item_recommend1_totalB/recommend_items_user.parquet')
recommend_items_user_all=recommend_items_user.join(recommended1,how='left_anti',on=['user_id','item_id']).union(recommended1)
recommend_items_user_df=recommend_items_user_all.toPandas()


def gen_itemids(r):
    if 'user_id' not in r.columns:
        return
    user_id=r['user_id'].iloc[0]
    l = [user_id]
    r=r.sort_values(by='behavior_value_sum',ascending=False)
    l.extend(list(r['item_id'])[:50])
    return l

recommend_items_user_series=recommend_items_user_df.groupby('user_id').apply(gen_itemids)
notmatched_users=users_test.select('user_id').subtract(recommend_items_user.select('user_id').distinct()).collect()
for a in notmatched_users:
    recommend_items_user_series[a.user_id]=[a.user_id]

need_more_recommends=recommend_items_user_series[recommend_items_user_series.apply(len)<51]
#如果商品推薦不足,從全部數據中得到推薦數據
for a in need_more_recommends:
    print(a[0])
    user=users_test.filter(' user_id='+str(a[0])).collect()[0]
    j = 0
    while len(a) < 51 and j < 4:
        if j == 0:
            pre_condition = ' age=%d and gender=%d and buy_cap=%d ' % (user.age, user.gender, user.buy_cap)
        elif j == 1:
            pre_condition = ' age between %d and %d and gender=%d and buy_cap=%d ' % (
            user.age - 3, user.age + 3, user.gender, user.buy_cap)
        elif j == 2:
            pre_condition = ' age =%d and gender=%d and buy_cap between %d and %d ' % (
                user.age, user.gender, user.buy_cap - 2, user.buy_cap + 2)
        else:
            pre_condition = ' age between %d and %d and gender=%d and buy_cap between %d and %d ' % (
                user.age - 3, user.age + 3, user.gender, user.buy_cap - 2, user.buy_cap + 2)
        condition = pre_condition
        if len(a) > 1:
            condition += (' and item_id not in (%s)' % ','.join([str(i) for i in a[1:]]) )
        print(condition)
        recommend_items = full_user_behaviors_user_item_user_age_item.filter(condition).orderBy(F.desc('behavior_value_sum')).limit(51-len(a)).collect()
        for i in recommend_items:
            if i.item_id not in a[1:]:
                a.append(i.item_id)
        if len(a) >= 51:
            break
        j=j+1
    recommend_items_user_series[a[0]]=a

# need_more_recommends=recommend_items_user_series[recommend_items_user_series.apply(len)<51]
# for a in need_more_recommends:
#     user=users_test.filter(' user_id='+str(a[0])).collect()[0]
#     j = 1
#     while len(a) < 51 and j < 4:
#         if j == 0:
#             pre_condition = ' age=%d and gender=%d and buy_cap=%d ' % (user.age, user.gender, user.buy_cap)
#         elif j == 1:
#             pre_condition = ' age between %d and %d and gender=%d and buy_cap=%d ' % (
#             user.age - 3, user.age + 3, user.gender, user.buy_cap)
#         elif j == 2:
#             pre_condition = ' age =%d and gender=%d and buy_cap between %d and %d ' % (
#                 user.age, user.gender, user.buy_cap - 2, user.buy_cap + 2)
#         else:
#             pre_condition = ' age between %d and %d and gender=%d and buy_cap between %d and %d ' % (
#                 user.age - 3, user.age + 3, user.gender, user.buy_cap - 2, user.buy_cap + 2)
#         condition = pre_condition
#         if len(a) > 1:
#             condition += (' and item_id not in (%s)' % ','.join([str(i) for i in a[1:]]) )
#         print(condition)
#         recommend_items = full_user_behaviors_user_item_user_age_item_3.filter(condition).orderBy(
#             F.desc('count'),F.desc('behavior_value_sum')).limit(51-len(a)).collect()
#         for i in recommend_items:
#             if i.item_id not in a[1:]:
#                 a.append(i.item_id)
#         if len(a) >= 51:
#             break
#         j=j+1
#     recommend_items_user_series[a[0]]=a





df=pd.DataFrame(list(recommend_items_user_series.values),dtype=int)
df.to_csv('/Users/zhangyugu/Downloads/testb_result_081416.csv',float_format='%d',header=False,index=False)

這是初賽,靠着這段邏輯進入了複賽;複賽中要求提交代碼了。我開始思考:這個篩選的規則都是我自己定的,不一定就是最好的推薦策略;那麼如何通過機器學習得到最優的推薦策略呢?

機器學習方式

als方式

首先,我可以用的數據有用戶信息(年齡、性別、購買力),商品信息(分類、品牌、店鋪),還有用戶的歷史數據,目標是用戶對商品的興趣度。興趣度的計算方式是可以自己定的,最簡單的就是用戶對商品的歷史動作加權之和。前面的數據就是特徵,其中比較複雜的是用戶的歷史行爲中蘊含的特徵提取。這裏選用spark的als做矩陣分解,拿到在歷史行爲中蘊含的用戶、商品特徵:


from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

hadoop_preffix="hdfs://master:8020/item_recommend1.db"
sc.setCheckpointDir(hadoop_preffix+'/item_recommend1_als_sc')

user_behaviors=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
users_test=spark.read.parquet('/item_recommend1_testB/users.parquet')
user_behaviors_test=user_behaviors.join(users_test,how='left_semi',on='user_id')
users=spark.read.parquet('/item_recommend1/users.parquet')
user_behaviors_test_full=user_behaviors.join(users_test,how='inner',on='user_id')
user_behaviors_test=user_behaviors_test.select(['user_id','item_id',"behavior_value_sum"])

import pyspark.mllib.recommendation as rd
user_behaviors_test_rdd=user_behaviors_test.rdd
user_behaviors_test_rddRating=user_behaviors_test.rdd.map(lambda r:rd.Rating(r.user_id,r.item_id,r.behavior_value_sum))
user_behaviors_test_rddRating.checkpoint()
user_behaviors_test_rddRating.cache()
model=rd.ALS.trainImplicit(user_behaviors_test_rddRating,8,50,0.01)
userFeatures=model.userFeatures()
def feature_to_row(a):
    l=list(a[1])
    l.insert(0,a[0])
    return l


userFeaturesRowed=userFeatures.map(feature_to_row)
productFeatures=model.productFeatures()
productFeaturesRowed=productFeatures.map(feature_to_row)
userFeaturesDf=sqlContext.createDataFrame(userFeaturesRowed,['user_id','feature_0','feature_1','feature_2','feature_3','feature_4','feature_5','feature_6','feature_7'])
itemFeaturesDf=sqlContext.createDataFrame(productFeaturesRowed,['item_id','item_feature_0','item_feature_1','item_feature_2','item_feature_3','item_feature_4','item_feature_5','item_feature_6','item_feature_7'])

userFeaturesDf.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/userFeaturesDf.parquet')
itemFeaturesDf.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/itemFeaturesDf.parquet')
user_behaviors_test_fullFeatured=user_behaviors_test_full.join(userFeaturesDf,how='inner',on='user_id').join(itemFeaturesDf,how='inner',on='item_id')
user_behaviors_test_fullFeatured.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors_test_fullFeatured.parquet')

有了特徵數據,接下來就是用特徵來得到用戶對某商品的興趣了,可以選擇一般的迴歸算法,如邏輯迴歸;也可以選擇比較強大的擬合工具,如神經網絡。

DeepFM方式

如何將特徵與神經網絡結合呢,這個時候我發現了DeepFM模型,具體介紹見DeepFM模型理論和實踐
下面的代碼是運用該模型來進行訓練

import numpy as np
import tensorflow as tf
import pandas as pd
from StratifiedKFoldByXColumnsYBin import StratifiedKFoldByXColumnsYBin
import sys
sys.path.append('../')
from DeepFM import DeepFM

# 數據讀取
data_dir = '/Users/zhangyugu/Downloads/'
dfTrain=pd.read_csv(data_dir+'user_behaviors_test_full.csv')
# dfTrain['behavior_value_sum'].describe()
cols=['user_id','item_id','brand_id','shop_id','category_id','gender','age','buy_cap']
X_train=dfTrain[cols]
Y_train=dfTrain['behavior_value_sum'].values
X_test=X_train

# 向離散特徵轉換成one-hot形式
feat_dim=0
feat_dict = {}
numeric_cols=['buy_cap','age']
for col in cols:
    us=dfTrain[col].unique()
    if col in numeric_cols:
        feat_dim+=1
        feat_dict[col]=feat_dim
    else:
        feat_dict[col] = dict(zip(us, range(feat_dim, len(us) + feat_dim)))
        feat_dim+=len(us)

dfTrain_i=X_train.copy()
dfTrain_v=X_train.copy()
for col in cols:
    if col not in numeric_cols:
        dfTrain_i[col] = dfTrain_i[col].map(feat_dict[col])
        dfTrain_v[col] = 1.
    else:
        dfTrain_i[col] = feat_dict[col]

dfTrain_i=dfTrain_i.values.tolist()
dfTrain_v=dfTrain_v.values.tolist()


# 預測結果度量方式
def gini_norm(actual,predict):
    return ((np.array(actual)-np.array(predict))**2).sum()/len(actual)

dfm_params={
    'use_fm':True,
    'use_deep':True,
    'embedding_size':24,
    'dropout_fm':[1.0,1.0],
    'deep_layers':[48,16],
    'dropout_deep':[0.5,0.5,0.5],
    'deep_layers_activation':tf.nn.relu,
    'epoch':30,
    'batch_size':1024,
    'learning_rate':0.001,
    'optimizer_type':'adam',
    'batch_norm':1,
    'batch_norm_decay':0.995,
    'l2_reg':0.01,
    'verbose':True,
    'eval_metric':gini_norm,
    'random_seed':2017,
    'greater_is_better':False,
    'loss_type':'mse'
}

dfm_params["feature_size"] = feat_dim #特徵數
dfm_params["field_size"] = len(cols) #字段數

folds= list(StratifiedKFoldByXColumnsYBin(columns=['gender','age','buy_cap'],n_splits=3,shuffle=True,random_state=2017).split(X_train,Y_train))
y_train_meta = np.zeros((dfTrain.shape[0],1),dtype=float) # 訓練數據的預測結果
y_test_meta = np.zeros((dfTrain.shape[0],1),dtype=float) # 測試數據的預測結果,下面的測試數據,偷懶的緣故,直接採用了訓練數據,待修復
gini_results_cv=np.zeros(len(folds),dtype=float) # 預測結果度量
gini_results_epoch_train=np.zeros((len(folds),dfm_params['epoch']),dtype=float) # 訓練數據的預測結果度量 folds * epoch
gini_results_epoch_valid=np.zeros((len(folds),dfm_params['epoch']),dtype=float)

# for i in range(len(folds)):
#     train_idx,valid_idx=train_test_split(range(len(dfTrain)),random_state=2017,train_size=2.0/3.0)
for i, (train_idx, valid_idx) in enumerate(folds):
    # 拿到feature_index feature_value label
    _get = lambda x, l: [x[i] for i in l]
    Xi_train_,Xv_train_,y_train_ = _get(dfTrain_i,train_idx), _get(dfTrain_v,train_idx),_get(Y_train,train_idx)
    Xi_valid_, Xv_valid_, y_valid_ = _get(dfTrain_i, valid_idx), _get(dfTrain_v, valid_idx), _get(Y_train, valid_idx)

    dfm=DeepFM(**dfm_params)
    dfm.fit(Xi_train_,Xv_train_,y_train_,Xi_valid_,Xv_valid_,y_valid_)

    y_train_meta[valid_idx,0]=dfm.predict(Xi_valid_,Xv_valid_)
    y_test_meta[:,0]=dfm.predict(dfTrain_i,dfTrain_v)

    gini_results_cv[i]=gini_norm(y_valid_,y_train_meta[valid_idx])
    gini_results_epoch_train[i]=dfm.train_result
    gini_results_epoch_valid[i]=dfm.valid_result

filename = data_dir+"DeepFm_Mean%.5f_Std%.5f.csv" % (gini_results_cv.mean(), gini_results_cv.std())
# 測試結果
y_test_meta/=float(len(folds))
pd.DataFrame({"user_id": X_train['user_id'],"item_id":X_train['item_id'], "target": y_test_meta.flatten()}).to_csv(
        filename, index=False, float_format="%.5f")

# 誤差
print("DeepFm: %.5f (%.5f)" % (gini_results_cv.mean(),gini_results_cv.std()))
import matplotlib.pyplot as plt
def _plot_fig(train_results, valid_results, filename):
    colors = ["red", "blue", "green"]
    xs = np.arange(1, train_results.shape[1]+1)
    plt.figure()
    legends = []
    for i in range(train_results.shape[0]):
        plt.plot(xs, train_results[i], color=colors[i], linestyle="solid", marker="o")
        plt.plot(xs, valid_results[i], color=colors[i], linestyle="dashed", marker="o")
        legends.append("train-%d"%(i+1))
        legends.append("valid-%d"%(i+1))
    plt.xlabel("Epoch")
    plt.ylabel("Normalized Gini")
    plt.legend(legends)
    plt.savefig(filename)
    plt.close()
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid,data_dir+'DeepFm_ItemRecommend1.png')

在本機跑,速度很慢,爲了用上gpu,嘗試了google colab;在此推薦這款神器,一個方便遠程使用gpu跑機器學習任務的環境。

在多次訓練時,需要將數據拆分爲訓練集和驗證集,採樣時需要考慮數據的均勻性,即在不同特徵以及結果取值上的樣本都要覆蓋到,tensorflow提供的StratifiedKFold只考慮了結果值y的均勻性,所以自己實現了一個數據拆分類StratifiedKFoldByXColumnsYBin:

import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
from StratifiedKFoldByXColumnsYBin import StratifiedKFoldByXColumnsYBin
import sys
sys.path.append('../')
from DeepFM import DeepFM


def data2index_values(dfTrain, feat_dict, cols, numeric_cols):
    dfTrain_i = dfTrain.copy()
    dfTrain_v = dfTrain.copy()
    for col in cols:
        if col not in numeric_cols:
            dfTrain_i[col] = dfTrain_i[col].map(feat_dict[col])
            dfTrain_v[col] = 1.
        else:
            dfTrain_i[col] = feat_dict[col]

    dfTrain_i = dfTrain_i.values.tolist()
    dfTrain_v = dfTrain_v.values.tolist()
    return dfTrain_i, dfTrain_v


# 向離散特徵轉換成one-hot形式
def data2sparse_matrix(X_train, X_test, numeric_cols):
    feat_dim = 0
    feat_dict = {}
    cols = X_train.columns
    for col in cols:
        us = dfTrain[col].unique()
        if col in numeric_cols:
            feat_dim += 1
            feat_dict[col] = feat_dim
        else:
            feat_dict[col] = dict(zip(us, range(feat_dim, len(us) + feat_dim)))
            feat_dim += len(us)

    return data2index_values(X_train), data2index_values(X_test), feat_dim


def train_and_predict(xTrain_i, xTrain_v, xTest_i, xTest_v, feat_dim):
    EPOCHES = 30
    folds = list(StratifiedKFoldByXColumnsYBin(columns=['gender', 'age', 'buy_cap'], n_splits=3, shuffle=True,
                                               random_state=2017).split(X_train, Y_train))
    y_train_meta = np.zeros((len(xTrain_v), 1), dtype=float)  # 訓練數據的預測結果
    y_test_meta = np.zeros((len(xTest_v), 1), dtype=float)  # 測試數據的預測結果
    loss_metrics_test = np.zeros(len(folds), dtype=float)  # 預測誤差度量
    epoch_loss_metrics_train = np.zeros((len(folds), EPOCHES), dtype=float)  # 訓練數據的預測誤差度量 folds * epoch
    epoch_loss_metrics_valid = np.zeros((len(folds), EPOCHES), dtype=float)

    for i, (train_idx, valid_idx) in enumerate(folds):
        dfm_params = {
            'use_fm': True,
            'use_deep': True,
            'embedding_size': 24,
            'dropout_fm': [1.0, 1.0],
            'deep_layers': [48, 16],
            'dropout_deep': [0.5, 0.5, 0.5],
            'deep_layers_activation': tf.nn.relu,
            'epoch': 30,
            'batch_size': 1024,
            'learning_rate': 0.001,
            'optimizer_type': 'adam',
            'batch_norm': 1,
            'batch_norm_decay': 0.995,
            'l2_reg': 0.01,
            'verbose': True,
            'eval_metric': lambda actual, predict: ((np.array(actual) - np.array(predict)) ** 2).sum() / len(actual),
        # 預測結果度量方式
            'random_seed': 2017,
            'greater_is_better': False,
            'loss_type': 'mse',
            "feature_size": feat_dim,  # 特徵數
            "field_size": len(xTrain_v[0])  # 字段數
        }
        dfm = DeepFM(**dfm_params)

        # 拿到feature_index feature_value label
        _get = lambda x, l: [x[i] for i in l]
        Xi_train_, Xv_train_, y_train_ = _get(xTrain_i, train_idx), _get(xTrain_v, train_idx), _get(Y_train,
                                                                                                      train_idx)
        Xi_valid_, Xv_valid_, y_valid_ = _get(xTrain_i, valid_idx), _get(xTrain_v, valid_idx), _get(Y_train,
                                                                                                      valid_idx)
        dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)

        y_train_meta[valid_idx, 0] = dfm.predict(Xi_valid_, Xv_valid_)
        y_test_meta[:, 0] += dfm.predict(xTest_i, xTest_v)

        loss_metrics_test[i] = dfm_params['eval_metric'](y_valid_, y_train_meta[valid_idx])
        epoch_loss_metrics_train[i] = dfm.train_result  # 維度爲 epoches 的向量
        epoch_loss_metrics_valid[i] = dfm.valid_result
    # 測試結果
    y_test_meta /= float(len(folds))
    return loss_metrics_test, y_test_meta, epoch_loss_metrics_train, epoch_loss_metrics_valid


def _plot_fig(train_results, valid_results, filename):
    colors = ["red", "blue", "green"]
    xs = np.arange(1, train_results.shape[1]+1)
    plt.figure()
    legends = []
    for i in range(train_results.shape[0]):
        plt.plot(xs, train_results[i], color=colors[i], linestyle="solid", marker="o")
        plt.plot(xs, valid_results[i], color=colors[i], linestyle="dashed", marker="o")
        legends.append("train-%d" % (i+1))
        legends.append("valid-%d" % (i+1))
    plt.xlabel("Epoch")
    plt.ylabel("Normalized Gini")
    plt.legend(legends)
    plt.savefig(filename)
    plt.close()

# 數據讀取
data_dir = '/Users/zhangyugu/Downloads/'
dfTrain = pd.read_csv(data_dir + 'user_behaviors_test_full.csv')
X_train = dfTrain[['user_id', 'item_id', 'brand_id', 'shop_id', 'category_id', 'gender', 'age', 'buy_cap']]
X_test = X_train[:100]
Y_train = dfTrain['behavior_value_sum'].values
xTrain_i,xTrain_v,xTest_i,xTest_v,feat_dim = data2sparse_matrix(X_train, X_test,['buy_cap', 'age'])

loss_metrics_test, y_test_meta, epoch_loss_metrics_train, epoch_loss_metrics_valid = train_and_predict(xTrain_i,xTrain_v,xTest_i,xTest_v,feat_dim)

pd.DataFrame({"user_id": X_train['user_id'],"item_id":X_train['item_id'], "target": y_test_meta.flatten()})\
    .to_csv(data_dir + "DeepFm_Mean%.5f_Std%.5f.csv" % (loss_metrics_test.mean(), loss_metrics_test.std())
    , index=False, float_format="%.5f")

# 誤差
print("DeepFm: %.5f (%.5f)" % (loss_metrics_test.mean(),loss_metrics_test.std()))
_plot_fig(epoch_loss_metrics_train, epoch_loss_metrics_valid,data_dir+'DeepFm_ItemRecommend1.png')

最後還剩下一個問題,對一個用戶預測他明天最有可能買或者瀏覽的商品,商品推薦源巨大,如何根據其特徵值進行篩選?迴歸本源,用戶購買商品,是因爲一系列興趣指標,而商品在這些興趣指標上也有相應的得分,反映到模型上,就是針對一列特徵,用戶和商品分別有自己的取值向量。那麼如果直接通過在商品特徵取值向量上快速篩選得到需要的商品?阿里的tdm是一個很好的方案。但是具體實現中每個用戶對應的商品排序結構中,每個商品的打分如何計算?有沒有可能在模型訓練的過程中就把該打分、樹結構構造好?

tdm的做法是基於商品本身的分類初始構造樹層次,最底層的葉子節點爲商品,接着對於每個商品根據其在特徵上的取值向量通過層次聚類方法進行分組,每個分組都有一個打分(用於排序),該打分再作爲一個特徵,放到機器學習過程中優化。該打分類似attention的機制來實現,具體實現方式待研究。

最後附上StratifiedKFoldByXColumnsYBin的實現代碼:

import warnings
import numpy as np
import pandas as pd
from sklearn.model_selection._split import _BaseKFold, NSPLIT_WARNING, KFold


class StratifiedKFoldByXColumnsYBin(_BaseKFold):
    def __init__(self, columns, y_bins=-1, n_splits ='warn', shuffle=False, random_state=None):
        if n_splits == 'warn':
            warnings.warn(NSPLIT_WARNING,FutureWarning)
            n_splits=3
        self.columns=columns
        self.y_bins=y_bins
        super(StratifiedKFoldByXColumnsYBin, self).__init__(n_splits,shuffle,random_state)

    def _make_test_folds(self, X, y, groups):
        rng = self.random_state
        n_samples = X.shape[0]
        X_for_stratify=X[self.columns].values
        if self.y_bins>0:
            y_bin_array=pd.cut(y,bins=self.y_bins).codes
            X_for_stratify=np.concatenate((X_for_stratify,y_bin_array),axis=1)
        unique_X,X_inversed = np.unique(X_for_stratify,return_inverse=True,axis=0)
        X_counts = np.bincount(X_inversed)
        # min_groups=np.min(X_counts)
        if np.all(self.n_splits > X_counts):
            raise ValueError("n_splits=%d cannot be greater than the number of members in each class." % self.n_splits)

        # pre-assign each sample to a test fold index using individual KFold
        # splitting strategies for each class
        # NOTE: Passing the data corresponding to ith class say X[y==class_i]
        # will break when the data is not 100% stratifiable for all classies.
        # so we pass np.zeros(max(c,n_splits)) as data to the KFold
        per_cls_cvs = [KFold(self.n_splits,shuffle=self.shuffle,random_state=rng).split(np.zeros(max(count,self.n_splits)))
            for count in X_counts]

        test_folds = np.zeros(n_samples,dtype=np.int)
        for test_fold_indices,per_cls_splits in enumerate(zip(*per_cls_cvs)):
            for cls, (_, test_split) in zip(unique_X,per_cls_splits):
                cls_indexs=(X_for_stratify == cls).all(axis=1)
                cls_test_folds = test_folds[cls_indexs] # 原始數據中 == 當前分類cls 的部分,該部分數量可能 < count
                test_split = test_split[test_split<len(cls_test_folds)] # trim 掉數量超出的部分
                # 這幾步的含義是:cls_test_folds記錄了當前分類部分的數據,根據split的下標賦值後,將cls_test_folds整個賦值給test_folds對應的下標下
                cls_test_folds[test_split]=test_fold_indices # 賦值爲分類下標
                test_folds[cls_indexs] = cls_test_folds

        return test_folds

    def _iter_test_masks(self, X, y=None, groups=None):
        test_folds = self._make_test_folds(X,y,groups)
        for i in range(self.n_splits):
            yield test_folds ==i

    def split(self, X, y=None, groups=None):
        return super(StratifiedKFoldByXColumnsYBin,self).split(X,y,groups)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章