Wide&Deep原理及實踐

原創

2020-07-01 04:20

背景

根據推薦系統使用數據的不同，推薦算法可分爲基於用戶行爲推薦、基於內容推薦等。主流的推薦系統算法可以分爲協同過濾推薦（Collaborative Filtering Recommendation）、基於內容推薦（Content-basedRecommendation）和混合推薦三種。混合推薦一般有UserCF、ItemCF、熱度推薦、時效推薦、歷史閱讀推薦、用戶愛好推薦等方法。
推薦排序方法一般有：gbrt+lr、Wide&Deep、DeepFM、YouTube推薦（發展歷程按順序）推薦這幾種方法，目前YouTube推薦方法最熱，但是很少人能夠應用到實踐中並取得良好的效果，目前Wide&Deep、DeepFM被反饋應用較成熟，因此本篇文章主要研究Wide&Deep的應用。

原理

Wide&Deep推薦算法出自一篇論文《Wide&Deep Learning for RecommenderSystems》，
提出W&D模型，平衡Wide模型和Deep模型的記憶能力和泛化能力。實際上是lr+dnn。
記憶（memorization）通過特徵叉乘對原始特徵做非線性變換，輸入爲高維度的稀疏向量。通過大量的特徵叉乘產生特徵相互作用的“記憶（Memorization）”，高效且可解釋，但要泛化需要更多的特徵工程。

泛化（generalization）只需要少量的特徵工程，深度神經網絡通過embedding的方法，使用低維稠密特徵輸入，可以更好地泛化訓練樣本中未出現過的特徵組合。但當user-item交互矩陣稀疏且高階時，容易出現“過泛化（over-generalize）”導致推薦的item相關性差。
參考：https://blog.csdn.net/zhangbaoanhadoop/article/details/81608947

實踐

參考https://github.com/tensorflow/models/tree/master/official/r1/wide_deep
不得不說這個谷歌的項目真香。

1.數據集準備
Census Income Data Set

python census_dataset.py

下載到/tmp/census_data，–data_dir設置路徑

特徵處理：
離散特徵處理分爲兩種情況：
知道所有的不同取值，而且取值不多。tf.feature_column.categorical_column_with_vocabulary_list
不知道所有不同取值，或者取值非常多。tf.feature_column.categorical_column_with_hash_bucket
原始連續特徵：tf.feature_column.numeric_column
規範化到[0,1]的連續特徵：tf.feature_column.bucketized_column

# 連續特徵
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

# 離散特徵
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship', [
        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
        'Other-relative'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])

# 離散hash bucket特徵
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=_HASH_BUCKET_SIZE)

# 特徵Transformations
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

2.訓練

python census_main.py

模型存儲到/tmp/census_model，–model_dir設置路徑

3.可視化

tensorboard --logdir=/tmp/census_model

4.預測

python census_main.py --export_dir /tmp/wide_deep_saved_model

訓練時模型導出爲Tensorflow SavedModel格式。

linix預測運行：

saved_model_cli run --dir /tmp/wide_deep_saved_model/${TIMESTAMP}/ \
--tag_set serve --signature_def="predict" \
--input_examples='examples=[{"age":[46.], "education_num":[10.], "capital_gain":[7688.], "capital_loss":[0.], "hours_per_week":[38.]}, {"age":[24.], "education_num":[13.], "capital_gain":[0.], "capital_loss":[0.], "hours_per_week":[50.]}]'

windows預測運行：
由於Windows將單引號視爲輸入的一部分，將雙引號改爲單引號，單引號轉爲雙引號

saved_model_cli run --dir /My Directory/ --tag_set serve --signature_def="predict" --input_examples="examples=[{'age':[46.], 'education_num':[10.], 'capital_gain':[7688.], 'capital_loss':[0.], 'hours_per_week':[38.]}, {'age':[24.], 'education_num':[13.], 'capital_gain':[0.], 'capital_loss':[0.], 'hours_per_week':[50.]}]"

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Wide&Deep原理及實踐

背景

原理

實踐

dlib人臉識別安裝及使用教程

數值計算+GPU加速算法

pyspark 文章畫像和用戶畫像（二）

pyspark 相似文章推薦-Word2Vec+Tfidf+LSH（一）

分類模型原理及優缺點整理總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結