日萌社
人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度學習實戰(不定時更新)
3.2 泛娛樂特徵工程與模型代碼構建
學習目標
- 目標
- 說明泛娛樂推薦系統的特徵工程過程
- 應用
- 應用完成泛娛樂推薦系統Wide&Deep模型的構建
3.2.1 特徵工程
3.2.1.1 定義正負樣本
- 根據模型最終的預測要求:使用戶產生更多的交互行爲, 來定義正負樣本
- 正樣本定義: 若用戶A對帖子B產生交互行爲, 則A的所有特徵和B的所有特徵連接組成的向量作爲正樣本特徵, 1作爲正樣本的標籤.
- 負樣本定義: 若用戶A對帖子B產生負向行爲(舉報/不感興趣)或用戶A未對已推薦的帖子B產生任何行爲, 則A的所有特徵和B的所有特徵連接組成的向量作爲負樣本特徵, 0作爲負樣本的標籤.
- 因爲我們的模型最終使用ACC作爲評估標準, 我們也需要把正負樣本的比例應維持在1:1左右, 實際情況中, 用戶主動標記舉報/不感興趣的情況非常少, 爲了補充負樣本數量, 需要將每次推薦給用戶但未產生交互行爲的數據定義爲負樣本
3.2.1.2 獲取初始訓練集
- 選取最新產生的200萬條樣本, 樣本總數的選擇依據模型參數總量,而且需要兼顧模型訓練時間和模型效果,有關實驗表明, 在數據集存在噪音的情況下, 樣本總數應該是模型參數量的10倍左右,將得到具備擬合能力與泛化能力較強的模型。
- 訓練數據樣式:
android,23,4,32,43,32,2,54,1502378738,2,4,6,33,421,22,43,12,0,0,0
ios,47,4,16,28,32,43,56,1502408488,22,33,12,2,1,23,4,47,0,0,1
android,22,4,16,7,0,0,7,1502362845,2,34,62,2,32,221,32,4,0,0,0
android,23,21,122,223,33,23,42,1502367552,77,11,2,3,87,1,20,6,2,0,0
android,76,432,876,23,2,23,56,1502430914,2,32,1,23,54,66,33,212,199,0,1
- 每一列分別代表: 設備系統類型、用戶的轉發數、評論數、點贊數、發佈帖子數、帖子被點贊數、被轉發數、被評論數、帖子發佈時間戳、用戶關注第一位明星編號、第二位明星編號、第三位明星編號、第四位明星編號、第五位明星編號、貼子涉及第一位明星編號、第二位明星編號、第三位明星編號、第四位明星編號、第五位明星編號、目標標籤
- wide-deep模型參數量的計算法分爲兩部分:
- wide模型側:對應的稀疏特徵擴展之後的維度,泛娛樂的特徵大概在5萬維左右, 其中包括原始稀疏特徵, hash分桶的連續特徵, 以及組合特徵
- deep模型側,:有大概1000維的特徵作爲輸入,參數總量大致11萬左右.
- 因此我們的模型參數總量爲:16萬, 因此樣本數量最好維持在160萬以上,考慮到還應存在測試數據集。因此每次使用相對於當前最新的200萬條數據
3.2.2 訓練樣本獲取
- 目的:讀取數據庫中的數據構造成樣本
- 設備系統類型、用戶的轉發數、評論數、點贊數、發佈帖子數、帖子被點贊數、被轉發數、被評論數、帖子發佈時間戳、用戶關注第一位明星編號、第二位明星編號、第三位明星編號、第四位明星編號、第五位明星編號、貼子涉及第一位明星編號、第二位明星編號、第三位明星編號、第四位明星編號、第五位明星編號、目標標籤。20個原始特徵列:19 + 1
我們需要從neo4j當中獲取訓練樣本,構造樣本過程分爲正樣本和負樣本兩部分,正樣本: 用戶產生交互行爲的雙畫像,負樣本: 推薦曝光後沒有產生交互行爲的雙畫像,通過cypher語句取出特徵。
1、讀取用戶、帖子特徵進行組合cypher語句
from neo4j.v1 import GraphDatabase
import numpy as np
import pandas as pd
NEO4J_CONFIG = dict({
"uri": "bolt://192.168.19.137:7687",
"auth": ("neo4j", "itcast"),
"encrypted": False
})
_driver = GraphDatabase.driver(**NEO4J_CONFIG)
# 選擇有過行爲關係的用戶和帖子,將相關特徵合併,以及目標爲1
def get_positive_sample():
cypher = "match(a:SuperfansUser)-[r:share|comment|like]-(b:SuperfansPost) return [a.like_posts_num, a.forward_posts_num, a.comment_posts_num,a.publish_posts_num,a.follow_stars_list,b.hot_score,b.commented_num,b.forwarded_num,b.liked_num,b.related_stars_list,b.publish_time] limit 200"
train_data_with_labels = get_train_data(cypher, '1')
return train_data_with_labels
# 選擇沒有關係的用戶和帖子,將相關特徵合併,以及目標爲0
def get_negative_sample():
cypher = "match(a:SuperfansUser)-[r:report|unlike]-(b:SuperfansPost) return [a.like_posts_num, a.forward_posts_num, a.comment_posts_num,a.publish_posts_num,a.follow_stars_list,b.hot_score,b.commented_num,b.forwarded_num,b.liked_num,b.related_stars_list,b.publish_time] limit 200"
train_data_with_labels = get_train_data(cypher, '0')
return train_data_with_labels
2、獲取結果之後,進行數據集的格式處理和構造:get_train_data(cypher, '1')
# [92, 47, 4, 1618, ['218960', '187579', '210958', '219148', '3116'], 549, 5, 2, 533, ['1'], 1516180431]
def _extended_length(b, index):
"""
:param b: 傳入樣本
:param index: 傳入位置
:return:
"""
print(b, index)
for i in index:
if len(b[i])<5:
k = [0]*5
for i, value in enumerate(b[i]):
k[i] = value
b.extend(k)
print(b)
i = 0
while i < len(index):
b.pop(index[i] - i)
i += 1
return b
def get_train_data(cypher, label):
### 根據neo4j關係生成標註數據
# 正樣本: 用戶產生交互行爲的雙畫像
# 負樣本: 推薦曝光後沒有產生交互行爲的雙畫像
with _driver.session() as session:
record = session.run(cypher)
sample = list(map(lambda x: x[0], record))
index_list = [4,9]
# 第一步特徵處理: 列表特徵處理
train_data = list(map(lambda x: _extended_length(x, index_list) + [str(label)], sample))
print(train_data)
return train_data
最後保存到本地當前目錄train_data.csv文件
if __name__ == "__main__":
p_train_data = get_positive_sample()
n_train_data = get_negative_sample()
print(len(p_train_data))
print(len(n_train_data))
train_data = p_train_data + n_train_data
pd.DataFrame(train_data).to_csv("./train_data.csv", header=False, index=False)
3.2.3 模型構建
-
分析:
- model.py中包含了從源數據文件到模型指定輸入格式數據的全部特徵工程;
- 其中兩個重要函數build_estimator和input_fn被task.py中的關鍵函數調用;
- 要求: build_estimator必須返回tf的分類器類型, 具體參見tf.estimator.DNNLinearCombinedClassifier()源碼
-
目的:構建泛娛樂WDL模型輸入、特徵的處理,從而進行後續模型訓練
- 步驟:
- 1、模型輸入函數構建
- 2、 tf.feature_column特徵處理
- 3、DNNLinearCombinedClassifier模型構建
3.2.3.1 模型輸入函數
- 使用tf.data.TextLineDataset(filenames)解析我們的訓練數據集CSV文件
1、指定讀取CSV文件API,返回dataset
def input_fn(filenames,
num_epochs=None,
shuffle=True,
skip_header_lines=0,
batch_size=200):
dataset = tf.data.TextLineDataset(filenames).skip(skip_header_lines).map(
_decode_csv)
2、實現_decode_csv解析文件內容特徵以及目標值函數
所有CSV列,以及解析式的默認格式。
CSV_COLUMNS = [
'like_posts_num', 'forward_posts_num', 'comment_posts_num', 'publish_posts_num', 'hot_score',
'commented_num', 'forwarded_num', 'liked_num', 'publish_time', 'follow_star_1',
'follow_star_2', 'follow_star_3', 'follow_star_4', 'follow_star_5', 'related_star_1',
'related_star_2', 'related_star_3', 'related_star_4', 'related_star_5', 'islike'
]
CSV_COLUMN_DEFAULTS = [[''], [0], [0], [0], [0],
[0], [0], [0], [0], [0],
[0], [0], [0], [0], [0],
[0], [0], [0], [0], ['']]
LABEL_COLUMN = 'islike'
LABELS = ['1', '0']
使用tf.decode_csv解析
def _decode_csv(line):
# ## ['123','321'] ---> [['123'], ['321']]
row_columns = tf.expand_dims(line, -1)
# ##修改各個特徵的類型
columns = tf.decode_csv(row_columns, record_defaults=CSV_COLUMN_DEFAULTS)
features = dict(zip(CSV_COLUMNS, columns))
# Remove unused columns
for col in UNUSED_COLUMNS:
features.pop(col)
return features
其中會對特徵進行過濾,指定無用的特徵列 UNUSED_COLUMNS,輸入的特徵以及目標標籤都會需要
UNUSED_COLUMNS = set(CSV_COLUMNS) - {col.name for col in INPUT_COLUMNS} - {LABEL_COLUMN}
這裏UNUSED_COLUMNS有'device_system'以及'islike'這兩列
# 指定StarID維度列表映射大小
STAR_ID_LIST = list(map(lambda x: x, range(0,500)))
INPUT_COLUMNS = [
tf.feature_column.numeric_column('like_posts_num'),
tf.feature_column.numeric_column('forward_posts_num'),
tf.feature_column.numeric_column('comment_posts_num'),
tf.feature_column.numeric_column('publish_posts_num'),
tf.feature_column.numeric_column('commented_num'),
tf.feature_column.numeric_column('forwarded_num'),
tf.feature_column.numeric_column('liked_num'),
tf.feature_column.numeric_column('publish_time'),
tf.feature_column.categorical_column_with_vocabulary_list(
'follow_star_1', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'follow_star_2', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'follow_star_3', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'follow_star_4', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'follow_star_5', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'related_star_1', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'related_star_2', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'related_star_3', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'related_star_4', STAR_ID_LIST),
tf.feature_column.categorical_column_with_vocabulary_list(
'related_star_5', STAR_ID_LIST)
]
3、dataset進行制定epoch以及Batch大小,打亂順序,並指定目標值,將字符串編程目標0,1
使用dataset的batch,repeat相關方法進行處理,
if shuffle:
dataset = dataset.shuffle(buffer_size=batch_size * 10)
iterator = dataset.repeat(num_epochs).batch(
batch_size).make_one_shot_iterator()
features = iterator.get_next()
return features, parse_label_column(features.pop(LABEL_COLUMN))
最後將目標值進行處理
def parse_label_column(label_string_tensor):
table = tf.contrib.lookup.index_table_from_tensor(tf.constant(LABELS))
return table.lookup(label_string_tensor)
3.2.3.2 tf.feature_column特徵處理
(like_posts_num, forward_posts_num, comment_posts_num, publish_posts_num, hot_score,
commented_num, forwarded_num, liked_num, publish_time, follow_star_1,
follow_star_2, follow_star_3, follow_star_4, follow_star_5, related_star_1,
related_star_2, related_star_3, related_star_4, related_star_5) = INPUT_COLUMNS
- 數值型特徵:
- 類別特徵進行數值化操作
- ['like_posts_num', 'forward_posts_num', 'comment_posts_num', 'publish_posts_num', 'hot_score','commented_num', 'forwarded_num', 'liked_num', 'publish_time']
- 類別型特徵:
- ['device_system','follow_star_1', 'follow_star_2', 'follow_star_3', 'follow_star_4', 'follow_star_5', 'related_star_1','related_star_2', 'related_star_3', 'related_star_4', 'related_star_5']
wide側特徵列指定
- device_system,follow_star_1,follow_star_2,follow_star_3,follow_star_4,follow_star_5,related_star_1,related_star_2,related_star_3,related_star_4,related_star_5
- [follow_star_1,follow_star_2]與 [related_star_1,related_star_2, related_star_3, related_star_4, related_star_5]的兩兩組合交叉特徵
- [follow_star_1, related_star_1, follow_star_2]
- [follow_star_2, related_star_1, related_star_2]
- [follow_star_3, related_star_1, related_star_2]
- [follow_star_1, related_star_2, related_star_1]
wide_columns = [
tf.feature_column.crossed_column([follow_star_1, related_star_1],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_1, related_star_2],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_1, related_star_3],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_1, related_star_4],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_1, related_star_5],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_2, related_star_1],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_2, related_star_2],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_2, related_star_3],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_2, related_star_4],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_2, related_star_5],
hash_bucket_size=int(1e3)),
tf.feature_column.crossed_column([follow_star_1, related_star_1, follow_star_2],
hash_bucket_size=int(1e4)),
tf.feature_column.crossed_column([follow_star_2, related_star_1, related_star_2],
hash_bucket_size=int(1e4)),
tf.feature_column.crossed_column([follow_star_3, related_star_1, related_star_2],
hash_bucket_size=int(1e4)),
tf.feature_column.crossed_column([follow_star_1, related_star_2, related_star_1],
hash_bucket_size=int(1e4)),
device_system,
follow_star_1,
follow_star_2,
follow_star_3,
follow_star_4,
follow_star_5,
related_star_1,
related_star_2,
related_star_3,
related_star_4,
related_star_5
]
deep側特徵指定
- 類別型特徵進行indicator_column指定 + 數值型特徵
# 深度特徵比做挖掘特徵,針對稀疏+稠密的所有特徵, 但由於隱層作用時將考慮大小問題,因此類別特徵必須onehot編碼才能作爲輸入
deep_columns = [
tf.feature_column.indicator_column(follow_star_1),
tf.feature_column.indicator_column(follow_star_2),
tf.feature_column.indicator_column(follow_star_3),
tf.feature_column.indicator_column(follow_star_4),
tf.feature_column.indicator_column(follow_star_5),
tf.feature_column.indicator_column(related_star_1),
tf.feature_column.indicator_column(related_star_2),
tf.feature_column.indicator_column(related_star_3),
tf.feature_column.indicator_column(related_star_4),
tf.feature_column.indicator_column(related_star_5),
like_posts_num,
forward_posts_num,
comment_posts_num,
publish_posts_num,
commented_num,
forwarded_num,
liked_num,
publish_time
]
3.2.3.3 Wide&Deep模型構建
- 這裏填入一個配置參數,wide和deep的特徵列指定,dnn網絡的神經元個數以及層數
- [100, 70, 50, 25]只是作爲我們初始化的一個值,後面在訓練階段會自動調參
def build_estimator(config, embedding_size=8, hidden_units=None):
"""
"""
# 特徵處理
# 模型構建
return tf.estimator.DNNLinearCombinedClassifier(config=config,
linear_feature_columns=wide_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[embedding_size] + [100, 70, 50, 25])
3.2.4 小結
- 泛娛樂推薦特徵工程以及樣本的導出
- 泛娛樂模型構建
- 輸入數據函數構建
- 特徵指定
- build_estimator構建