文中,作者研究瞭如何有效地處理神經推薦系統中的上下文數據。 首先對傳統的將上下文作爲特徵的方法進行了分析,並證明這種方法在捕獲特徵交叉時效率低下。然後據此來設計RNN推薦系統。 We first describe our RNN-based recommender system in use at YouTube. Next, we offer “Latent Cross,” an easy-to-use technique to incorporate contextual data in the RNN by embedding the context feature first and then performing an element-wise product of the context embedding with model’s hidden states.
學習的目的 to best learn from users actions, e.g., clicks, purchases, watches, and ratings…
一些重要的contextual data:request and watch time, the type of device, and the page on the website or mobile app
2 describe
Netflix Prize setting e≡(i,j,R), user i gave movie j a rating of R. e≡(i,j,t,d), user i watched video j at time t on device type d.
recommender systems as trying to predict one value of the event given the others: for a tuple e=(i,j,R), use (i,j) predict R.
Symbol eeℓEui,vjXiXi,te(τ)<⋅>∗f(⋅) Description Tuple of k values describing an observed event Element ℓ in the tuple Set of all observed events Trainable embeddings of user i and item j All events for user i All events for user i befor time t Event at step τ in a particular sequence k way inner product Element-wise product An arbitrary neural network
machine learning perspective, we can split our tuple e into features x and label y such that x=(i,j) and label y=R .
通過生成一些low-rank數據來驗證first order DNN是否可以很好的建模low-rank之間的關係。
生成長度爲r的隨機向量 ui : ui∼N(0,r1/2m1I),其中r爲data的秩,m個特徵。
當 m=3時,每個樣本可以表示成(i,j,t,⟨ui,uj,ut⟩),將三部分合並起來作爲輸入,然後經過RELU激活函數輸入到最終的線性層,損失函數採用MSE( mean squared error loss ),採用Adagrade進行優化 ,最終以Pearson correlation進行評價。
1、隨着隱藏層大小增加,模型擬合訓練數據的能力更好。
2、rank從1變成2時,隱藏層nodes需要翻倍才能達到相同的準確率。
3、Considering collaborative filtering models will often discover rank 200 relations , this intuitively suggests that real world models would require very wide layers for a single two-way relation to be learned.
結果:ReLU layers越多,擬合的越好,但效率低。因此開始考慮RNN模型。
4 YOUTUBE’S RECURRENT RECOMMENDER
RNNs are notable as a baseline model because they are already second-order neural networks, significantly more complex than the first-order models explored above, and are at the cutting edge of dynamic recommender systems
4.1 Formal Description
the input to the model is the set of events for user: Xi={e=(i,j,ψ(j),t)∈E∣e0=i}, use Xi,t to denote all watches before t for user Xi :Xi,t={e=(i,j,t)∈E∣e0=i∧e3<t}⊂XI, Pr(j∣i,t,Xi,t) 表示the video j that user i will watch at a given time t based on all watches before t.
以time爲例,perform an element-wise product in the middle of the network h0(τ)=(1+wt)∗h0(τ), 通過 0-mean Gaussian來初始化w,有兩點好處:
1、This can be interpreted as the context providing a mask or attention mechanism over the hidden state. (相當於在隱狀態上加了mask和attention)
2、enables low-rank relations between the input previous watch and the time.(捕捉上次記錄與time的low-rank關係)
對於h1(τ), h1(τ)=(1+wt)∗h1(τ).