推薦算法 | 《Latent Cross: Making Use of Context in Recurrent Recommender Systems》

在這裏插入圖片描述

1 introduction

文中,作者研究瞭如何有效地處理神經推薦系統中的上下文數據。 首先對傳統的將上下文作爲特徵的方法進行了分析,並證明這種方法在捕獲特徵交叉時效率低下。然後據此來設計RNN推薦系統。 We first describe our RNN-based recommender system in use at YouTube. Next, we offer “Latent Cross,” an easy-to-use technique to incorporate contextual data in the RNN by embedding the context feature first and then performing an element-wise product of the context embedding with model’s hidden states.

學習的目的 to best learn from users actions, e.g., clicks, purchases, watches, and ratings…

一些重要的contextual data:request and watch time, the type of device, and the page on the website or mobile app

2 describe

Netflix Prize setting e(i,j,R)e \equiv(i, j, R), user ii gave movie jj a rating of RR. e(i,j,t,d)e \equiv(i, j, t, d), user ii watched video jj at time tt on device type dd.

recommender systems as trying to predict one value of the event given the others: for a tuple e=(i,j,R)e = (i, j, R), use (i,j)(i, j) predict RR.

 Symbol  Description e Tuple of k values describing an observed event e Element  in the tuple E Set of all observed events ui,vj Trainable embeddings of user i and item jXi All events for user iXi,t All events for user i befor time te(τ) Event at step τ in a particular sequence <> k way inner product Element-wise product f() An arbitrary neural network \begin{array}{lll}{\text { Symbol }} & {\text { Description }} \\ \hline e & {\text { Tuple of } k \text { values describing an observed event }} \\ {e_{\ell}} & {\text { Element } \ell \text { in the tuple }} \\ {\mathcal{E}} & {\text { Set of all observed events }} \\ {u_{i}, v_{j}} & {\text { Trainable embeddings of user } i \text { and item } j} \\ {X_{i}} & {\text { All events for user } i} \\ {X_{i, t}} & {\text { All events for user i befor time t}} \\ {e^{(\tau)}} & {\text { Event at step } \tau \text { in a particular sequence }} \\ {<·>} & {\text { k way inner product}} \\ {*} & {\text { Element-wise product }} \\ {f(\cdot)} & {\text { An arbitrary neural network }}\end{array}

 machine learning perspective, we can split our tuple e into features  x and label y such that x=(i,j) and label y=R . \begin{array}{l}{\text { machine learning perspective, we can split our tuple } e \text { into features }} \ {x \text { and label } y \text { such that } x=(i, j) \text { and label } y=R \text { . }}\end{array}

矩陣分解:uivju_{i} \cdot v_{j}
張量分解:rui,rvj,rwt,r\sum_{r} u_{i, r} v_{j, r} w_{t, r}
表示稱內積:ui,vj,wt=rui,rvj,rwt,r\left\langle u_{i}, v_{j}, w_{t}\right\rangle=\sum_{r} u_{i, r} v_{j, r} w_{t, r}

3 MoDELING PRELIMINARIES

3.1 First Order DNN的侷限

模型最後的輸出:hτ=g(Wτhτ1+bτ)h_{\tau}=g\left(W_{\tau} h_{\tau-1}+b_{\tau}\right),這個公式可看做 hτ1h_{\tau - 1}的一階轉換,原因就是隻涉及了hτ1h_{\tau - 1}中元素的加法,並沒有涉及到元素間的乘法。
矩陣分解可以捕捉到不同類型輸入(user,item,time等)之間的低秩關係

3.2 Modeling Low-Rank Relations

通過生成一些low-rank數據來驗證first order DNN是否可以很好的建模low-rank之間的關係。
生成長度爲rr的隨機向量 uiu_i : uiN(0,1r1/2mI)u_{i} \sim \mathcal{N}\left(0, \frac{1}{r^{1 / 2 m}} \mathbf{I}\right),其中rr爲data的秩,m個特徵。
m=3m=3時,每個樣本可以表示成(i,j,t,ui,uj,ut)\left(i, j, t,\left\langle u_{i}, u_{j}, u_{t}\right\rangle\right),將三部分合並起來作爲輸入,然後經過RELU激活函數輸入到最終的線性層,損失函數採用MSE( mean squared error loss ),採用Adagrade進行優化 ,最終以Pearson correlation進行評價。
在這裏插入圖片描述
1、隨着隱藏層大小增加,模型擬合訓練數據的能力更好。
2、rank從1變成2時,隱藏層nodes需要翻倍才能達到相同的準確率。
3、Considering collaborative filtering models will often discover rank 200 relations , this intuitively suggests that real world models would require very wide layers for a single two-way relation to be learned.
結果:ReLU layers越多,擬合的越好,但效率低。因此開始考慮RNN模型。

4 YOUTUBE’S RECURRENT RECOMMENDER

RNNs are notable as a baseline model because they are already second-order neural networks, significantly more complex than the first-order models explored above, and are at the cutting edge of dynamic recommender systems

4.1 Formal Description

the input to the model is the set of events for user: Xi={e=(i,j,ψ(j),t)Ee0=i}X_{i}=\left\{e=(i, j, \psi(j), t) \in \mathcal{E} | e_{0}=i\right\}, use Xi,tX_{i,t} to denote all watches before tt for user XiX_i :Xi,t={e=(i,j,t)Ee0=ie3<t}XIX_{i, t}=\left\{e=(i, j, t) \in \mathcal{E} | e_{0}=i \wedge e_{3}<t\right\} \subset X_{I}, Pr(ji,t,Xi,t)\operatorname{Pr}\left(j | i, t, X_{i, t}\right) 表示the video jj that user ii will watch at a given time tt based on all watches before tt.

user ii在time tt觀看ψ(j)\psi(j)上傳的具有wtw_t feature的video jj。模型以user ii在time tt之前的瀏覽記錄Xi,tX_{i,t}作爲輸入。使用e(τ)e^{(\tau)}表示序列中的第τ\tau次事件,x(τ)x^{(\tau)}表示事件e(τ)e^{(\tau)}轉換後的輸入(就是user ii對應的一些embedding),而y(τ)y^{(\tau)}表示預測的標籤。當前時刻e(τ)=(i,j,ψ(j),t)e^{(\tau)}=(i, j, \psi(j), t) , 下一時刻e(τ+1)=(i,j,ψ(j),t)e^{(\tau+1)}=\left(i, j^{\prime}, \psi\left(j^{\prime}\right), t^{\prime}\right),則輸入 x(τ)=[vj;uψ(j);wt]x^{(\tau)}=\left[v_{j} ; u_{\psi}(j) ; w_{t}\right]來預測標籤y(τ+1)=jy^{(\tau+1)}=j^{'},其中vjv_{j}是video的embedding,uψ(j)u_{\psi}(j)是上傳者(uploader)的embedding, wtw_{t}是情景的embedding。在預測y(τ+1)y^{(\tau+1)}時,不能使用e(τ)e^{(\tau)}的標籤作爲輸入,但是可以使用 wtw_{t}的情景特徵,記爲 c(τ)=[wt]c^{(\tau)}=\left[w_{t}\right]

4.2 Structure of the Baseline RNN Model

RNN模型對一系列的actions進行建模:
1、對於每個event e(τ)e^{(\tau)}e(τ)e^{(\tau)}對應爲x(τ)x^{(\tau)},先輸入到一層NN中得到h0(τ)=fi(x(τ))h_{0}^{(\tau)}=f_{i}\left(x^{(\tau)}\right)
2、將其輸入到RNN(LSTM、GRU)模型,得到h1(τ),z(τ)=fr(h0(τ),z(τ1))h_{1}^{(\tau)}, z^{(\tau)}=f_{r}\left(h_{0}^{(\tau)}, z^{(\tau-1)}\right)
3、使用fo(h1(τ1),c(τ))f_{o}\left(h_{1}^{(\tau-1)}, c^{(\tau)}\right)來預測 y(τ)y^{(\tau)}
在這裏插入圖片描述

4.3 Context Features

1、TimeDelta
Δt(τ)=log(t(τ+1)t(τ))\Delta t^{(\tau)}=\log \left(t^{(\tau+1)}-t^{(\tau)}\right)
2、Software Client
video的長短會影響user觀看使用的device
3、Page
從網站home_page開始瀏覽的話可能對new content更有興趣,從一個具體的video page跳轉可能表示user對某個特定的topic更感興趣。
4、Pre- and Post-Fusion.
前面將情景特徵標記爲c(τ)c^{(\tau)} ,pre-fusion表示情景特徵從NN底部作爲input,post-fusion表示和RNN的輸出合併起來。把c(τ)1c^{(\tau)-1}作爲pre-fusion特徵來影響RNN的state,而把 c(τ)c^{(\tau)}作爲post-fusion特徵來直接用於預測y(τ)y^{(\tau)}

5 CONTEXT MODELING WITH THE LATENT CROSS

前面介紹,直接將content feature concat 是低效的,因此下面展開研究。

5.1 Single Feature

以time爲例,perform an element-wise product in the middle of the network h0(τ)=(1+wt)h0(τ)h_{0}^{(\tau)}=\left(1+w_{t}\right) * h_{0}^{(\tau)}, 通過 0-mean Gaussian來初始化ww,有兩點好處:
1、This can be interpreted as the context providing a mask or attention mechanism over the hidden state. (相當於在隱狀態上加了mask和attention)
2、enables low-rank relations between the input previous watch and the time.(捕捉上次記錄與time的low-rank關係)
對於h1(τ)h_{1}^{(\tau)}, h1(τ)=(1+wt)h1(τ)h_{1}^{(\tau)}=\left(1+w_{t}\right) * h_{1}^{(\tau)}.

5.2 Using Multiple Features

通常,會有很多 contextual feature,以device和time爲例:h(τ)=(1+wt+wd)h(τ)h^{(\tau)}=\left(1+w_{t}+w_{d}\right) * h^{(\tau)}
1、相當於在隱狀態上加了mask和attention
2、捕捉2-way relation
3、加法運算容易訓練, 而wtwdh(τ)w_{t} * w_{d} * h^{(\tau)}以及f([wt;wd])f\left(\left[w_{t} ; w_{d}\right]\right)難訓練\。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章