推薦算法_03_FM算法論文

Abstract—In this paper, we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models. Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail. We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution. We show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings.
在本文中，我們介紹了因子分解機（FM），它是一種新的模型類，它結合了支持向量機（SVM）和因式分解模型的優點。與SVM一樣，FM是使用任何實值特徵向量的能用預測器。與SVM相比，FM使用分解參數模擬變量之間的所有交互。因此，即使在SVM失敗的巨大稀疏性（如推薦系統）的問題中，他們也能夠估計相互作用。我們證明了FM的模型方程可以在線性時間內計算，因此FM可以直接優化。因此，與非線性SVM不同，不需要雙重形式的變換，並且可以直接估計模型參數，而無需解決方案中的任何支持向量。我們展示了與SVM的關係以及FM在稀疏設置中進行參數估計的優勢。

On the other hand there are many different factorization models like matrix factorization, parallel factor analysis or specialized models like SVD++, PITF or FPMC. The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task. We show that FMs can mimic these models just by specifying the input data (i.e. the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models.
Index Terms—factorization machine; sparse data; tensor fac- torization; support vector machine
另一方面，有許多不同的因子分解模型，如矩陣分解，並行因子分析或專用模型，如SVD ++，PITF或FPMC。這些模型的缺點是它們不適用於能用的預測任務，但僅適用於特殊輸入數據。此外，他們的模型方程和優化算法是針對每個任務單獨導出的。FM僅通過指定輸入數據（即特徵向量）就可以模擬這些模型。這使得即使對於沒有分解模型專業知識的用戶，FM也很容易適用。
索引術語：分解機；稀疏數據；張量因子化；支持向量機

I. INTRODUCTION

Support Vector Machines are one of the most popular predictors in machine learning and data mining. Nevertheless in settings like collaborative filtering, SVMs play no important role and the best models are either direct applications of standard matrix/ tensor factorization models like PARAFAC [1] or specialized models using factorized parameters [2], [3], [4]. In this paper, we show that the only reason why standard SVM predictors are not successful in these tasks is that they cannot learn reliable parameters (‘hyperplanes’) in complex (non-linear) kernel spaces under very sparse data. On the other hand, the drawback of tensor factorization models and even more for specialized factorization models is that (1) they are not applicable to standard prediction data (e.g. a real valued feature vector in Rn.) and (2) that specialized models are usually derived individually for a specific task requiring effort in modelling and design of a learning algorithm.
支持向量機是機器學習和數據挖掘中最受歡迎的預測器之一。然而，在協同過濾等環境中，SVM並不起重要作用，最好的模型要麼是直接應用於標準矩陣/張量分解模型，如PARAFAC [1]，要麼是使用分解參數[2]，[3]，[4]的專用模型。在本文中，我們表明標準SVM預測器在這些任務中不成功的唯一原因，是它們無法在非常稀疏的數據下學習複雜（非線性）內核空間中的可靠參數（“超平面”）。另一方面，張量因子分解模型，甚至專門分解模型的缺點是（1）它們不適用於標準預測數據（例如Rn中的實值特徵向量）和（2）專用模型是通常爲需要在學習和設計學習算法方面付出努力的特定任務單獨導出。

In this paper, we introduce a new predictor, the Factorization Machine (FM), that is a general predictor like SVMs but is also able to estimate reliable parameters under very high sparsity. The factorization machine models all nested variable interactions (comparable to a polynomial kernel in SVM), but uses a factorized parametrization instead of a dense parametrization like in SVMs. We show that the model equation of FMs can be computed in linear time and that it depends only on a linear number of parameters. This allows direct optimization and storage of model parameters without the need of storing any training data (e.g. support vectors) for prediction. In contrast to this, non-linear SVMs are usually optimized in the dual form and computing a prediction (the model equation) depends on parts of the training data (the support vectors). We also show that FMs subsume many of the most successful approaches for the task of collaborative filtering including biased MF, SVD++ [2], PITF [3] and FPMC [4].
在本文中，我們引入了一種新的預測器，即因子分解機（FM），它是像SVM一樣的通用預測器，但也能夠在非常高度的稀疏下估計可靠的參數。FM模擬所有嵌套變量交互（與SVM中的多項式內核相比），但使用分解參數化而不是像SVM中那樣的密集參數化。我們證明了FM的模型方程可以在線性時間內計算O(kn)，並且它僅取決於線性數量的參數。這允許直接優化和存儲模型參數，而無需存儲任何用於預測的訓練數據（例如，支持向量）。與此相反，非線性SVM通常以雙重形式進行優化，並且計算預測（模型方程）取決於訓練數據的部分（支持向量）。我們還表明，FM包含許多最成功的協同過濾任務方法，包括偏置MF，SVD ++ [2]，PITF [3]和FPMC [4]。

In total, the advantages of our proposed FM are:
1) FMs allow parameter estimation under very sparse data where SVMs fail.
2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs. We show that FMs scale to large datasets like Netflix with 100 millions of training instances.
3) FMs are a general predictor that can work with any real valued feature vector. In contrast to this, other state-of- the-art factorization models work only on very restricted input data. We will show that just by defining the feature vectors of the input data, FMs can mimic state-of-the-art models like biased MF, SVD++, PITF or FPMC.
總的來講，FM有如下優點：
1. FM在高度稀疏數據下參數估計表現良好，而SVM則不行；
2. FM具有線性時間複雜度O(kn)，可以在原始中進行優化，不用像SVM那樣依賴支持向量。我們展示了FM可以擴展到像Netflix這樣擁有1億個訓練實例的大型數據集。
3. FM是一種可以與任何實值特徵向量一起使用的通用預測器。與此相反，其他最先進的分解模型僅適用於非常有限的輸入數據。我們將展示僅通過定義輸入數據的特徵向量，FM可以模擬最先進的模型，如偏置MF，SVD ++，PITF或FPMC。

FM三板斧：線性時間複雜度O(kn)、高度稀疏數據下表現良好、通用預測器

II. PREDICTION UNDER SPARSITY

The most common prediction task is to estimate a function y: →T from a real valued feature vector x∈ to a target domain T (e.g. T = R for regression or T = {+, −} for classification). In supervised settings, it is assumed that there is a training dataset D = { $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}),$ , . . .} of examples for the target function y given. We also investigate the ranking task where the function y with target T = R can be used to score feature vectors x and sort them according to their score. Scoring functions can be learned with pairwise training data [5], where a feature tuple ( $x^{(A)}$ , $x^{(B)}$ ) ∈ D means that $x^{(A)}$ should be ranked higher than $x^{(B)}$ . As the pairwise ranking relation is antisymmetric, it is sufficient to use only positive training instances.
最常見的預測任務是估計一個函數
y： $R^{n}$ → T
該函數將一個n維的實值特徵向量 $x \in R^{n}$ ，映射到一個目標域。（例如，對於迴歸問題，對於分類問題 T = {+, -}）
在監督學習場景中，通常有一個帶標籤的訓練數據集：

$D = \left \{ \left ( x^{(1)},y^{(1)} \right ) , \left ( x^{(2)},y^{(2)} \right ) ,..., \left ( x^{(n)},y^{(n)} \right )\right \}$

其中 $x^{(i)} \in R^{(n)}$ 表示輸入數據，對應樣本的特徵向量， $y^{(i)}$ 是標籤，n是樣本數目。
我們還研究了排名任務，其中具有目標T = R的函數y可用於對特徵向量x進行評分，並根據其得分對它們進行排序。評分函數可以用成對訓練數據[5]學習，其中特徵元組 $\left ( x^{(A)}, x^{(B)} \right ) \in D$ 表示應該排名 $x^{(A)}$ 高於 $x^{(B)}$ 。由於成對排序關係是反對稱的，因此僅使用積極的訓練實例就足夠了。

In this paper, we deal with problems where x is highly sparse, i.e. almost all of the elements xi of a vector x are zero. Let m(x) be the number of non-zero elements in the feature vector x and mD be the average number of non-zero elements m(x) of all vectors x ∈ D. Huge sparsity (mD ≪ n) appears in many real-world data like feature vectors of event transactions (e.g. purchases in recommender systems) or text analysis (e.g. bag of word approach). One reason for huge sparsity is that the underlying problem deals with large categorical variable domains.
  在本文中，我們處理的特徵向量是高度稀疏的，即向量的幾乎所有元素都爲零。設 $m\left ( x \right )$ 是特徵向量中的非零元素的數量， $\bar{m_D}$ 是向量x∈D中所有非零元素的平均數。高度稀疏性（ $\bar{m_D}$ « n）出現在許多中現實世界數據，如事件交易的特徵向量（例如，推薦系統中的購買）或文本分析（例如，詞彙方法）。巨大稀疏性的一個原因是潛在的問題涉及大的分類變量域。

Example 1 Assume we have the transaction data of a movie review system. The system records which user u ∈ U rates a movie (item) i ∈ I at a certain time t ∈ R with a rating r ∈ {1, 2, 3, 4, 5}. Let the users U and items I be:
  U = {Alice (A), Bob (B), Charlie (C), . . .}
I = {Titanic (TI), Notting Hill (NH), Star Wars (SW), Star Trek (ST), . . .}
Let the observed data S be:
   S = {(A, TI, 2010-1, 5),(A, NH, 2010-2, 3),(A, SW, 2010-4, 1), (B, SW, 2009-5, 4),(B, ST, 2009-8, 5), (C, TI, 2009-9, 1),(C, SW, 2009-12, 5)}
An example for a prediction task using this data, is to estimate a function yˆ that predicts the rating behaviour of a user for an item at a certain point in time.
在這裏，我們以電影評分系統爲例，舉一個高度稀疏數據的例子。
在電影評分系統中，記錄着用戶 $u \in U$ ，在某個時間 $t \in R$ ，對某個電影 $i \in I$ ，做出評分 $r \in \left \{ 1,2,3,4,5 \right\}$ 。假設用戶集U和電影集I分別如下：
U = {Alice (A), Bob (B), Charlie (C), . . .}
I = {Titanic (TI), Notting Hill (NH), Star Wars (SW), Star Trek (ST), . . .}
設觀測到的數據集S如下：
S = { (A, TI, 2010-1, 5), //表示Alice在2010年1月，對電影Titanic評分5分
(A, NH, 2010-2, 3),
(A, SW, 2010-4, 1),
   (B, SW, 2009-5, 4),
(B, ST, 2009-8, 5),
(C, TI, 2009-9, 1),
(C, SW, 2009-12, 5)}
利用觀測數據集S，來進行預測任務的一個實例是：估計一個函數 $\hat{y}$ ，來預測某個用戶在某個時間，對某部電影的打分行爲。

Figure 1 shows one example of how feature vectors can be created from S for this task. Here, first there are |U| binary indicator variables (blue) that represent the active user of a transaction – there is always exactly one active user in each transaction (u, i, t, r) ∈ S, e.g. user Alice in the first one (x (1) A = 1). The next |I| binary indicator variables (red) hold the active item – again there is always exactly one active item (e.g. x (1) TI = 1). The feature vectors in figure 1 also contain indicator variables (yellow) for all the other movies the user has ever rated. For each user, the variables are normalized such that they sum up to 1. E.g. Alice has rated Titanic, Notting Hill and Star Wars. Additionally the example contains a variable (green) holding the time in months starting from January, 2009. And finally the vector contains information of the last movie (brown) the user has rated before (s)he rated the active one – e.g. for x (2) , Alice rated Titanic before she rated Notting Hill. In section V, we show how factorization machines using such feature vectors as input data are related to specialized state-of-the-art factorization models.
We will use this example data throughout the paper for illustration. However please note that FMs are general predictors like SVMs and thus are applicable to any real valued feature vectors and are not restricted to recommender systems.
上圖是由觀測集S構造的特徵向量和標籤的例子，如第一條觀測記錄中，Alice對Titanic的評分是5。特徵向量由五個部分組成：

藍色方框：表示評分用戶信息，維度是 $\left | U \right |$ ，在該部分分量中，當前電影評分用戶所在位置爲1，其它爲0。例如，在第一條觀測記錄中，有 $x^{(1)}_A = 1$ ，表示當前評分用戶是Alice。
橙色方框：表示被評分電影信息，維度是 $\left | I \right |$ ，在該部分分量中，當前被評分的電影所在位置爲1，其它爲0。例如，在第一條觀測記錄中，有 $x^{(1)}_{TI}$ = 1，表示當前被評分電影是Titanic。
黃色方框：表示當前評分用戶評分過的所有電影信息，維度是 $\left | I \right |$ ，在該部分分量中，被當前用戶評分過的所有電影的位置爲 $\frac{1}{n_I}$ （是所有評分過的電影數目），其它爲0。例如，Alice評分過電影TI，NH和SW，那麼 $x^{(1)}_{TI} = x^{(1)}_{NH} = x^{(1)}_SW = \frac{1}{3}$
綠色方框：表示評分日期信息，維度是1。基數是2009年1月，以後每增加1個月就加1，例如2009年5月可表示爲5。
棕色方框：表示當前評分用戶最近評分過的一部電影信息，維度是 $\left | I \right |$ 。

在第五節中，我們展示了使用這些特徵向量作爲輸入數據的分解機器如何與專門的現有分解模型相關聯。我們將在整篇論文中使用此示例數據進行說明。但請注意，同SVM一樣，FM是一般預測器，因此適用於任何實值特徵向量，不限於推薦系統。

III. FACTORIZATION MACHINES (FM)
本節將介紹FM模型。我們詳細的討論模型方程，並且簡單介紹FM在一些預測任務上的應用。

1. FM模型
1.1 模型方程：
FM二階表達式如下：
$\hat{y}(x) = w_0 + \sum_{i=1}^{n}w_ix_i + {\color{Red} \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left \langle \mathbf{v}_i, \mathbf{v}_j \right \rangle x_ix_j}$
其中， $w_0 \in R$ ， $\mathbf{w} \in R^n$ (n維向量)，
$\mathbf{v} \in R^{n*k}$ (n*k的矩陣)，，k是超參數，表示分解的維度。

而， $\left \langle \mathbf{v_i}, \mathbf{v_j} \right \rangle = \sum_{f=1}^{n}v_i_f \cdot v_j_f$
FM的二階模型，能夠表達特徵變量的獨自和兩兩間的交互相系。
是全局偏置
是第i個特徵變量的權重
$w_i_j = \left \langle \mathbf{v_i}, \mathbf{v_j} \right \rangle$ 模擬了特徵變量與的交互，而不是直接用一個簡單的實數表示權重。

1.2 表達能力
有定理指出“當k足夠大時，對於任意一個正定矩陣 $\mathbf{W} \in R^{n*n}$ ，均存在矩陣 $\mathbf{V} \in R^{n*k}$ ，使得 $\mathbf{W} = \mathbf{V}\mathbf{V^T}$ ”。理論分析中，參數k要足夠大，但是在高度稀疏數據場景中，由於沒有足夠的樣本來估計複雜的交互矩陣，通常k取得很小。對參數k（即FM的表達能力）的限制，可以得到更好的泛化能力。

1.3 稀疏下的參數估計
在稀疏場景中，通常沒有足夠多的數據直接獨立的來評估特徵變量間的交互性。但是FM可以應付這種場景，它是通過分解的方式。舉例：在測試集S中，沒有Alice對電影Star Trek的評分記錄，如果要直接估計Alice和Star Trek之間（即和）的相互關係，顯然得到係數。但是在FM中，用分解的交互參數 $\left \langle \mathbf{v}_A, \mathbf{v}_S_T \right \rangle$ 可以評估