推薦系統矩陣分解詳解之spark ALS

1.推薦系統與spark

做推薦系統的同學，一般都會用到spark。spark的用途相當廣泛，可以用來做效果數據分析，更是構建特徵與離線訓練集的不二人選，同時spark streaming也是做實時數據的常用解決方案，mllib包與ml包裏面也實現了很多常用的算法，是針對大數據集分佈式算法最常用的算法框架。因此能熟練掌握spark的使用算是做推薦系統的基本功。

2.ALS算法

spark mllib/ml中，recommendation包裏只有一個算法:ALS，估計做過推薦系統相關的同學，都會或多或少用過ALS，下面我們來對ALS做個總結。

推薦系統中的矩陣分解詳解
一文中，提到我們最終的目標是將原始的user-item評分矩陣分解爲兩個低秩矩陣，損失函數爲
$\min \limits_{q^*,p^*} \sum \limits_{(u, i)} (r_{ui} - q_i^Tp_u) ^ 2$

有了損失函數以後，就是確定優化算法來求解了。
大規模數據集中，常用的優化算法一般是兩種：梯度下降(gradient descent)系列與交叉最小二乘法（alternative least squares，ALS）。梯度下降我們已經比較熟悉了，那麼ALS是什麼呢？或者說，ALS與普通的最小二乘有什麼區別？
上面的損失函數與一般損失函數區別就在於其有不止一個變量，包含一個物品向量 $q_i$ 與用戶向量 $p_u$ ，所以ALS的優化方式總結起來爲:
固定 $q_i$ 求 $p_{u+1}$ ，再固定 $p_{u+1}$ 求 $q_{i+1}$ 。

ALS相比GD系列算法，主要有以下兩個優點：
1. $q_i$ 與 $p_u$ 的計算是獨立的，因此計算的時候可以並行提高計算速度。
2.在一般的推薦場景中，user-item的組合非常多，比如千萬級別的用戶與十萬甚至百萬級別的item很常見。對於這種場景，用GD或者SGD去挨個迭代是非常慢的。當然我們也可以用負採樣等方法，但是整體也不會太快。而ALS可以用一些矩陣的技巧來解決計算低效的問題。

ALS的每步迭代都會降低誤差，並且誤差有下界，所以 ALS 一定會收斂。但由於問題是非凸的，ALS 並不保證會收斂到全局最優解。但在實際應用中，ALS 對初始點不是很敏感，是否全局最優解造成的影響並不大。(參考文獻1)

3.spark中的ALS

下面我們來看看spark中的ALS。mllib與ml包中均有ALS實現，API會有一些差異，但是基本的思想是一致的，我們就以mllib中的ALS爲例分析一下，spark版本2.3。
首先看一下Rating類

/**
 * A more compact class to represent a rating than Tuple3[Int, Int, Double].
 */
@Since("0.8.0")
case class Rating @Since("0.8.0") (
    @Since("0.8.0") user: Int,
    @Since("0.8.0") product: Int,
    @Since("0.8.0") rating: Double)

這個類就是我們輸入的訓練集，總共散列：user, prodcut(item)，rating。分別表示用戶，物品，分數。

然後是ALS類

/**
 * Alternating Least Squares matrix factorization.
 *
 * ALS attempts to estimate the ratings matrix `R` as the product of two lower-rank matrices,
 * `X` and `Y`, i.e. `X * Yt = R`. Typically these approximations are called 'factor' matrices.
 * The general approach is iterative. During each iteration, one of the factor matrices is held
 * constant, while the other is solved for using least squares. The newly-solved factor matrix is
 * then held constant while solving for the other factor matrix.
 *
 * This is a blocked implementation of the ALS factorization algorithm that groups the two sets
 * of factors (referred to as "users" and "products") into blocks and reduces communication by only
 * sending one copy of each user vector to each product block on each iteration, and only for the
 * product blocks that need that user's feature vector. This is achieved by precomputing some
 * information about the ratings matrix to determine the "out-links" of each user (which blocks of
 * products it will contribute to) and "in-link" information for each product (which of the feature
 * vectors it receives from each user block it will depend on). This allows us to send only an
 * array of feature vectors between each user block and product block, and have the product block
 * find the users' ratings and update the products based on these messages.
 *
 * For implicit preference data, the algorithm used is based on
 * "Collaborative Filtering for Implicit Feedback Datasets", available at
 * <a href="http://dx.doi.org/10.1109/ICDM.2008.22">here</a>, adapted for the blocked approach
 * used here.
 *
 * Essentially instead of finding the low-rank approximations to the rating matrix `R`,
 * this finds the approximations for a preference matrix `P` where the elements of `P` are 1 if
 * r &gt; 0 and 0 if r &lt;= 0. The ratings then act as 'confidence' values related to strength of
 * indicated user
 * preferences rather than explicit ratings given to items.
 */
@Since("0.8.0")
class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,
    private var alpha: Double,
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {
...

一般看知名開源項目的時候，註釋都是非常好非常重要的信息，看懂了註釋對我們理解代碼有非常大的好處。

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, 
X and Y, i.e. X * Yt = R. 
Typically these approximations are called 'factor' matrices.

The general approach is iterative. During each iteration, 
one of the factor matrices is held constant, 
while the other is solved for using least squares.
The newly-solved factor matrix is then held constant while solving for the other factor matrix.

這段註釋就簡明扼要地介紹了ALS的精髓。
1.ALS是將評分矩陣R分解爲兩個低秩矩陣X,Y，有 $X * Y^T = R$
2.這些低秩矩陣的近似被稱爲因子(factor)。
3.基本的實現方式是迭代。在每輪迭代時，先固定一個因子爲常量，然後對另外一個因子用最小二乘求解。然後這個新求得的解固定，作爲常量固定用來求解另外一個因子。

是不是看完以後就基本知道了ALS的思路？註釋是不是非常精彩？
註釋的中間一大段是講spark計算的時候並行優化的問題，本文暫時先不討論。

For implicit preference data, the algorithm used is based on
 "Collaborative Filtering for Implicit Feedback Datasets", available at
 <a href="http://dx.doi.org/10.1109/ICDM.2008.22">here</a>, 
 adapted for the blocked approach  used here.
 
Essentially instead of finding the low-rank approximations to the rating matrix `R`,
this finds the approximations for a preference matrix `P` 
where the elements of `P` are 1 if
 r &gt; 0 and 0 if r &lt;= 0.
 The ratings then act as 'confidence' values related to strength of
indicated user
 preferences rather than explicit ratings given to items.

這段註釋也非常精彩非常重要，主要是提到了隱式反饋的問題。

推薦系統中的矩陣分解詳解
文中，已經詳細解釋了隱式反饋，這裏就不再多做描述，可以去仔細查看關於隱式反饋的部分。

看完了註釋，再看ALS的構造方法就很清楚了

class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,
    private var alpha: Double,
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {
...

numUserBlocks, numProductBlocks都是spark並行計算的參數，rank是我們想得到的隱向量維度，iteration爲算法迭代次數，lambda爲正則參數，implicitPrefs表示是否爲隱式數據集，alpha爲隱式計算中， $c_{ui} = 1 + \alpha d_{ui}$ 中的超參數。

參考文獻

1.https://blog.csdn.net/u011239443/article/details/51752904

推薦系統矩陣分解詳解之spark ALS

1.推薦系統與spark

2.ALS算法

3.spark中的ALS

參考文獻

小白都能理解的FTRL

樹算法系列之四:XGBoost

Redis常用數據結構

樹算法系列之一:CART迴歸樹

HashMap簡單小結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結