推荐系统矩阵分解详解之spark ALS

1.推荐系统与spark

做推荐系统的同学，一般都会用到spark。spark的用途相当广泛，可以用来做效果数据分析，更是构建特征与离线训练集的不二人选，同时spark streaming也是做实时数据的常用解决方案，mllib包与ml包里面也实现了很多常用的算法，是针对大数据集分布式算法最常用的算法框架。因此能熟练掌握spark的使用算是做推荐系统的基本功。

2.ALS算法

spark mllib/ml中，recommendation包里只有一个算法:ALS，估计做过推荐系统相关的同学，都会或多或少用过ALS，下面我们来对ALS做个总结。

推荐系统中的矩阵分解详解
一文中，提到我们最终的目标是将原始的user-item评分矩阵分解为两个低秩矩阵，损失函数为
$\min \limits_{q^*,p^*} \sum \limits_{(u, i)} (r_{ui} - q_i^Tp_u) ^ 2$

有了损失函数以后，就是确定优化算法来求解了。
大规模数据集中，常用的优化算法一般是两种：梯度下降(gradient descent)系列与交叉最小二乘法（alternative least squares，ALS）。梯度下降我们已经比较熟悉了，那么ALS是什么呢？或者说，ALS与普通的最小二乘有什么区别？
上面的损失函数与一般损失函数区别就在于其有不止一个变量，包含一个物品向量 $q_i$ 与用户向量 $p_u$ ，所以ALS的优化方式总结起来为:
固定 $q_i$ 求 $p_{u+1}$ ，再固定 $p_{u+1}$ 求 $q_{i+1}$ 。

ALS相比GD系列算法，主要有以下两个优点：
1. $q_i$ 与 $p_u$ 的计算是独立的，因此计算的时候可以并行提高计算速度。
2.在一般的推荐场景中，user-item的组合非常多，比如千万级别的用户与十万甚至百万级别的item很常见。对于这种场景，用GD或者SGD去挨个迭代是非常慢的。当然我们也可以用负采样等方法，但是整体也不会太快。而ALS可以用一些矩阵的技巧来解决计算低效的问题。

ALS的每步迭代都会降低误差，并且误差有下界，所以 ALS 一定会收敛。但由于问题是非凸的，ALS 并不保证会收敛到全局最优解。但在实际应用中，ALS 对初始点不是很敏感，是否全局最优解造成的影响并不大。(参考文献1)

3.spark中的ALS

下面我们来看看spark中的ALS。mllib与ml包中均有ALS实现，API会有一些差异，但是基本的思想是一致的，我们就以mllib中的ALS为例分析一下，spark版本2.3。
首先看一下Rating类

/**
 * A more compact class to represent a rating than Tuple3[Int, Int, Double].
 */
@Since("0.8.0")
case class Rating @Since("0.8.0") (
    @Since("0.8.0") user: Int,
    @Since("0.8.0") product: Int,
    @Since("0.8.0") rating: Double)

这个类就是我们输入的训练集，总共散列：user, prodcut(item)，rating。分别表示用户，物品，分数。

然后是ALS类

/**
 * Alternating Least Squares matrix factorization.
 *
 * ALS attempts to estimate the ratings matrix `R` as the product of two lower-rank matrices,
 * `X` and `Y`, i.e. `X * Yt = R`. Typically these approximations are called 'factor' matrices.
 * The general approach is iterative. During each iteration, one of the factor matrices is held
 * constant, while the other is solved for using least squares. The newly-solved factor matrix is
 * then held constant while solving for the other factor matrix.
 *
 * This is a blocked implementation of the ALS factorization algorithm that groups the two sets
 * of factors (referred to as "users" and "products") into blocks and reduces communication by only
 * sending one copy of each user vector to each product block on each iteration, and only for the
 * product blocks that need that user's feature vector. This is achieved by precomputing some
 * information about the ratings matrix to determine the "out-links" of each user (which blocks of
 * products it will contribute to) and "in-link" information for each product (which of the feature
 * vectors it receives from each user block it will depend on). This allows us to send only an
 * array of feature vectors between each user block and product block, and have the product block
 * find the users' ratings and update the products based on these messages.
 *
 * For implicit preference data, the algorithm used is based on
 * "Collaborative Filtering for Implicit Feedback Datasets", available at
 * <a href="http://dx.doi.org/10.1109/ICDM.2008.22">here</a>, adapted for the blocked approach
 * used here.
 *
 * Essentially instead of finding the low-rank approximations to the rating matrix `R`,
 * this finds the approximations for a preference matrix `P` where the elements of `P` are 1 if
 * r &gt; 0 and 0 if r &lt;= 0. The ratings then act as 'confidence' values related to strength of
 * indicated user
 * preferences rather than explicit ratings given to items.
 */
@Since("0.8.0")
class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,
    private var alpha: Double,
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {
...

一般看知名开源项目的时候，注释都是非常好非常重要的信息，看懂了注释对我们理解代码有非常大的好处。

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, 
X and Y, i.e. X * Yt = R. 
Typically these approximations are called 'factor' matrices.

The general approach is iterative. During each iteration, 
one of the factor matrices is held constant, 
while the other is solved for using least squares.
The newly-solved factor matrix is then held constant while solving for the other factor matrix.

这段注释就简明扼要地介绍了ALS的精髓。
1.ALS是将评分矩阵R分解为两个低秩矩阵X,Y，有 $X * Y^T = R$
2.这些低秩矩阵的近似被称为因子(factor)。
3.基本的实现方式是迭代。在每轮迭代时，先固定一个因子为常量，然后对另外一个因子用最小二乘求解。然后这个新求得的解固定，作为常量固定用来求解另外一个因子。

是不是看完以后就基本知道了ALS的思路？注释是不是非常精彩？
注释的中间一大段是讲spark计算的时候并行优化的问题，本文暂时先不讨论。

For implicit preference data, the algorithm used is based on
 "Collaborative Filtering for Implicit Feedback Datasets", available at
 <a href="http://dx.doi.org/10.1109/ICDM.2008.22">here</a>, 
 adapted for the blocked approach  used here.
 
Essentially instead of finding the low-rank approximations to the rating matrix `R`,
this finds the approximations for a preference matrix `P` 
where the elements of `P` are 1 if
 r &gt; 0 and 0 if r &lt;= 0.
 The ratings then act as 'confidence' values related to strength of
indicated user
 preferences rather than explicit ratings given to items.

这段注释也非常精彩非常重要，主要是提到了隐式反馈的问题。

推荐系统中的矩阵分解详解
文中，已经详细解释了隐式反馈，这里就不再多做描述，可以去仔细查看关于隐式反馈的部分。

看完了注释，再看ALS的构造方法就很清楚了

class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,
    private var alpha: Double,
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {
...

numUserBlocks, numProductBlocks都是spark并行计算的参数，rank是我们想得到的隐向量维度，iteration为算法迭代次数，lambda为正则参数，implicitPrefs表示是否为隐式数据集，alpha为隐式计算中， $c_{ui} = 1 + \alpha d_{ui}$ 中的超参数。

参考文献

1.https://blog.csdn.net/u011239443/article/details/51752904

推荐系统矩阵分解详解之spark ALS

1.推荐系统与spark

2.ALS算法

3.spark中的ALS

参考文献

小白都能理解的FTRL

樹算法系列之四:XGBoost

Redis常用數據結構

樹算法系列之一:CART迴歸樹

HashMap簡單小結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結