spark中協同過濾算法分析

原創

qq_23617681

2020-02-20 15:29

spark的MLlib是其機器學習算法庫。

其中協同過濾算法叫做ALS，交替最小二乘法。

下面對算法思路和執行代碼進行分析。

算法思想：

1、對於用戶、商品、評分構成的打分矩陣。一般來說是稀疏的矩陣，因爲用戶沒有對所有商品打分，很多地方評分是未知數。

2、我們的目的是要將這個打分矩陣填滿，從而預測用戶對某個商品的打分，繼而進行推薦。

3、計算這個原始矩陣的計算量是非常巨大的，而且沒有必要。我們希望計算出其低秩矩陣，從而宏觀上勾勒出用戶和商品之間的關聯關係，即相似度。

通過這個相似度，構成推薦系統的基本依據。

4、計算時，採用最小二乘法。

由於要優化的公式變量很多，我們採取固定其一，優化其他變量的方式尋求最優值的方式求解。所以較交替最小二乘法。

計算的最終結束標誌是：公式中差的平方和小於預先設定的值，則認爲找到最優解了。可以看出，這個最優解不一定是全局最優的。

java源碼如下：

package sparkTest;

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// $example on$

import scala.Tuple2;

import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.recommendation.ALS;
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
import org.apache.spark.mllib.recommendation.Rating;
import org.apache.spark.SparkConf;
// $example off$

public class JavaRecommendationExample {
  public static void main(String args[]) {
    // $example on$
    SparkConf conf = new SparkConf().setAppName("Java Collaborative Filtering Example").setMaster("local");
    JavaSparkContext jsc = new JavaSparkContext(conf);

    // Load and parse the data
    String path = "input/test.data";
    JavaRDD<String> data = jsc.textFile(path);
    JavaRDD<Rating> ratings = data.map(new Function<String, Rating>() {	//Rating包含user、item、rating
        public Rating call(String s) {
          String[] sarray = s.split(",");
          return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]), Double.parseDouble(sarray[2]));
        }
      }
    );

    // Build the recommendation model using ALS
    int rank = 10;
    int numIterations = 10;
    //MatrixFactorizationModel：矩陣因式分解模型
    //toRDD：JavaRDD轉化爲RDD
    /*train()參數詳解
     * RDD<ratings>:原始的評分矩陣
     * rank:模型中隱語義因子個數
     * iterations:迭代次數
     * lambda:正則化參數，防止過度擬合
     */
    MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numIterations, 0.01);

    // Evaluate the model on rating data
    /*map()參數詳解
     * rating.user()
     * rating.product()
     * 返回新的JavaRDD：包含user, product
     */
    JavaRDD<Tuple2<Object, Object>> userProducts = ratings.map(new Function<Rating, Tuple2<Object, Object>>() {
        public Tuple2<Object, Object> call(Rating r) {
          return new Tuple2<Object, Object>(r.user(), r.product());
        }
      }
    );
    
    //預測
    /*model.predict():對user,product的評分進行預測
     * RDD：從JavaRDD轉化過來的RDD
     * 返回RDD<Rating>
     */
    /*RDD<Rating>.toJavaRDD()轉化爲JavaRDD，便於後面的map
     */
    JavaPairRDD<Tuple2<Integer, Integer>, Double> predictions = JavaPairRDD.fromJavaRDD(
      model.predict(JavaRDD.toRDD(userProducts)).toJavaRDD().map(new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() {
          public Tuple2<Tuple2<Integer, Integer>, Double> call(Rating r){
            return new Tuple2<Tuple2<Integer, Integer>, Double>(
              new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating());
          }
        }
      ));
    
    /*JavaPairRDD.fromJavaRDD():將JavaRDD(具有KeyValue形式)轉化爲JavaPairRDD
     * JavaPairRDD.join(JavaPairRDD other)將本RDD和other RDD中所有相同keys值的連接起來
     * 換句話：將同一用戶對同一商品的評價和預測值，連接起來
     * JavaRDD.values()返回keyvalue鍵值對的value值
     */
    JavaRDD<Tuple2<Double, Double>> ratesAndPreds =
      JavaPairRDD.fromJavaRDD(ratings.map(new Function<Rating, Tuple2<Tuple2<Integer, Integer>, Double>>() {
          public Tuple2<Tuple2<Integer, Integer>, Double> call(Rating r){
            return new Tuple2<Tuple2<Integer, Integer>, Double>(
              new Tuple2<Integer, Integer>(r.user(), r.product()), r.rating());
          }
        }
      )).join(predictions).values();
    
    /*JavaDoubleRDD.fromRDD()將RDD轉化爲Double類型的RDD
     * Function()求預測打分與真實打分之間的誤差平方
     * rdd()將JavaRDD轉化爲rdd
     * means()求JavaDoubleRDD中所有元素均值
     */
    double MSE = JavaDoubleRDD.fromRDD(ratesAndPreds.map(new Function<Tuple2<Double, Double>, Object>() {
        public Object call(Tuple2<Double, Double> pair) {
          Double err = pair._1() - pair._2();
          return err * err;
        }
      }
    ).rdd()).mean();
    System.out.println("Mean Squared Error = " + MSE);	//打印均值
    
    /*其實推薦系統還沒完成
     * 因爲完成這個均方誤差，無法直觀得出該推薦什麼元素給用戶啊
     *我們還需要從這個已經填滿（預測出評分值）的矩陣中，找到適合推薦給某用戶的商品，或者適合某商品的用戶
     *相應的函數爲：model.recommendProducts(user, num);
     *model.recommendUsers(product, num);
     */
    
    // Save and load model
    //save()將模型存儲在指定位置，存儲的結果可以在下次讀取時，直接執行上面的推薦函數，給出推薦結果。
    model.save(jsc.sc(), "target/tmp/myCollaborativeFilter");
    MatrixFactorizationModel sameModel = MatrixFactorizationModel.load(jsc.sc(), "target/tmp/myCollaborativeFilter");
    // $example off$
  }
}

注意：

各種模型的選擇，比如MatrixFactorizationModel（矩陣分解模型）、FactorizationModel（分解模型）以及 LinearRegressionModel（線性迴歸模型）都支持評分預測。

參考文章3.

參考文章：

1、http://www.tuicool.com/articles/fANvieZ

2、http://www.mamicode.com/info-detail-865258.html

3、http://blog.jobbole.com/86959/

4、http://blog.sina.com.cn/s/blog_5773f1390102w2zr.html（一個電影推薦系統實例+代碼）

qq_23617681

發佈了139 篇原創文章 · 獲贊 19 · 訪問量 20萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark中協同過濾算法分析

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

hadoop程序開發實踐——簡單程序

海量數據的KNN分類、Kmeans聚類

spark程序解析——WordCount

spark中協同過濾算法分析

算法模型好壞、評價標準、算法系統設計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結