第四部分-推薦系統-離線推薦
- 本模塊基於第4節得到的模型,開始爲用戶做離線推薦,推薦用戶最有可能喜愛的5部電影。
說明幾點
1.主要分爲兩個模塊。其一是爲 單個隨機用戶 做推薦,其二是爲 所有用戶做推薦,並將推薦結果進行保存
2. 其中所有推薦的結果保存在 MySQL中,HBase,Hive中 <三種版本>。
3. 其中取得的userid一定要存在於模型中, 這樣就建議直接從trainingData中取。
1.爲某一用戶產生推薦結果
開始模塊一Coding
步驟一: 繼續在前面的項目中,新建config包,新建AppConf接口
爲了代碼不要那麼冗餘,我們抽離一個接口出來
package com.csylh.recommend.config
import java.util.Properties
import org.apache.spark.sql.SparkSession
/**
* Description: 後續ETL操作或者其他操作必須要實現的trait
*
* @Author: 留歌36
* @Date: 2019-07-17 16:53
*/
trait AppConf {
val localMasterURL = "local[2]"
val clusterMasterURL = "spark://hadoop001:7077"
// 面向SparkSession編程
val spark = SparkSession.builder()
// .master(localMasterURL)
.enableHiveSupport() //開啓訪問Hive數據, 要將hive-site.xml等文件放入Spark的conf路徑
.getOrCreate()
val sc = spark.sparkContext
// 設置RDD的partitions 的數量一般以集羣分配給應用的CPU核數的整數倍爲宜, 4核8G ,設置爲8就可以
// 問題一:爲什麼設置爲CPU核心數的整數倍?
// 問題二:數據傾斜,拿到數據大的partitions的處理,會消耗大量的時間,因此做數據預處理的時候,需要考量會不會發生數據傾斜
val minPartitions = 8
// 在生產環境中一定要注意設置spark.sql.shuffle.partitions,默認是200,及需要配置分區的數量
val shuffleMinPartitions = "8"
spark.sqlContext.setConf("spark.sql.shuffle.partitions",shuffleMinPartitions)
//jdbc連接
val jdbcURL = "jdbc:mysql://hadoop001:3306/recommend?useUnicode=true&characterEncoding=UTF-8&useSSL=false"
val alsTable = "recommend.alsTab"
val recResultTable = "recommend.similarTab"
val top5Table = "recommend.top5Result"
val userTable= "recommend.user"
val ratingTable= "recommend.rating"
val mysqlusername = "root"
val mysqlpassword = "123456"
val prop = new Properties
prop.put("driver", "com.mysql.jdbc.Driver")
prop.put("user", mysqlusername)
prop.put("password", mysqlpassword)
}
步驟二: 新建Recommender
package com.csylh.recommend.ml
import com.csylh.recommend.config.AppConf
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
/**
* Description: 爲某一用戶產生推薦結果
*
* @Author: 留歌36
* @Date: 2019-07-18 10:04
*/
object Recommender extends AppConf{
def main(args: Array[String]): Unit = {
// 從trainingData中取得的userid 一定存在於模型中。
val users = spark.sql("select distinct(userId) from trainingData order by userId asc")
// 取隨意一個用戶
val index = 36
val uid = users.take(index).last.getInt(0)
val modelpath = "/tmp/BestModel/0.8521581387523667"
val model = MatrixFactorizationModel.load(sc, modelpath)
val rec = model.recommendProducts(uid, 5)
val recmoviesid = rec.map(_.product)
println("我爲用戶" + uid + "推薦了以下5部電影:")
/**
* 1
*/
for (i <- recmoviesid) {
val moviename = spark.sql(s"select title from movies where movieId=$i").first().getString(0)
println(moviename)
}
// /**
// * 2
// */
// recmoviesid.foreach(x => {
// val moviename = spark.sql(s"select title from movies where movieId=$x").first().getString(0)
// println(moviename)
// })
//
// /**
// * 3
// */
// recmoviesid.map(x => {
// val moviename = spark.sql(s"select title from movies where movieId=$x").first().getString(0)
// println(moviename)
// })
}
}
步驟二:將創建的項目進行打包上傳到服務器
mvn clean package -Dmaven.test.skip=true
步驟三:編寫shell 執行腳本
[root@hadoop001 ml]# vim recommender.sh
export HADOOP_CONF_DIR=/root/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop
$SPARK_HOME/bin/spark-submit \
--class com.csylh.recommend.ml.Recommender \
--master spark://hadoop001:7077 \
--name Recommender \
--driver-memory 10g \
--executor-memory 5g \
/root/data/ml/movie-recommend-1.0.jar
步驟四:執行 sh recommender.sh 即可
[root@hadoop001 ml]# sh recommender.sh
19/10/20 21:39:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/20 21:40:07 WARN MatrixFactorizationModel: User factor does not have a partitioner. Prediction on individual records could be slow.
19/10/20 21:40:07 WARN MatrixFactorizationModel: User factor is not cached. Prediction could be slow.
19/10/20 21:40:07 WARN MatrixFactorizationModel: Product factor does not have a partitioner. Prediction on individual records could be slow.
19/10/20 21:40:07 WARN MatrixFactorizationModel: Product factor is not cached. Prediction could be slow.
我爲用戶53推薦了以下5部電影:
8 Murders a Day (2011)
49 Pulses (2017)
Styx - Caught In The Act (2007)
The Change
Earth's Natural Wonders (2016)
[root@hadoop001 ml]#
說明: val uid = users.take(index).last.getInt(0) //index=36 ,對應的uid=53
2.爲所有用戶產生推薦結果
開始模塊二Coding
步驟一: 在ml包下新建RecommendForAllUsers
package com.csylh.recommend.ml
import com.csylh.recommend.config.AppConf
import com.csylh.recommend.entity.Result
import org.apache.spark.SparkContext
import org.apache.spark.mllib.recommendation._
import org.apache.spark.sql.{SaveMode, SparkSession}
/**
* Description: TODO
*
* @Author: 留歌36
* @Date: 2019-07-18 10:42
*/
object RecommendForAllUsers extends AppConf{
def main(args: Array[String]): Unit = {
val users = spark.sql("select distinct(userId) from trainingData order by userId asc")
// 取得所有的用戶ID
val allusers = users.rdd.map(_.getInt(0)).toLocalIterator
// 方法1,可行,但是效率不高,一條條取
val modelpath = "/tmp/BestModel/0.8521581387523667"
val model = MatrixFactorizationModel.load(sc, modelpath)
while (allusers.hasNext) {
val rec = model.recommendProducts(allusers.next(), 5) // 得到Array[Rating]
writeRecResultToMysql(rec, spark, sc)
// writeRecResultToSparkSQL(rec),寫入到SPARK-SQL(DataFrame)+hive,同ETL。
// writeRecResultToHbase(rec, sqlContext, sc)
}
// 方法2,不可行,因爲一次將矩陣表全部加載到內存,消耗資源太大
// val recResult = model.recommendProductsForUsers(5)
def writeRecResultToMysql(uid: Array[Rating], spark: SparkSession, sc: SparkContext) {
val uidString = uid.map(x => x.user.toString() + ","
+ x.product.toString() + "," + x.rating.toString())
import spark.implicits._
val uidDFArray = sc.parallelize(uidString)
val uidDF = uidDFArray.map(_.split(",")).map(x => Result(x(0).trim().toInt, x(1).trim.toInt, x(2).trim().toDouble)).toDF
// 寫入mysql數據庫,數據庫配置在 AppConf中
uidDF.write.mode(SaveMode.Append).jdbc(jdbcURL, alsTable, prop)
}
// // 把推薦結果寫入到phoenix+hbase,通過DF操作,不推薦。
// val hbaseConnectionString = "localhost"
// val userTupleRDD = users.rdd.map { x => Tuple3(x.getInt(0), x.getInt(1), x.getDouble(2)) }
// // zkUrl需要按照hbase配置的zookeeper的url來設置,本地模式就寫localhost
// userTupleRDD.saveToPhoenix("NGINXLOG_P", Seq("USERID", "MOVIEID", "RATING"), zkUrl = Some(hbaseConnectionString))
//
// // 把推薦結果寫入到phoenix+hbase,通過DF操作,不推薦。
// def writeRecResultToHbase(uid: Array[Rating], sqlContext: SQLContext, sc: SparkContext) {
// val uidString = uid.map(x => x.user.toString() + "|"
// + x.product.toString() + "|" + x.rating.toString())
// import sqlContext.implicits._
// val uidDF = sc.parallelize(uidString).map(_.split("|")).map(x => Result(x(0).trim().toInt, x(1).trim.toInt, x(2).trim().toDouble)).toDF
// // zkUrl需要按照hbase配置的zookeeper的url來設置
// uidDF.save("org.apache.phoenix.spark", SaveMode.Overwrite, Map("table" -> "phoenix_rec", "zkUrl" -> "hadoop001:2181"))
// }
}
}
步驟二:將創建的項目進行打包上傳到服務器
mvn clean package -Dmaven.test.skip=true
步驟三:編寫shell 執行腳本
[root@hadoop001 ml]# vim RecommendForAllUsers.sh
export HADOOP_CONF_DIR=/root/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop
$SPARK_HOME/bin/spark-submit \
--class com.csylh.recommend.ml.RecommendForAllUsers \
--master spark://hadoop001:7077 \
--name RecommendForAllUsers \
--driver-memory 10g \
--executor-memory 5g \
--packages "mysql:mysql-connector-java:5.1.38" \
/root/data/ml/movie-recommend-1.0.jar
步驟四:執行 sh RecommendForAllUsers.sh 即可
有任何問題,歡迎留言一起交流~~
更多文章:基於Spark的電影推薦系統:https://blog.csdn.net/liuge36/column/info/29285