文章召回-基於ALS的協同過濾算法

本項目完整源碼地址：https://github.com/angeliababy/ALS_col

項目博客地址: https://blog.csdn.net/qq_29153321/article/details/104007318

原理

ALS算法屬於User-Item CF，也叫做混合CF。它同時考慮了User和Item兩個方面。

用戶和商品的關係，可以抽象爲如下的三元組：<User,Item,Rating>。其中，Rating是用戶對商品的評分，表徵用戶對該商品的喜好程度。

一個用戶也不可能給所有商品評分，因此，R矩陣註定是個稀疏矩陣。

針對這樣的特點，我們可以假設用戶和商品之間存在若干關聯維度（比如用戶年齡、性別、受教育程度和商品的外觀、價格等），我們只需要將R矩陣投射到這些維度上即可。

我們並不需要顯式的定義這些關聯維度，而只需要假定它們存在即可。一般情況下，k的值遠小於n和m的值，從而達到了數據降維的目的。k的典型取值一般是20～200。

計算出來用戶對未知物品的得分，同時，矩陣X和Y，還可以用於比較不同的User（或Item）之間的相似度。

優缺點：
首先，協同過濾不是全局推薦。詳細如下：

實踐部分

數據準備
用戶資訊得分數據，也可用網上的電影數據集

用戶資訊得分方案：
1. 如果不是有效閱讀（閱讀時長<2秒），得分爲0
2. 閱讀，直接2分*閱讀佔比
3. 點贊，直接3分
4. 評論，直接4分
5. 收藏或者分享，直接5分
6. 不喜歡，直接-1分

依據下面表格式構造訓練數據，只需構建代碼中用到的字段即可

CREATE TABLE `news_read` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `user_id` int(11) NOT NULL DEFAULT '0',
  `member_id` int(11) NOT NULL DEFAULT '0',
  `channel_id` int(11) DEFAULT NULL COMMENT '頻道編號',
  `news_id` int(11) DEFAULT NULL COMMENT '文章編號',
  `entry_datetime` datetime DEFAULT NULL COMMENT '進入時間',
  `leave_datetime` datetime DEFAULT NULL COMMENT '離開時間',
  `readed_percent` int(11) DEFAULT NULL COMMENT '閱讀進度',
  `is_try_share` bit(1) DEFAULT b'0' COMMENT '是否嘗試分享',
  `review_count` int(11) DEFAULT '0' COMMENT '評論數量',
  `is_review` bit(1) DEFAULT b'0' COMMENT '是否評論',
  `is_collect` bit(1) DEFAULT NULL COMMENT '是否收藏',
  `is_praise` bit(1) DEFAULT NULL COMMENT '是否點贊',
  `is_tread` bit(1) DEFAULT b'0' COMMENT '是否踩',
  `lang` int(11) DEFAULT '-1' COMMENT '1:中文、2:柬文、3:英文',
  `platform` int(11) DEFAULT '-1' COMMENT '1：安卓，2：蘋果',
  `batch_no` varchar(64) NOT NULL DEFAULT '' COMMENT '批次號',
  `source` tinyint(4) NOT NULL DEFAULT '99' COMMENT '資訊閱讀入口（1、歷史 2、收藏 3、頻道(遺棄) 4、相關推薦 5、搜索 6、用戶動態  7、資訊模塊 8、視頻模塊 9、小視頻模塊 99、其它）',
  `news_type` tinyint(4) NOT NULL DEFAULT '-1' COMMENT '資訊類型（1、圖文，2、圖集，3、視頻，4、小說、6、廣告推廣 8、小視頻）',
  `add_datetime` datetime DEFAULT NULL COMMENT '添加時間',
  `app_version` varchar(128) DEFAULT '' COMMENT 'APP版本',
  `ip` varchar(255) DEFAULT '' COMMENT '客戶端IP',
  `device_id` varchar(255) DEFAULT '' COMMENT '設備ID',
  `country` varchar(255) DEFAULT NULL,
  `province` varchar(255) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  `uuid` varchar(128) DEFAULT '' COMMENT '客戶端全局唯一id',
  `log_id` varchar(64) NOT NULL DEFAULT '' COMMENT '日誌Id',
  `relation_news_ids` varchar(255) NOT NULL DEFAULT '' COMMENT '延展閱讀曝光的資訊',
  PRIMARY KEY (`id`),
  KEY `news_id` (`news_id`,`channel_id`,`member_id`) USING BTREE,
  KEY `channel_id` (`channel_id`,`news_id`,`member_id`),
  KEY `add_datetime` (`add_datetime`),
  KEY `channelId_memberId_time` (`channel_id`,`member_id`,`add_datetime`),
  KEY `member_id` (`member_id`,`add_datetime`,`channel_id`) USING BTREE,
  KEY `user_id` (`user_id`,`add_datetime`,`channel_id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=53485290 DEFAULT CHARSET=utf8mb4 COMMENT='資訊閱讀行爲表'

news_read = news_read.select(news_read.user_id,news_read.member_id,news_read.news_id,news_read.readed_percent,news_read.is_try_share,news_read.is_collect,news_read.is_review,news_read.is_praise,news_read.entry_datetime,news_read.leave_datetime,news_read.news_type,news_read.add_datetime,news_read.id)

修改程序中讀取數據部分
運行程序，修改路徑

spark-submit --master local --jars /home/spider/code/reconmmend_test/Trunk/news_recommed_reckon_java/action_reckon/searchword/lib/mysql-connector-java-5.1.38.jar --conf spark.pyspark.python=/usr/local/bin/python3 --conf spark.pyspark.driver.python=/usr/local/bin/python3  offline/user_news_score_t.py

ALS評估
上步中生成的news_score樣例數據如下：
user_id,member_id,item,score,log_time
51778098,898036,8644098,1.04,2020-01-06 10:39:21

1.訓練數據

from pyspark.mllib.recommendation import Rating, ALS
# 1.訓練數據（用戶序號,item,rating）
ratings = user_data_join.map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))

2.訓練

model = ALS.train(ratings,50,10,0.01)

3.給單個用戶推薦物品

user = 3
topKRecs = model.recommendProducts(user, 10)
for i in topKRecs:
    print(i)

4.測試數據
7天當作訓練數據，1天當作測試

# 處理數據，用戶序號，資訊，得分
user_data_join, user_label = get_data(datas, user_label)

5.預測數據

userProducts = ratings.map(lambda rating:(rating.user,rating.product))
print('實際的評分電影:',userProducts.take(5))
# print (model.predictAll(userProducts).collect()[0])
predictions = model.predictAll(userProducts)

6.評估

predictions = model.predictAll(userProducts).map(lambda rating:((rating.user,rating.product), rating.rating))
print('預測的評分:',predictions.take(5))

ratingsAndPredictions = ratings.map(lambda rating:((rating.user,rating.product),rating.rating)).join(predictions)
print('組合預測的評分和實際的評分:',ratingsAndPredictions.take(5))
# 組合預測的評分和實際的評分: [((1730, 8904080), (1.52, 1.5183552691178965)), ((2634, 8903648), (1.08, 1.109481704984579)), ((412, 8824284), (1.94, 1.9368518512651538)), ((759, 8841175), (2.0, 1.9916580018716354)),((846, 8825136), (5.0, 4.987908502485232))]

from pyspark.mllib.evaluation import RegressionMetrics
from pyspark.mllib.evaluation import RankingMetrics
#((196, 242), (3.0, 3.089619902353484))
predictedAndTrue = ratingsAndPredictions.map(lambda x:x[1][0:2])
print (predictedAndTrue.take(5))
regressionMetrics = RegressionMetrics(predictedAndTrue)
print ("均方誤差 = %f"%regressionMetrics.meanSquaredError)
print ("均方根誤差 = %f"% regressionMetrics.rootMeanSquaredError)

7.評估結果
1）訓練數據擬合的準確性（7天當作訓練數據，預測數據也來自訓練集）：
均方誤差 = 0.000245
均方根誤差 = 0.015641
2）測試數據的準確性（7天當作訓練數據，另1天當作測試）：
均方誤差 = 1.663446
均方根誤差 = 1.289746

參考博客：
https://blog.csdn.net/buptdavid/article/details/78970906
https://www.zybuluo.com/xtccc/note/200979

文章召回-基於ALS的協同過濾算法實踐及評估

文章召回-基於ALS的協同過濾算法

原理

實踐部分

Python 潮流週刊#52：Python 處理 Excel 的資源

dlib人臉識別安裝及使用教程

數值計算+GPU加速算法

pyspark 文章畫像和用戶畫像（二）

pyspark 相似文章推薦-Word2Vec+Tfidf+LSH（一）

分類模型原理及優缺點整理總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結