Mahout的一大特色就是包含了推薦算法,裏面包括了多種常見的算法,下面我們來分析分析。
針對基於用戶行爲數據的推薦算法一般稱爲協同過濾算法。協同過濾算法有基於領域(neighborhood-based)的方法,隱語義模型(latent factor model)的方法,基於圖的隨機遊走(random walk on graph)算法。目前用的最多的就是基於領域的方法,基於領域的算法裏面主要有基於用戶的協同過濾算法和基於物品的協同過濾算法。下面幾點摘自mahout的官網關於推薦算法的基本忠告。
- 不要一上來就來個分佈式的基於Hadoop的推薦,除非必要;建議從非分佈式的推薦開始,這樣簡單,靈活
- 最爲最佳實踐,系統在100M用戶-物品項的級別對4G內存現代服務器來說是合適的可用的,能夠作爲實時推薦運行起來
- 超過了上述規模的可以考慮分佈式系統,但是很多應用並沒有真的有100M的數據處理。很多數據可以簡化的,儘量修剪噪聲和舊的數據對結果沒有顯著的影響
- 用戶和物品是否存在真的關聯,是否擁有真的用戶偏好數據。如果有用戶評級數據,可以考慮GenericItemBasedRecommender和PearsonCorrelationSimilarity 相似矩陣. 如果沒有則考慮GenericBooleanPrefItemBasedRecommender 和 LogLikelihoodSimilarity.
如果想使用基於內容的item-item similarity,需要實現自己的ItemSimilarity. - CSV文件可以使用 FileDataModel 數據保存在數據庫可以使用MySQLJDBCDataModel (PostgreSQL counterpart, etc.) and R eloadFromJDBCDataModel。
1.基於用戶的協同過濾
1.1.準備數據 (dataset.csv)
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
1.2. 開始構建推薦模型
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);//構建用戶相似度評價方法,這裏用的是PearsonCorrelation相似度
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);//使用默認的用戶近鄰閾值
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);//計算得到推薦模型
List<RecommendedItem> recommendations = recommender.recommend(2, 3);//爲用戶2,推薦3個物品
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);//打印每個物品ID
}
1.3.運行
INFO - Reading file info...
INFO - Read lines: 32
INFO - Processed 4 users
DEBUG - Recommending items for user ID '2'
DEBUG - Recommendations are: [RecommendedItem[item:12, value:4.8328104], RecommendedItem[item:13, value:4.6656213], RecommendedItem[item:14, value:4.331242]]
RecommendedItem[item:12, value:4.8328104]
RecommendedItem[item:13, value:4.6656213]
RecommendedItem[item:14, value:4.331242]
1.4千萬級別數據來運行
int batchSize = 10000;
int recordsCnt = 30000000; //30M個
String fileName = "D:/tmp/recommandtestdata2.csv";
StringBuffer sb = new StringBuffer();
for (int i = 0; i < recordsCnt; i++) {
// System.out.println(i + "===" + (char) i);
if (sb == null) {
sb = new StringBuffer();
}
sb.append(getRandInt(1000000,1));// userId 1M個
sb.append(",");
sb.append(getRandInt(1000,1));// itemId
sb.append(",");
sb.append(getRandInt(5,1));// value
sb.append("\n");
if (i > 0 && (i % batchSize == 0)) {
System.out.println(i);
write2File(sb.toString(), fileName);// append data to file
sb = null;
}
}
}
return (int) (Math.random() * (max - min) + min);
}
RandomAccessFile myFileStream;
try {
myFileStream = new RandomAccessFile(path, "rw");
myFileStream.seek(myFileStream.length());
myFileStream.write((str).getBytes("UTF-8"));
// myFileStream.w
myFileStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
at org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$1.apply(GenericUserPreferenceArray.java:251)
INFO - Reading file info...
INFO - Processed 1000000 lines
INFO - Processed 2000000 lines
。。。。。。(此處省略)
INFO - Processed 28000000 lines
INFO - Processed 29000000 lines
INFO - Read lines: 29990001
INFO - Processed 10000 users
INFO - Processed 20000 users
INFO - Processed 980000 users
INFO - Processed 990000 users
INFO - Processed 999999 users
DEBUG - Recommending items for user ID '2'
DEBUG - Recommendations are: [RecommendedItem[item:458, value:2.5922961], RecommendedItem[item:842, value:2.5879922], RecommendedItem[item:802, value:2.5861814]]
RecommendedItem[item:458, value:2.5922961]
RecommendedItem[item:842, value:2.5879922]
RecommendedItem[item:802, value:2.5861814]