背景
- 在新的系統裏面,早期都是沒有很多數據,很難直接拿來做推薦系統,這就是有些算法存在冷啓動的問題,所以在系統早期推薦都是基於熱度(流行度)或者基於運營策略的推薦.
- 我們此處案例是一個搜索熱詞推薦,如同百度右側熱詞推薦
基礎知識
- 我們此處使用的數據是搜狗實驗室數據 - 用戶查詢日誌(SogouQ)版本:2008
數據下載鏈接:http://www.sogou.com/labs/resource/q.php
自己測試可以下載:迷你版(樣例數據, 376KB):tar.gz格式,zip格式
樣例數據:
00:00:00 2982199073774412 [360安全衛士] 8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html
00:00:00 07594220010824798 [***] 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
00:00:00 5228056822071097 [***] 14 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5
00:00:00 6140463203615646 [繩藝] 62 36 www.jd-cd.com/jd_opus/xx/200607/706.html
00:00:00 8561366108033201 [***] 3 2 www.big38.net/
00:00:00 23908140386148713 [莫衷一是的意思] 1 2 www.chinabaike.com/article/81/82/110/2007/2007020724490.html
00:00:00 1797943298449139 [星夢緣全集在線觀看] 8 5 www.6wei.net/dianshiju/???\xa1\xe9|???do=index
00:00:00 00717725924582846 [閃字吧] 1 2 www.shanziba.com/
00:00:00 41416219018952116 [***] 2 6 bbs.gouzai.cn/thread-698736.html
00:00:00 9975666857142764 [電腦創業] 2 2 ks.cn.yahoo.com/question/1307120203719.html
- 使用技術知識是spark,scala的api,爲了降低大家學習使用成本,我們使用spark-local模式執行
- spark官方文檔:http://spark.apache.org/docs/latest/
數據清洗
- 清洗過程主要是把數據轉換成我們想要的格式,此處我們直接把行分割,對關鍵詞做下處理即可
- 項目git地址:https://github.com/liurui-rolin/recommend
- 項目結構如下
- 清洗代碼如下
package youling.studio.recommend.hotbase
import org.apache.spark.sql.SparkSession
/**
* @author liurui
* @date 2019/8/11 下午9:32
*/
object HotBase {
def main(args: Array[String]): Unit = {
println("start...")
val logFile = "data/SogouQ.sample"
// 創建spark
val spark = SparkSession.builder.master("local[2]").appName("Hot base app").getOrCreate()
//讀取數據
val logData = spark.read.textFile(logFile)
import spark.implicits._
//簡單清洗
val etlData = logData.map(_.toString.replace("[","").replace("]",""))
//顯示示例數據
etlData.limit(10).collect().foreach(println(_))
spark.stop()
println("end...")
}
}
- 輸出如下
start...
HotBase.scala:19, took 0.636522 s
00:00:00 2982199073774412 360安全衛士 8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html
00:00:00 07594220010824798 哄搶救災物資 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
00:00:00 5228056822071097 *** 14 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5
00:00:00 6140463203615646 *** 62 36 www.jd-cd.com/jd_opus/xx/200607/706.html
00:00:00 8561366108033201 *** 3 2 www.big38.net/
00:00:00 23908140386148713 莫衷一是的意思 1 2 www.chinabaike.com/article/81/82/110/2007/2007020724490.html
00:00:00 1797943298449139 星夢緣全集在線觀看 8 5 www.6wei.net/dianshiju/????\xa1\xe9|????do=index
00:00:00 00717725924582846 閃字吧 1 2 www.shanziba.com/
00:00:00 41416219018952116 *** 2 6 bbs.gouzai.cn/thread-698736.html
00:00:00 9975666857142764 電腦創業 2 2 ks.cn.yahoo.com/question/1307120203719.html
end...
Process finished with exit code 0
計算熱度推薦詞
- 還是基於上面清洗的類直接進行熱詞推薦
package youling.studio.recommend.hotbase
import org.apache.spark.sql.SparkSession
/**
* @author liurui
* @date 2019/8/11 下午9:32
*/
object HotBase {
def main(args: Array[String]): Unit = {
println("start...")
val logFile = "data/SogouQ.sample"
// 創建spark
val spark = SparkSession.builder.master("local[2]").appName("Hot base app").getOrCreate()
//讀取數據
val logData = spark.read.textFile(logFile)
import spark.implicits._
//簡單清洗
val etlData = logData.map(_.toString.replace("[","").replace("]","")).cache()
//顯示示例數據
etlData.limit(10).collect().foreach(println(_))
//執行熱詞計算
val hotWords = etlData.map(line => (line.split("\t")(2),1)).rdd.reduceByKey((a,b) => a+b).map(res => (res._2,res._1)).sortByKey(false,1)
hotWords.take(100).foreach(println(_))
spark.stop()
println("end...")
}
}
- 此處結果集直接打印出來了,爲了方便測試,大家可以寫入文件系統,數據庫等地方
查看結果
- 結果如下
start...
...
19/08/11 23:43:48 INFO DAGScheduler: ResultStage 3 (take at HotBase.scala:27) finished in 0.102 s
19/08/11 23:43:48 INFO DAGScheduler: Job 1 finished: take at HotBase.scala:27, took 0.992908 s
...此處省略日誌,涉及敏感信息,大家可以自己跑數查看
(8,免費電影)
19/08/11 23:43:48 INFO SparkUI: Stopped Spark web UI at http://192.168.1.100:4040
19/08/11 23:43:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/08/11 23:43:48 INFO MemoryStore: MemoryStore cleared
19/08/11 23:43:48 INFO BlockManager: BlockManager stopped
19/08/11 23:43:48 INFO BlockManagerMaster: BlockManagerMaster stopped
19/08/11 23:43:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/08/11 23:43:48 INFO SparkContext: Successfully stopped SparkContext
19/08/11 23:43:48 INFO ShutdownHookManager: Shutdown hook called
19/08/11 23:43:48 INFO ShutdownHookManager: Deleting directory /private/var/folders/ns/2vftqg2n1f76m5hhj_mvstm00000gn/T/spark-0becc81f-50b4-45ba-81fd-36674d526ac4
end...
- 分析結果:看來是當時08年大家都在關注**(此處打馬賽克,因爲涉及敏感詞,大家可以自行跑數查看),還有一些gay話題,哈哈,看來這推薦有的合理有的就吐血了…
繼續看下文~