1. org.apache.spark.ml.recommendation.ALS推薦出來的結果雖然是排序了的,但是沒有排序號;想知道推薦成功與推薦排名的關係需要自己加上Row_Number,方法如下:
val recDF=spark.sqlContext.read.load(savePathMl)
.selectExpr("id","explode(recommendations) as rec").selectExpr("id as uId","rec.itemId","rec.rating as rec_rating")
recDF.createOrReplaceTempView("recommend")
spark.sql("select uId,itemId,rec_rating,Row_Number() OVER (partition by uId order by rec_rating desc) as rank from recommend ")
如果想要得分相同的時候並列名次則考慮用rank over ()或者dense_rank()替代row_number(),區別在於:
rank over ()特點是得分相同的兩名是並列,如下1 2 2 4 5
dense_rank()和rank over()很像,但並列後並不會空出並列所佔的名次,如下1 2 2 3 4
row_number這個函數不需要考慮是否並列,那怕根據條件查詢出來的得分相同也會進行連續排名,如1 2 3 4 5
參考:https://blog.csdn.net/zz_xiaohuli_zz/article/details/87472176
2. 展開dataframe逗號分隔符集合字段info:
val genresArr = (s:String) => s.split(",")
spark.udf.register("getGenresArr",genresArr(_:String))
select uId,explode(getGenresArr(info)) as info from table
3. 歌曲特徵集合collect_list是Array,在udf中必須用Seq傳參(用Array傳參會報錯):
def getArtistFeature(songsFeature:Seq[Seq[Float]]): Array[Float] ={
var artFeature: DenseMatrix[Float] =DenseMatrix(Array(0F,0F,0F,0F,0F, 0F,0F,0F,0F,0F))
songsFeature.foreach(songFeature => artFeature += DenseMatrix(songFeature) )
( artFeature * (1L/songsFeature.length.toFloat) ).toArray
}
spark.udf.register("getArtistFeature",getArtistFeature(_:Seq[Seq[Float]]))
val artDF=spark.sql(
"""
|select artistId as id,getArtistFeature(collect_list(features)) as features
|from musicInfo t1
|group by artistId
""".stripMargin)