sparksql筆記——explode/Row_Number/collect_list篇

1. org.apache.spark.ml.recommendation.ALS推薦出來的結果雖然是排序了的,但是沒有排序號;想知道推薦成功與推薦排名的關係需要自己加上Row_Number,方法如下:

val recDF=spark.sqlContext.read.load(savePathMl)
  .selectExpr("id","explode(recommendations) as rec").selectExpr("id as uId","rec.itemId","rec.rating as rec_rating")
recDF.createOrReplaceTempView("recommend")
spark.sql("select uId,itemId,rec_rating,Row_Number() OVER (partition by uId order by rec_rating desc) as rank from recommend ")

如果想要得分相同的時候並列名次則考慮用rank over ()或者dense_rank()替代row_number(),區別在於:

rank over ()特點是得分相同的兩名是並列,如下1 2 2 4 5

dense_rank()和rank over()很像,但並列後並不會空出並列所佔的名次,如下1 2 2 3 4

row_number這個函數不需要考慮是否並列,那怕根據條件查詢出來的得分相同也會進行連續排名,如1 2 3 4 5

參考:https://blog.csdn.net/zz_xiaohuli_zz/article/details/87472176

 

2. 展開dataframe逗號分隔符集合字段info:
val genresArr = (s:String) => s.split(",")
spark.udf.register("getGenresArr",genresArr(_:String)) 
select uId,explode(getGenresArr(info)) as info from table

 

3. 歌曲特徵集合collect_list是Array,在udf中必須用Seq傳參(用Array傳參會報錯):

def getArtistFeature(songsFeature:Seq[Seq[Float]]): Array[Float] ={
    var artFeature: DenseMatrix[Float] =DenseMatrix(Array(0F,0F,0F,0F,0F, 0F,0F,0F,0F,0F))
    songsFeature.foreach(songFeature => artFeature += DenseMatrix(songFeature) )
    ( artFeature * (1L/songsFeature.length.toFloat) ).toArray
}
spark.udf.register("getArtistFeature",getArtistFeature(_:Seq[Seq[Float]]))
val artDF=spark.sql(
  """
    |select artistId as id,getArtistFeature(collect_list(features)) as features
    |from musicInfo t1
    |group by artistId
  """.stripMargin)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章