TFIDF算法介紹
TF-IDF(Term Frequency–InverseDocument Frequency)是一種用於資訊檢索與文本挖掘的常用加權技術。TF-IDF的主要思想是:如果某個詞或短語在一篇文章中出現的頻率TF高,並且在其他文章中很少出現,則認爲此詞或者短語具有很好的類別區分能力,適合用來分類。
TF-IDF實際是TF*IDF,其中TF(Term Frequency)表示詞條在文檔中的出現的頻率,TF的計算公式如下所示:
其中IDF(InverseDocument Frequency)表示總文檔與包含詞條t的文檔的比值求對數,IDF的計算公式如下所示:
其中爲所有的文檔總數,表示文檔是否包含詞條,若包含爲1,不包含爲0。但此處存在一個問題,即當詞條在所有文檔中都沒有出現的話IDF計算公式的分母爲0,此時就需要對IDF做平滑處理,改善後的IDF計算公式如下所示:
那麼最終詞條在文檔中的TF-IDF值爲: 。
從上述的計算詞條在文檔中的TF-IDF值計算可以看出:當一個詞條在文檔中出現的頻率越高,且新鮮度低(即普遍度低),則其對應的TF-IDF值越高。
比如現在有一個預料庫,包含了100篇()論文,其中涉及包含推薦系統()這個詞條的有20篇,在第一篇論文()中總共有200個技術詞彙,其中推薦系統出現了15次,則詞條推薦系統的在第一篇論文()中的TF-IDF值爲:
更多詳細的關於TFIDF的介紹可以參考
關於TF-IDF的其他實戰:
基於TFIDF計算文本相似度
這裏需要注意的是在spark2.x中默認不支持dataframe的笛卡爾積操作,需要在創建Spark對象時開啓。
創建spark對象,並設置日誌等級
// spark.sql.crossJoin.enabled=true spark 2.0 x不支持笛卡爾積操作,需要開啓支持
val spark = SparkSession
.builder()
.appName("docSimCalWithTFIDF")
.config("spark.sql.crossJoin.enabled","true")
.master("local[10]")
.enableHiveSupport()
.getOrCreate()
Logger.getRootLogger.setLevel(Level.WARN)
這裏以官方樣例代碼中的三行英文句子爲例,創建數據集,並進行分詞(spark中的中文分詞包有很多,比如jieba,han,ansj,fudannlp等,這裏不展開介紹)
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer()
.setInputCol("sentence")
.setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
wordsData.show(10)
展示的結果爲:
+-----+--------------------+--------------------+
|label| sentence| words|
+-----+--------------------+--------------------+
| 0|Hi I heard about ...|[hi, i, heard, ab...|
| 1|I wish Java could...|[i, wish, java, c...|
| 2|Logistic regressi...|[logistic, regres...|
+-----+--------------------+--------------------+
調用官方的tfidf包計算向量:
// setNumFeatures(5)表示將Hash分桶的數量設置爲5個,可以根據你的詞語數量來調整,一般來說,這個值越大不同的詞被計算爲一個Hash值的概率就越小,數據也更準確,但需要消耗更大的內存
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
.setNumFeatures(5)
val featurizedData = hashingTF
.transform(wordsData)
featurizedData.show(10)
val idf = new IDF()
.setInputCol("rawFeatures")
.setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.show(10)
rescaledData.select("label", "features").show()
展示的結果爲:
+-----+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures|
+-----+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[hi, i, heard, ab...|(5,[0,2,4],[2.0,2...|
| 1|I wish Java could...|[i, wish, java, c...|(5,[0,2,3,4],[1.0...|
| 2|Logistic regressi...|[logistic, regres...|(5,[0,1,3,4],[1.0...|
+-----+--------------------+--------------------+--------------------+
+-----+--------------------+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures| features|
+-----+--------------------+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[hi, i, heard, ab...|(5,[0,2,4],[2.0,2...|(5,[0,2,4],[0.0,0...|
| 1|I wish Java could...|[i, wish, java, c...|(5,[0,2,3,4],[1.0...|(5,[0,2,3,4],[0.0...|
| 2|Logistic regressi...|[logistic, regres...|(5,[0,1,3,4],[1.0...|(5,[0,1,3,4],[0.0...|
+-----+--------------------+--------------------+--------------------+--------------------+
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0|(5,[0,2,4],[0.0,0...|
| 1|(5,[0,2,3,4],[0.0...|
| 2|(5,[0,1,3,4],[0.0...|
+-----+--------------------+
其中 是向量的一種壓縮表示格式,例如表示的是 向量的長度爲3,其中第 1位和第2位的值爲0.1 和0.3,第3位的值爲0。
這裏需要將其轉化爲向量的形式,方便後續進行計算,可以直接通過dataframe進行轉化,也可以先將dataframe轉化爲rdd,再進行轉化。
datafram通過自定義UDF進行轉化如下:
import spark.implicits._
// 解析數據 轉化爲denseVector格式 datafra
val sparseVectorToDenseVector = udf {
features: SV => features.toDense
}
val df = rescaledData
.select($"label", sparseVectorToDenseVector($"features"))
.withColumn("tag",lit(1))
df.show(10)
展示結果爲:
+------+--------------------+---+
|label1| features1|tag|
+------+--------------------+---+
| 0|[0.0,0.0,0.575364...| 1|
| 1|[0.0,0.0,0.575364...| 1|
| 2|[0.0,0.6931471805...| 1|
+------+--------------------+---+
先轉化爲RDD,再進行轉化如下:
val selectedRDD = rescaledData.select("label", "features").rdd
.map( l=>( l.get(0).toString, l.getAs[SV](1).toDense))
selectedRDD.take(10).foreach(println)
展示結果爲:
(0,[0.0,0.0,0.5753641449035617,0.0,0.0])
(1,[0.0,0.0,0.5753641449035617,0.28768207245178085,0.0])
(2,[0.0,0.6931471805599453,0.0,0.5753641449035617,0.0])
當然也可以在進行相似度計算時進行轉化,實現代碼如下:
// 定義相似度計算udf
import spark.implicits._
val df1 = rescaledData
.select($"label".alias("id1"), $"features".alias("f1"))
.withColumn("tag",lit(1))
val df2 = rescaledData
.select($"label".alias("id2"), $"features".alias("f2"))
.withColumn("tag",lit(1))
val simTwoDoc = udf{
(f1: SV, f2: SV) => calTwoDocSim(f1,f2)
}
val df = df1.join(df2, Seq("tag"), "inner")
.where("id1 != id2")
.withColumn("simscore",simTwoDoc(col("f1"), col("f2")))
.where("simscore > 0.0")
.select("id1","id2","simscore")
df.printSchema()
df.show(20)
其中calTwoDocSim 函數實現如下:
/**
* @Author: GaoYangtuan
* @Description: 自定義計算兩個文本的距離
* @Thinkgamer: 《推薦系統開發實戰》作者,「搜索與推薦Wiki」公號負責人,算法工程師
* @Param: [f1, f2]
* @return: double
**/
def calTwoDocSim(f1: SV, f2: SV): Double = {
val breeze1 =new SparseVector(f1.indices,f1.values, f1.size)
val breeze2 =new SparseVector(f2.indices,f2.values, f2.size)
val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) * norm(breeze2))
cosineSim
}
打印結果如下:
root
|-- id1: integer (nullable = false)
|-- id2: integer (nullable = false)
|-- simscore: double (nullable = false)
+---+---+------------------+
|id1|id2| simscore|
+---+---+------------------+
| 0| 1|0.8944271909999159|
| 1| 0|0.8944271909999159|
| 1| 2|0.2856369296406274|
| 2| 1|0.2856369296406274|
+---+---+------------------+
最後進行排序和保存,代碼如下:
val sortAndSlice = udf { simids: Seq[Row] =>
simids.map{
case Row(id2: Int, simscore: Double) => (id2,simscore)
}
.sortBy(_._2)
.reverse
.slice(0,100)
.map(e => e._1 + ":" + e._2.formatted("%.3f"))
.mkString(",")
}
val result = df
.groupBy($"id1")
.agg(collect_list(struct($"id2", $"simscore")).as("simids"))
.withColumn("simids", sortAndSlice(sort_array($"simids", asc = false)))
result.show(10)
result.coalesce(1).write.format("parquet").mode("overwrite").save("data/tfidf")
打印結果如下:
+---+---------------+
|id1| simids|
+---+---------------+
| 1|0:0.894,2:0.286|
| 2| 1:0.286|
| 0| 1:0.894|
+---+---------------+