spark之TF-IDF淺談

所用或所學知識，忘了搜，搜了忘，還不如在此記下，還能讓其他同志獲知。

在使用spark實現機器學習相關算法過程中，檔語料或者數據集是中文文本時，使用spark實現機器學習相關的算法需要把中文文本轉換成Vector或LabeledPoint等格式的數據，需要用到TF-IDF工具。

何爲TF-IDF

TF(Term Frequency)：表示某個單詞或短語在某個文檔中出現的頻率，說白了就是詞頻，其公式：

$TF_{i,j} = \frac{n_{i,j}}{\sum n_{i,j}}$

其中，分子表示該單詞在文件中出現的次數，分母表示在文件中所有單詞的出現次數之和。

IDF(Inverse Document Frequency，逆向文件頻率)：在所有文檔中，若包含某個單詞（此單詞就是自己定義的，需要獲取其TF*IDF值的）的文檔越少，即單詞個數越少，則IDF值就越大，則說明某個單詞具有良好的類別區分能力。某一個單詞的IDF可有總文件數除以包含該單詞文件的數目，再將得到的商取以10爲底的對數，其公式：

$IDF_{i}=lg\frac{|D|}{|\left \{ j:t_{i} \subset d_{j} \right \}|}$

其中，|D|表示語料庫中的文件總數；：表示包含單詞的文件數目。

然後再計算TF與IDF的乘積，其公式爲：

$TFIDF_{i,j} = TF_{i,f} \times IDF_{i}$

但是在實際應用過程中，並不是所有的單詞或短語都必須進行計算TF_IDF值，如“的”、“我們”、“在”等詞語，這些助詞、介詞或短語在實際中被當做停用詞，需要特定的工具或手段把這些停用詞過濾掉，我使用中文分詞工具是ANSJ，基本上能滿足要求，而且準確率較高，ANSJ的下載地址：

https://oss.sonatype.org/content/repositories/releases/org/ansj/ansj_seg/

https://oss.sonatype.org/content/repositories/releases/org/nlpcn/nlp-lang/

使用ANSJ工具進行中文分詞的代碼：

import java.io.InputStream
import java.util

import org.ansj.domain.Result
import org.ansj.recognition.impl.StopRecognition
import org.ansj.splitWord.analysis.ToAnalysis
import org.ansj.util.MyStaticValue
import org.apache.spark.{SparkConf, SparkContext}
import org.nlpcn.commons.lang.tire.domain.{Forest, Value}
import org.nlpcn.commons.lang.tire.library.Library
import org.nlpcn.commons.lang.util.IOUtil

class ChineseSegment extends Serializable {

  @transient private val sparkConf: SparkConf = new SparkConf().setAppName("chinese segment")
  @transient private val sparkContext: SparkContext = SparkContext.getOrCreate(sparkConf)

  private val stopLibRecog = new StopLibraryRecognition
  private val stopLib: util.ArrayList[String] = stopLibRecog.stopLibraryFromHDFS(sparkContext)
  private val selfStopRecognition: StopRecognition = stopLibRecog.stopRecognitionFilter(stopLib)

  private val dicUserLibrary = new DicUserLibrary
  @transient private val aListDicLibrary: util.ArrayList[Value] = dicUserLibrary.getUserLibraryList(sparkContext)
  @transient private val dirLibraryForest: Forest = Library.makeForest(aListDicLibrary)

  /**中文分詞和模式識別*/
  def cNSeg(comment : String) : String = {

    val result: Result = ToAnalysis.parse(comment,dirLibraryForest).recognition(selfStopRecognition)
    result.toStringWithOutNature(" ")
  }


}


/**停用詞典識別：
  * 格式： 詞語  停用詞類型[可以爲空]  使用製表符Tab進行分割
  * 如：
  * #
  * v nature
  * .*了 regex
  *
  * */

class StopLibraryRecognition extends Serializable {

  def stopRecognitionFilter(arrayList: util.ArrayList[String]): StopRecognition ={

    MyStaticValue.isQuantifierRecognition = true //數字和量詞合併

    val stopRecognition = new StopRecognition

    //識別評論中的介詞（p）、嘆詞（e）、連詞（c）、代詞（r）、助詞（u）、字符串（x）、擬聲詞（o）
    stopRecognition.insertStopNatures("p", "e", "c", "r", "u", "x", "o")

    stopRecognition.insertStopNatures("w")  //剔除標點符號

    //剔除以中文數字開頭的，以一個字或者兩個字爲刪除單位，超過三個的都不刪除
    stopRecognition.insertStopRegexes("^一.{0,2}","^二.{0,2}","^三.{0,2}","^四.{0,2}","^五.{0,2}",
      "^六.{0,2}","^七.{0,2}","^八.{0,2}","^九.{0,2}","^十.{0,2}")

    stopRecognition.insertStopNatures("null") //剔除空

    stopRecognition.insertStopRegexes(".{0,1}")  //剔除只有一個漢字的

    stopRecognition.insertStopRegexes("^[a-zA-Z]{1,}")  //把分詞只爲英文字母的剔除掉

    stopRecognition.insertStopWords(arrayList)  //添加停用詞

    stopRecognition.insertStopRegexes("^[0-9]+") //把分詞只爲數字的剔除

    stopRecognition.insertStopRegexes("[^a-zA-Z0-9\u4e00-\\u9fa5]+")  //把不是漢字、英文、數字的剔除

    stopRecognition
  }


  /**停用詞格式：
  導演
  上映
  終於
  加載
  中國*/
  def stopLibraryFromHDFS(sparkContext: SparkContext): util.ArrayList[String] ={
    /** 獲取stop.dic文件中的數據 方法二：
      * 在集羣上運行的話，需要把stop的數據放在hdfs上，這樣集羣中所有的節點都能訪問到停用詞典的數據 */
    val stopLib: Array[String] = sparkContext.textFile("hdfs://zysdmaster000:8020/data/library/stop.dic").collect()
    val arrayList: util.ArrayList[String] = new util.ArrayList[String]()
    for (i<- 0 until stopLib.length)arrayList.add(stopLib(i))

    arrayList

  }
}


/**用戶自定義詞典：
  * 格式：詞語 詞性  詞頻
  * 詞語、詞性和詞頻用製表符分開（Tab）
  * 如：
  * 足球王者        define  1513
  * 媽媽咪呀2       define  1514
  * 黃金兄弟        define  1515
  * 江湖兒女        define  1516
  * 一生有你        define  1517
  *
  * */
class DicUserLibrary extends Serializable {

  def getUserLibraryList(sparkContext: SparkContext): util.ArrayList[Value] = {
    /** 獲取userLibrary.dic文件中的數據 方法二：
      * 在集羣上運行的話，需要把userLibrary的數據放在hdfs上，這樣集羣中所有的節點都能訪問到user library的數據 */
    val va: Array[String] = sparkContext.textFile("hdfs://zysdmaster000:8020/data/library/userLibrary.dic").collect()
    val arrayList: util.ArrayList[Value] = new util.ArrayList[Value]()
    for (i <- 0 until va.length)arrayList.add(new Value(va(i)))
    arrayList
  }
}

分詞結果如下：

分詞成功後需要計算每個詞的TF值，在這裏使用HashTF類，其TF值的結果如下：

以標籤爲1的計算結果爲例，其中262144表示哈希表的桶數，198759表示“祖國”的哈希值，1.0表示“祖國”這個單詞出現的次數。

由TF獲取TFIDF值是調用IDF、IDFModel兩個類實現的，其結果如下：

以標籤1的計算結果爲例，其中262144表示哈希表的桶數，198759表示“祖國”的哈希值，0.8472978603872037表示“祖國”的TF-IDF的計算結果值。其整個程序的代碼：

import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel, Tokenizer}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
object TestML {

  def data(sparkContext: SparkContext): DataFrame ={

    val sqlContext = new SQLContext(sparkContext)
    import sqlContext.implicits._

    val chineseSegment = new ChineseSegment

    val data = sparkContext.parallelize(Seq(
      (1,"我愛我的祖國"),
      (2,"中華人民共和國萬歲"),
      (3,"祖國萬歲"),
      (4,"中華文明萬歲"),
      (5,"我是小學生"),
      (6,"中華人民共和國正在雄起"))
    ).map{x =>
      val str = chineseSegment.cNSeg(x._2)
      (x._1,str)
    }.toDF("id","context")

    data

  }

  def computeTFIDF(dataFrame: DataFrame): Unit ={

    //把分詞結果轉換爲數組
    val tokenizer: Tokenizer = new Tokenizer().setInputCol("context").setOutputCol("words")
    val wordData: DataFrame = tokenizer.transform(dataFrame)

    //對分詞結果進行TF計算
    val hashingTF: HashingTF = new HashingTF().setInputCol("words").setOutputCol("tfvalues")
    val tfDFrame: DataFrame = hashingTF.transform(wordData)
//    tfDFrame.select("id","words","tfvalues").foreach(println)

    //根據獲取的TF值進行IDF計算
    val idf: IDF = new IDF().setInputCol("tfvalues").setOutputCol("rfidfValues")
    val idfModel: IDFModel = idf.fit(tfDFrame)
    val dfidfDFrame: DataFrame = idfModel.transform(tfDFrame)

    //評論對應的DF-IDF
    dfidfDFrame.select("id","words", "rfidfValues").foreach(println)
  }

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("testml")
    val sparkContext = new SparkContext(sparkConf)
    val dataFrame = data(sparkContext)
    computeTFIDF(dataFrame)
  }
}

以上是使用Spark MLlib框架計算中文單詞的tf-idf值，TF-IDF值可有效表示一個單詞（短語）對於文章集或語料集中的其中一部分文件的重要程度，但是使用tf-idf表現某個單詞在文檔中的重要性並不是萬能的。

例如：

設該單詞所在的文檔數x，若x越大，IDF就越大，TFIDF值就越小，該單詞越不能代表該文檔，反之亦然。

但是當該詞只存在於某個或某幾個文檔中，其他文檔都沒有此單詞，則該單詞只能說明代表所在的文檔，並不能代表其他不存在該單詞的文檔。因此，TF-IDF沒有考慮特徵詞在各類的文檔（各文檔之間）頻率的差異性。TF-IDF主要存在的問題：

1、忽略了特徵詞在類之間的分佈情況。

該選擇的特徵詞在某類中出現的次數多，而在其他類中出現的次數少，選取的特徵詞在各類別之間具有較大的差異性，TF-IDF不能區分特徵詞在各個類別之間是否分佈均勻；

2、忽略特徵詞在同一個類別中內部文檔之間的分佈情況。

在同一個類別數據集中，若選擇的特徵詞均勻分佈其中，則這個特徵詞能較好的反應這個類的特徵，若選擇的特徵詞只分布在其中幾個文檔中，在其他文檔中沒有出現，則選擇的特徵詞的TF-IDF值即使很大，也不能代表這個類別的特徵。

因此，若需要對文本特徵進行特徵提取或降維等操作，最好使用卡方、信息增益等方法。

spark之TF-IDF淺談

druid數據源 xml配置

scala的breakOut的應用

TigerGraph圖數據庫的數據加載_GraphStudio方式

TigerGraph圖數據庫創建一個圖Schema

TigerGraph算法庫

知識圖譜推理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結