spark mllib CountVectorizer源碼解析

CountVectorizer和CountVectorizerModel旨在通過計數來將一個文檔轉換爲向量。當不存在先驗字典時，Countvectorizer可作爲Estimator來提取詞彙，並生成一個CountVectorizerModel。該模型產生文檔關於詞語的稀疏表示，其表示可以傳遞給其他算法如LDA。在fitting過程中，countvectorizer將根據語料庫中的詞頻排序從高到低進行選擇，詞彙表的最大含量由vocabsize參數來指定。一個可選的參數minDF也影響fitting過程，它指定詞彙表中的詞語至少要在多少個不同文檔中出現。
1、vocabSize 詞典表的大小

/**
   * 詞典的大小 默認爲math.pow(2,18)
   * 關於詞的選擇：先對詞典做wordcount然後去top vocabSize個特徵放入詞典
   * Default: 2^18^
   * @group param
   */
  val vocabSize: IntParam =
    new IntParam(this, "vocabSize", "max size of the vocabulary", ParamValidators.gt(0))

2、minDF 語料中詞的逆詞頻

 /**
   * DF代表該特徵在多少個語料庫中出現過 minDF代表出現在字典中DF的下限
   * 如果傳入的是Int類型 則代表特徵出現文檔的次數
   * Default: 1.0
   * @group param
   */
   val minDF: DoubleParam = new DoubleParam(this, "minDF", "Specifies the minimum number of" +
    " different documents a term must appear in to be included in the vocabulary." +
    " If this is an integer >= 1, this specifies the number of documents the term must" +
    " appear in; if this is a double in [0,1), then this specifies the fraction of documents.",
    ParamValidators.gtEq(0.0))

3、minTF 文檔中詞頻
/**

TF代表一個詞在文檔中出現的頻率用詞出現的次數/該文檔總共的詞個數
Default: 1.0
@group param
*/

 val minTF: DoubleParam = new DoubleParam(this, "minTF", "Filter to ignore rare words in" +
    " a document. For each document, terms with frequency/count less than the given threshold are" +
    " ignored. If this is an integer >= 1, then this specifies a count (of times the term must" +
    " appear in the document); if this is a double in [0,1), then this specifies a fraction (out" +
    " of the document's token count). Note that the parameter is only used in transform of" +
    " CountVectorizerModel and does not affect fitting.", ParamValidators.gtEq(0.0))

4、binary

/**
   * 適用於離散概率模型 模擬二進制事件 即發生和不發生
   * @group param
   */
  val binary: BooleanParam =
    new BooleanParam(this, "binary", "If True, all non zero counts are set to 1.")

5、查看fit，理解爲裝箱的過程

@Since("2.0.0")
  override def fit(dataset: Dataset[_]): CountVectorizerModel = {
    transformSchema(dataset.schema, logging = true)
    val vocSize = $(vocabSize)
    val input = dataset.select($(inputCol)).rdd.map(_.getAs[Seq[String]](0))
    // 如果傳入的minDF大於等於1 則使用傳入參數 否則使用傳入minDF * dataFrame.count()
    val minDf = if ($(minDF) >= 1.0) {
      $(minDF)
    } else {
      $(minDF) * input.cache().count()
    }
    // input:RDD[Seq[String]] 類型 每個文檔計算wordcount 然後使用flatMap展開
    // 輸出類型爲RDD[(String, Long)] 再調用reduceByKey計算word的DF逆詞頻
    val wordCounts: RDD[(String, Long)] = input.flatMap { case (tokens) =>
      val wc = new OpenHashMap[String, Long]
      tokens.foreach { w =>
        wc.changeValue(w, 1L, _ + 1L)
      }
      wc.map { case (word, count) => (word, (count, 1)) }
    }.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
      (wc1 + wc2, df1 + df2)
    }.filter { case (word, (wc, df)) =>
      df >= minDf
    }.map { case (word, (count, dfCount)) =>
      (word, count)
    }.cache()
    val fullVocabSize = wordCounts.count()

    // 使用top算子去固定數量的詞放入到詞典中
    val vocab = wordCounts
      .top(math.min(fullVocabSize, vocSize).toInt)(Ordering.by(_._2))
      .map(_._1)

    require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
    copyValues(new CountVectorizerModel(uid, vocab).setParent(this))
  }

Tips：總體實現還是比較簡單易懂的；這裏主要是關注詞典表的大小，如果你新數據的數據比訓練樣本大很多，會導致很多詞無法在model vecab裏面出現，導致該特徵值被忽略計算，會導致數據失真。

6、查看transform，生成向量的過程

override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema, logging = true)
    if (broadcastDict.isEmpty) {
      val dict = vocabulary.zipWithIndex.toMap
      broadcastDict = Some(dataset.sparkSession.sparkContext.broadcast(dict))
    }
    val dictBr = broadcastDict.get
    val minTf = $(minTF)
    val vectorizer = udf { (document: Seq[String]) =>
      // 保存該預料的詞頻
      val termCounts = new OpenHashMap[Int, Double]
      // 該預料中單詞的個數
      var tokenCount = 0L
      document.foreach { term =>
        dictBr.value.get(term) match {
          case Some(index) => termCounts.changeValue(index, 1.0, _ + 1.0)
          case None => // ignore terms not in the vocabulary
        }
        tokenCount += 1
      }
      // 計算有效的詞頻
      val effectiveMinTF = if (minTf >= 1.0) minTf else tokenCount * minTf
      //
      val effectiveCounts = if ($(binary)) {
        // 如果是二進制類型 值只有 0和1
        termCounts.filter(_._2 >= effectiveMinTF).map(p => (p._1, 1.0)).toSeq
      } else {
        termCounts.filter(_._2 >= effectiveMinTF).toSeq
      }
      // 生成稀疏向量
      Vectors.sparse(dictBr.value.size, effectiveCounts)
    }
    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
  }

7、例子：

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(3)
  .setMinDF(2)
  .fit(df)
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")
cvModel.transform(df).show(false)

通常CountVectorizer會和IDF配合使用，去獲取詞的分佈
也可以將CountVectorizer處理後的數據餵給LDA繼續使用。
後分析ngram。

spark mllib CountVectorizer源碼解析

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Json字符串轉Java Bean

Spark-SQL adaptive 自適應框架

spark streaming讀取kafka 零丟失（四）

spark批量寫入redis

kafka 安裝配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結