CountVectorizer和CountVectorizerModel旨在通過計數來將一個文檔轉換爲向量。當不存在先驗字典時,Countvectorizer可作爲Estimator來提取詞彙,並生成一個CountVectorizerModel。該模型產生文檔關於詞語的稀疏表示,其表示可以傳遞給其他算法如LDA。 在fitting過程中,countvectorizer將根據語料庫中的詞頻排序從高到低進行選擇,詞彙表的最大含量由vocabsize參數來指定。一個可選的參數minDF也影響fitting過程,它指定詞彙表中的詞語至少要在多少個不同文檔中出現。
1、vocabSize 詞典表的大小
/**
* 詞典的大小 默認爲math.pow(2,18)
* 關於詞的選擇:先對詞典做wordcount然後去top vocabSize個特徵放入詞典
* Default: 2^18^
* @group param
*/
val vocabSize: IntParam =
new IntParam(this, "vocabSize", "max size of the vocabulary", ParamValidators.gt(0))
2、minDF 語料中詞的逆詞頻
/**
* DF代表該特徵在多少個語料庫中出現過 minDF代表出現在字典中DF的下限
* 如果傳入的是Int類型 則代表特徵出現文檔的次數
* Default: 1.0
* @group param
*/
val minDF: DoubleParam = new DoubleParam(this, "minDF", "Specifies the minimum number of" +
" different documents a term must appear in to be included in the vocabulary." +
" If this is an integer >= 1, this specifies the number of documents the term must" +
" appear in; if this is a double in [0,1), then this specifies the fraction of documents.",
ParamValidators.gtEq(0.0))
3、minTF 文檔中詞頻
/**
- TF代表一個詞在文檔中出現的頻率 用詞出現的次數/該文檔總共的詞個數
- Default: 1.0
- @group param
*/
val minTF: DoubleParam = new DoubleParam(this, "minTF", "Filter to ignore rare words in" +
" a document. For each document, terms with frequency/count less than the given threshold are" +
" ignored. If this is an integer >= 1, then this specifies a count (of times the term must" +
" appear in the document); if this is a double in [0,1), then this specifies a fraction (out" +
" of the document's token count). Note that the parameter is only used in transform of" +
" CountVectorizerModel and does not affect fitting.", ParamValidators.gtEq(0.0))
4、binary
/**
* 適用於離散概率模型 模擬二進制事件 即發生和不發生
* @group param
*/
val binary: BooleanParam =
new BooleanParam(this, "binary", "If True, all non zero counts are set to 1.")
5、查看fit,理解爲裝箱的過程
@Since("2.0.0")
override def fit(dataset: Dataset[_]): CountVectorizerModel = {
transformSchema(dataset.schema, logging = true)
val vocSize = $(vocabSize)
val input = dataset.select($(inputCol)).rdd.map(_.getAs[Seq[String]](0))
// 如果傳入的minDF大於等於1 則使用傳入參數 否則使用傳入minDF * dataFrame.count()
val minDf = if ($(minDF) >= 1.0) {
$(minDF)
} else {
$(minDF) * input.cache().count()
}
// input:RDD[Seq[String]] 類型 每個文檔計算wordcount 然後使用flatMap展開
// 輸出類型爲RDD[(String, Long)] 再調用reduceByKey計算word的DF逆詞頻
val wordCounts: RDD[(String, Long)] = input.flatMap { case (tokens) =>
val wc = new OpenHashMap[String, Long]
tokens.foreach { w =>
wc.changeValue(w, 1L, _ + 1L)
}
wc.map { case (word, count) => (word, (count, 1)) }
}.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
(wc1 + wc2, df1 + df2)
}.filter { case (word, (wc, df)) =>
df >= minDf
}.map { case (word, (count, dfCount)) =>
(word, count)
}.cache()
val fullVocabSize = wordCounts.count()
// 使用top算子去固定數量的詞放入到詞典中
val vocab = wordCounts
.top(math.min(fullVocabSize, vocSize).toInt)(Ordering.by(_._2))
.map(_._1)
require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF as necessary.")
copyValues(new CountVectorizerModel(uid, vocab).setParent(this))
}
Tips:總體實現還是比較簡單易懂的;這裏主要是關注詞典表的大小,如果你新數據的數據比訓練樣本大很多,會導致很多詞無法在model vecab裏面出現,導致該特徵值被忽略計算,會導致數據失真。
6、查看transform,生成向量的過程
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
if (broadcastDict.isEmpty) {
val dict = vocabulary.zipWithIndex.toMap
broadcastDict = Some(dataset.sparkSession.sparkContext.broadcast(dict))
}
val dictBr = broadcastDict.get
val minTf = $(minTF)
val vectorizer = udf { (document: Seq[String]) =>
// 保存該預料的詞頻
val termCounts = new OpenHashMap[Int, Double]
// 該預料中單詞的個數
var tokenCount = 0L
document.foreach { term =>
dictBr.value.get(term) match {
case Some(index) => termCounts.changeValue(index, 1.0, _ + 1.0)
case None => // ignore terms not in the vocabulary
}
tokenCount += 1
}
// 計算有效的詞頻
val effectiveMinTF = if (minTf >= 1.0) minTf else tokenCount * minTf
//
val effectiveCounts = if ($(binary)) {
// 如果是二進制類型 值只有 0和1
termCounts.filter(_._2 >= effectiveMinTF).map(p => (p._1, 1.0)).toSeq
} else {
termCounts.filter(_._2 >= effectiveMinTF).toSeq
}
// 生成稀疏向量
Vectors.sparse(dictBr.value.size, effectiveCounts)
}
dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
}
7、例子:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).show(false)
通常CountVectorizer會和IDF配合使用,去獲取詞的分佈
也可以將CountVectorizer處理後的數據餵給LDA繼續使用。
後分析ngram。