參考 : https://www.jianshu.com/p/c7e2771eccaa
TfidfVectorizer()函數
TfidfVectorizer() 基於tf-idf算法。此算法包括兩部分 tf 和 idf ,兩者相乘得到 tf-idf 算法。
tf 算法統計某訓練文本中,某個詞的出現次數,計算公式如下:
idf算法,用於調整詞頻的權重係數,如果一個詞越常見,那麼分母就越大,逆文檔頻率就越小越接近0。
tf - idf算法 = tf算法 * idf算法。
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
texts=["orange banana apple grape","banana apple apple","grape", 'orange apple']
cv = TfidfVectorizer()
cv_fit=cv.fit_transform(texts)
print(cv.vocabulary_)
print(cv_fit)
print(cv_fit.toarray())
TfidfVectorizer 可以把 CountVectorizer, TfidfTransformer 合併起來,直接生成tfidf值。
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
print(tfidf_vec.get_feature_names())
print(tfidf_vec.vocabulary_)
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
print(tfidf_matrix.toarray())
[[ 0. 0.43877674 0.54197657 0.43877674 0. 0.
0.35872874 0. 0.43877674]
[ 0. 0.27230147 0. 0.27230147 0. 0.85322574
0.22262429 0. 0.27230147]
[ 0.55280532 0. 0. 0. 0.55280532 0.
0.28847675 0.55280532 0. ]
[ 0. 0.43877674 0.54197657 0.43877674 0. 0.
0.35872874 0. 0.43877674]]
CountVectorizer()函數
CountVectorizer是通過fit_transform函數將文本中的詞語轉換爲詞頻矩陣。
fit_transform : 詞語轉換爲詞頻矩陣
get_feature_names() :可看到所有文本的關鍵字 (其中比如this這個詞,出現了總文本數-1 次,計算idf=0,所以不出現。)
vocabulary_ :可看到所有文本的關鍵字和其位置
toarray() :可看到詞頻矩陣的結果
vectorizer = CountVectorizer()
count = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
print(count.toarray())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 2 1 0 1]
[1 0 0 0 1 0 1 1 0]
[0 1 1 1 0 0 1 0 1]]
TfidfTransformer()函數
TfidfTransformer是統計CountVectorizer中每個詞語的tf-idf權值
transformer = TfidfTransformer()
tfidf_matrix = transformer.fit_transform(count)
print(tfidf_matrix.toarray())
[[ 0. 0.43877674 0.54197657 0.43877674 0. 0.
0.35872874 0. 0.43877674]
[ 0. 0.27230147 0. 0.27230147 0. 0.85322574
0.22262429 0. 0.27230147]
[ 0.55280532 0. 0. 0. 0.55280532 0.
0.28847675 0.55280532 0. ]
[ 0. 0.43877674 0.54197657 0.43877674 0. 0.
0.35872874 0. 0.43877674]]