BOW模;型CountVectorizer模型;tfidf模型;

原創

frostjsy

2020-06-10 20:32

自然語言入門

一、BOW模型：使用一組無序的單詞來表達一段文字或者一個文檔，並且每個單詞的出現都是獨立的。在表示文檔時是二值（出現1，不出現0）；

eg:

Doc1:practice makes perfect perfect.

Doc2:nobody is perfect.

Doc1和Doc2作爲語料庫：詞有（practice makes perfect nobody is）

Doc1用BOW模型向量表示爲：[1,1,1,0,0]

Doc2用BOW模型向量表示爲：[0,0,1,1,1]

二、CountVectorizer模型：使用一組無序的單詞來表達一段文字或者一個文檔，並且每個單詞的出現都是獨立的。在表示文檔時是每個詞在相應文檔中出現的次數；

eg:

Doc1:practice makes perfect perfect.

Doc2:nobody is perfect.

Doc1和Doc2作爲語料庫：詞有（practice makes perfect nobody is）

Doc1用BOW模型向量表示爲：[1,1,2,0,0]

Doc2用BOW模型向量表示爲：[0,0,1,1,1]

CountVectorizer調用sklearn實現，sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
'''停用詞表'''
stop_list='is a the of'.split()

'''申明CountVectorizer模型'''
cnt = CountVectorizer(min_df=1,ngram_range=(1,2),stop_words=stop_list)

'''語料庫'''
corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
 ]

'''將文本轉化爲特徵'''
X =cnt.fit_transform(corpus)
print(type(X)) #輸出<class 'scipy.sparse.csr.csr_matrix'> 
print(X)

'''取得特徵詞列表[]'''
print(cnt.get_feature_names())

'''
輸出
['and', 'and third', 'document', 'first', 'first document', 'one', 'second', 
'second document','second second', 'third', 'third one', 'this', 'this first', 'this second']
'''

'''獲取得到的詞'''
print(cnt.vocabulary_)

'''
輸出：
{'this': 11, 'first': 3, 'document': 2, 'this first': 12, 'first document': 4, 'second': 6, 
'this second': 13, 'second second': 8, 'second document': 7, 'and': 0, 'third': 9, 'one': 5, 
'and third': 1, 'third one': 10}
'''

三、tfidf模型（可以調sklearn包實現tfidf的特徵提取from sklearn.feature_extraction.text import TfidfVectorizer）

1、計算詞頻tf

　　詞頻（TF） = 某個詞在文章中的出現次數 / 文章總詞數

2、計算逆文檔數idf

逆文檔頻率（IDF） = log（語料庫的文檔總數/包含該詞的文檔總數）

3、計算TF-IDF

TF-IDF = 詞頻（TF) * 逆文檔頻率（IDF）

實現代碼見git：https://github.com/frostjsy/my_study/blob/master/nlp/feature_extract/tf_idf.py

TfidfVectorizer調用sklearn實現，sklearn.feature_extraction.text import TfidfVectorizer

'''申明TfidfVectorizer模型，用法類似於CountVectorizer'''
tfidf=TfidfVectorizer(ngram_range=(1,2),stop_words=stop_list)
x=tfidf.fit_transform(corpus)
print(tfidf.get_feature_names())
print(tfidf.vocabulary_)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

BOW模;型CountVectorizer模型;tfidf模型;

nc傳輸數據

mysql報錯pymysql.err.InterfacError插入問題解決

BOW模;型CountVectorizer模型;tfidf模型;

hive複雜語句處理

gitlab配置多個用戶

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結