自然語言入門
一、BOW模型:使用一組無序的單詞來表達一段文字或者一個文檔,並且每個單詞的出現都是獨立的。在表示文檔時是二值(出現1,不出現0);
eg:
Doc1:practice makes perfect perfect.
Doc2:nobody is perfect.
Doc1和Doc2作爲語料庫:詞有(practice makes perfect nobody is)
Doc1用BOW模型向量表示爲:[1,1,1,0,0]
Doc2用BOW模型向量表示爲:[0,0,1,1,1]
二、CountVectorizer模型:使用一組無序的單詞來表達一段文字或者一個文檔,並且每個單詞的出現都是獨立的。在表示文檔時是每個詞在相應文檔中出現的次數;
eg:
Doc1:practice makes perfect perfect.
Doc2:nobody is perfect.
Doc1和Doc2作爲語料庫:詞有(practice makes perfect nobody is)
Doc1用BOW模型向量表示爲:[1,1,2,0,0]
Doc2用BOW模型向量表示爲:[0,0,1,1,1]
CountVectorizer調用sklearn實現,sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
'''停用詞表'''
stop_list='is a the of'.split()
'''申明CountVectorizer模型'''
cnt = CountVectorizer(min_df=1,ngram_range=(1,2),stop_words=stop_list)
'''語料庫'''
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
'''將文本轉化爲特徵'''
X =cnt.fit_transform(corpus)
print(type(X)) #輸出<class 'scipy.sparse.csr.csr_matrix'>
print(X)
'''取得特徵詞列表[]'''
print(cnt.get_feature_names())
'''
輸出
['and', 'and third', 'document', 'first', 'first document', 'one', 'second',
'second document','second second', 'third', 'third one', 'this', 'this first', 'this second']
'''
'''獲取得到的詞'''
print(cnt.vocabulary_)
'''
輸出:
{'this': 11, 'first': 3, 'document': 2, 'this first': 12, 'first document': 4, 'second': 6,
'this second': 13, 'second second': 8, 'second document': 7, 'and': 0, 'third': 9, 'one': 5,
'and third': 1, 'third one': 10}
'''
三、tfidf模型(可以調sklearn包實現tfidf的特徵提取from sklearn.feature_extraction.text import TfidfVectorizer)
1、計算詞頻tf
詞頻(TF) = 某個詞在文章中的出現次數 / 文章總詞數
2、計算逆文檔數idf
逆文檔頻率(IDF) = log(語料庫的文檔總數/包含該詞的文檔總數)
3、計算TF-IDF
TF-IDF = 詞頻(TF) * 逆文檔頻率(IDF)
實現代碼見git:https://github.com/frostjsy/my_study/blob/master/nlp/feature_extract/tf_idf.py
TfidfVectorizer調用sklearn實現,sklearn.feature_extraction.text import TfidfVectorizer
'''申明TfidfVectorizer模型,用法類似於CountVectorizer'''
tfidf=TfidfVectorizer(ngram_range=(1,2),stop_words=stop_list)
x=tfidf.fit_transform(corpus)
print(tfidf.get_feature_names())
print(tfidf.vocabulary_)