一、TF-IDF簡介
1.1 TF-IDF概念
TF-IDF(term frequency-inverse document frequency):一種用於信息檢索與數據挖掘的常用加權技術。用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨着它在文件中出現的次數成正比增加,但同時會隨着它在語料庫中出現的頻率成反比下降。
主要思想:如果某個詞或短語在一篇文章中出現的頻率TF高,並且在其他文章中很少出現,則認爲此詞或者短語具有很好的類別區分能力,適合用來分類,也就可以作爲上文中所提到的關鍵字(以上內容來自百度百科)。
TF(term frequency):是分詞出現的頻率,表示該分詞在文檔中出現的頻率,這個值越大說明這個詞越重要。計算公式爲:
TF=(該分詞在該文檔出現的次數)/(該文檔分詞的總數)
IDF(inverse document frequency):即逆向文件頻率,在一個文檔庫中,一個分詞出現在的文檔數越少就越能和其他文檔區別開來。計算公式爲:
IDF=log((總文檔數/出現該分詞的文檔數)+0.01)
TF-IDF的計算就是將上面兩個值進行相乘,即:TF-IDF=TF*IDF。
1.2 TF-IDF應用
主要應用來計算兩篇文章或者兩短文本的相似性,首先計算出文本的分詞,把所有的詞合併成一個集合,計算每篇文章對於這個集合中的詞的詞頻,生成兩篇文章各自的詞頻向量,進而通過歐氏距離或餘弦距離求出兩個向量的餘弦相似度,值越大就表示越相似。
1.3 TF-IDF的優缺點
TF-IDF的優缺點:TF-IDF算法非常容易理解,並且很容易實現,但是其簡單結構並沒有真正反映出每個單詞的重要程度,根據我們的經驗知道在文檔的首尾詞語一般都會表達出文章的主旨,另外也忽略了該詞在文檔中的分佈情況。
二、TF-IDF實現
2.1 通過Scikit-Learn實現
# -*- coding:utf-8 -*-
from sklearn.feature_extraction.text import TfidfVectorizer
def get_tf_tfid(corpus):
countVectorizer = TfidfVectorizer(encoding='utf-8', lowercase=True, stop_words='english',
token_pattern='(?u)[A-Za-z][A-Za-z]+[A-Za-z]', ngram_range=(1, 1),
analyzer='word', max_df=0.85, min_df=1, max_features=150)
vector = countVectorizer.fit_transform(corpus).toarray()
print vector
if __name__=='__main__':
corpus = ['Where I can buy good oil for massage?',
'I advise you to use car oil any type forever.',
'I`m searching oil for body massage.',
'I love you!', ]
get_tf_tfid(corpus)
輸出結果:[[ 0. 0. 0.574 0. 0. 0.574 0. 0.453 0.366 0. 0. 0. ]
[ 0.430 0. 0. 0.430 0.430 0. 0. 0. 0.274 0. 0.430 0.430 ]
[ 0. 0.574 0. 0. 0. 0. 0. 0.453 0.366 0.574 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. ]]
2.2 動手實現
# -*- coding:utf-8 -*-
import math
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
def nltkProcess(corpus):
tokens=[]
for line in corpus:
token = WordPunctTokenizer().tokenize(line)
noStopwords = [w.lower() for w in token if not w.lower() in stopwords.words('english')]
lmtzr = []
for w in noStopwords:
lmtzr.append(WordNetLemmatizer().lemmatize(w))
stem = []
for w in lmtzr:
stem.append(PorterStemmer().stem(w))
tokens.append(stem)
return tokens
def get_word(word, corpus):
sum=0
for text in corpus:
if word in text:
sum+=1
return sum
def get_tf(word,text):
return (text.count(word)+0.00)/len(text)
def get_idf(word, corpus):
return math.log(len(corpus) / (1 + get_word(word, corpus)))
def get_tfidf(corpus):
corpus=nltkProcess(corpus)
for i, text in enumerate(corpus):
print("The words of TF-IDF is:")
scores = {word: get_tf(word,text) * get_idf(word,corpus) for word in text}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:2]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 3)))
if __name__=='__main__':
corpus = ['Where I can buy good oil for massage?',
'I advise you to use car oil any type forever.',
'I`m searching oil for body massage.',
'I love you!', ]
get_tfidf(corpus)
輸出結果:
The words of TF-IDF is:
Word: good, TF-IDF: 0.139
Word: buy, TF-IDF: 0.139
The words of TF-IDF is:
Word: use, TF-IDF: 0.099
Word: car, TF-IDF: 0.099
The words of TF-IDF is:
Word: `, TF-IDF: 0.116
Word: search, TF-IDF: 0.116
The words of TF-IDF is:
Word: !, TF-IDF: 0.347
Word: love, TF-IDF: 0.347