tf-idf 使用
我們的目標是提取一篇文章中的關鍵詞 or 給出關鍵詞,在語料庫中找到這組關鍵詞最相近的文章。 兩個目標要解決的問題是差不多的。今天用一種很簡單卻很有效的方法來解決這個問題, TF-IDF。在本文,我們選取第二種描述,即給出關鍵詞,在語料庫中找到與這組關鍵詞最相近的文章。
TF,Term Frequency 詞頻,表示詞語在一篇文章中出現的頻數。TF值越大,表示這個詞在該篇文章中出現的頻數約大。但是如果僅僅根據數量來判斷一個詞是否爲關鍵詞,顯然是不夠的。例如[1],在文章中“的”,“是”這樣的詞往往數量很大,但卻不是我們想要的關鍵詞,這樣的詞稱爲停用詞。(Stop words)。爲了解決這個問題,於是引入了 IDF。
IDF Inverse Document Frequency。逆文檔頻率,它表示一個詞的區分程度大小。 一個詞的 IDF 值越大,表示這個詞越重要。 本文就不列舉公式了,想看公式的同學請參考引文 阮一峯老師的文章。
本文的主要目標是實現一個demo。
有了TF(數量)和IDF(權重)
我們將二者相乘,就可以比較合理的衡量一個詞重要性。TF-IDF
import numpy as np
import math
file_dir = 'input/tf_idf_data.txt' # 數據在文尾給出
docid2content = {} # int - list
word2id = {} # str-int
id2word = {} # int-str
word_id = 0
with open(file_dir, 'r') as f:
doc_id = 0
for line in f.readlines():
seg = line.strip('\n').split(' ')
docid2content[doc_id] = seg
doc_id += 1
for word in seg:
# 自定義詞典
if word not in word2id:
word2id[word] = word_id
id2word[word_id] = word
word_id += 1
n_doc = len(docid2content)
n_word = len(word2id)
print('Document length = %d' % n_doc)
print('Unique word number = %d' % n_word)
Document length = 148
Unique word number = 20035
# V 詞典詞數量, M 文檔數量
# 統計詞頻 - Term Frequency
word_tf_VM = np.zeros(shape=[n_word, n_doc])
for doc_id in range(n_doc):
for word in docid2content[doc_id]:
word_tf_VM[word2id[word]][doc_id] += (1.0/len(docid2content[doc_id])) # 歸一化
print('==========> 詞頻統計')
for i in range(5):
print(word_tf_VM[i])
==========> 詞頻統計
[ 0.01611279 0. 0. 0. 0. 0.0021978 0.
0. 0. 0. 0.0208605 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.03106212 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.02606882 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.02504174 0. 0. 0.00064226 0.00089127 0. 0.
0. 0. 0. 0.03455497 0. 0. 0.
0. 0. 0. 0. 0. 0.00089767
0.02677888 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.02833333 0. 0. 0.
0. 0. 0. 0. 0. 0.01930215
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.01268657 0.00817439 0.00331126 0. 0.
0. 0.00388601 0. 0. 0. 0.02874133
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.06626506 0.00102564 0. 0. 0.
0.0027248 0. 0. 0. 0. 0.05405405
0. 0. 0. 0. 0. 0. 0.
0. 0.02467232 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.02205882 0. 0.
0. 0. 0. 0. 0. 0.
0.00239808]
[ 0.00302115 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.00090909 0. 0.00079051 0. 0. 0.
0. 0. 0.00137931 0. 0. 0.00088339
0. 0. 0. 0.00208551 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00160256 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00089928 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00087642 0. 0.
0. 0. 0.00099108 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0.00069013 0.00086806 0. 0. 0. 0. 0.
0. 0. 0. 0.00087489 0. 0.
0.00133333 0. 0. 0. 0.00077101 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[ 0.00050352 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.00130378 0. 0.0010929 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00208551 0. 0.00106724 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.00064226 0. 0. 0. 0. 0.
0. 0.00104712 0. 0.00160256 0. 0. 0.
0. 0.00097943 0. 0.00089767 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00112867 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.00145349
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00269179 0. 0. 0.00100402
0. 0.00069013 0. 0. 0. 0. 0.
0. 0. 0.001001 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. ]
[ 0.00050352 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0.00097943 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. ]
[ 0.00050352 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00089127 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0.00208768 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. ]
# 逆文檔頻率 - Inverse Document frequency
word_idf_V = np.zeros(shape=[n_word])
for i in range(n_word):
word = id2word[i]
for doc_id in range(n_doc):
if word in docid2content[doc_id]:
word_idf_V[i] += 1
for i in range(n_word):
word_idf_V[i] = math.log( n_doc / (word_idf_V[i] + 1) )
print('==========> 逆文檔頻率預覽')
for i in range(5):
print(word_idf_V[i])
==========> 逆文檔頻率預覽
1.73911573574
2.22462355152
2.16399892971
3.8985999851
3.61091791264
# 根據 TF-IDF 值, 輸入關鍵詞,給出最相近的 top 3 篇文章標號
input_word = [2,5,10,34,100]
for word_id in input_word:
word = id2word[word_id]
tf_idf = list() # ele (doc_id, tf_idf)
for doc_id in range(n_doc):
tf_idf.append((doc_id, word_tf_VM[word_id][doc_id] * word_idf_V[word_id]))
sort_tf_idf = sorted(tf_idf, key = lambda x:x[1], reverse=True)
print(word,'==>', sort_tf_idf[0], sort_tf_idf[1], sort_tf_idf[2])
你們好 ==> (106, 0.0058250307663738872) (30, 0.0045130321787443146) (52, 0.0034679470027370175)
二週目 ==> (0, 0.062589924690941171) (20, 0.0063331879234507513) (101, 0.0056082711158862925)
微信 ==> (127, 0.0083196841455246261) (126, 0.0069068013846811339) (109, 0.006832832962890706)
彈幕 ==> (23, 0.0017124917032905211) (0, 0.0012503086454034534) (3, 0.0010561444578422166)
快樂 ==> (125, 0.009259887219141132) (121, 0.0070725122854857457) (88, 0.0037244328690671309)