tf-idf使用-提取文章關鍵詞-搜索文章

tf-idf 使用

我們的目標是提取一篇文章中的關鍵詞 or 給出關鍵詞,在語料庫中找到這組關鍵詞最相近的文章。 兩個目標要解決的問題是差不多的。今天用一種很簡單卻很有效的方法來解決這個問題, TF-IDF。在本文,我們選取第二種描述,即給出關鍵詞,在語料庫中找到與這組關鍵詞最相近的文章。

TF,Term Frequency 詞頻,表示詞語在一篇文章中出現的頻數。TF值越大,表示這個詞在該篇文章中出現的頻數約大。但是如果僅僅根據數量來判斷一個詞是否爲關鍵詞,顯然是不夠的。例如[1],在文章中“的”,“是”這樣的詞往往數量很大,但卻不是我們想要的關鍵詞,這樣的詞稱爲停用詞。(Stop words)。爲了解決這個問題,於是引入了 IDF。

IDF Inverse Document Frequency。逆文檔頻率,它表示一個詞的區分程度大小。 一個詞的 IDF 值越大,表示這個詞越重要。 本文就不列舉公式了,想看公式的同學請參考引文 阮一峯老師的文章。

本文的主要目標是實現一個demo。

有了TF(數量)和IDF(權重)
我們將二者相乘,就可以比較合理的衡量一個詞重要性。TF-IDF

import numpy as np
import math
file_dir = 'input/tf_idf_data.txt' # 數據在文尾給出
docid2content = {} # int - list
word2id = {} # str-int
id2word = {} # int-str
word_id = 0
with open(file_dir, 'r') as f:
    doc_id = 0
    for line in f.readlines():
        seg = line.strip('\n').split(' ')
        docid2content[doc_id] = seg
        doc_id += 1
        for word in seg:
            # 自定義詞典
            if word not in word2id:
                word2id[word] = word_id
                id2word[word_id] = word
                word_id += 1

n_doc = len(docid2content)
n_word = len(word2id)
print('Document length = %d' % n_doc)
print('Unique word number = %d' % n_word)
Document length = 148
Unique word number = 20035
# V 詞典詞數量, M 文檔數量

# 統計詞頻 - Term Frequency
word_tf_VM = np.zeros(shape=[n_word, n_doc])
for doc_id in range(n_doc):
    for word in docid2content[doc_id]:
        word_tf_VM[word2id[word]][doc_id] += (1.0/len(docid2content[doc_id])) # 歸一化

print('==========> 詞頻統計')
for i in range(5):
    print(word_tf_VM[i])
==========> 詞頻統計
[ 0.01611279  0.          0.          0.          0.          0.0021978   0.
  0.          0.          0.          0.0208605   0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.03106212  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.02606882  0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.02504174  0.          0.          0.00064226  0.00089127  0.          0.
  0.          0.          0.          0.03455497  0.          0.          0.
  0.          0.          0.          0.          0.          0.00089767
  0.02677888  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.02833333  0.          0.          0.
  0.          0.          0.          0.          0.          0.01930215
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.01268657  0.00817439  0.00331126  0.          0.
  0.          0.00388601  0.          0.          0.          0.02874133
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.06626506  0.00102564  0.          0.          0.
  0.0027248   0.          0.          0.          0.          0.05405405
  0.          0.          0.          0.          0.          0.          0.
  0.          0.02467232  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.02205882  0.          0.
  0.          0.          0.          0.          0.          0.
  0.00239808]
[ 0.00302115  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.00090909  0.          0.00079051  0.          0.          0.
  0.          0.          0.00137931  0.          0.          0.00088339
  0.          0.          0.          0.00208551  0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.00160256  0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.00089928  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.00087642  0.          0.
  0.          0.          0.00099108  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.00069013  0.00086806  0.          0.          0.          0.          0.
  0.          0.          0.          0.00087489  0.          0.
  0.00133333  0.          0.          0.          0.00077101  0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.        ]
[ 0.00050352  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.00130378  0.          0.0010929   0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.00208551  0.          0.00106724  0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.00064226  0.          0.          0.          0.          0.
  0.          0.00104712  0.          0.00160256  0.          0.          0.
  0.          0.00097943  0.          0.00089767  0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.00112867  0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.00145349
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.00269179  0.          0.          0.00100402
  0.          0.00069013  0.          0.          0.          0.          0.
  0.          0.          0.001001    0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.        ]
[ 0.00050352  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.00097943  0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.        ]
[ 0.00050352  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.00089127  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.00208768  0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.        ]
# 逆文檔頻率 - Inverse Document frequency
word_idf_V = np.zeros(shape=[n_word])
for i in range(n_word):
    word = id2word[i]
    for doc_id in range(n_doc):
        if word in docid2content[doc_id]:
            word_idf_V[i] += 1

for i in range(n_word):
    word_idf_V[i] = math.log( n_doc / (word_idf_V[i] + 1) )

print('==========> 逆文檔頻率預覽')
for i in range(5):
    print(word_idf_V[i])
==========> 逆文檔頻率預覽
1.73911573574
2.22462355152
2.16399892971
3.8985999851
3.61091791264
# 根據 TF-IDF 值, 輸入關鍵詞,給出最相近的 top 3 篇文章標號
input_word = [2,5,10,34,100]
for word_id in input_word:
    word = id2word[word_id]
    tf_idf = list() # ele (doc_id, tf_idf)
    for doc_id in range(n_doc):
        tf_idf.append((doc_id, word_tf_VM[word_id][doc_id] * word_idf_V[word_id]))
    sort_tf_idf = sorted(tf_idf, key = lambda x:x[1], reverse=True)
    print(word,'==>', sort_tf_idf[0], sort_tf_idf[1], sort_tf_idf[2])
你們好 ==> (106, 0.0058250307663738872) (30, 0.0045130321787443146) (52, 0.0034679470027370175)
二週目 ==> (0, 0.062589924690941171) (20, 0.0063331879234507513) (101, 0.0056082711158862925)
微信 ==> (127, 0.0083196841455246261) (126, 0.0069068013846811339) (109, 0.006832832962890706)
彈幕 ==> (23, 0.0017124917032905211) (0, 0.0012503086454034534) (3, 0.0010561444578422166)
快樂 ==> (125, 0.009259887219141132) (121, 0.0070725122854857457) (88, 0.0037244328690671309)

Reference & Recommend

  1. 阮一峯老師的博文
  2. 測試數據
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章