【Python】向量空間模型：TF-IDF實例實現(set.union())

原創

Vivid-victory

2020-06-13 01:44

一、部分理論介紹

向量空間模型（VSM：Vector Space Model）

TF-IDF（term frequency–inverse document frequency）
TF是詞頻(Term Frequency)，IDF是逆文本頻率指數(Inverse Document Frequency)

其他理論部分請依據關鍵詞自行探索研究。

二、TF-IDF相關實例

1、題目
Q：“gold silver truck”
D1：“Shipment of gold damaged in a fire”
D2：“Delivery of silver arrived in a silver truck”
D3：“Shipment of gold arrived in a truck”
基於TF-IDF向量化方法，求文檔Q與文檔D1、D2、D3相似程度。

2、分析過程
在這個文檔集中，d=3。
lg(d/dfi) = lg(3/1) = 0.477
lg(d/dfi) = lg(3/2) = 0.176
lg(d/dfi) = lg(3/3) = 0

3、代碼分享：
直接上完成代碼：

import numpy as np 
import pandas as pd
import math

#1.聲明文檔 分詞 去重合並
D1 = 'Shipment of gold damaged in a fire'
D2 = 'Delivery of silver arrived in a silver truck'
D3 = 'Shipment of gold arrived in a truck'
split1 = D1.split(' ')
split2 = D2.split(' ')
split3 = D3.split(' ')
wordSet = set(split1).union(split2,split3)  #通過set去重來構建詞庫

#2.統計詞項tj在文檔Di中出現的次數，也就是詞頻。
def computeTF(wordSet,split):
    tf = dict.fromkeys(wordSet, 0)
    for word in split:
        tf[word] += 1
    return tf
tf1 = computeTF(wordSet,split1)
tf2 = computeTF(wordSet,split2)
tf3 = computeTF(wordSet,split3)
print('tf1:\n',tf1)

#3.計算逆文檔頻率IDF
def computeIDF(tfList): 
    idfDict = dict.fromkeys(tfList[0],0) #詞爲key，初始值爲0
    N = len(tfList)  #總文檔數量 
    for tf in tfList: # 遍歷字典中每一篇文章
        for word, count in tf.items(): #遍歷當前文章的每一個詞
            if count > 0 : #當前遍歷的詞語在當前遍歷到的文章中出現
                idfDict[word] += 1 #包含詞項tj的文檔的篇數df+1  
    for word, Ni in idfDict.items(): #利用公式將df替換爲逆文檔頻率idf
        idfDict[word] = math.log10(N/Ni)  #N,Ni均不會爲0
    return idfDict   #返回逆文檔頻率IDF字典
idfs = computeIDF([tf1, tf2, tf3])
print('idfs:\n',idfs)

#4.計算tf-idf(term frequency–inverse document frequency)
def computeTFIDF(tf, idfs): #tf詞頻,idf逆文檔頻率
    tfidf = {}
    for word, tfval in tf.items():
        tfidf[word] = tfval * idfs[word]
    return tfidf
tfidf1 = computeTFIDF(tf1, idfs)
tfidf2 = computeTFIDF(tf2, idfs)
tfidf3 = computeTFIDF(tf3, idfs)
tfidf = pd.DataFrame([tfidf1, tfidf2, tfidf3])
print(tfidf)

#5.查詢與文檔Q最相似的文章
q = 'gold silver truck' #查詢文檔Q
split_q = q.split(' ')   #分詞
tf_q = computeTF(wordSet,split_q) #計算Q的詞頻
tfidf_q = computeTFIDF(tf_q, idfs) #計算Q的tf_idf(構建向量)
ans = pd.DataFrame([tfidf1, tfidf2, tfidf3, tfidf_q])
print(ans)

#6.計算Q和文檔Di的相似度（可以簡單地定義爲兩個向量的內積）
print('Q和文檔D1的相似度SC(Q, D1) :', (ans.loc[0,:]*ans.loc[3,:]).sum())
print('Q和文檔D2的相似度SC(Q, D2) :', (ans.loc[1,:]*ans.loc[3,:]).sum())
print('Q和文檔D3的相似度SC(Q, D3) :', (ans.loc[2,:]*ans.loc[3,:]).sum())