Python TFIDF計算文本相似度

原創

2020-06-25 23:12

本文主要參考https://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similaritStackOverflow的回答
主要是使用sklearn的TfidfTransformer

cosine_similarity就是計算L2歸一化的向量點乘。如果x,y是行向量，它們的cosine similarityk是：

linear_kernel 是多項式核的特例，如果x,和y是列向量，他們的線性核爲：

fit &fit_tansform & transform

fit是一個適配的過程，用於train，得到一個統一的轉換的規則的模型；
transform：將數據進行轉換，比如測試數據按照訓練數據同樣的模型進行轉換，得到特徵向量；
fit_tansform:將上述兩個合併起來，fit to data,then transform it. 如果訓練階段用的是fit_transform，在測試階段只需要transform就行

也就是一般訓練的時候用fit_transform(train_data)
在測試的時候用transform(test_data)

回答一

如果你想提取count features並應用TF-IDFnormalizaition以及行基礎的歐式距離，用一個操作就行：
TFidfVectorizor

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 1787553 stored elements in Compressed Sparse Row format>

現在要求一個文檔（如第一句）同其他所有文檔的距離，只需要計算第一個向量和其他所有向量的點乘，因爲tfidf向量已經row-normalized
cos距離並不考慮向量的大小（也就是絕對值），Row-normalised（行標準化）向量大小爲1，所以Linear Kernel足夠計算相似值。
scipy sparse matrix查看第一個向量：

>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 89 stored elements in Compressed Sparse Row format>

scikit-learn已經提供了pairwise metrics，稀疏的不稀疏的矩陣表示。這裏我們需要點乘操作，也叫linear kernel:

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

這裏插播一下，linear_kernel的輸入是(NT)和（MT）的向量，輸出（N*M）的向量

因此，要找5個最接近的相關文檔，只需要用argsort切片取就行了：

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

第一個結果用於檢查，這是query本身，相似度爲1
在這個例子裏，learn_kernel就相當於cos similarity，因爲sklearn.feature_extraction.text.TfidfVectorizer本身得到的就是歸一化後的向量，這樣cosine_similarity就相當於linear_kernel

回答2

是個手動計算的方法
循環計算test_data與train_data的特徵間的cosine 距離
首先用簡單的lambda函數表示cosine距離的計算：

cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

然後就只要for循環就行

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

回答3

跟回答1一樣，不過直接用cosine_similarity

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other elements

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python TFIDF計算文本相似度

fit &fit_tansform & transform

回答一

回答2

回答3

【論文閱讀】 Aspect Based Sentiment Analysis with Gated Convolutional Networks

Python TFIDF計算文本相似度

【論文閱讀】Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

【論文翻譯】Issues and Challenges of Aspect-based Sentiment Analysis: A Comprehensive Survey

tensorflow手動實現線性迴歸梯度下降

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結