TF-IDF理论和实践

TF-IDF是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随著它在文件中出现的次数成正比增加,但同时会随著它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。
 

在文本挖掘中,要对文本库分词,而分词后需要对个每个分词计算它的权重,而这个权重可以使用TF-IDF计算。

 

  • 术语介绍:

 

  • TF(term frequency) 分词频率:

TF就是分词出现的频率:该分词在该文档中出现的频率,算法是:(该分词在该文档出现的次数)/ (该文档分词的总数),这个值越大表示这个词越重要,即权重就越大。

 

例如:一篇文档分词后,总共有1000个分词,而分词”You”出现的次数是50次,则TF值是: tf =50/1000=0.02

 

  • IDF(inverse document frequency)逆向文件频率:

IDF逆向文件频率,一个文档库中,一个分词出现在的文档数越少越能和其它文档区别开来。算法是:log((总文档数/出现该分词的文档数)+0.01) ;(注加上0.01是为了防止log计算返回值为0)。

 

例如:一个文档库中总共有50篇文档,2篇文档中出现过“You”分词,则idf是: Idf = log(1000/50 + 0.01) = log(20.01)=1.3012470886362 TF-IDF结合计算就是 tfidf,比如上面的“You”分词例子中: TF-IDF = tf idf = (50/1000)* log(1000/50 + 0.01)= 0.05*1.3012470886362=0.06506235443181

 

  • TF-IDF的基本思想

词语的重要性与它在文件中出现的次数成正比,但同时会随着它在语料库中出现的频率成反比下降。 但无论如何,统计每个单词在文档中出现的次数是必要的操作。所以说,TF-IDF也是一种基于 bag-of-word 的方法。

 

  • TF-IDF的算法原理

预处理过程中,我们已经把停词都过滤掉了。如果只考虑剩下的有实际意义的词,前我们已经讲过,显然词频(TF,Term Frequency)较高的词之于一篇文章来说可能是更为重要的词(也就是潜在的关键词)。但这样又会遇到了另一个问题,我们可能发现在上面例子中,madefortv、california、includ 都出现了2次(madefortv其实是原文中的made-for-TV,因为我们所选分词法的缘故,它被当做是一个词来看待),但这显然并不意味着“作为关键词,它们的重要性是等同的”。

因为”includ”是很常见的词(注意 includ 是 include 的词干)。相比之下,california 可能并不那么常见。如果这两个词在一篇文章的出现次数一样多,我们有理由认为,california 重要程度要大于 include ,也就是说,在关键词排序上面,california 应该排在 include 的前面。

于是,我们需要一个重要性权值调整参数,来衡量一个词是不是常见词。如果某个词比较少见,但是它在某篇文章中多次出现,那么它很可能就反映了这篇文章的特性,它就更有可能揭示这篇文字的话题所在。这个权重调整参数就是“逆文档频率”(IDF,Inverse Document Frequency),它的大小与一个词的常见程度成反比。

知道了 TF 和 IDF 以后,将这两个值相乘,就得到了一个词的TF-IDF值。某个词对文章的重要性越高,它的TF-IDF值就越大。

 

 

  • 代码实现如下:

import nltk
import math
import string
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer

text1 = "Python is a 2000 made-for-TV horror movie directed by Richard \
Clabaugh. The film features several cult favorite actors, including William \
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, \
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the \
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean \
Whalen. The film concerns a genetically engineered snake, a python, that \
escapes and unleashes itself on a small town. It includes the classic final\
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles, \
 California and Malibu, California. Python was followed by two sequels: Python \
 II (2002) and Boa vs. Python (2004), both also made-for-TV films."

text2 = "Python, from the Greek word (πύθων/πύθωνας), is a genus of \
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are \
recognised.[2] A member of this genus, P. reticulatus, is among the longest \
snakes known."

text3 = "The Colt Python is a .357 Magnum caliber revolver formerly \
manufactured by Colt's Manufacturing Company of Hartford, Connecticut. \
It is sometimes referred to as a \"Combat Magnum\".[1] It was first introduced \
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued \
Colt Python targeted the premium revolver market segment. Some firearm \
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy \
Thompson, Renee Smeets and Martin Dougherty have described the Python as the \
finest production revolver ever made."

def get_tokens(text):
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    no_punctuation = lowers.translate(remove_punctuation_map)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

#剔除其中的标点符号

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed
#去除“词袋”中的“停词”

tokens = get_tokens(text1)
filtered = [w for w in tokens if not w in stopwords.words('english')]
stemmer = PorterStemmer()
stemmed = stem_tokens(filtered, stemmer)

#用Stemming 方法对同类词归一统计

tokens = get_tokens(text1)
count = Counter(tokens)
#用于统计每个单词出现个数
print (count.most_common(10))

#测试上述分词结果

 

 

 

 

 

参考:

  1. https://blog.csdn.net/roger_royer/article/details/78790050 
  2. https://blog.csdn.net/baimafujinji/article/details/ 51476117
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章