



  • 術語介紹:


  • TF(term frequency) 分詞頻率:

TF就是分詞出現的頻率:該分詞在該文檔中出現的頻率,算法是:(該分詞在該文檔出現的次數)/ (該文檔分詞的總數),這個值越大表示這個詞越重要,即權重就越大。


例如:一篇文檔分詞後,總共有1000個分詞,而分詞”You”出現的次數是50次,則TF值是: tf =50/1000=0.02


  • IDF(inverse document frequency)逆向文件頻率:

IDF逆向文件頻率,一個文檔庫中,一個分詞出現在的文檔數越少越能和其它文檔區別開來。算法是:log((總文檔數/出現該分詞的文檔數)+0.01) ;(注加上0.01是爲了防止log計算返回值爲0)。


例如:一個文檔庫中總共有50篇文檔,2篇文檔中出現過“You”分詞,則idf是: Idf = log(1000/50 + 0.01) = log(20.01)=1.3012470886362 TF-IDF結合計算就是 tfidf,比如上面的“You”分詞例子中: TF-IDF = tf idf = (50/1000)* log(1000/50 + 0.01)= 0.05*1.3012470886362=0.06506235443181


  • TF-IDF的基本思想

詞語的重要性與它在文件中出現的次數成正比,但同時會隨着它在語料庫中出現的頻率成反比下降。 但無論如何,統計每個單詞在文檔中出現的次數是必要的操作。所以說,TF-IDF也是一種基於 bag-of-word 的方法。


  • TF-IDF的算法原理

預處理過程中,我們已經把停詞都過濾掉了。如果只考慮剩下的有實際意義的詞,前我們已經講過,顯然詞頻(TF,Term Frequency)較高的詞之於一篇文章來說可能是更爲重要的詞(也就是潛在的關鍵詞)。但這樣又會遇到了另一個問題,我們可能發現在上面例子中,madefortv、california、includ 都出現了2次(madefortv其實是原文中的made-for-TV,因爲我們所選分詞法的緣故,它被當做是一個詞來看待),但這顯然並不意味着“作爲關鍵詞,它們的重要性是等同的”。

因爲”includ”是很常見的詞(注意 includ 是 include 的詞幹)。相比之下,california 可能並不那麼常見。如果這兩個詞在一篇文章的出現次數一樣多,我們有理由認爲,california 重要程度要大於 include ,也就是說,在關鍵詞排序上面,california 應該排在 include 的前面。

於是,我們需要一個重要性權值調整參數,來衡量一個詞是不是常見詞。如果某個詞比較少見,但是它在某篇文章中多次出現,那麼它很可能就反映了這篇文章的特性,它就更有可能揭示這篇文字的話題所在。這個權重調整參數就是“逆文檔頻率”(IDF,Inverse Document Frequency),它的大小與一個詞的常見程度成反比。

知道了 TF 和 IDF 以後,將這兩個值相乘,就得到了一個詞的TF-IDF值。某個詞對文章的重要性越高,它的TF-IDF值就越大。



  • 代碼實現如下:

import nltk
import math
import string
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer

text1 = "Python is a 2000 made-for-TV horror movie directed by Richard \
Clabaugh. The film features several cult favorite actors, including William \
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy, \
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the \
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean \
Whalen. The film concerns a genetically engineered snake, a python, that \
escapes and unleashes itself on a small town. It includes the classic final\
girl scenario evident in films like Friday the 13th. It was filmed in Los Angeles, \
 California and Malibu, California. Python was followed by two sequels: Python \
 II (2002) and Boa vs. Python (2004), both also made-for-TV films."

text2 = "Python, from the Greek word (πύθων/πύθωνας), is a genus of \
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are \
recognised.[2] A member of this genus, P. reticulatus, is among the longest \
snakes known."

text3 = "The Colt Python is a .357 Magnum caliber revolver formerly \
manufactured by Colt's Manufacturing Company of Hartford, Connecticut. \
It is sometimes referred to as a \"Combat Magnum\".[1] It was first introduced \
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now discontinued \
Colt Python targeted the premium revolver market segment. Some firearm \
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy \
Thompson, Renee Smeets and Martin Dougherty have described the Python as the \
finest production revolver ever made."

def get_tokens(text):
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    no_punctuation = lowers.translate(remove_punctuation_map)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens


def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
    return stemmed

tokens = get_tokens(text1)
filtered = [w for w in tokens if not w in stopwords.words('english')]
stemmer = PorterStemmer()
stemmed = stem_tokens(filtered, stemmer)

#用Stemming 方法對同類詞歸一統計

tokens = get_tokens(text1)
count = Counter(tokens)
print (count.most_common(10))








  2. 51476117
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.