Classify Text With NLTK

Classification is the task of choosing the correct class label for a given input.

A classifier is called supervised if it is built based on training corpora containing the correct label for each input.

這裏就以一個例子來說明怎樣用nltk來實現分類器訓練和分類

一個簡單的分類任務，給定一個名字，判斷其性別，就是在male，female兩類進行分類

好，先來訓練，訓練就要有corpus，就是分好類的名字的例子

nltk提供了names的corpus

>>> from nltk.corpus import names

>>> names.words('male.txt') ＃男性的name的列表

>>> names.words('female.txt') ＃女性的name的列表

有了訓練corpus，下面就是特徵提取

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features.

這裏簡單的假設這個名字的性別和最後一個字母相關，那麼就把最後一個字母作爲每個test case的特徵

>>> def gender_features(word):
... return {'last_letter': word[-1]}
>>> gender_features('Shrek')

{'last_letter': 'k'}

所以就定義如上的特徵抽取函數，並用它來生成我們的訓練集和測試集

>>> from nltk.corpus import names
>>> import random
>>> names = ([(name, 'male') for name in names.words('male.txt')] +
... [(name, 'female') for name in names.words('female.txt')])
>>> random.shuffle(names) ＃原來的name是按字母排序的，爲了達到比較好的訓練效果，必須打亂順序，隨機化

>>> featuresets = [(gender_features(n), g) for (n,g) in names]

>>> train_set, test_set = featuresets[500:], featuresets[:500] ＃把特徵集一部分作爲train集，一部分用來測試
>>> classifier = nltk.NaiveBayesClassifier.train (train_set) ＃用訓練集來訓練bayes分類器

>>> classifier.classify (gender_features('Trinity')) ＃訓練完就可以用這個分類器來實際進行分類工作了
'female'

用測試集來測試

>>> print nltk.classify.accuracy (classifier, test_set) ＃用測試集來測試這個分類器，nltk提供accuracy接口
0.758

現在只考慮了最後一個字母這個特徵，準確率是75％，顯然還有很大的提升空間。

>>> classifier.show_most_informative_features (5) ＃這個接口有意思，你可以顯示出區分度最高的幾個features
Most Informative Features
last_letter = 'a'     female : male = 38.3 : 1.0
last_letter = 'k'     male : female = 31.4 : 1.0
last_letter = 'f'      male : female = 15.3 : 1.0
last_letter = 'p'     male : female = 10.6 : 1.0
last_letter = 'w'    male : female = 10.6 : 1.0

nltk接口很貼心，還考慮到你內存太小，放不下所有的feature集合，提供這個接口來當用到時，實時的計算feature

>>> from nltk.classify import apply_features
>>> train_set = apply_features (gender_features, names[500:])
>>> test_set = apply_features(gender_features, names[:500])

分類器分類效果好壞很大取決於訓練集的特徵選取，特徵選取的比較合理，就會取得比較好的分類效果。

當然特徵也不是選取的越多越好，

if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don’t generalize well to new examples. This problem is known as overfitting , and can be especially problematic when working with small training sets.

所以特徵抽取這個在分類領域中是一個很重要的研究方向。

比如把上面那個例子的特徵增加爲，分別把最後兩個字符，作爲兩個特徵，這樣會發現分類器測試的準確性有所提高。

>>> def gender_features(word):
... return {'suffix1': word[-1:],
... 'suffix2': word[-2:]}

但是如果把特徵增加爲，首字母，尾字母，並統計每個字符的出現次數，反而會導致overfitting，測試準確性反而不如之前只考慮尾字母的情況

def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[–1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
>>> gender_features2('John')
{'count(j)': 1, 'has(d)': False, 'count(b)': 0, ...}

>>> featuresets = [(gender_features2(n), g) for (n,g) in names]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.748

那麼上面這個簡單的方法已經講明瞭用nltk，進行分類的過程，那麼剩下的就是針對不同的分類任務，特徵的選取上會有不同，還有分類器的也不止bayes一種，可以針對不同的任務來選取。

比如對於文本分類，可以選取是否包含特徵詞彙作爲文本特徵

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000] ＃找出出現頻率較高的特徵詞，雖然這個找法不太合理
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{'contains(waste)': False, 'contains(lot)': False, ...}

對於pos tagging，我們也可以用分類的方法去解決

比如我們可以通過詞的後綴來判斷它的詞性，這邊就以是否包含常見的詞的後綴作爲特徵

>>> def pos_features(word):
...     features = {}
...     for suffix in common_suffixes:
...         features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
...     return features

當然這個特徵選取的比較簡單，那麼改進一下，根據後綴，並考慮context，即前一個詞和詞性，一起作爲特徵，這樣考慮就比較全面了。後綴之所以要考慮3種情況，是因爲一般表示詞性的後綴，最多3個字符，s，er，ing

def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                       "suffix(2)": sentence[i][-2:],
               "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1] ＃history裏面存放了句子裏面每個詞的詞性
    return features

那麼分類器，除了bayes外，nltk還有decision tree， Maximum Entropy classifier就不具體說了

還有對於大規模數據處理， pure python的分類器的效率相對是比較底下的，所以必須用高效的語言如c語言實現的分類器， NLTK也支持這樣的分類器的package，可以參考NLTK的web page。

Classify Text With NLTK

24-5-18 X

Lucene in action 筆記 term vector

數論(算法概述)

Classify Text With NLTK

Extracting Information from Text With NLTK

Hadoop- The Definitive Guide 筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結