POS Tagging

POS tagging :part-of-speech tagging , or word classes or lexical categories . 說法很多其實就是詞性標註。

那麼用nltk的工具集的off-the-shelf工具可以簡單的對文本進行POS tagging

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

API Document裏面是這麼介紹這個接口的

Use NLTK's currently recommended part of speech tagger to   tag the given list of tokens.

我查了下code, pos_tag load the Standard treebank POS tagger

1.      CC      Coordinating conjunction
2.     CD     Cardinal number
3.     DT     Determiner
4.     EX     Existential there
5.     FW     Foreign word
6.     IN     Preposition or subordinating conjunction
7.     JJ     Adjective
8.     JJR     Adjective, comparative
9.     JJS     Adjective, superlative
10.     LS     List item marker
11.     MD     Modal
12.     NN     Noun, singular or mass
13.     NNS     Noun, plural
14.     NNP     Proper noun, singular
15.     NNPS     Proper noun, plural
16.     PDT     Predeterminer
17.     POS     Possessive ending
18.     PRP     Personal pronoun
19.     PRP$     Possessive pronoun
20.     RB     Adverb
21.     RBR     Adverb, comparative
22.     RBS     Adverb, superlative
23.     RP     Particle
24.     SYM     Symbol
25.     TO     to
26.     UH     Interjection
27.     VB     Verb, base form
28.     VBD     Verb, past tense
29.     VBG     Verb, gerund or present participle
30.     VBN     Verb, past participle
31.     VBP     Verb, non-3rd person singular present
32.     VBZ     Verb, 3rd person singular present
33.     WDT     Wh-determiner
34.     WP     Wh-pronoun
35.     WP$     Possessive wh-pronoun
36.     WRB     Wh-adverb

 

現在根據上面主要詞性縮寫的解釋,可以比較容易理解上面接口給出的詞性標註了。

在nltk的corpus,語料庫,裏面有些是加過詞性標註的,這些可以用於訓練集,標註過的corpors都有tagged_words() method

>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

 

Automatic Tagging

下面就來講講各種自動標註的方法,因爲tag要根據詞的context,所以tag是以sentense爲單位的,而不是word爲單位,因爲如果以詞爲單位,一個句子的結尾詞會影響到下個句子開頭詞的tag,這樣是不合理的,以句子爲單位可以避免這樣的錯誤,讓context的影響不會越過sentense。

我們就用brown corpus作爲例子,

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')

可以分佈取出標註過的句子集合, 未標註的句子集合,分別用做標註算法的驗證集和測試集。

 

The Default Tagger

The simplest possible tagger assigns the same tag to each token.

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
198 | Chapter 5: Categorizing and Tagging Words
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]

這個Tagger,真的很簡單就是把所有的都標註成你告訴他的這種,看似毫無意義的tagger,不過作爲backoff,還是有用的

 

The Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns.

>>> patterns = [
... (r'.*ing$', 'VBG'), # gerunds
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*/'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)
... ]

>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_sents[3])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]

這個Tagger,進步了一點,就是你可以定義一些正則文法的規則,滿足規則就tag成相應的詞性,否則還是default

 

The Lookup Tagger

A lot of high-frequency words do not have the NN tag. Let’s find the hundred most frequent words and store their most likely tag.

這個方法開始有點實用價值了, 就是通過統計訓練corpus裏面最常用的詞,最有可能出現的詞性是什麼,來進行詞性標註。

>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = fd.keys()[:100]
>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)

這段code就是從corpus中取出top 100的詞,然後找到這100個詞出現次數最多的詞性,然後形成likely_tags的字典

然後將這個字典作爲model傳個unigramTagger

unigramTagger就是一元的tagger,即不考慮前後context的一種簡單的tagger

這個方法有個最大的問題,你只指定了top 100詞的詞性,那麼其他的詞怎麼辦

好,前面的default tagger有用了

baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))

這樣就可以部分解決這個問題, 不知道的就用default tagger來標註

這個方法的準確性完全取決於這個model的大小,這兒取了top100的詞,可能準確性不高,但是隨着你取的詞的增多,這個準確率會不斷提高。

 

N-Gram Tagging

Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.

上面給出的lookup tagger就是用的Unigram tagger, 現在給出Unigram tagger更一般的用法

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) #Training
>>> unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]

你可以來已標註的語料庫對Unigram tagger進行訓練

 

An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.

n元就是要考慮context,即考慮前n-1個word的tag,來給當前的word進行tagging

就n元tagger的特例二元tagger作爲例子

>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])

這樣有個問題,如果tag的句子中的某個詞的context在訓練集裏面沒有,哪怕這個詞在訓練集中有,也無法對他進行標註,還是要通過backoff來解決這樣的問題

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)

 

Transformation-Based Tagging

n-gram tagger存在的問題是,model會佔用比較大的空間,還有就是在考慮context時,只會考慮前面詞的tag,而不會考慮詞本身。

而要介紹的這種tagger可以比較好的解決這些問題,用存儲rule來代替model,這樣可以節省大量的空間,同時在rule中不限制僅考慮tag,也可以考慮word本身。

 

Brill tagging is a kind of transformation-based learning, named after its inventor. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes.

 

那麼Brill tagging的原理從底下這個例子就可以瞭解

(1) replace NN with VB when the previous word is TO;

(2) replace TO with IN when the next tag is NNS.

Phrase     to increase grants to states for vocational rehabilitation
Unigram TO    NN        NNS   TO NNS    IN      JJ                NN
Rule 1              VB
Rule 2                                    IN
Output     TO    VB        NNS    IN NNS    IN      JJ                NN

第一步用unigram tagger對所有詞做一遍tagging,這裏面可能有很多不準確的

下面就用rule來糾正第一步中guess錯的那些詞的tag,最終得到比較準確的tagging

 

那麼這些rules是怎麼生成的了,答案是在training階段自動生成的

During its training phase, the tagger guesses values for T1, T2, and C, to create thousands of candidate rules. Each rule is scored according to its net benefit: the number of incorrect tags that it corrects, less the number
of correct tags it incorrectly modifies.

意思就是在training階段,先創建thousands of candidate rules, 這些rule創建可以通過簡單的統計來完成,所以可能有一些rule是不準確的。那麼用每條rule去fix mistakes,然後和正確tag對比,改對的數目減去改錯的數目用來作爲score評價該rule的好壞,自然得分高的留下,得分低的rule就刪去, 底下是些rules的例子

NN -> VB if the tag of the preceding word is 'TO'
NN -> VBD if the tag of the following word is 'DT'
NN -> VBD if the tag of the preceding word is 'NNS'
NN -> NNP if the tag of words i-2...i-1 is '-NONE-'
NN -> NNP if the tag of the following word is 'NNP'
NN -> NNP if the text of words i-2...i-1 is 'like'
NN -> VBN if the text of the following word is '*-1'

 

 


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章