信息提取(Information Extraction)
下圖顯示了一個簡單的信息提取系統的結構。
首先,使用句子分割器將文檔的原始文本分割成句,使用分詞器將每個句子進一步細分爲詞。接下來,對每個句子進行詞性標註,在下一步,命名實體識別中我們將尋找句子中提到的實體;最後,使用關係識別搜索文本中不同實體間的可能關係。
對於前三步,我們可以定義一個函數:
>>> def ie_preprocess(document):
... sentences = nltk.sent_tokenize(document)
... sentences = [nltk.word_tokenize(sent) for sent in sentences]
... sentences = [nltk.pos_tag(sent) for sent in sentences]
分塊(Chunking)
用正則表達式分塊(Chunking with Regular Expressions)
正則表達式的格式爲
r”“”
塊名:{<表達式>…<>}
{…}
“””
一個簡單的名詞短語分塊器:
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and noun
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print(cp.parse(sentence))
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
大括號內爲分塊規則,可以有一個或多個,當rule不止一個時,RegexpParser會依次調用各個規則,並不斷更新分塊結果,直到所有的rule都被調用。
nltk.RegexpParser(grammar)用於依照分塊規則創建一個chunk分析器,cp.parse()則在目標句子中運行分析器,最後的結果是一個樹結構,我們可以用print打印它。
再看一個例子:
>>> cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
>>> brown = nltk.corpus.brown
>>> for sent in brown.tagged_sents():
... tree = cp.parse(sent)
... for subtree in tree.subtrees():
... if subtree.label() == 'CHUNK': print(subtree)
...
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
...
(CHUNK seems/VBZ to/TO overtake/VB)
(CHUNK want/VB to/TO buy/VB)
加縫隙(Chinking)
有時定義我們想從一個塊排除什麼比較容易。我們可以爲不包括在一大塊中的一個標識符序列定義一個 縫隙。這種表達式的格式爲:‘ }表達式{ ’ 。在下面的例子中,barked/VBD at/IN 是一個縫隙:
[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]
加縫隙是從一大塊中去除一個標識符序列的過程。有三種情況:
1. 如果匹配的標識符序列貫穿一整塊 ,那麼這一整塊會被去除。
2. 如果標識符序列出現在塊中間,這些標識符會被去除,在以前只有一個塊的地方留下兩個塊。
3. 如果序列在塊的兩邊,這些標記被去除,留下一個較小的塊。
下表展示了這三種情況:
’ | Entire chunk | Middle of a chunk | End of a chunk |
---|---|---|---|
Input | [a/DT little/JJ dog/NN] | [a/DT little/JJ dog/NN] | [a/DT little/JJ dog/NN] |
Operation | Chink “DT JJ NN” | Chink “JJ” | Chink “NN” |
Pattern | }DT JJ NN{ | }JJ{ | }NN{ |
Output | a/DT little/JJ dog/NN | [a/DT] little/JJ [dog/NN] | [a/DT little/JJ] dog/NN |
例子:
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
>>> print(cp.parse(sentence))
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))