OpenNLP使用小結

原創

robinliu2010

2020-06-28 15:49

http://danielmclaren.com/2007/05/11/getting-started-with-opennlp-natural-language-processing

OpenNLP使用小結

我剛剛開始接觸NLP，最近使用了一下開源工具包OpenNLP。它包含sentence detector, parts-of-speech (POS) tagger和treebank parser。本文主要對我這段時間來使用OpenNLP的一些經驗技巧做一下小結。

OpenNLP能做什麼？

以下面一段句子爲例，我們來看看OpenNLP到底可以做一些什麼工作: This isn't the greatest example sentence in the world because I've seen better. Neither is this one. This one's not bad, though.

Sentence Detector
簡單直觀的理解就是提取句子。但是可能沒有我們想象的那麼簡單，因爲有些句子不是以句號結尾，尤其對一些對話文本可能會更加複雜。幸運的是OpenNLP爲我們提供了一個提取句子結構的模塊。Sentence Detector是所有其他操作的一個先行步驟，因爲其他操作一次只能處理一個sentence。
Sentence Detector返回String數組，在這裏，返回的第一個數組如下：
This isn't the greatest example sentence in the world because I've seen better.
Tokenizer
POS tagger和Treebank parser都需要將句子分解成tokens。通常一個單詞是一個token，但是，有些單詞需要分解成兩個tokens。例如，"don't"分解成"do"和"n't"這兩個tokens。下面是一個句子的分解：
This is n't the greatest example sentence in the world because I 've seen better .
POS Tagger
就是對句子進行語法結構分析，將每個token對應一個speech tags (verb, adverb, personal pronoun)。下面是tagging的結果：
This/DT is/VBZ n't/RB the/DT greatest/JJS example/NN sentence/NN in/IN the/DT world/NN because/IN I/PRP 've/VBP seen/VBN better/RB ./.
可以參考這篇文章理解POS。
Treebank Chunker
將句子分塊chunking。名詞phrase和動詞phrase可以被正確的標記。在我們的例子中，我們可以得到如下的chunks：
[NP This/DT ] [VP is/VBZ ] n't/RB [NP the/DT greatest/JJS example/NN sentence/NN ] [PP in/IN ] [NP the/DT world/NN ] [SBAR because/IN ] [NP I/PRP ] [VP 've/VBP seen/VBN ] [ADVP better/RB ] ./.
Treebank Parser
構建語法結構樹

http://www.numb3r3.com/opennlp-tutorial

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

OpenNLP使用小結

OpenNLP能做什麼？

linepipe——又一個自然語言開源程序

django模型5

LDA必看文章

最新中文文本挖掘小例子及程序

運行python manage.py runserver報錯現象、原因和解決辦法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結