NLP開源工具-NLTK

NLTK是Python很強大的第三方庫，可以很方便的完成很多自然語言處理（NLP）的任務，包括分詞、詞性標註、命名實體識別（NER）及句法分析。

一、NLTK進行分詞

用到的函數：
nltk.sent_tokenize(text) #對文本按照句子進行分割
nltk.word_tokenize(sent) #對句子進行分詞

import nltk
text = 'PythonTip.com is a very good website. We can learn a lot from it.'
# 將文本拆分成句子列表
sens = nltk.sent_tokenize(text)
print sens
# ['PythonTip.com is a very good website.', 'We can learn a lot from it.']
# 對句子進行分詞，nltk的分詞是句子級的，因此要先分句，再逐句分詞，否則效果會很差
words = []
for sent in sens:
    words.append(nltk.word_tokenize(sent))
print words
# [['PythonTip.com', 'is', 'a', 'very', 'good', 'website', '.'],
# ['We', 'can', 'learn', 'a', 'lot', 'from', 'it', '.']]

二、NLTK進行詞性標註

用到的函數：
nltk.pos_tag(tokens) #tokens是句子分詞後的結果，同樣是句子級的標註

tags = []
# 詞性標註要利用上一步分詞的結果
for tokens in words:
    tags.append(nltk.pos_tag(tokens))
print tags
# [[('PythonTip.com', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('very', 'RB'), ('good', 'JJ'),
# ('website', 'NN'), ('.', '.')], [('We', 'PRP'), ('can', 'MD'), ('learn', 'VB'), ('a', 'DT'),
# ('lot', 'NN'), ('from', 'IN'), ('it', 'PRP'), ('.', '.')]]

三、NLTK進行命名實體識別（NER）

用到的函數：
nltk.ne_chunk(tags) #tags是句子詞性標註後的結果，同樣是句子級

text = "Xi is the chairman of China in the year 2013."
# 分詞
tokens = nltk.word_tokenize(text)
# 詞性標註
tags = nltk.pos_tag(tokens)
print tags
# [('Xi', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('chairman', 'NN'),
# ('of', 'IN'), ('China', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('year', 'NN'),
# ('2013', 'CD'), ('.', '.')]
# NER需要利用詞性標註的結果
ners = nltk.ne_chunk(tags)
print '%s --- %s' % (str(ners), str(ners.label()))
"""
(S
  (GPE Xi/NN)
  is/VBZ
  the/DT
  chairman/NN
  of/IN
  (GPE China/NNP)
  in/IN
  the/DT
  year/NN
  2013/CD
  ./.) --- S
"""

上例中，有兩個命名實體，一個是Xi,這個應該是PER，被錯誤識別爲GPE了；另一個事China，被正確識別爲GPE。

四、句法分析

nltk沒有好的parser，推薦使用stanfordparser
但是nltk有很好的樹類，該類用list實現
可以利用stanfordparser的輸出構建一棵python的句法樹

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NLP開源工具-NTLK

NLP開源工具-NLTK

一、NLTK進行分詞

二、NLTK進行詞性標註

三、NLTK進行命名實體識別（NER）

四、句法分析

物理機開關機

基礎知識面試準備

Linux學習筆記(一)-Linux操作系統啓動流程以及系統關機命令

機器學習筆記(2)-決策樹

Linux學習筆記(八)-基於AIX/Linux平臺的項目開發

Linux學習筆記(五)-安全管理以及開發基礎技術

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結