NLTK使用匯總

原創

2020-06-28 03:55

1. LookupError: Resource not found.

例如在運行下列代碼時出現錯誤：

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I am a good boy')

解決方法一：

import nltk
nltk.download('punkt')

但可能會出現遠程主機強迫關閉了一個現有的連接的錯誤，此時我們就需要使用其他辦法。

解決方法二：

手動下載nltk所有的數據集，然後解壓至上圖中的某個目錄下，https://download.csdn.net/download/herosunly/12399290：
運行下列代碼：

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)

如果出錯，再運行下列代碼：

import nltk
nltk.download('punkt')

注1：之所以需要重新下載，是由於之前的數據集的nltk的版本和pip install的最新版本不相符。

注2：如果是Linux系統，最好是先通過Config設置路徑，然後把下載好的NLTK數據包放到裏面即可。

import nltk
nltk.data.path.append("/home/library/nltk_data")

2. 分詞和停用詞

分詞

from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)

停用詞

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

3. 詞性標註和詞形還原

詞形還原與詞幹提取類似，但不同之處在於詞幹提取經常可能創造出不存在的詞彙，詞形還原的結果是一個真正的詞彙。所以我們這裏只介紹詞形還原。但是詞性還原又取決於詞性，所以我們需要藉助詞性標註得到的結果。

詞性標註

import nltk
text = nltk.word_tokenize('what does the fox say')
print(text)
print(nltk.pos_tag(text))
 
結果爲：
['what', 'does', 'the', 'fox', 'say']
輸出是元組列表，元組中的第一個元素是單詞，第二個元素是詞性標籤
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

標記（Tag）	含義（Meaning）	例子（Examples）
ADJ	形容詞（adjective）	new，good，high，special，big
ADV	副詞（adverb）	really,，already，still，early，now
CNJ	連詞（conjunction）	and，or，but，if，while
DET	限定詞（determiner）	the，a，some，most，every
EX	存在量詞（existential）	there，there’s
FW	外來詞（foreign word）	dolce，ersatz，esprit，quo，maitre
MOD	情態動詞（modal verb）	will，can，would，may，must
N	名詞（noun）	year，home，costs，time
NP	專有名詞（proper noun）	Alison，Africa，April，Washington
NUM	數詞（number）	twenty-four，fourth，1991，14:24
PRO	代詞（pronoun）	he，their，her，its，my，I，us
P	介詞（preposition）	on，of，at，with，by，into，under
TO	詞 to（the word to）	to
UH	感嘆詞（interjection）	ah，bang，ha，whee，hmpf，oops
V	動詞（verb）	is，has，get，do，make，see，run
VD	過去式（past tense）	said，took，told，made，asked
VG	現在分詞（present participle）	making，going，playing，working
VN	過去分詞（past participle）	given，taken，begun，sung
WH	wh限定詞（wh determiner）	who，which，when，what，where

也可以使用nltk.help.upenn_tagset()進行查看。https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/(上述表格有錯誤！！！)

詞性還原

# { Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"
# }

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
'''
結果爲：
play
playing
playing
playing
'''

4. 分句

由於word2vec本質上是對每個句子求詞向量，所以我們需要對文章劃分成句子。

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NLTK使用匯總

1. LookupError: Resource not found.

2. 分詞和停用詞

3. 詞性標註和詞形還原

4. 分句

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

貪心算法和動態規劃的區別與聯繫

使用區間來簡化代碼思考

NLTK使用匯總

NLP定義和機器翻譯

tensorflow.keras使用匯總(持續更新)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結