1. LookupError: Resource not found.
例如在運行下列代碼時出現錯誤:
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I am a good boy')
- 解決方法一:
import nltk
nltk.download('punkt')
但可能會出現遠程主機強迫關閉了一個現有的連接的錯誤,此時我們就需要使用其他辦法。
- 解決方法二:
- 手動下載nltk所有的數據集,然後解壓至上圖中的某個目錄下,https://download.csdn.net/download/herosunly/12399290:
- 運行下列代碼:
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)
- 如果出錯,再運行下列代碼:
import nltk
nltk.download('punkt')
注1:之所以需要重新下載,是由於之前的數據集的nltk的版本和pip install的最新版本不相符。
注2:如果是Linux系統,最好是先通過Config設置路徑,然後把下載好的NLTK數據包放到裏面即可。
import nltk
nltk.data.path.append("/home/library/nltk_data")
2. 分詞和停用詞
- 分詞
from nltk.tokenize import word_tokenize
tokenized_word = word_tokenize('I love a good boy')
print(tokenized_word)
- 停用詞
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
3. 詞性標註和詞形還原
詞形還原與詞幹提取類似, 但不同之處在於詞幹提取經常可能創造出不存在的詞彙,詞形還原的結果是一個真正的詞彙。所以我們這裏只介紹詞形還原。但是詞性還原又取決於詞性,所以我們需要藉助詞性標註得到的結果。
- 詞性標註
import nltk
text = nltk.word_tokenize('what does the fox say')
print(text)
print(nltk.pos_tag(text))
結果爲:
['what', 'does', 'the', 'fox', 'say']
輸出是元組列表,元組中的第一個元素是單詞,第二個元素是詞性標籤
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]
標記(Tag) | 含義(Meaning) | 例子(Examples) |
---|---|---|
ADJ | 形容詞(adjective) | new,good,high,special,big |
ADV | 副詞(adverb) | really,,already,still,early,now |
CNJ | 連詞(conjunction) | and,or,but,if,while |
DET | 限定詞(determiner) | the,a,some,most,every |
EX | 存在量詞(existential) | there,there’s |
FW | 外來詞(foreign word) | dolce,ersatz,esprit,quo,maitre |
MOD | 情態動詞(modal verb) | will,can,would,may,must |
N | 名詞(noun) | year,home,costs,time |
NP | 專有名詞(proper noun) | Alison,Africa,April,Washington |
NUM | 數詞(number) | twenty-four,fourth,1991,14:24 |
PRO | 代詞(pronoun) | he,their,her,its,my,I,us |
P | 介詞(preposition) | on,of,at,with,by,into,under |
TO | 詞 to(the word to) | to |
UH | 感嘆詞(interjection) | ah,bang,ha,whee,hmpf,oops |
V | 動詞(verb) | is,has,get,do,make,see,run |
VD | 過去式(past tense) | said,took,told,made,asked |
VG | 現在分詞(present participle) | making,going,playing,working |
VN | 過去分詞(past participle) | given,taken,begun,sung |
WH | wh限定詞(wh determiner) | who,which,when,what,where |
也可以使用nltk.help.upenn_tagset()進行查看。https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/(上述表格有錯誤!!!)
- 詞性還原
# { Part-of-speech constants
ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"
# }
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
'''
結果爲:
play
playing
playing
playing
'''
4. 分句
由於word2vec本質上是對每個句子求詞向量,所以我們需要對文章劃分成句子。
from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text = sent_tokenize(text)
print(tokenized_text)