NLTK使用方法總結

NLTK（natural language toolkit）是一套基於python的自然語言處理工具集。

1. NLTK安裝與功能描述

（1）NLTK安裝

首先，打開終端安裝nltk

pip install nltk

打開Python終端並輸入以下內容來安裝 NLTK 包

import nltk
nltk.download()

（2）語言處理任務與相應NLTK模塊以及功能描述

（3）NLTK自帶的語料庫（corpus）

在nltk.corpus包下，提供了幾類標註好的語料庫。見下表

語料庫	說明
gutenberg	一個有若干萬部的小說語料庫，多是古典作品
webtext	收集的網絡廣告等內容
nps_chat	有上萬條聊天消息語料庫，即時聊天消息爲主
brown	一個百萬詞級的英語語料庫，按文體進行分類
reuters	路透社語料庫，上萬篇新聞方檔，約有1百萬字，分90個主題，並分爲訓練集和測試集兩組
inaugural	演講語料庫，幾十個文本，都是總統演說

from nltk.corpus import brown
print(brown.categories())   #輸出brown語料庫的類別
print(len(brown.sents()))   #輸出brown語料庫的句子數量
print(len(brown.words()))   #輸出brown語料庫的詞數量

'''
結果爲：
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 
'science_fiction']
57340
1161192
'''

2. NLTK詞頻統計（Frequency）

NLTK 中的FreqDist( ) 類主要記錄了每個詞出現的次數，根據統計數據生成表格或繪圖。其結構簡單，用一個有序詞典進行實現。

方法	作用
B()	返回詞典的長度
plot(title,cumulative=False)	繪製頻率分佈圖，若cumu爲True，則是累積頻率分佈圖
tabulate()	生成頻率分佈的表格形式
most_common()	返回出現次數最頻繁的詞與頻度
hapaxes()	返回只出現過一次的詞

詞頻統計功能實現如下：

import nltk
tokens=[ 'my','dog','has','flea','problems','help','please',
         'maybe','not','take','him','to','dog','park','stupid',
         'my','dalmation','is','so','cute','I','love','him'  ]
#統計詞頻
freq = nltk.FreqDist(tokens)

#輸出詞和相應的頻率
for key,val in freq.items():
    print (str(key) + ':' + str(val))

#可以把最常用的5個單詞拿出來
standard_freq=freq.most_common(5)
print(standard_freq)

#繪圖函數爲這些詞頻繪製一個圖形
freq.plot(20, cumulative=False)

3. NLTK去除停用詞（stopwords）

from nltk.corpus import stopwords
tokens=[ 'my','dog','has','flea','problems','help','please',
         'maybe','not','take','him','to','dog','park','stupid',
         'my','dalmation','is','so','cute','I','love','him'  ]

clean_tokens=tokens[:]
stwords=stopwords.words('english')
for token in tokens:
    if token in stwords:
        clean_tokens.remove(token)

print(clean_tokens)

4. NLTK分句和分詞（tokenize）

（1）nltk分句

from nltk.tokenize import sent_tokenize
mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))

結果如下：

['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

（2） nltk分詞

from nltk.tokenize import word_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))

結果如下：

['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']

（3） nltk標記非英語語言文本

from nltk.tokenize import sent_tokenize
mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."
print(sent_tokenize(mytext,"french"))

結果如下：

['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]

5. NLTK詞幹提取（Stemming）

單詞詞幹提取就是從單詞中去除詞綴並返回詞根。（比方說 working 的詞幹是 work。）搜索引擎在索引頁面的時候使用這種技術，所以很多人通過同一個單詞的不同形式進行搜索，返回的都是相同的，有關這個詞幹的頁面。

詞幹提取的算法有很多，但最常用的算法是 Porter 提取算法。NLTK 有一個 PorterStemmer 類，使用的就是 Porter 提取算法。

（1） PorterStemmer

from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('working'))
#結果爲：work

（2）LancasterStemmer

from nltk.stem import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('working'))
#結果爲：work

（3）SnowballStemmer 提取非英語單詞詞幹

SnowballStemmer 類，除了英語外，還可以適用於其他 13 種語言。支持的語言如下：

from nltk.stem import SnowballStemmer
print(SnowballStemmer.languages)
#結果爲：
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

使用 SnowballStemmer 類的 stem() 函數來提取非英語單詞

from nltk.stem import SnowballStemmer
french_stemmer = SnowballStemmer('french')
print(french_stemmer.stem("French word"))
#結果爲：french word

6. NLTK詞形還原（Lemmatization）

（1）詞形還原與詞幹提取類似，但不同之處在於詞幹提取經常可能創造出不存在的詞彙，詞形還原的結果是一個真正的詞彙。

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('increases'))

#結果爲：increase

（2）結果可能是同義詞或具有相同含義的不同詞語。有時，如果你試圖還原一個詞，比如 playing,還原的結果還是 playing。這是因爲默認還原的結果是名詞，如果你想得到動詞，可以通過以下的方式指定。

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))

#結果爲：play

（3）實際上，這是一個非常好的文本壓縮水平。最終壓縮到原文本的 50％到 60％左右。結果可能是動詞，名詞，形容詞或副詞：

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))
'''
結果爲：
play
playing
playing
playing
'''

7. NLTK詞性標註（POS Tag）

（1）詞性標註是把一個句子中的單詞標註爲名詞，形容詞，動詞等。

import nltk
text=nltk.word_tokenize('what does the fox say')
print(text)
print(nltk.pos_tag(text))

'''
結果爲：
['what', 'does', 'the', 'fox', 'say']

輸出是元組列表，元組中的第一個元素是單詞，第二個元素是詞性標籤
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]
'''

（2）簡化的詞性標記集列表（Part of Speech）

標記（Tag）	含義（Meaning）	例子（Examples）
ADJ	形容詞（adjective）	new，good，high，special，big
ADV	副詞（adverb）	really,，already，still，early，now
CNJ	連詞（conjunction）	and，or，but，if，while
DET	限定詞（determiner）	the，a，some，most，every
EX	存在量詞（existential）	there，there's
FW	外來詞（foreign word）	dolce，ersatz，esprit，quo，maitre
MOD	情態動詞（modal verb）	will，can，would，may，must
N	名詞（noun）	year，home，costs，time
NP	專有名詞（proper noun）	Alison，Africa，April，Washington
NUM	數詞（number）	twenty-four，fourth，1991，14:24
PRO	代詞（pronoun）	he，their，her，its，my，I，us
P	介詞（preposition）	on，of，at，with，by，into，under
TO	詞 to（the word to）	to
UH	感嘆詞（interjection）	ah，bang，ha，whee，hmpf，oops
V	動詞（verb）	is，has，get，do，make，see，run
VD	過去式（past tense）	said，took，told，made，asked
VG	現在分詞（present participle）	making，going，playing，working
VN	過去分詞（past participle）	given，taken，begun，sung
WH	wh限定詞（wh determiner）	who，which，when，what，where

NLTK詞性標註編碼含義

8. NLTK中的wordnet

wordnet 是爲自然語言處理構建的數據庫。它包括部分詞語的一個同義詞組和一個簡短的定義。

（1）通過 wordnet可以得到給定詞的定義和例句

from nltk.corpus import wordnet
syn = wordnet.synsets("pain")  #獲取“pain”的同義詞集
print(syn[0].definition())
print(syn[0].examples())

'''
結果爲：
a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']
'''

（2）使用 wordnet來獲得同義詞

from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets('Computer'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)

'''
結果爲：
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']
'''

（3）使用wordnet來獲取反義詞

from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets("small"):
    for l in syn.lemmas():
        if l.antonyms():   #判斷是否是正確的反義詞
            antonyms.append(l.antonyms()[0].name())
print(antonyms)

'''
結果爲：
['large', 'big', 'big']
'''

NLTK使用方法總結

1. NLTK安裝與功能描述

2. NLTK詞頻統計（Frequency）

3. NLTK去除停用詞（stopwords）

4. NLTK分句和分詞（tokenize）

5. NLTK詞幹提取（Stemming）

6. NLTK詞形還原（Lemmatization）

7. NLTK詞性標註（POS Tag）

8. NLTK中的wordnet

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

同事使用 insert into select 遷移數據，開開心心上線，上線後被公司開除！

Git使用經驗總結5-修改提交信息

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Git使用經驗總結4-撤回上一次本地提交

Java中止線程的方式

壓榨數據庫的真實處理速度

國內SaaS遇冷？未來企業服務賽道是否還有機會？

NLP數據增強方法總結及實現

基於樹模型的lightGBM文本分類

TextRank算法介紹及實現

Linux環境下編譯TensorFlow C++ API和測試方法總結（完美版）

Python3讀取和寫入excel表格數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

NLTK使用方法總結

1. NLTK安裝與功能描述

2. NLTK詞頻統計（Frequency）

3. NLTK去除停用詞（stopwords）

4. NLTK分句和分詞（tokenize）

5. NLTK詞幹提取 （Stemming）

6. NLTK詞形還原（Lemmatization）

7. NLTK詞性標註（POS Tag）

8. NLTK中的wordnet

5. NLTK詞幹提取（Stemming）