0415學習筆記（nltk基本處理）

原創

2020-04-18 21:10

首先，不要把python代碼文件命名爲nltk，不然會報錯ModuleNotFoundError: No module named ‘nltk.book’; ‘nltk’ is not a package
使用nltk.download()提示——遠程主機強迫關閉了一個現有連接
在網上找個nltk_data的數據包下載，解壓（重要），放在一個根目錄下（如C：,D:），放在給的目錄裏是會報錯的比如Resource gutenberg not found.

實例1——讀取NLTK語料庫信息

from nltk.corpus import brown
print(brown.categories())
print(len(brown.sents()))
print(len(brown.words()))

結果：
[‘adventure’, ‘belles_lettres’, ‘editorial’, ‘fiction’, ‘government’, ‘hobbies’, ‘humor’, ‘learned’, ‘lore’, ‘mystery’, ‘news’, ‘religion’, ‘reviews’, ‘romance’, ‘science_fiction’]
57340
1161192

corpus：語料庫
brown大學的語料庫

文本處理流程：
預處理（Preprocess）——分詞（Tokenize）——提取特徵（Make Features轉化爲數字）——機器學習（ML）

實例2——

import nltk
sentence="hello,world"
tokens=nltk.word_tokenize(sentence)
print(tokens)

期間報錯，另一個resource not found。用terminal測出要需要的文件在裏面的一個壓縮包裏，解壓給一個正確的地址就可以運行了
punkt\PY3\english.pickle

結果：
[‘hello’, ‘,’, ‘world’]

實例3
結巴分詞

import jieba
seg_list = jieba.cut("我來到北京清華大學", cut_all=True)
print ("Full Mode:", "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我來到北京清華大學", cut_all=False)
print ("Default Mode:", "/ ".join(seg_list)) # 精確模式
seg_list = jieba.cut("他來到了⽹易杭研大廈") # 默認是精確模式
print (", ".join(seg_list))
seg_list = jieba.cut_for_search("⼩明碩士畢業於中國科學院計算所，後在⽇本京都大學深造")
# 搜索引擎模式
print (", ".join(seg_list))

結果：
Full Mode: 我/ 來到/ 北京/ 清華/ 清華大學/ 華大/ 大學
Default Mode: 我/ 來到/ 北京/ 清華大學
他, 來到, 了, ⽹, 易, 杭研, 大廈
⼩, 明, 碩士, 畢業, 於, 中國, 科學, 學院, 科學院, 中國科學院, 計算, 計算所, ，, 後, 在, ⽇, 本, 京都, 大學, 京都大學, 深造

全模式：所有可能的詞
精確：本句話可以分的詞

實例4
判定表情符號並分詞

import re
emoticons_str = r"""
(?:
[:=;] # 眼睛
[oO\-]? # ⿐鼻⼦子
[D\)\]\(\]/\\OpP] # 嘴
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @某⼈人
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # 話題標籤
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+',
# URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # 數字
r"(?:[a-z][a-z'\-_]+[a-z])", # 含有 - 和 ‘ 的單詞
r'(?:[\w_]+)', # 其他
r'(?:\S)' # 其他
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
    return tokens_re.findall(s)
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens
tweet = 'RT @angelababy: love you baby! :D http://ah.love #168cm'
print(preprocess(tweet))

詞形
變化：walk-walked-walking
引申：nation-national-nationalize

詞形歸一化
詞幹提取stemming
walking-walk
walked-walk
詞形歸一
went-go
are-be

實例5
NLTK實現stemming

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('maximum'))
print(porter_stemmer.stem('presumably'))
print(porter_stemmer.stem('multiply'))
print(porter_stemmer.stem('provision'))
print("-------------------------")
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('maximum'))
print(snowball_stemmer.stem('presumably'))
print("-------------------------")
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('maximum'))
print(lancaster_stemmer.stem('presumably'))
print("-------------------------")
from nltk.stem.porter import PorterStemmer
p = PorterStemmer()
print(p.stem('went'))
print(p.stem('wenting'))
print("-------------------------")
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize('dogs'))
print(wordnet_lemmatizer.lemmatize('churches'))
print(wordnet_lemmatizer.lemmatize('aardwolves'))
print(wordnet_lemmatizer.lemmatize('abaci'))
print(wordnet_lemmatizer.lemmatize('hardrock'))
print("-------------------------")
print(wordnet_lemmatizer.lemmatize('are'))
print(wordnet_lemmatizer.lemmatize('is'))
print(wordnet_lemmatizer.lemmatize('are',pos='v'))
print(wordnet_lemmatizer.lemmatize('is',pos='v'))

結果：
maximum
presum
multipli
provis
————————————
maximum
presum
————————————
maxim
presum
————————————
went
went
————————————
dog
church
aardwolf
abacus
hardrock
————————————
#沒有POS Tag，默認是NN 名詞
are
is
be
be

關於查看pos tag

import nltk
text=nltk.word_tokenize('what does the fox say')
print(nltk.pos_tag(text))

結果：
[(‘what’, ‘WDT’), (‘does’, ‘VBZ’), (‘the’, ‘DT’), (‘fox’, ‘NNS’), (‘say’, ‘VBP’)]

實例6
NLTK去除stopwords

import nltk
from nltk.corpus import stopwords
sentence="what does the fox say"
word_list=nltk.word_tokenize(sentence)
filtered_words =[word for word in word_list if word not in stopwords.words('english')]
print(filtered_words)

結果：
[‘fox’, ‘say’]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

0415學習筆記（nltk基本處理）

記一次 .NET某工業設計軟件崩潰分析

創建 Vue3 項目

TS + Webpack 整合 Jest

分享5款.NET開源免費的Redis客戶端組件庫

安卓手機如何登錄抖音境外版

golang開發 gorilla websocket的使用

面試官：如果不允許線程池丟棄任務，應該選擇哪個拒絕策略？

嵌入式汽車電子學習路線

Mac卸載 Node npm，升級 Node

uni.showModel內容換行

2.9學習筆記（西瓜書1）

0302學習筆記（css）

9.17學習筆記（重複值處理、數據清洗）

9.18學習筆記（特徵工程）

9.19學習筆記（數據清洗、建模）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結