nltk：python自然語言處理一

環境：

1.安裝nltk：pip install nltk 注：windows如果提示需要安裝依賴包msgpack pip install msgpack

2.nltk_data的下載

交互模式：

import nltk

nltk.download() 【windows：nltk.download_shell()】

輸入：d 進入下載器

輸入：all 開始下載

下載完成之後交互模式 :from nltk import * 測試是否安裝成功

nltk.tokenize模塊下構造了各種分詞器的類
基本上每一個分詞器的類相對應的都有一個構建好的分詞方法
開發者已經把這些工具導入到nltk下的__init__文件中

Ⅰ 、將文本切分爲語句

1.sent_tekenize方法將文本切分爲獨立的句子

from nltk.tokenize import sent_tokenize

text = "To the world you may be just one person. To the person you may be the whole world. "

# sent_tokenize會根據標點符號將文本按句子分割
result = sent_tokenize(text)
print(result)
# ['To the world you may be just one person.', 'To the person you may be the whole world.']
print len(result)
# 2

2.其他語言的句子分割

nltk默認是對英文進行操作我們要對其他語言進行切分的時候我們可以通過加載其他語言的語言包來創建分詞器

# 對法語進行切分

import nltk

# 創建法語的分詞器
tokenizer_french = nltk.data.load('tokenizers/punkt/french.pickle')

text = "Les rencontres dans la vie sont comme le vent. Certaines vous effleurent juste la peau, d’autres vous renversent."

# 使用分詞器的tokenize方法進行分詞
result = tokenizer_french.tokenize(text)
print(result)
# ['Les rencontres dans la vie sont comme le vent.', 'Certaines vous effleurent juste la peau, d\xe2\x80\x99autres vous renversent.']

Ⅱ 將句子切分單詞

1.TreebankWordTokenizer類

TreebankWordTokenizer依據Penn Treebank的語料庫得而約定通過分離縮略詞來實現切分

from nltk.tokenize import TreebankWordTokenizer

# 實例化TreebankWordTokenizer
tokenizer_twt = TreebankWordTokenizer()
# 對文本進行切分
print(tokenizer_twt.tokenize(sentence))
# ['There', "'s", 'a', 'difference', 'between', 'love', 'and', 'like.', 'If', 'you', 'like', 'a', 'flower', 'you', 'will', 'pick', 'it', ',', 'but', 'if', 'you', '.']

2.word_tokenize方法

from nltk import word_tokenize

sentence = "There's a difference between love and like. If you like a flower you will pick it, but if you."
# nltk.word_tokenize方法將句子分割爲獨立的單詞，函數內部使用TreebankWordTokenizer的對象進行分詞
# 如果是大文本可以指定preserve_line=True 先斷句再分詞
print(word_tokenize(sentence))
# ['There', "'s", 'a', 'difference', 'between', 'love', 'and', 'like', '.', 'If', 'you', 'like', 'a', 'flower', 'you', 'will', 'pick', 'it', ',', 'but', 'if', 'you', '.']

3.WordPunctTokenizer

WordPunctTokenizer通過分離標點來實現切分會將標點轉化爲一個新的標識符（我們通常需要的分詞形式）

from nltk.tokenize import WordPunctTokenizer

sentence = "There's a difference between love and like. If you like a flower you will pick it, but if you."

# 實例化WordPunctTokenizer
tokenizer_wpt = WordPunctTokenizer()
# 將句子分離爲獨立的單詞，
print(tokenizer_wpt.tokenize(sentence))
# ['There', "'", 's', 'a', 'difference', 'between', 'love', 'and', 'like', '.', 'If', 'you', 'like', 'a', 'flower', 'you', 'will', 'pick', 'it', ',', 'but', 'if', 'you', '.']

nltk：python自然語言處理一

使用skopeo同步鏡像

基於centos7-python3的scrapyd鏡像

python企業微信報警服務

文件下載太慢嗎? 使用httpx和asyncio實現併發下載的小demo

docker-compose啓動mysql、redis服務

python實現elasticsearch鏈接池

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結