第一章獲取語料庫

原創

我爱玩泥巴

2018-08-24 04:51

1.獲取文本語料

通過import nltk.book 訪問定義好的文本

通過nltk.corpus.gutenberg.fileids()訪問古藤堡項目的文件

from nltk.corpus import gutenburg

emma=gutenburg.fileids()

print(emma)

emma=gutenburg.words('austen-emma.txt')#獲取該文本的詞彙

emma=gutenburg.raw()#獲取原始文本

emma= gutenburg.sents()將句子轉化成鏈表

Nltk中定義的基本語料庫函數（nltk.corpus.reader）

fileids() 語料庫中的文件
fileids([categories]) 這些分類對應的語料庫中的文件
categories() 語料庫中的分類
categories([fileids]) 這些文件對應的語料庫中的分類
raw() 語料庫的原始內容
raw(fileids=[f1,f2,f3]) 指定文件的原始內容
raw(categories=[c1,c2]) 指定分類的原始內容
words() 整個語料庫中的詞彙
words(fileids=[f1,f2,f3]) 指定文件中的詞彙
words(categories=[c1,c2]) 指定分類中的詞彙
sents() 指定分類中的句子
sents(fileids=[f1,f2,f3]) 指定文件中的句子
sents(categories=[c1,c2]) 指定分類中的句子
abspath(fileid) 指定文件在磁盤上的位置
encoding(fileid) 文件的編碼（如果知道的話）
open(fileid) 打開指定語料庫文件的文件流
root() 到本地安裝的語料庫根目錄的路徑

gutenberg(古騰堡)

webtext（網絡和聊天文本）

reuters（路透社語料庫）

inaugural（就職演說）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

第一章獲取語料庫

Mysql 第 n 高的薪水相關知識整理

Pandas 數據結構-Series

sklearn 數據預處理

Pandas-第六章缺失數據處理

Pandas 第8章分類數據

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

第一章 獲取語料庫

第一章獲取語料庫