奮戰聊天機器人（二）語料和詞彙資源

原創

2018-09-03 03:27

當代自然語言處理都是基於統計的，統計自然需要很多樣本，因此語料和詞彙資源是必不可少的

1. NLTK語料庫

NLTK包含多種語料庫，比如：Gutenberg語料庫

nltk.corpus.gutenberg.fileids()

nltk.corpus.gutenberg：語料庫的閱讀器
nltk.corpus.gutenberg.raw(‘chesterton-brown.txt’)：輸出chesterton-brown.txt文章的原始內容
nltk.corpus.gutenberg.words(‘chesterton-brown.txt’)：輸出chesterton-brown.txt文章的單詞列表
nltk.corpus.gutenberg.sents(‘chesterton-brown.txt’)：輸出chesterton-brown.txt文章的句子列表

類似的語料庫還有：

from nltk.corpus import webtext：網絡文本語料庫，網絡和聊天文本
from nltk.corpus import brown：布朗語料庫，按照文本分類好的500個不同來源的文
from nltk.corpus import reuters：路透社語料庫，1萬多個新聞文檔
from nltk.corpus import inaugural：就職演說語料庫，55個總統的演說

1.1 語料庫的一般結構

語料庫的幾種組織結構：
- 散養式（孤立的多篇文章）
- 分類式（按照類別組織、相互之間沒有交集）
- 交叉式（一篇文章可能屬於多個類）
- 漸變式（語法隨時間發生變化）

1.2 語料庫的通用接口

fileids()：返回語料庫中的文件
categories()：返回語料庫中的分類
raw()：返回語料庫的原始內容
words()：返回語料庫中的詞彙
sents()：返回語料庫句子
abspath()：指定文件在磁盤上的位置
open()：打開語料庫的文件流

1.3 加載自己的語料庫

收集自己的語料庫（文本文件）到某路徑下（比如/tmp），然後執行：

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/tmp'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

就可以列出自己語料庫的各個文件了，也可以使用如wordlists.sents(‘a.txt’)和wordlists.words(‘a.txt’)等方法來獲取句子和詞信息

1.4 條件頻率分佈

自然語言的條件頻率分佈就是指定條件下某個事件的頻率分佈

比如要輸出在布朗語料庫中每個類別條件下每個詞的頻率

# encoding:utf-8

import nltk
from nltk.corpus import brown

# 鏈表推導式，genre是brown語料庫裏的所有類別列表，word是這個類別中的詞彙列表
# (genre, word)就是類別加詞彙對
genre_word = [(genre, word)
              for genre in brown.categories()
              for word in brown.words(categories=genre)]

# 創建條件頻率分佈
cfd = nltk.ConditionalFreqDist(genre_word)
# 指定條件和樣本作圖
cfd.plot(conditions=['news', 'adventure'], samples=[u'stock', u'sunbonnet'])
# 自定條件和樣本作表格
cfd.tabulate(conditions=['news', 'adventure'], samples=[u'stock', u'sunbonnet'])

我們還可以利用條件頻率分佈，按照最大條件概率生成雙連詞，最終生成一個隨機文本

這可以直接使用bigrams()函數，它的功能是生成詞對鏈表。

# encoding:utf-8

import nltk


# 循環10次，從cddist中取當前單詞最大概率的連詞，並打印出來
def generate_model(cfdist, word, num=10):
    for i in range(num):
        print(word)
        word = cfdist[word].max()

# 加載語料庫
text = nltk.corpus.genesis.words('english-kjv.txt')
# 生成雙連詞
bigrams = nltk.bigrams(text)
# 生成條件頻率分佈
cfd = nltk.ConditionalFreqDist(bigrams)

# 以 the 開頭，生成隨機串
generate_model(cfd, 'the')

其他詞典資源

有一些僅是詞或短語以及一些相關信息的集合，叫做詞典資源。

詞彙列表語料庫：nltk.corpus.words.words()，所有英文單詞，這個可以用來識別語法錯誤
停用詞語料庫：nltk.corpus.stopwords.words，用來識別那些最頻繁出現的沒有意義的詞
發音詞典：nltk.corpus.cmudict.dict()，用來輸出每個英文單詞的發音
比較詞表：nltk.corpus.swadesh，多種語言核心200多個詞的對照，可以作爲語言翻譯的基礎
同義詞集：WordNet，面向語義的英語詞典，由同義詞集組成，並組織成一個網絡

參考資料來源：http://www.shareditor.com/

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

奮戰聊天機器人（二）語料和詞彙資源

1. NLTK語料庫

類似的語料庫還有：

1.1 語料庫的一般結構

1.2 語料庫的通用接口

1.3 加載自己的語料庫

1.4 條件頻率分佈

其他詞典資源

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

《C++Primer》讀書筆記（二）C++基礎(上)

《C++Primer》讀書筆記（六）函數

《C++Primer》讀書筆記（四）表達式

《C++Primer》讀書筆記（七）類

程序員是怎樣練成的？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結