學習筆記CB001:NLTK庫、語料庫、詞概率、雙連詞、詞典

原創

2018-09-04 08:40

聊天機器人知識主要是自然語言處理。包括語言分析和理解、語言生成、機器學習、人機對話、信息檢索、信息傳輸與信息存儲、文本分類、自動文摘、數學方法、語言資源、系統評測。

NLTK庫安裝，pip install nltk 。執行python。下載書籍，import nltk，nltk.download()，選擇book，點Download。下載完，加載書籍，from nltk.book import * 。輸入text*書籍節點，輸出書籍標題。搜索文本，text1.concordance(“former”) 。搜索相關詞，text1.similar(“ship”) 。查看詞在文章的位置，text4.dispersion_plot([“citizens”, “democracy”, “freedom”, “duties”, “America”]) ，可以按Ctr+Z退出。繼續嘗試其他函數需要重新執行python，重新加載書籍。詞統計，總字數 len(text1)，文本所有詞集合 set(text1)，文本總詞數 len(set(text4))，單詞出現總次數 text4.count(“is”) ，統計文章詞頻從大到小排序到列表 FreqDist(text1)，統計詞頻輸出累計圖 fdist1 = FreqDist(text1);fdist1.plot(50, cumulative=True)，只出現一次的詞 fdist1.hapaxes()，頻繁雙聯詞 text4.collocations() 。

自然語言處理關鍵點，詞意理解、自動生成語言，機器翻譯、人機對話(圖靈測試，5分鐘內回答提出問題的30%)。基於規則，完全從語法句法出發，照語言規則分析、理解。基於統計，收集大量語料數據，統計學習理解語言，得益於硬件(GPU)、大數據、深度學習的發展。

NLTK語料庫，Gutenberg，nltk.corpus.gutenberg.fileids()。Gutenberg語料庫文件標識符，import nltk，nltk.corpus.gutenberg.fileids()。Gutenberg語料庫閱讀器 nltk.corpus.gutenberg。輸出文章原始內容 nltk.corpus.gutenberg.raw(‘chesterton-brown.txt’) 。輸出文章單詞列表 nltk.corpus.gutenberg.words(‘chesterton-brown.txt’) 。輸出文章句子列表 nltk.corpus.gutenberg.sents(‘chesterton-brown.txt’) 。網絡文本語料庫，網絡和聊天文本，from nltk.corpus import webtext 。布朗語料庫，按照文本分類好500個不同來源文本，from nltk.corpus import brown 。路透社語料庫，1萬多個新聞文檔，from nltk.corpus import reuters 。就職演說語料庫，55個總統的演說，from nltk.corpus import inaugural 。

語料庫組織結構，散養式(孤立多篇文章)、分類式(按照類別組織，但沒有交集)、交叉式(文章屬多個類)、漸變式(語法隨時間發生變化)。

語料庫通用接口，文件 fileids()，分類 categories()，原始內容 raw()，詞彙 words()，句子 sents()，指定文件磁盤位置 abspath()，文件流 open()。

加載自定義語料庫，from nltk.corpus import PlaintextCorpusReader ，corpus_root = ‘/Users/libinggen/Documents/workspace/Python/robot/txt’ ，wordlists = PlaintextCorpusReader(corpus_root, ‘.*’) ，wordlists.fileids() 。

格式轉換GBK2UTF8，iconv -f GBK -t UTF-8 安娜·卡列尼娜.txt > 安娜·卡列尼娜utf8.txt 。

條件分佈，在一定條件下事件概率頒上。條件頻率分佈，指定條件下事件頻率分佈。

輸出布朗語料庫每個類別條件每個詞概率：

# coding:utf-8

import sys
import importlib
importlib.reload(sys)
import nltk
from nltk.corpus import brown

# 鏈表推導式，genre是brown語料庫裏的所有類別列表，word是這個類別中的詞彙列表
# (genre, word)就是類別加詞彙對
genre_word = [(genre, word)
        for genre in brown.categories()
        for word in brown.words(categories=genre)
        ]

# 創建條件頻率分佈
cfd = nltk.ConditionalFreqDist(genre_word)

# 指定條件和樣本作圖
# cfd.tabulate(conditions=['news','adventure'], samples=[u'stock', u'sunbonnet', u'Elevated', u'narcotic', u'four', u'woods', u'railing', u'Until', u'aggression', u'marching', u'looking', u'eligible', u'electricity', u'$25-a-plate', u'consulate', u'Casey', u'all-county', u'Belgians', u'Western', u'1959-60', u'Duhagon', u'sinking', u'1,119', u'co-operation', u'Famed', u'regional', u'Charitable', u'appropriation', u'yellow', u'uncertain', u'Heights', u'bringing', u'prize', u'Loen', u'Publique', u'wooden', u'Loeb', u'963', u'specialties', u'Sands', u'succession', u'Paul', u'Phyfe'])

cfd.plot(conditions=['news','adventure'], samples=[u'stock', u'sunbonnet', u'Elevated', u'narcotic', u'four', u'woods', u'railing', u'Until', u'aggression', u'marching', u'looking', u'eligible', u'electricity', u'$25-a-plate', u'consulate', u'Casey', u'all-county', u'Belgians', u'Western', u'1959-60', u'Duhagon', u'sinking', u'1,119', u'co-operation', u'Famed', u'regional', u'Charitable', u'appropriation', u'yellow', u'uncertain', u'Heights', u'bringing', u'prize', u'Loen', u'Publique', u'wooden', u'Loeb', u'963', u'specialties', u'Sands', u'succession', u'Paul', u'Phyfe'])

利用條件頻率分佈，按照最大條件概率生成雙連詞，生成隨機文本：

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import nltk

# 循環10次，從cfdist中取當前單詞最大概率的連詞,並打印出來
def generate_model(cfdist, word, num=10):
    for i in range(num):
        print(word),
        word = cfdist[word].max()

# 加載語料庫
text = nltk.corpus.genesis.words('english-kjv.txt')

# 生成雙連詞
bigrams = nltk.bigrams(text)

# 生成條件頻率分佈
cfd = nltk.ConditionalFreqDist(bigrams)

# 以the開頭，生成隨機串
generate_model(cfd, 'the')

詞典資源，詞或短語集合：
詞彙列表語料庫，所有英文單詞，識別語法錯誤 nltk.corpus.words.words 。
停用詞語料庫，識別最頻繁出現沒有意義詞 nltk.corpus.stopwords.words 。
發音詞典，輸出英文單詞發音 nltk.corpus.cmudict.dict 。比較詞表，多種語言核心200多個詞對照，語言翻譯基礎 nltk.corpus.swadesh 。同義詞集，面向語義英語詞典，同義詞集網絡 WordNet 。

參考資料：

http://www.shareditor.com/blogshow/?blogId=63

http://www.shareditor.com/blogshow?blogId=64

http://www.shareditor.com/blogshow?blogId=65

歡迎推薦上海機器學習工作機會，我的微信：qingxingfengzi

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

學習筆記CB001:NLTK庫、語料庫、詞概率、雙連詞、詞典

Wireshark 安裝+使用（一）

學習筆記CB009:人工神經網絡模型、手寫數字識別、多層卷積網絡、詞向量、word2vec

臨時筆記 Vue.js

學習筆記TF060:圖像語音結合，看圖說話

學習筆記TF056:TensorFlow MNIST，數據集、分類、可視化

區塊鏈資源

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結