簡單的貝葉斯拼寫檢查器

最近在學習機器學習這塊，嘗試寫次關於貝葉斯算法的博客，希望能幫到新手朋友們 orz

關於理論部分相信網上有更詳細的展開了，這裏略過(0.0)直接上代碼

首先引入包(對語料庫去除特殊字符)

#引入 re collections 包
import re,collections

去掉語料庫的特殊字符

def words(text): return re.findall('[a-z]+',text.lower())
#定義函數統計各單詞出現個數
def train(features):
    model=collections.defaultdict(lambda: 1)
    for f in features:
        model[f]+=1
    return model

NWORDS=train(words(open('big.txt').read()))

定義字母集用來對輸入單詞修改或插入某個字母

alphabet = 'abcdefghijklmnopqrstuvwxyz'

定義編輯距離爲1的函數(輸入單詞可能是多打了一個字母，次序錯了，打錯了一個字母，少打了一個字母，返回這些集合)

def edits1(word):
    n=len(word)
    return set(
    [word[0:i]+word[i+1:] for i in range(n)]+  #原單詞多打了一個字母 range(n)返回刪除一個字母的列表 (ord wrd wod wor)
    [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)]+  #原單詞交換一次位置的可能列表 （owrd wrod wodr）
    [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet]+  #原單詞某字母需被修改可能列表 （~ord  w~rd wo~d wor~）
    [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]  #原單詞需插入一個字母的可能 
    )

定義編輯距離爲2的函數

#判斷該單詞是否爲語料庫的'真實'單詞
def known(words):return set(w for w in words if w in NWORDS)
#編輯距離爲2的可能列表中的真實單詞
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

定義檢查函數

#主函數拼寫器  返回優先級 真實單詞>編輯距離1>編輯距離2>不存在的原單詞
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=lambda w: NWORDS[w])

最後調用即可

correct('ope')

輸出結果 ‘one’

（題外話，我怎麼感覺這傢伙好像沒用到貝葉斯算法吧，就是最後返回某單詞在語料庫出現次數最多的單詞）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

簡單的貝葉斯拼寫檢查器

高效率使用windows

TCP/IP協議棧在Linux內核中的運行時序分析

問答系統的系統設計方案

一個問答系統的後端項目分析建模

代碼中的軟件工程

簡單的貝葉斯拼寫檢查器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結