貝葉斯定理在拼寫檢查中的應用

原創

2018-08-22 04:39

貝葉斯定理

條件概率

通常條件概率表示爲P(A|B) ，表示在給定B條件下A事件發生的概率。
聯合概率

兩個事件同時發生的概率，表示爲P(A,B) ，事件A，B互相獨立時有P(A，B)=P(A)P(B)
通常意義下，聯合概率表示爲P(A,B)=P(A)P(B|A)
貝葉斯定理
由聯合概率乘法交換律可得：
P(A,B)=P(B,A)

又因爲：
P(A,B)=P(A)P(B|A)

P(B,A)=P(B)P(A|B)

所以可得：
P(A)P(B|A)=P(B)P(A|B)

即：
P(A∣B)=P(A)P(B∣A)P(B)
P(A) : 稱爲先驗概率
P(A|B) : 稱爲後驗概率
P(B|A) : 稱爲似然度
P(B) : 稱爲標準化常量
- 貝葉斯定理在拼寫檢查中的應用
  先看一個一段代碼（出自google工程師之手）

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

運行情況如下：

correction('lates')
Out[1]: 'later'

Q: 當你輸入一個單詞（如lates）在錯誤的情況下，該錯誤單詞對應的所有正確的單詞（late、latest、lattes、…）有很多，但是程序怎麼猜測你最有可能想輸入的是哪個單詞。爲什麼程序返回的是later？
A: 當你輸入lates時（錯誤的輸入用w表示），所有可能正確的單詞中(c表示正確的結果)，使得在給出錯誤輸入lates時找到c中可能最大的正確單詞，用條件概率表示如下：

$a r g m a x P (c | w)$
貝葉斯定理表示爲：
$P (c ∣ w) = P ( c ) P ( w ∣ c ) P ( w )$
當P(c∣w) 取最大值時所對應的c就是你可能要輸入的那個正確單詞。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

貝葉斯定理在拼寫檢查中的應用

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

初識Ubuntu——使用SecureCRT連接Ubuntu&命令行顯示當前路徑

Scala基本概念（三）——函數

Scala基本概念（一）

Scala基本概念（二）——循環

Hadoop 3.0學習筆記(持續更新....)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結