如何用Python編寫拼寫校正器（拼寫檢查器）

原創

2019-09-15 16:44

原文鏈接：http://norvig.com/spell-correct.html

2007年的一個星期，兩位朋友（迪恩和比爾）獨立告訴我，他們對谷歌的拼寫糾正感到驚訝。輸入類似[speling]的搜索，Google會立即顯示結果： spelling。我認爲Dean和Bill是高度成熟的工程師和數學家，他們對這個過程的運作方式有很好的直覺。但他們沒有，並且想到它，爲什麼他們應該知道迄今爲止他們的專長？

我認爲他們和其他人可以從解釋中受益。工業強度的糾正器的全部細節非常複雜（你可以在這裏或這裏閱讀一些關於它的內容）。但我認爲，在橫貫大陸的飛機旅行過程中，我可以編寫和解釋一個玩具拼寫校正器，在大約半頁代碼中以每秒至少10個字的處理速度達到80％或90％的準確度。

這裏是（或參見spell.py）：

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

注：edits1() 函數寫的太簡潔了。後邊還有很多進一步分析，我不想翻譯了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

如何用Python編寫拼寫校正器（拼寫檢查器）

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

GAN與自動編碼器：深度生成模型的比較

自動操作軟件獲取軟件按鈕內容 UIAutomation 軟件自動化測試（我的一點補充）

Ubuntu 安裝 Android Studio 全過程記錄（2020年1月）

文法和語言的形式描述詞法分析 - 編譯原理

發現貝葉斯的樂高積木

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結