NLP初學-文本預處理

一.spell correction(拼寫錯誤糾正)

1. 錯寫的單詞與正確單詞的拼寫相似，容易錯寫；這裏safari是否容易錯寫成saferi需要統計數據的支持；爲了簡化問題，我們認爲字形越相近的錯寫率越高，用編輯距離來表示。字形相近要求單詞之間編輯距離小於等於2，這裏saferi與safari編輯距離爲1，後面我們再具體瞭解編輯距離的定義。

2. 正確單詞有很多，除去語義因素外最有可能的單詞，也就是這個單詞的使用頻率了。所以我們確認的標準還有一項就是，單詞使用頻率。

下面介紹一個機器學習拼寫檢查方法，基於貝葉斯定理的拼寫檢查法，主要思想就是上面2條，列舉所有可能的正確拼寫，根據編輯距離以及詞頻從中選取可能性最大的用於校正。

原理：

用戶輸入的錯誤的單詞記做w，用戶想要輸入的拼寫正確的單詞記做c，則

P(c | w) ：用戶輸錯成w時，想要的單詞是c的概率。

P(w | c) : 用戶將c錯寫成w的概率，與編輯距離有關。

P(c) : 正確詞是c的概率，可以認爲是c的使用頻率，需要數據訓練。

根據貝葉斯公式

P(c | w) = P(w | c) * P(c) / P(w)

因爲同一次糾正中w是不變的，所以公式中我們不必理會P(w)，它是一個常量。比較 P(c | w) 就是比較 P(w | c) * P(c) 的大小。

1）、P(c)

P(c)替換成“使用頻率”，我們從足夠大的文本庫（詞典）點擊打開鏈接中統計出各個單詞的出現頻率，也可以將頻率歸一化縮小方便比較。

2）、P(w | c)

P(w | c)替換成常數lambda * editDist

editDist編輯距離只計算editDist = 1與editDist = 2的，

editDist1，編輯距離爲1的有下面幾種情況：

（1）splits：將word依次按照每一位分割成前後兩半。比如，'abc'會被分割成 [('', 'abc'), ('a', 'bc'), ('ab', 'c'), ('abc', '')] 。

　　（2）beletes：依次刪除word的每一位後、所形成的所有新詞。比如，'abc'對應的deletes就是 ['bc', 'ac', 'ab'] 。

　　（3）transposes：依次交換word的鄰近兩位，所形成的所有新詞。比如，'abc'對應的transposes就是 ['bac', 'acb'] 。

　　（4）replaces：將word的每一位依次替換成其他25個字母，所形成的所有新詞。比如，'abc'對應的replaces就是 ['abc', 'bbc', 'cbc', ... , 'abx', ' aby', 'abz' ] ，一共包含78個詞（26 × 3）。

　　（5）inserts：在word的鄰近兩位之間依次插入一個字母，所形成的所有新詞。比如，'abc' 對應的inserts就是['aabc', 'babc', 'cabc', ..., 'abcx', 'abcy', 'abcz']，一共包含104個詞（26 × 4）。

editDist2則是在editDist1得到的單詞集合的基礎上再對它們作以上五種變換，得到所有編輯距離爲2的單詞（無論是否存在，在詞典中不存在的記P(c) = 1）。

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

http://norvig.com/big.txt

還有現成的python庫：

pip install pyenchant

具體用法：

http://pythonhosted.org/pyenchant/tutorial.html

二、Filtering words（停用詞）

import jieba

#分詞
def stripdata(Test):
    # jieba 默認啓用了HMM（隱馬爾科夫模型）進行中文分詞
    seg_list = jieba.cut(Test,cut_all=True)  # 分詞

    #獲取字典，去除停用詞
    line = "/".join(seg_list)
    word = stripword(line)
    #print(line)
    #列出關鍵字
    print("\n關鍵字：\n"+word)

#停用詞分析
def stripword(seg):
    #打開寫入關鍵詞的文件
    keyword = open('key_word.txt', 'w+', encoding='utf-8')
    print("去停用詞：\n")
    wordlist = []

    #獲取停用詞表
    stop = open('stopword.txt', 'r+', encoding='utf-8')
    stopword = stop.read().split("\n")

    #遍歷分詞表
    for key in seg.split('/'):
        #print(key)
        #去除停用詞，去除單字，去除重複詞
        if not(key.strip() in stopword) and (len(key.strip()) > 1) and not(key.strip() in wordlist) :
            wordlist.append(key)
            print(key)
            keyword.write(key+"\n")

    #停用詞去除END
    stop.close()
    keyword.close()
    return '/'.join(wordlist)

def creat():
    Rawdata = open('raw.txt','r+',encoding='utf-8')
    text = Rawdata.read()
    #調用分詞
    stripdata(text)

    #END
    Rawdata.close()

NLP初學-文本預處理

QT中tableview基本用法

NLP初學-簡易聊天機器人

NLP初學-文本表示

NLP初學-文本預處理

NLP初學-Word Segmentation(分詞)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結