python實現三元語言模型與輸入法推薦

語言模型的作用是在大量的訓練樣例中，給出一個句子求出概率，其中應用的技術有平滑、統計語言模型。

假設S表示某一個有意義的句子，由一連串特定順序排列的詞w₁、w₂、w₃、...、w_n組成，利用條件概率公式，能夠得到：

P(S)=P(W₁,W₂,W₃,...,W_n)

=P(W₁)P(W₂|W₁)P(W₃|W₁,W₂)…P(W_n|W₁,W₂,…,W_n-1)

P(W₁) —— 第一個詞W₁出現的概率

P(W₂|W₁)—— 已知第一個詞W₁的前提下，第二個詞W₂出現的概率

根據馬爾科夫模型，

任意一個詞出現的概率只與它前面出現的有限的1個或着i個詞有關。

如句子，“我愛打籃球”的概率：

二元模型：P(我) * P(愛|我) * P(打|愛) * P(籃球|打)

三元模型：P(我,愛) * P(打|我,愛) * P(籃球|愛,打)

P(我,愛) = P(我) * (愛|我)

• Goal:compute the probability of a sentence or sequence of words:

P(W) = P(w₁,w₂,w₃,w₄,w₅…w_n)

• Related task: probability of an upcoming word:

P(w₅|w₁,w₂,w₃,w₄)

• Amodel that computes either of these:

P(W) or P(w_n|w₁,w₂…w_n-1) is called alanguage model.

• Better:the grammar But languagemodel or LM is standard

根據語言模型和統計結果確定候選集，完成推薦下一個詞的任務。

實踐步驟：

1、獲取語料庫資料

2、分詞統計

3、以一定格式輸出到文本中

4、讀入文本

5、分詞

6、查找文本中的資料

7、排序

8、輸出前5個結果

主要的問題包括：

1、需要大量的多領域語料庫。

2、運行效率和存儲空間，是時間和空間的折衷處理。

3、可以加入詞性考慮。

代碼：

獲取一個文件夾下全部文件：

#讀取目錄內所有文件，返回文件路徑列表
def readAllFiles(baseDir, fileList):
    filelist=os.listdir(baseDir)
    for dir in filelist:
        if os.path.isdir(baseDir+'/'+dir):
            readAllFiles(baseDir+'/'+dir, fileList)
        else:
            fileList.append(baseDir+'/'+dir)

讀取文件：

def readFile(url):
    dataset = open(url, 'r+', encoding='utf-8').read()

核心統計詞頻，其中包含去除文件中的詞性標註：

#統計詞頻,三個單詞
def frequency(sentence, dicts, N=2, replace=True, split_string='  '):
	#排除詞性標註
    if replace==True:
        sentence = sentence.replace('\n', '').replace('/', '')
        for alp in alp_table:
            sentence=sentence.replace(alp, '')
    lists=sentence.split('  ')
    preword_list=[]
    total_word=0
    #list是當前的sentence詞語集合的列表
    for word in lists:
    	#代表句子的結尾
        if word in symbol_table:
            temp = dicts
            for preword in preword_list:
                temp = temp[pre·	word]
            if preword_list!=[]:
                temp[' '] += 1
            preword_list = []
            continue

        #去除詞語中的無意義標註，如胖子&#中的&,#
        for ch in forbidden_table:
            if ch in word:
                word.replace(ch, '')

        if word not in word_dict:
            word_dict[word]=1
        else:
            word_dict[word]+=1

        total_word+=1
        if len(preword_list)==N:
            temp=dicts
            for preword in preword_list:
                temp=temp[preword]
            temp[' ']+=1
            preword_list.pop(0)

        temp=dicts
        for preword in preword_list:
            if preword not in temp:
                temp[preword]={' ':0}
            temp=temp[preword]
        if word not in temp:
            temp[word]={' ':0}
        preword_list.append(word)
    return total_word

把字典轉換爲格式化字符串，把字典轉換爲方便排序的列表：

def dict2str(dicts):
    result=""
    for dic in dicts:
        if dic==' ':
            continue
            #result+=dic+' '+str(dicts[dic])+'\n'
        else:
            temp=dicts[dic]
            for dic2 in temp:
                if dic2 == ' ':
                    result+=dic+' '+str(temp[dic2])+'\n'
                else:
                    temp2=temp[dic2]
                    for dic3 in temp2:
                        if dic3==' ':
                            result+=dic+' '+dic2+' '+str(temp2[dic3])+'\n'
                        else:
                            result+=dic+' '+dic2+' '+dic3+' '+str(temp2[dic3][' '])+'\n'
    return result

def dict2list(dic:dict):
    ''' 將字典轉化爲列表 '''
    keys = dic.keys()
    vals = dic.values()
    lst = [(key, val) for key, val in zip(keys, vals)]
    return lst

運行函數：

import os
alp_table = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
             'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u','v',
             'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G',
             'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
             'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_']
symbol_table = ['，', '。', '！', '？', '；', '、', '/', '（', '）', '《', '》', '——', '\'', '\"']+alp_table
num_table = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


forbidden_table = ['&','#','，', '。', '！', '？', '；', '、', '/', '（', '）', '《', '》', '——', '\'', '“', '”', '【', '】']+symbol_table+alp_table+num_table
dataset=""
word_dict={}
if __name__=='__main__':
    dicts = {}
    files_list = []
    # 讀取全部文件
    readAllFiles('./corpus2/corpus', files_list)
    # 對每個文件進行識別單詞操作
    for file in files_list:
        fp = open(file, 'r+')
        frequency(fp.read(), dicts, N=3)
        fp.close()

    fp = open('dict.txt', 'a')
    word_dict=sorted(word_dict.items(), key=lambda x:x[1],reverse=True)
    for k in word_dict:
        fp.write(k[0]+' '+str(k[1])+'\n')
    #print(len(lst))
    fp.close()

預處理結束後，進行第二步查詢結果：

核心函數，預測，功能是讀入文件，進行統計排序：

fp=open('result3.txt', 'r')
dataset=fp.read().split('\n')
fp.close()
#輸入詞組列表後預測
def Predict2(sentence, urlPath='result3.txt'):
    word_list=" ".join(jieba.cut(sentence, HMM=False)).split(' ')
    if (len(word_list) <= 0):
        return False
    result_dict={}
    start=False
    if len(word_list)==1:
        for data in dataset:
            words = data.split(' ')
            if len(words)<3:
                continue
            if word_list[0]==words[0]:
                start=True
                if words[1] not in result_dict:
                    if len(words) == 3:
                        result_dict[words[1]] = int(words[2])
                    else:
                        result_dict[words[1]] = int(words[3])
                else:
                    if len(words) == 3:
                        result_dict[words[1]] += int(words[2])
                    else:
                        result_dict[words[1]] += int(words[3])
            '''else:
                if start==True:
                    break'''
    else:
        for data in dataset:
            words = data.split(' ')
            if len(words)!=4:
                continue
            if word_list[-2]==words[0] and word_list[-1]==words[1]:
                #start=True
                result_dict[words[2]] =int(words[3])
            '''else:
                if start==True:
                    break'''
    result_list = dict2list(result_dict)
    result_list.sort(key=lambda item:item[1], reverse=True)
    return result_list

最後，輸入句子，利用jieba分詞，可以得到最終結果：

alp_table = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
             'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u','v',
             'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G',
             'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
             'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_']
symbol_table = ['，', '。', '！', '？', '；', '、', '/', '（', '）', '《', '》', '——', '\'', '\"', ';']+alp_table
num_table = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


forbidden_table = ['&','#','，', '。', '！', '？', '；', '、', '/', '（', '）', '《', '》', '——', '\'', '“', '”', '【', '】']+symbol_table+alp_table+num_table

#begin = datetime.datetime.now()
#list=Predict('坐')
#end = datetime.datetime.now()
#print(end-begin)
if __name__=='__main__':
    print('按q退出程序')
    while True:
        s = input('輸入句子：')
        if (s == 'q'):
            break
        word_list = " ".join(jieba.cut(s, HMM=False)).split(' ')
        start=datetime.datetime.now()
        lkk = Predict2(s)
        lst=[]
        for word in lkk:
            ok=True
            for ch in forbidden_table:
                if ch in word[0]:
                    ok=False
                    break
            if(ok==True):
                lst.append(word)
                if(len(lst)==5):
                    break


        length=len(lst)
        if length < 5:
            length = 5
            fp = open('dict.txt', 'r')
            lines = fp.readlines()
            for line in lines:
                templst = line.split(' ')
                word=templst[0]
                templst[1] = templst[1].split('\n')[0]
                if (word in lst) or (word in word_list):
                    continue
                ok = True
                for ch in forbidden_table:
                    if ch in word:
                        ok = False
                        break
                if (ok == True):
                    lst.append(templst)
                if (len(lst) == 5):
                    break
            fp.close()
        result = "推薦: "
        for i in range(length):
            result += lst[i][0] + str(lst[i][1]) + '  '
        print(result + '\n')
        end = datetime.datetime.now()
        print(end - start)

實現效果：

文件保存形式：

到此完成推薦詞語的任務，記得安裝jieba分詞，可以免去輸入分詞。

有需要語料庫的朋友可以留言告知。

python實現三元語言模型與輸入法推薦

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

python實現三元語言模型與輸入法推薦

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結