分詞學習(3)，基於ngram語言模型的n元分詞

最大概率分詞中，認爲每個詞的概率都是獨立的，但是有一部分詞，其切分卻與前一個詞密切相關，特別是中文分詞中更爲明顯，英文中就是如上一篇文章中的“tositdown”的例子。

這樣就可以使用2元模型，就是如一個分割形式"ab cde f"的概率，

如果按照1-gram計算：P(ab cde f) = P(ab)*P(cde)*P(f)

如果按照2-gram計算：P(ab cde f) = P(ab|<s>)*P(cde|ab)P(f|cde)

基本的方法和最大概率分詞差不多，就是計算片段概率的時候，需要知道選擇的前驅節點的前驅節點位置，這樣才能計算轉移概率。具體如下圖所示：

確定當前節點4的狀態，就是根據幾個概率累計值，取最大的，即可確定前驅節點和當前節點的累積概率

上代碼(Python)：

[python] view plain copy

#!/usr/bin/env python
#coding=utf-8
#############################################################
#function: max probility segment
# a dynamic programming method
#
#input: dict file
#output: segmented words, divide by delimiter "\ "
#author: [email protected]
##############################################################
import sys
import math
#global parameter
DELIMITER = " " #分詞之後的分隔符
class DNASegment:
def __init__(self):
self.word1_dict = {} #記錄概率,1-gram
self.word1_dict_count = {} #記錄詞頻,1-gram
self.word1_dict_count["<S>"] = 8310575403 #開始的<S>的個數
self.word2_dict = {} #記錄概率,2-gram
self.word2_dict_count = {} #記錄詞頻,2-gram
self.gmax_word_length = 0
self.all_freq = 0 #所有詞的詞頻總和,1-gram的
#估算未出現的詞的概率,根據beautiful data裏面的方法估算
def get_unkonw_word_prob(self, word):
return math.log(10./(self.all_freq*10**len(word)))
#獲得片段的概率
def get_word_prob(self, word):
if self.word1_dict.has_key(word): #如果字典包含這個詞
prob = self.word1_dict[word]
else:
prob = self.get_unkonw_word_prob(word)
return prob
#獲得兩個詞的轉移概率
def get_word_trans_prob(self, first_word, second_word):
trans_word = first_word + " " + second_word
#print trans_word
if self.word2_dict_count.has_key(trans_word):
trans_prob = \
math.log(self.word2_dict_count[trans_word]/self.word1_dict_count[first_word])
else:
trans_prob = self.get_word_prob(second_word)
return trans_prob
#尋找node的最佳前驅節點
#方法爲尋找所有可能的前驅片段
def get_best_pre_node(self, sequence, node, node_state_list):
#如果node比最大詞長小，取的片段長度以node的長度爲限
max_seg_length = min([node, self.gmax_word_length])
pre_node_list = [] #前驅節點列表
#獲得所有的前驅片段，並記錄累加概率
for segment_length in range(1,max_seg_length+1):
pre_node = segment_start_node #取該片段，則記錄對應的前驅節點
if pre_node == 0:
#如果前驅片段開始節點是序列的開始節點，
#則概率爲<S>轉移到當前詞的概率
#segment_prob = self.get_word_prob(segment)
segment_prob = \
self.get_word_trans_prob("<S>", segment)
else: #如果不是序列開始節點，按照二元概率計算
#獲得前驅片段的前一個詞
pre_pre_node = node_state_list[pre_node]["pre_node"]
pre_pre_word = sequence[pre_pre_node:pre_node]
segment_prob = \
self.get_word_trans_prob(pre_pre_word, segment)
#當前node一個候選的累加概率值
candidate_prob_sum = pre_node_prob_sum + segment_prob
pre_node_list.append((pre_node, candidate_prob_sum))
#找到最大的候選概率值
(best_pre_node, best_prob_sum) = \
max(pre_node_list,key=lambda d:d[1])
return (best_pre_node, best_prob_sum)
#最大概率分詞
def mp_seg(self, sequence):
sequence = sequence.strip()
#初始化
node_state_list = [] #記錄節點的最佳前驅，index就是位置信息
#初始節點，也就是0節點信息
ini_state = {}
ini_state["pre_node"] = -1 #前一個節點
ini_state["prob_sum"] = 0 #當前的概率總和
node_state_list.append( ini_state )
#字符串概率爲2元概率
#P(a b c) = P(a|<S>)P(b|a)P(c|b)
#逐個節點尋找最佳前驅節點
for node in range(1,len(sequence) + 1):
#尋找最佳前驅，並記錄當前最大的概率累加值
(best_pre_node, best_prob_sum) = \
self.get_best_pre_node(sequence, node, node_state_list)
#添加到隊列
cur_node = {}
cur_node["pre_node"] = best_pre_node
cur_node["prob_sum"] = best_prob_sum
node_state_list.append(cur_node)
#print "cur node list",node_state_list
# step 2, 獲得最優路徑,從後到前
best_path = []
node = len(sequence) #最後一個點
best_path.append(node)
while True:
pre_node = node_state_list[node]["pre_node"]
if pre_node == -1:
break
node = pre_node
best_path.append(node)
best_path.reverse()
# step 3, 構建切分
word_list = []
for i in range(len(best_path)-1):
left = best_path[i]
word_list.append(word)
seg_sequence = DELIMITER.join(word_list)
return seg_sequence
#加載詞典，爲詞\t詞頻的格式
def initial_dict(self, gram1_file, gram2_file):
#讀取1_gram文件
dict_file = open(gram1_file, "r")
for line in dict_file:
sequence = line.strip()
key = sequence.split('\t')[0]
value = float(sequence.split('\t')[1])
self.word1_dict_count[key] = value
#計算頻率
self.all_freq = sum(self.word1_dict_count.itervalues()) #所有詞的詞頻
self.gmax_word_length = 20
self.all_freq = 1024908267229.0
#計算1gram詞的概率
for key in self.word1_dict_count:
self.word1_dict[key] = math.log(self.word1_dict_count[key]/self.all_freq)
#讀取2_gram_file，同時計算轉移概率
dict_file = open(gram2_file, "r")
for line in dict_file:
sequence = line.strip()
key = sequence.split('\t')[0]
value = float(sequence.split('\t')[1])
first_word = key.split(" ")[0]
second_word = key.split(" ")[1]
self.word2_dict_count[key] = float(value)
if self.word1_dict_count.has_key(first_word):
self.word2_dict[key] = \
math.log(value/self.word1_dict_count[first_word]) #取自然對數
else:
self.word2_dict[key] = self.word1_dict[second_word]
#test
if __name__=='__main__':
myseg = DNASegment()
myseg.initial_dict("count_1w.txt","count_2w.txt")
sequence = "itisatest"
seg_sequence = myseg.mp_seg(sequence)
print "original sequence: " + sequence
print "segment result: " + seg_sequence
sequence = "tositdown"
seg_sequence = myseg.mp_seg(sequence)
print "original sequence: " + sequence
print "segment result: " + seg_sequence

可以看到

這樣，itistst，仍然可以分成 it is a test

而前面分錯的tositedown，則正確的分爲to sit down

代碼和字典見附件：http://pan.baidu.com/s/1bnw197L

但這樣的分詞顯然還有一些問題，就是一個詞是由前一個或者幾個詞決定的，這樣可以去除一部分歧義問題，但是ngram模型還是基於馬爾科夫模型的，其基本原理就是無後效性，就是後續的節點的狀態不影響前面的狀態，就是先前的分詞形式一旦確定，無論後續跟的是什麼詞，都不會再有變化，這在現實中顯然是不成立的。因此就有一些可以考慮到後續詞的算法，如crf等方法，局可以參考相應的資料，這些算法，用幾十行python代碼一般很難寫出來，因此，一般會使用具體的代碼包來做。如crf++,http://crfpp.googlecode.com/svn/trunk/doc/index.html

等

分詞學習(3)，基於ngram語言模型的n元分詞

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

【MySQL】基礎學習全解（一）

李航統計學習方法——感知機的實現

beautiful-soup

正則表達式實戰

正則表達式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結