jieba中文分詞源碼分析（三）

一、前綴字典

作者這個版本(0.37)中使用前綴字典實現了詞庫的存儲(即dict.txt文件中的內容)，而棄用之前版本的trie樹存儲詞庫，python中實現的trie樹是基於dict類型的數據結構而且dict中又嵌套dict 類型，這樣嵌套很深，導致內存耗費嚴重，具體點這裏，下面是@gumblex commit的內容:

對於get_DAG()函數來說，用Trie數據結構，特別是在Python環境，內存使用量過大。經實驗，可構造一個前綴集合解決問題。
該集合儲存詞語及其前綴，如set([‘數’, ‘數據’, ‘數據結’, ‘數據結構’])。在句子中按字正向查找詞語，在前綴列表中就繼續查找，直到不在前綴列表中或超出句子範圍。大約比原詞庫增加40%詞條。
該版本通過各項測試，與原版本分詞結果相同。測試：一本5.7M的小說，用默認字典，64位Ubuntu，Python 2.7.6。
Trie：第一次加載2.8秒，緩存加載1.1秒；內存277.4MB，平均速率724kB/s
前綴字典：第一次加載2.1秒，緩存加載0.4秒；內存99.0MB，平均速率781kB/s
此方法解決純Python中Trie空間效率低下的問題。

jieba0.37版本中實際使用是前綴字典具體實現(對應代碼中Tokenizer.FREQ字典)，即就是利用python中的dict把dict.txt中出現的詞作爲key，出現頻次作爲value，比如sentece : “北京大學”,處理後的結果爲：{u’北’:17860, u’北京’ :34488,u’北京大’: 0,u’北京大學’: 2053}，具體詳情見代碼：

    def gen_pfdict(self, f_name):
        lfreq = {} # 字典存儲  詞條:出現次數
        ltotal = 0 # 所有詞條的總的出現次數
        with open(f_name, 'rb') as f: # 打開文件 dict.txt 
            for lineno, line in enumerate(f, 1): # 行號,行
                try:
                    line = line.strip().decode('utf-8') # 解碼爲Unicode
                    word, freq = line.split(' ')[:2] # 獲得詞條 及其出現次數
                    freq = int(freq)
                    lfreq[word] = freq
                    ltotal += freq
                    for ch in xrange(len(word)):# 處理word的前綴
                        wfrag = word[:ch + 1]
                        if wfrag not in lfreq: # word前綴不在lfreq則其出現頻次置0 
                            lfreq[wfrag] = 0
                except ValueError:
                    raise ValueError(
                        'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
        return lfreq, ltotal

二、DAG

DAG根據我們生成的前綴字典來構造一個這樣的DAG，對一個sentence DAG是以{key:list[i,j…], …}的字典結構存儲，其中key是詞的在sentence中的位置，list存放的是在sentence中以key開始且詞sentence[key:i+1]在我們的前綴詞典中的以key開始i結尾的詞的末位置i的列表，即list存放的是sentence中以位置key開始的可能的詞語的結束位置，這樣通過查字典得到詞, 開始位置+結束位置列表。
例如句子”去北京大學玩“對應的DAG爲：
{0 : [0], 1 : [1, 2, 4], 2 : [2], 3 : [3, 4], 4 : [4], 5 : [5]}
例如DAG中{0:[0]} 這樣一個簡單的DAG, 就是表示0位置對應的是詞, 就是說0~0,即”去”這個詞在dict.txt中是詞條。DAG中{1:[1,2,4]}, 就是表示1位置開始, 在1,2,4位置都是詞, 就是說1~1,1~2,1~4 即 “北”，“北京”，“北京大學”這三個詞在dict.txt對應文件的詞庫中。

三、基於詞頻最大切分組合

通過上面兩小節可以得知，我們已經有了詞庫(dict.txt)的前綴字典和待分詞句子sentence的DAG，基於詞頻的最大切分要在所有的路徑中找出一條概率得分最大的路徑，該怎麼做呢？
jieba中的思路就是使用動態規劃方法，從後往前遍歷，選擇一個頻度得分最大的一個切分組合。
具體實現見代碼，已給詳細註釋。

     #動態規劃，計算最大概率的切分組合
    def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
         # 對概率值取對數之後的結果(可以讓概率相乘的計算變成對數相加,防止相乘造成下溢)
        logtotal = log(self.total)
        # 從後往前遍歷句子 反向計算最大概率
        for idx in xrange(N - 1, -1, -1):
           # 列表推倒求最大概率對數路徑
           # route[idx] = max([ (概率對數，詞語末字位置) for x in DAG[idx] ])
           # 以idx:(概率對數最大值，詞語末字位置)鍵值對形式保存在route中
           # route[x+1][0] 表示 詞路徑[x+1,N-1]的最大概率對數,
           # [x+1][0]即表示取句子x+1位置對應元組(概率對數，詞語末字位置)的概率對數
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])

從代碼中可以看出calc是一個自底向上的動態規劃(重疊子問題、最優子結構)，它從sentence的最後一個字(N-1)開始倒序遍歷sentence的字(idx)的方式，計算子句sentence[isdx~N-1]概率對數得分（這裏利用DAG及歷史計算結果route實現，同時贊下作者的概率使用概率對數這樣有效防止下溢問題）。然後將概率對數得分最高的情況以（概率對數，詞語最後一個字的位置）這樣的tuple保存在route中。
根據上面的結束寫了如下的測試：

#coding:utf8
'''
 測試jieba __init__文件
'''
import os
import logging
import marshal
import re
from math import log

_get_abs_path = lambda path: os.path.normpath(os.path.join(os.getcwd(), path))

DEFAULT_DICT = _get_abs_path("../jieba/dict.txt")
re_eng = re.compile('[a-zA-Z0-9]', re.U)

#print DEFAULT_DICT

class Tokenizer(object):
    def __init__(self, dictionary=DEFAULT_DICT):
        self.dictionary = _get_abs_path(dictionary)
        self.FREQ = {}
        self.total = 0
        self.initialized = False
        self.cache_file = None

    def gen_pfdict(self, f_name):
        lfreq = {} # 字典存儲  詞條:出現次數
        ltotal = 0 # 所有詞條的總的出現次數
        with open(f_name, 'rb') as f: # 打開文件 dict.txt 
            for lineno, line in enumerate(f, 1): # 行號,行
                try:
                    line = line.strip().decode('utf-8') # 解碼爲Unicode
                    word, freq = line.split(' ')[:2] # 獲得詞條 及其出現次數
                    freq = int(freq)
                    lfreq[word] = freq
                    ltotal += freq
                    for ch in xrange(len(word)):# 處理word的前綴
                        wfrag = word[:ch + 1]
                        if wfrag not in lfreq: # word前綴不在lfreq則其出現頻次置0 
                            lfreq[wfrag] = 0
                except ValueError:
                    raise ValueError(
                        'invalid dictionary entry in %s at Line %s: %s' % (f_name, lineno, line))
        return lfreq, ltotal

    # 從前綴字典中獲得此的出現次數
    def gen_word_freq(self, word):
        if word in self.FREQ:
            return self.FREQ[word]
        else:
            return 0

    def check_initialized(self):
        if not self.initialized:
            abs_path = _get_abs_path(self.dictionary)
            if self.cache_file:
                cache_file = self.cache_file
            # 默認的cachefile
            elif abs_path:
                cache_file = "jieba.cache"

            load_from_cache_fail = True
            # cachefile 存在
            if os.path.isfile(cache_file):

                try:
                    with open(cache_file, 'rb') as cf:
                        self.FREQ, self.total = marshal.load(cf)
                    load_from_cache_fail = False
                except Exception:
                    load_from_cache_fail = True
            if load_from_cache_fail:
                self.FREQ, self.total = self.gen_pfdict(abs_path)
                #把dict前綴集合,總詞頻寫入文件
                try:
                    with open(cache_file, 'w') as temp_cache_file:
                        marshal.dump((self.FREQ, self.total), temp_cache_file)
                except Exception:
                    #continue
                    pass
            # 標記初始化成功
            self.initialized = True

    def get_DAG(self, sentence):
        self.check_initialized()
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
            while i < N and frag in self.FREQ:
                if self.FREQ[frag]:
                    tmplist.append(i)
                i += 1
                frag = sentence[k:i + 1]
            if not tmplist:
                tmplist.append(k)
            DAG[k] = tmplist
        return DAG

    #動態規劃，計算最大概率的切分組合
    def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
         # 對概率值取對數之後的結果(可以讓概率相乘的計算變成對數相加,防止相乘造成下溢)
        logtotal = log(self.total)
        # 從後往前遍歷句子 反向計算最大概率
        for idx in xrange(N - 1, -1, -1):
           # 列表推倒求最大概率對數路徑
           # route[idx] = max([ (概率對數，詞語末字位置) for x in DAG[idx] ])
           # 以idx:(概率對數最大值，詞語末字位置)鍵值對形式保存在route中
           # route[x+1][0] 表示 詞路徑[x+1,N-1]的最大概率對數,
           # [x+1][0]即表示取句子x+1位置對應元組(概率對數，詞語末字位置)的概率對數
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])

    # DAG中是以{key:list,...}的字典結構存儲
    # key是字的開始位置


    def cut_DAG_NO_HMM(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        N = len(sentence)
        buf = ''
        while x < N:
            y = route[x][1] + 1 
            l_word = sentence[x:y]# 得到以x位置起點的最大概率切分詞語
            if re_eng.match(l_word) and len(l_word) == 1:#數字,字母
                buf += l_word
                x = y
            else:
                if buf:
                    yield buf
                    buf = ''
                yield l_word
                x = y
        if buf:
            yield buf
            buf = ''


if __name__ == '__main__':
    s = u'去北京大學玩'
    t = Tokenizer()
    dag = t.get_DAG(s)

    # 打印s的前綴字典
    print(u'\"%s\"的前綴字典:' % s)
    for pos in xrange(len(s)):
        print s[:pos+1], t.gen_word_freq(s[:pos+1]) 

    print(u'\"%s\"的DAG:' % s)
    for d in dag:
        print d, ':', dag[d]
    route = {}
    t.calc(s, dag, route)
    print 'route:'
    print route

    print('/'.join(t.cut_DAG_NO_HMM(u'去北京大學玩')))

輸出結果爲：

“去北京大學玩”的前綴字典:
去 123402
去北 0
去北京 0
去北京大 0
去北京大學 0
去北京大學玩 0
“去北京大學玩”的DAG:
0 : [0]
1 : [1, 2, 4]
2 : [2]
3 : [3, 4]
4 : [4]
5 : [5]
route:
{0: (-26.039894284878688, 0), 1: (-19.851543754900984, 4), 2: (-26.6931716802707, 2), 3: (-17.573864399983357, 4), 4: (-17.709674112779485, 4), 5: (-9.567048044164698, 5), 6: (0, 0)}
去/北京大學/玩

測試代碼，這裏。
好了，基於 DAG 的中文分詞算法就介紹完畢了。下面將介紹對於分詞中未登錄詞的切分方法。

jieba中文分詞源碼分析（三）

一、前綴字典

二、DAG

三、基於詞頻最大切分組合

參考

HTTP 協議解析

編程求幻方(魔方)，1-N

nginx 源碼學習(四) 基本數據結構 ngx_queue_t

二級指針實現單鏈表的插入、刪除及 linux內核源碼雙向鏈表之奇技

使用git 進行版本控制

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結