結巴分詞1.8.2版本源代碼解析(一)

原創

2020-06-09 12:22

概要說明：結巴分詞是基於python的開源分詞工具。在其根目錄下的結構爲
.
|--analyse
|--finalseg
|--posseg
|--__init__.py
|--__main__.py
|--_compat.py
|--dict.txt
其中analyse是對分詞結果進行分析的文件夾，提供了TF-IDF算法和textrank算法。finalseg提供了vertbit算法需要的初始矩陣。posseg是進行詞性標註的代碼。
結巴分詞的核心代碼在根目錄下的__init__.py中。

官方文檔中關於算法的說明：
1、基於Trie樹結構實現高效的詞圖掃描，生成句子中漢字所有可能成詞情況所構成的有向無環圖（DAG)
2、採用了動態規劃查找最大概率路徑, 找出基於詞頻的最大切分組合
3、對於未登錄詞，採用了基於漢字成詞能力的HMM模型，使用了Viterbi算法

__init__.py:
全局變量：

DICTIONARY = "dict.txt" #默認字典名
DICT_LOCK = threading.RLock() #線程鎖
FREQ = {} # to be initialized #tire樹
total = 0 #計數器
user_word_tag_tab = {}
initialized = False #標記是否初始化
pool = None
tmp_dir = None

_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))#得到當前絕對路徑

log_console = logging.StreamHandler(sys.stderr) #日誌相關

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(log_console)

關鍵函數：
1、函數名：gen_pfdict(f_name) 這個是算法說明1中基於Trie樹結構實現高效的詞圖掃描，但是通過代碼可以看出這裏所謂的Tire樹，只是是一個字典，並且沒有嵌套。
通過代碼54-57行看出這裏對每一個詞從第一個字開始逐字增加判斷是否包含在Tire樹(字典)中。

2、函數名：initialize(dictionary=None)。這裏是對程序的初始化，主要工作是載入詞典，這裏運用了緩存技術(tempfile庫)，還沒看這個庫(mark)。

3、函數名：get_DAG(sentence) 函數功能爲把輸入的句子生成有向無環圖。測試sentence="英語單詞的詞形變化主要是增加前後綴"。
運行結果爲：{"0": [0, 1, 3], "1": [1], "2": [2, 3], "3": [3], "4": [4], "5": [5, 6, 8], "6": [6, 7], "7": [7, 8], "8": [8], "9": [9, 10], "10": [10, 11], "11": [11], "12": [12, 13], "13": [13], "14": [14, 15], "15": [15, 16], "16": [16]}
這個字典即爲DAG，key爲字所在的位置，value爲從字開始能在FREQ中的匹配到的詞末尾位置所在的list。句子中的第一個字爲'英'，所在位置即key爲0,
value爲[0, 1, 3]，表示'英'、'英語'、'英語單詞'可以再FREQ中找到。

4、函數名：__cut_all(sentence) 函數功能就是結巴分詞的全模式分詞，作用即把DAG中的結果全部顯示出來。這裏使用了yield來迭代返回結果。

5、函數名：__cut_DAG_NO_HMM(sentence)。函數功能：對sentence進行不加hmm的精確分詞。精確分詞是在全分詞的基礎上計算各種路徑的概率，選取概率最大的路徑。
函數calc(sentence, DAG, route)就是計算概率的過程。其中語句 xrange(N - 1, -1, -1)是從句子的末尾開始計算，
route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) -
logtotal + route[x + 1][0], x) for x in DAG[idx])
max函數返回的是一個元組，計算方法是log(freq/total)+後一個字得到的最大概率路徑的概率。這裏即爲動態規劃查找最大概率路徑。注意的是動態規劃的方向是
從後往前。

6、函數名：__cut_DAG(sentence)功能是對語句進行精確分詞並且使用HMM模型，對比函數__cut__DAG_NO_HMM(sentence)的主要區別是有一個未登錄詞的識別功能。代碼爲
if not FREQ.get(buf):
recognized = finalseg.cut(buf)
for t in recognized:
yield t
可以看到調用的是finalseg.cut()函數

7、函數名：cut(sentence, cut_all=False, HMM=True) 這裏源代碼中帶有註釋，函數根據參數的不同調用上面不同的函數4、5、6

測試代碼：

#coding=utf-8
#author:zhangyang
#2015-5-27
#程序用於結巴分詞根目錄__init__.py測試


from __future__ import absolute_import, unicode_literals
import os
from math import log
import json

dirname = os.path.dirname(__file__)
print dirname
cwd=os.getcwd()
print cwd
ww=os.path.join(os.getcwd(), os.path.dirname(__file__))
print ww
FREQ={}
total=0

def gen_pfdict(f_name):
    lfreq = {}
    ltotal = 0
    with open(f_name, 'rb') as f:
        lineno = 0
        for line in f.read().rstrip().decode('utf-8').splitlines():
            lineno += 1
            try:
                word, freq = line.split(' ')[:2]
                freq = int(freq)
                lfreq[word] = freq
                ltotal += freq
                for ch in xrange(len(word)):
                    wfrag = word[:ch + 1]
                    if wfrag not in lfreq:
                        lfreq[wfrag] = 0
            except ValueError as e:
                logger.debug('%s at line %s %s' % (f_name, lineno, line))
                raise e
    return lfreq, ltotal

def get_DAG(sentence):
    global FREQ
    DAG = {}
    N = len(sentence)
    for k in xrange(N):
        tmplist = []
        i = k
        frag = sentence[k]
        while i < N and frag in FREQ:
            if FREQ[frag]:
                tmplist.append(i)
            i += 1
            frag = sentence[k:i + 1]
        if not tmplist:
            tmplist.append(k)
        DAG[k] = tmplist
    return DAG

def calc(sentence, DAG, route):
    N = len(sentence)
    route[N] = (0, 0)
    logtotal = log(total)
    for idx in xrange(N - 1, -1, -1):
        route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) -
                          logtotal + route[x + 1][0], x) for x in DAG[idx])
	print json.dumps(route)
	#for k,v in route.items():
		#print k,str(v)

def main():
	global FREQ,total
	dictfile='./dict.txt'
	FREQ,total=gen_pfdict(dictfile)
	print "total frequnce is "+str(total)
	print "dict length is "+str(len(FREQ))
	#i=0
	#g = lambda m: '\n'.join([ '%s=%d'%(k, v) for k, v in m.items() ])
	sent="英語單詞很難記憶"
	dag=get_DAG(sent)
	print json.dumps(dag)
	route={}
	calc(sent,dag,route)
	N=len(sent)
	x=0
	while x < N:
		y = route[x][1] + 1
		print y
		lword = sent[x:y]
		print lword
		x=y

if __name__=='__main__':
	main()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

結巴分詞1.8.2版本源代碼解析(一)

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

結巴分詞1.8.2版本源代碼解析(一)

人民日報語料庫抓取python實現（二）--多線程

結巴分詞源代碼解析（二）

人民日報語料庫抓取python實現

HMM模型之viterbi算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結