结巴分词1.8.2版本源代码解析(一)

原創

2020-06-09 12:22

概要说明：结巴分词是基于python的开源分词工具。在其根目录下的结构为
.
|--analyse
|--finalseg
|--posseg
|--__init__.py
|--__main__.py
|--_compat.py
|--dict.txt
其中analyse是对分词结果进行分析的文件夹，提供了TF-IDF算法和textrank算法。finalseg提供了vertbit算法需要的初始矩阵。posseg是进行词性标注的代码。
结巴分词的核心代码在根目录下的__init__.py中。

官方文档中关于算法的说明：
1、基于Trie树结构实现高效的词图扫描，生成句子中汉字所有可能成词情况所构成的有向无环图（DAG)
2、采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合
3、对于未登录词，采用了基于汉字成词能力的HMM模型，使用了Viterbi算法

__init__.py:
全局变量：

DICTIONARY = "dict.txt" #默认字典名
DICT_LOCK = threading.RLock() #线程锁
FREQ = {} # to be initialized #tire树
total = 0 #计数器
user_word_tag_tab = {}
initialized = False #标记是否初始化
pool = None
tmp_dir = None

_curpath = os.path.normpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))#得到当前绝对路径

log_console = logging.StreamHandler(sys.stderr) #日志相关

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(log_console)

关键函数：
1、函数名：gen_pfdict(f_name) 这个是算法说明1中基于Trie树结构实现高效的词图扫描，但是通过代码可以看出这里所谓的Tire树，只是是一个字典，并且没有嵌套。
通过代码54-57行看出这里对每一个词从第一个字开始逐字增加判断是否包含在Tire树(字典)中。

2、函数名：initialize(dictionary=None)。这里是对程序的初始化，主要工作是载入词典，这里运用了缓存技术(tempfile库)，还没看这个库(mark)。

3、函数名：get_DAG(sentence) 函数功能为把输入的句子生成有向无环图。测试sentence="英语单词的词形变化主要是增加前后缀"。
运行结果为：{"0": [0, 1, 3], "1": [1], "2": [2, 3], "3": [3], "4": [4], "5": [5, 6, 8], "6": [6, 7], "7": [7, 8], "8": [8], "9": [9, 10], "10": [10, 11], "11": [11], "12": [12, 13], "13": [13], "14": [14, 15], "15": [15, 16], "16": [16]}
这个字典即为DAG，key为字所在的位置，value为从字开始能在FREQ中的匹配到的词末尾位置所在的list。句子中的第一个字为'英'，所在位置即key为0,
value为[0, 1, 3]，表示'英'、'英语'、'英语单词'可以再FREQ中找到。

4、函数名：__cut_all(sentence) 函数功能就是结巴分词的全模式分词，作用即把DAG中的结果全部显示出来。这里使用了yield来迭代返回结果。

5、函数名：__cut_DAG_NO_HMM(sentence)。函数功能：对sentence进行不加hmm的精确分词。精确分词是在全分词的基础上计算各种路径的概率，选取概率最大的路径。
函数calc(sentence, DAG, route)就是计算概率的过程。其中语句 xrange(N - 1, -1, -1)是从句子的末尾开始计算，
route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) -
logtotal + route[x + 1][0], x) for x in DAG[idx])
max函数返回的是一个元组，计算方法是log(freq/total)+后一个字得到的最大概率路径的概率。这里即为动态规划查找最大概率路径。注意的是动态规划的方向是
从后往前。

6、函数名：__cut_DAG(sentence)功能是对语句进行精确分词并且使用HMM模型，对比函数__cut__DAG_NO_HMM(sentence)的主要区别是有一个未登录词的识别功能。代码为
if not FREQ.get(buf):
recognized = finalseg.cut(buf)
for t in recognized:
yield t
可以看到调用的是finalseg.cut()函数

7、函数名：cut(sentence, cut_all=False, HMM=True) 这里源代码中带有注释，函数根据参数的不同调用上面不同的函数4、5、6

测试代码：

#coding=utf-8
#author:zhangyang
#2015-5-27
#程序用于结巴分词根目录__init__.py测试


from __future__ import absolute_import, unicode_literals
import os
from math import log
import json

dirname = os.path.dirname(__file__)
print dirname
cwd=os.getcwd()
print cwd
ww=os.path.join(os.getcwd(), os.path.dirname(__file__))
print ww
FREQ={}
total=0

def gen_pfdict(f_name):
    lfreq = {}
    ltotal = 0
    with open(f_name, 'rb') as f:
        lineno = 0
        for line in f.read().rstrip().decode('utf-8').splitlines():
            lineno += 1
            try:
                word, freq = line.split(' ')[:2]
                freq = int(freq)
                lfreq[word] = freq
                ltotal += freq
                for ch in xrange(len(word)):
                    wfrag = word[:ch + 1]
                    if wfrag not in lfreq:
                        lfreq[wfrag] = 0
            except ValueError as e:
                logger.debug('%s at line %s %s' % (f_name, lineno, line))
                raise e
    return lfreq, ltotal

def get_DAG(sentence):
    global FREQ
    DAG = {}
    N = len(sentence)
    for k in xrange(N):
        tmplist = []
        i = k
        frag = sentence[k]
        while i < N and frag in FREQ:
            if FREQ[frag]:
                tmplist.append(i)
            i += 1
            frag = sentence[k:i + 1]
        if not tmplist:
            tmplist.append(k)
        DAG[k] = tmplist
    return DAG

def calc(sentence, DAG, route):
    N = len(sentence)
    route[N] = (0, 0)
    logtotal = log(total)
    for idx in xrange(N - 1, -1, -1):
        route[idx] = max((log(FREQ.get(sentence[idx:x + 1]) or 1) -
                          logtotal + route[x + 1][0], x) for x in DAG[idx])
	print json.dumps(route)
	#for k,v in route.items():
		#print k,str(v)

def main():
	global FREQ,total
	dictfile='./dict.txt'
	FREQ,total=gen_pfdict(dictfile)
	print "total frequnce is "+str(total)
	print "dict length is "+str(len(FREQ))
	#i=0
	#g = lambda m: '\n'.join([ '%s=%d'%(k, v) for k, v in m.items() ])
	sent="英语单词很难记忆"
	dag=get_DAG(sent)
	print json.dumps(dag)
	route={}
	calc(sent,dag,route)
	N=len(sent)
	x=0
	while x < N:
		y = route[x][1] + 1
		print y
		lword = sent[x:y]
		print lword
		x=y

if __name__=='__main__':
	main()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

结巴分词1.8.2版本源代码解析(一)

結巴分詞1.8.2版本源代碼解析(一)

人民日報語料庫抓取python實現（二）--多線程

結巴分詞源代碼解析（二）

人民日報語料庫抓取python實現

HMM模型之viterbi算法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結