【HanLP】正向、逆向及双向最长匹配法分词

前言

我们知道，在英文的行文中，单词之间是以空格作为自然分界符的，而中文只是字、句和段能通过明显的分界符来简单划界，唯独词没有一个形式上的分界符，虽然英文也同样存在短语的划分问题，不过在词这一层上，中文比之英文要复杂得多、困难得多。

在中文信息处理过程中，自动中文分词备受关注。中文分词大概可分为：

基于词典规则
基于机器学习

本篇主要介绍第一种

1、环境准备

windows 10
安装pyhanlp：pip install pyhanlp（这里可能安装不成功，可留言）
HanLP附带的迷你核心词典为例
jupyter notebook（python3）
java（jdk1.8）

2、词典分词

词典分词是最简单、最常见的分词算法，仅需一部词典和一套查词典的规则即可

加载词典

Java代码实现：

// 加载词典
TreeMap<String, CoreDictionary.Attribute> dictionary = IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");

通过 IOUtil.loadDictionary 得到一个 TreeMap

它的键是单词本身，而值是 CoreDictionary.Attribute

CoreDictionary.Attribute是一个包含词性和词频的结构，这些与词典分词无关，暂时忽略

Python代码实现：

from pyhanlp import *

def load_dictionary():
    IOUtil = JClass('com.hankcs.hanlp.corpus.io.IOUtil')  # 利用JClass取得Hanlp中的IOUtil工具类
    path = HanLP.Config.CoreDictionaryPath.replace('.txt', '.mini.txt')  # 获取HanLPde配置项Config中的词典路径
    dic = IOUtil.loadDictionary([path])
    return set(dic.keySet())

3、切分算法

现在我们已经有了词典，就剩下查字典的规则了，常用的规则有正向最长匹配、逆向最长匹配和双向最长匹配，它们都是基于完全切分过程。

（1）完全切分

Python代码实现：

def fully_segment(text, dic):
    word_list = []
    for i in range(len(text)):                  # i 从 0 到text的最后一个字的下标遍历
        for j in range(i + 1, len(text) + 1):   # j 遍历[i + 1, len(text)]区间
            word = text[i:j]                    # 取出连续区间[i, j]对应的字符串
            if word in dic:                     # 如果在词典中，则认为是一个词
                word_list.append(word)
    return word_list

dic = load_dictionary()

print(fully_segment('商品和服务', dic))

[‘商’, ‘商品’, ‘品’, ‘和’, ‘和服’, ‘服’, ‘服务’, ‘务’]

该程序输出了包含在词典中的所有可能的单词

Java代码实现：

/**
* 完全切分式的中文分词算法
*
* @param text       待分词的文本
* @param dictionary 词典
* @return 单词列表
*/
public static List<String> segmentFully(String text, Map<String, CoreDictionary.Attribute> dictionary){
	List<String> wordList = new LinkedList<String>();  //存储结果
	for (int i = 0; i < text.length(); ++i){  //遍历每个字
		for (int j = i + 1; j <= text.length(); ++j){  //遍历后续的字
			String word = text.substring(i, j);  //截取子串
			if (dictionary.containsKey(word)){  //如果词典中包括
				wordList.add(word);  //加到结果中
			}
		}
	}
	return wordList;  //返回最终分词结果
}

由上面结果我们可以知道，完全切分的结果就是所有出现在词典中的单词构成的列表。很明显，这结果并不是我们所希望的中文分词。

例如：商品和服务
我们希望得到的是：['商品','和','服务'] 并不是 ['商', '商品', '品', '和', '和服', '服', '服务', '务']

为了解决上面的问题，需要完善一下规则，考虑到越长的单词表达的意义越丰富，于是我们定义单词越长优先级越高

具体来说，就是以某个下标为起点递增查词的过程中，优先输出更长的单词，这种规则称为：最长匹配算法，根据扫描顺序的不同又可以分为：

正向最长匹配：从前往后
逆向最长匹配：从后往前
双向最长匹配：前两者结合

（2）正向最长匹配

Python代码实现：

def forward_segment(text, dic):
    word_list = []  # 分词结果
    i = 0
    while i < len(text):  
        longest_word = text[i]  # 当前扫描位置的单字
        for j in range(i + 1, len(text) + 1):  # 所有可能的结尾
            word = text[i : j]  # 截取子串
            if word in dic:   # 判断是否在词典中
                if len(word) > len(longest_word):  # 如果在，且长度大于之前的，就按此时最长的
                    longest_word = word
        word_list.append(longest_word)  # 加入到结果中
        i += len(longest_word)  # 跳到结尾字的下一个字继续扫描
    return word_list

print(forward_segment("就读北京大学", dic))
print(forward_segment("研究生命起源", dic))

[‘就读’, ‘北京大学’]

[‘研究生’, ‘命’, ‘起源’]

Java代码实现：

/**
 * 正向最长匹配的中文分词算法
 * @param text  待分词的文本
 * @param dictionary  词典
 * @return  返回结果列表
 */
public static List<String> segmentForwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary){
	List<String> wordList = new LinkedList<String>();  //结果
	for(int i = 0; i < text.length(); ) {
		String longestWord = text.substring(i, i+1);  //存储以当前字开头的最长单词
		for(int j = i + 1; j <= text.length(); ++j) {
			String word = text.substring(i, j);
			if(dictionary.containsKey(word)) {
				if(word.length() > longestWord.length()) {
					longestWord = word;
				}
			}
		}
		wordList.add(longestWord);  //扫描结束后加入结果中
		i += longestWord.length();  //调到扫描到单词的后一个字符
	}
	return wordList;
}

我们可以发现，有些句子会出乎我们的意料，因为在使用正向最长匹配时，“研究生”的优先级大于“研究”

（3）逆向最长匹配

Python代码实现：

def backward_segment(text, dic):
    word_list = []
    i = len(text) - 1
    while i >= 0:  # 扫描位置作为终点
        longest_word = text[i]  # 扫描当前的单字
        for j in range(0, i):  # 遍历[0,i]区间作为待查词语的起点
            word = text[j : i+1]  # 取出子串
            if word in dic:  # 如果在词典中
                if len(word) > len(longest_word):  # 并且长度大于最长单词
                    longest_word = word  # 替换
        word_list.insert(0, longest_word)  # 插入最前面，逆向扫描
        i -= len(longest_word)
    return word_list

print(backward_segment("就读北京大学", dic))
print(backward_segment("研究生命起源", dic))
print(backward_segment("项目的研究", dic))

[‘就读’, ‘北京大学’]

[‘研究’, ‘生命’, ‘起源’]

[‘项’, ‘目的’, ‘研究’]

Java代码实现：

/**
 * 逆向最长匹配的中文分词算法
 *
 * @param text       待分词的文本
 * @param dictionary 词典
 * @return 单词列表
 */
public static List<String> segmentBackwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary){
	List<String> wordList = new LinkedList<String>();
	
	for(int i = text.length() - 1; i >= 0; ) {
		String longestWord = text.substring(i, i + 1);
		for(int j = 0; j <= i; j++) {
			String word = text.substring(j, i + 1);
			if(dictionary.containsKey(word)) {
				if(word.length() > longestWord.length()) {
					longestWord = word;
				}
			}
		}
		wordList.add(0, longestWord);
		i -= longestWord.length();
	}
	return wordList;
}

虽然 "研究生命起源"得到了正确的结果，但是 "项目的研究" 又出现了错误，那么岂不是无法解决了，我们总是为了应付一个问题去修改规则，却又带来了其他的问题。既然两种方法各有优缺，那我们就结合他们呗。

（4）双向最长匹配

同时执行正向和逆向最长匹配，若两者的词数不同，这返回词数更少的那一个
否则，返回两者中单字更少的那一个
当单字数也相同时，优先返回逆向最长匹配的结果

出发点来自语言学上的启发——汉语中单字词的数量要远远小于非单字词

Python代码实现：

def count_single_char(word_list: list): # 统计单字词的个数
    return sum(1 for word in word_list if len(word) == 1)

def bidirectional_segment(text, dic):
    f = forward_segment(text, dic)
    b = backward_segment(text, dic)
    if len(f) < len(b):  # 词数更少的优先级高
        return f
    elif len(f) > len(b):
        return b
    else:
        if count_single_char(f) < count_single_char(b):  # 单字更少的优先级高
            return f
        else:
            return b  # 都相等时返回逆向结果

print(bidirectional_segment("研究声明的源泉", dic)_

[‘研究’, ‘声明’, ‘的’, ‘源泉’]

Java代码实现：

/**
* 统计分词结果中的单字数量
*
* @param wordList 分词结果
* @return 单字数量
*/
public static int countSingleChar(List<String> wordList)
{
   int size = 0;
   for (String word : wordList)
   {
       if (word.length() == 1)
           ++size;
   }
   return size;
}

/**
* 双向最长匹配的中文分词算法
*
* @param text       待分词的文本
* @param dictionary 词典
* @return 单词列表
*/
public static List<String> segmentBidirectional(String text, Map<String, CoreDictionary.Attribute> dictionary)
{
   List<String> forwardLongest = segmentForwardLongest(text, dictionary);
   List<String> backwardLongest = segmentBackwardLongest(text, dictionary);
   if (forwardLongest.size() < backwardLongest.size())
       return forwardLongest;
   else if (forwardLongest.size() > backwardLongest.size())
       return backwardLongest;
   else
   {
       if (countSingleChar(forwardLongest) < countSingleChar(backwardLongest))
           return forwardLongest;
       else
           return backwardLongest;
   }
}

（5）三种方法分词结果

Python代码实现：

texts = ['项目的研究','商品和服务','研究生命起源','当下雨天地面积水','结婚的和尚未结婚','欢迎新老师生前来就餐']
for text in texts:
    print("前向最长匹配：")
    print(forward_segment(text, dic))
    print("逆向最长匹配：")
    print(backward_segment(text, dic))
    print("双向最长匹配：")
    print(bidirectional_segment(text, dic))
    print("-----------------"*3)

前向最长匹配：
['项目', '的', '研究']
逆向最长匹配：
['项', '目的', '研究']
双向最长匹配：
['项', '目的', '研究']
---------------------------------------------------
前向最长匹配：
['商品', '和服', '务']
逆向最长匹配：
['商品', '和', '服务']
双向最长匹配：
['商品', '和', '服务']
---------------------------------------------------
前向最长匹配：
['研究生', '命', '起源']
逆向最长匹配：
['研究', '生命', '起源']
双向最长匹配：
['研究', '生命', '起源']
---------------------------------------------------
前向最长匹配：
['当下', '雨天', '地面', '积水']
逆向最长匹配：
['当', '下雨天', '地面', '积水']
双向最长匹配：
['当下', '雨天', '地面', '积水']
---------------------------------------------------
前向最长匹配：
['结婚', '的', '和尚', '未', '结婚']
逆向最长匹配：
['结婚', '的', '和', '尚未', '结婚']
双向最长匹配：
['结婚', '的', '和', '尚未', '结婚']
---------------------------------------------------
前向最长匹配：
['欢迎', '新', '老师', '生前', '来', '就餐']
逆向最长匹配：
['欢', '迎新', '老', '师生', '前来', '就餐']
双向最长匹配：
['欢', '迎新', '老', '师生', '前来', '就餐']
---------------------------------------------------

通过上面的结果可以发现，规则系统的脆弱可见一斑。规则集的维护有时是拆东墙补西墙，有时帮倒忙

（6）速度测评

Python代码实现：

import time
def evaluate_speed(segment, text, dic):
    start_time = time.time()
    for i in range(pressure):
        segment(text, dic)
    elapsed_time = time.time() - start_time
    print('%.2f 万字/秒' % (len(text) * pressure / 10000 / elapsed_time))
    
text = "江西鄱阳湖干枯，中国最大淡水湖变成大草原"
pressure = 10000
dic = load_dictionary()

evaluate_speed(forward_segment, text, dic)
evaluate_speed(backward_segment, text, dic)
evaluate_speed(bidirectional_segment, text, dic)

74.27 万字/秒
68.44 万字/秒
33.99 万字/秒

Java代码实现：

/**
* 评测速度
*
* @param dictionary 词典
*/
public static void evaluateSpeed(Map<String, CoreDictionary.Attribute> dictionary)
{
	String text = "江西鄱阳湖干枯，中国最大淡水湖变成大草原";
	long start;
	double costTime;
	final int pressure = 10000;

	System.out.println("正向最长");
	start = System.currentTimeMillis();
	for (int i = 0; i < pressure; ++i)
	{
		segmentForwardLongest(text, dictionary);
	}
	costTime = (System.currentTimeMillis() - start) / (double) 1000;
	System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);

	System.out.println("逆向最长");
	start = System.currentTimeMillis();
	for (int i = 0; i < pressure; ++i)
	{
		segmentBackwardLongest(text, dictionary);
	}
	costTime = (System.currentTimeMillis() - start) / (double) 1000;
	System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);

	System.out.println("双向最长");
	start = System.currentTimeMillis();
	for (int i = 0; i < pressure; ++i)
	{
		segmentBidirectional(text, dictionary);
	}
	costTime = (System.currentTimeMillis() - start) / (double) 1000;
	System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
}

正向最长
206.19万字/秒
逆向最长
134.23万字/秒
双向最长
86.21万字/秒

上面我们是模拟的测试，一定程度上可以反映出：

（1）同等条件下，Python的运行速度比Java慢，效率只有Java的一半不到
（2）正向匹配和逆向匹配的速度差不多，是双向的两倍
（3）Java实现的正向匹配比逆向匹配块，可能是内存回收的原因，不过依然比Python快

4、总结

通过以上内容我们可以对句子进行分词了，但是在算法效率上依然可以优化，下篇将通过字典树来优化分词算法
内容参考自 《自然语言处理入门》何晗

【HanLP】正向、逆向及双向最长匹配法分词

前言

1、环境准备

2、词典分词

加载词典

3、切分算法

（1）完全切分

（2）正向最长匹配

（3）逆向最长匹配

（4）双向最长匹配

（5）三种方法分词结果

（6）速度测评

4、总结

美团一面：项目中有 10000 个 if else 如何优化？想了半天，被问懵了！

京东面试：如何进行JVM调优？

Python 将PowerPoint (PPT/PPTX) 转为HTML

SQL优化-20231016

【LeetCode】0013——羅馬數字轉整數

【深度學習】神經網絡與BP算法

【Tensorflow】Tensorflow實現簡單神經網絡進行手寫數字識別

【NLP】分步剖析Transformer

【NLP】圖解從RNN到seq2seq+Attention

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結