【HanLP】正向、逆向及雙向最長匹配法分詞

前言

我們知道，在英文的行文中，單詞之間是以空格作爲自然分界符的，而中文只是字、句和段能通過明顯的分界符來簡單劃界，唯獨詞沒有一個形式上的分界符，雖然英文也同樣存在短語的劃分問題，不過在詞這一層上，中文比之英文要複雜得多、困難得多。

在中文信息處理過程中，自動中文分詞備受關注。中文分詞大概可分爲：

基於詞典規則
基於機器學習

本篇主要介紹第一種

1、環境準備

windows 10
安裝pyhanlp：pip install pyhanlp（這裏可能安裝不成功，可留言）
HanLP附帶的迷你核心詞典爲例
jupyter notebook（python3）
java（jdk1.8）

2、詞典分詞

詞典分詞是最簡單、最常見的分詞算法，僅需一部詞典和一套查詞典的規則即可

加載詞典

Java代碼實現：

// 加載詞典
TreeMap<String, CoreDictionary.Attribute> dictionary = IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");

通過 IOUtil.loadDictionary 得到一個 TreeMap

它的鍵是單詞本身，而值是 CoreDictionary.Attribute

CoreDictionary.Attribute是一個包含詞性和詞頻的結構，這些與詞典分詞無關，暫時忽略

Python代碼實現：

from pyhanlp import *

def load_dictionary():
    IOUtil = JClass('com.hankcs.hanlp.corpus.io.IOUtil')  # 利用JClass取得Hanlp中的IOUtil工具類
    path = HanLP.Config.CoreDictionaryPath.replace('.txt', '.mini.txt')  # 獲取HanLPde配置項Config中的詞典路徑
    dic = IOUtil.loadDictionary([path])
    return set(dic.keySet())

3、切分算法

現在我們已經有了詞典，就剩下查字典的規則了，常用的規則有正向最長匹配、逆向最長匹配和雙向最長匹配，它們都是基於完全切分過程。

（1）完全切分

Python代碼實現：

def fully_segment(text, dic):
    word_list = []
    for i in range(len(text)):                  # i 從 0 到text的最後一個字的下標遍歷
        for j in range(i + 1, len(text) + 1):   # j 遍歷[i + 1, len(text)]區間
            word = text[i:j]                    # 取出連續區間[i, j]對應的字符串
            if word in dic:                     # 如果在詞典中，則認爲是一個詞
                word_list.append(word)
    return word_list

dic = load_dictionary()

print(fully_segment('商品和服務', dic))

[‘商’, ‘商品’, ‘品’, ‘和’, ‘和服’, ‘服’, ‘服務’, ‘務’]

該程序輸出了包含在詞典中的所有可能的單詞

Java代碼實現：

/**
* 完全切分式的中文分詞算法
*
* @param text       待分詞的文本
* @param dictionary 詞典
* @return 單詞列表
*/
public static List<String> segmentFully(String text, Map<String, CoreDictionary.Attribute> dictionary){
	List<String> wordList = new LinkedList<String>();  //存儲結果
	for (int i = 0; i < text.length(); ++i){  //遍歷每個字
		for (int j = i + 1; j <= text.length(); ++j){  //遍歷後續的字
			String word = text.substring(i, j);  //截取子串
			if (dictionary.containsKey(word)){  //如果詞典中包括
				wordList.add(word);  //加到結果中
			}
		}
	}
	return wordList;  //返回最終分詞結果
}

由上面結果我們可以知道，完全切分的結果就是所有出現在詞典中的單詞構成的列表。很明顯，這結果並不是我們所希望的中文分詞。

例如：商品和服務
我們希望得到的是：['商品','和','服務'] 並不是 ['商', '商品', '品', '和', '和服', '服', '服務', '務']

爲了解決上面的問題，需要完善一下規則，考慮到越長的單詞表達的意義越豐富，於是我們定義單詞越長優先級越高

具體來說，就是以某個下標爲起點遞增查詞的過程中，優先輸出更長的單詞，這種規則稱爲：最長匹配算法，根據掃描順序的不同又可以分爲：

正向最長匹配：從前往後
逆向最長匹配：從後往前
雙向最長匹配：前兩者結合

（2）正向最長匹配

Python代碼實現：

def forward_segment(text, dic):
    word_list = []  # 分詞結果
    i = 0
    while i < len(text):  
        longest_word = text[i]  # 當前掃描位置的單字
        for j in range(i + 1, len(text) + 1):  # 所有可能的結尾
            word = text[i : j]  # 截取子串
            if word in dic:   # 判斷是否在詞典中
                if len(word) > len(longest_word):  # 如果在，且長度大於之前的，就按此時最長的
                    longest_word = word
        word_list.append(longest_word)  # 加入到結果中
        i += len(longest_word)  # 跳到結尾字的下一個字繼續掃描
    return word_list

print(forward_segment("就讀北京大學", dic))
print(forward_segment("研究生命起源", dic))

[‘就讀’, ‘北京大學’]

[‘研究生’, ‘命’, ‘起源’]

Java代碼實現：

/**
 * 正向最長匹配的中文分詞算法
 * @param text  待分詞的文本
 * @param dictionary  詞典
 * @return  返回結果列表
 */
public static List<String> segmentForwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary){
	List<String> wordList = new LinkedList<String>();  //結果
	for(int i = 0; i < text.length(); ) {
		String longestWord = text.substring(i, i+1);  //存儲以當前字開頭的最長單詞
		for(int j = i + 1; j <= text.length(); ++j) {
			String word = text.substring(i, j);
			if(dictionary.containsKey(word)) {
				if(word.length() > longestWord.length()) {
					longestWord = word;
				}
			}
		}
		wordList.add(longestWord);  //掃描結束後加入結果中
		i += longestWord.length();  //調到掃描到單詞的後一個字符
	}
	return wordList;
}

我們可以發現，有些句子會出乎我們的意料，因爲在使用正向最長匹配時，“研究生”的優先級大於“研究”

（3）逆向最長匹配

Python代碼實現：

def backward_segment(text, dic):
    word_list = []
    i = len(text) - 1
    while i >= 0:  # 掃描位置作爲終點
        longest_word = text[i]  # 掃描當前的單字
        for j in range(0, i):  # 遍歷[0,i]區間作爲待查詞語的起點
            word = text[j : i+1]  # 取出子串
            if word in dic:  # 如果在詞典中
                if len(word) > len(longest_word):  # 並且長度大於最長單詞
                    longest_word = word  # 替換
        word_list.insert(0, longest_word)  # 插入最前面，逆向掃描
        i -= len(longest_word)
    return word_list

print(backward_segment("就讀北京大學", dic))
print(backward_segment("研究生命起源", dic))
print(backward_segment("項目的研究", dic))

[‘就讀’, ‘北京大學’]

[‘研究’, ‘生命’, ‘起源’]

[‘項’, ‘目的’, ‘研究’]

Java代碼實現：

/**
 * 逆向最長匹配的中文分詞算法
 *
 * @param text       待分詞的文本
 * @param dictionary 詞典
 * @return 單詞列表
 */
public static List<String> segmentBackwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary){
	List<String> wordList = new LinkedList<String>();
	
	for(int i = text.length() - 1; i >= 0; ) {
		String longestWord = text.substring(i, i + 1);
		for(int j = 0; j <= i; j++) {
			String word = text.substring(j, i + 1);
			if(dictionary.containsKey(word)) {
				if(word.length() > longestWord.length()) {
					longestWord = word;
				}
			}
		}
		wordList.add(0, longestWord);
		i -= longestWord.length();
	}
	return wordList;
}

雖然 "研究生命起源"得到了正確的結果，但是 "項目的研究" 又出現了錯誤，那麼豈不是無法解決了，我們總是爲了應付一個問題去修改規則，卻又帶來了其他的問題。既然兩種方法各有優缺，那我們就結合他們唄。

（4）雙向最長匹配

同時執行正向和逆向最長匹配，若兩者的詞數不同，這返回詞數更少的那一個
否則，返回兩者中單字更少的那一個
當單字數也相同時，優先返回逆向最長匹配的結果

出發點來自語言學上的啓發——漢語中單字詞的數量要遠遠小於非單字詞

Python代碼實現：

def count_single_char(word_list: list): # 統計單字詞的個數
    return sum(1 for word in word_list if len(word) == 1)

def bidirectional_segment(text, dic):
    f = forward_segment(text, dic)
    b = backward_segment(text, dic)
    if len(f) < len(b):  # 詞數更少的優先級高
        return f
    elif len(f) > len(b):
        return b
    else:
        if count_single_char(f) < count_single_char(b):  # 單字更少的優先級高
            return f
        else:
            return b  # 都相等時返回逆向結果

print(bidirectional_segment("研究聲明的源泉", dic)_

[‘研究’, ‘聲明’, ‘的’, ‘源泉’]

Java代碼實現：

/**
* 統計分詞結果中的單字數量
*
* @param wordList 分詞結果
* @return 單字數量
*/
public static int countSingleChar(List<String> wordList)
{
   int size = 0;
   for (String word : wordList)
   {
       if (word.length() == 1)
           ++size;
   }
   return size;
}

/**
* 雙向最長匹配的中文分詞算法
*
* @param text       待分詞的文本
* @param dictionary 詞典
* @return 單詞列表
*/
public static List<String> segmentBidirectional(String text, Map<String, CoreDictionary.Attribute> dictionary)
{
   List<String> forwardLongest = segmentForwardLongest(text, dictionary);
   List<String> backwardLongest = segmentBackwardLongest(text, dictionary);
   if (forwardLongest.size() < backwardLongest.size())
       return forwardLongest;
   else if (forwardLongest.size() > backwardLongest.size())
       return backwardLongest;
   else
   {
       if (countSingleChar(forwardLongest) < countSingleChar(backwardLongest))
           return forwardLongest;
       else
           return backwardLongest;
   }
}

（5）三種方法分詞結果

Python代碼實現：

texts = ['項目的研究','商品和服務','研究生命起源','當下雨天地面積水','結婚的和尚未結婚','歡迎新老師生前來就餐']
for text in texts:
    print("前向最長匹配：")
    print(forward_segment(text, dic))
    print("逆向最長匹配：")
    print(backward_segment(text, dic))
    print("雙向最長匹配：")
    print(bidirectional_segment(text, dic))
    print("-----------------"*3)

前向最長匹配：
['項目', '的', '研究']
逆向最長匹配：
['項', '目的', '研究']
雙向最長匹配：
['項', '目的', '研究']
---------------------------------------------------
前向最長匹配：
['商品', '和服', '務']
逆向最長匹配：
['商品', '和', '服務']
雙向最長匹配：
['商品', '和', '服務']
---------------------------------------------------
前向最長匹配：
['研究生', '命', '起源']
逆向最長匹配：
['研究', '生命', '起源']
雙向最長匹配：
['研究', '生命', '起源']
---------------------------------------------------
前向最長匹配：
['當下', '雨天', '地面', '積水']
逆向最長匹配：
['當', '下雨天', '地面', '積水']
雙向最長匹配：
['當下', '雨天', '地面', '積水']
---------------------------------------------------
前向最長匹配：
['結婚', '的', '和尚', '未', '結婚']
逆向最長匹配：
['結婚', '的', '和', '尚未', '結婚']
雙向最長匹配：
['結婚', '的', '和', '尚未', '結婚']
---------------------------------------------------
前向最長匹配：
['歡迎', '新', '老師', '生前', '來', '就餐']
逆向最長匹配：
['歡', '迎新', '老', '師生', '前來', '就餐']
雙向最長匹配：
['歡', '迎新', '老', '師生', '前來', '就餐']
---------------------------------------------------

通過上面的結果可以發現，規則系統的脆弱可見一斑。規則集的維護有時是拆東牆補西牆，有時幫倒忙

（6）速度測評

Python代碼實現：

import time
def evaluate_speed(segment, text, dic):
    start_time = time.time()
    for i in range(pressure):
        segment(text, dic)
    elapsed_time = time.time() - start_time
    print('%.2f 萬字/秒' % (len(text) * pressure / 10000 / elapsed_time))
    
text = "江西鄱陽湖乾枯，中國最大淡水湖變成大草原"
pressure = 10000
dic = load_dictionary()

evaluate_speed(forward_segment, text, dic)
evaluate_speed(backward_segment, text, dic)
evaluate_speed(bidirectional_segment, text, dic)

74.27 萬字/秒
68.44 萬字/秒
33.99 萬字/秒

Java代碼實現：

/**
* 評測速度
*
* @param dictionary 詞典
*/
public static void evaluateSpeed(Map<String, CoreDictionary.Attribute> dictionary)
{
	String text = "江西鄱陽湖乾枯，中國最大淡水湖變成大草原";
	long start;
	double costTime;
	final int pressure = 10000;

	System.out.println("正向最長");
	start = System.currentTimeMillis();
	for (int i = 0; i < pressure; ++i)
	{
		segmentForwardLongest(text, dictionary);
	}
	costTime = (System.currentTimeMillis() - start) / (double) 1000;
	System.out.printf("%.2f萬字/秒\n", text.length() * pressure / 10000 / costTime);

	System.out.println("逆向最長");
	start = System.currentTimeMillis();
	for (int i = 0; i < pressure; ++i)
	{
		segmentBackwardLongest(text, dictionary);
	}
	costTime = (System.currentTimeMillis() - start) / (double) 1000;
	System.out.printf("%.2f萬字/秒\n", text.length() * pressure / 10000 / costTime);

	System.out.println("雙向最長");
	start = System.currentTimeMillis();
	for (int i = 0; i < pressure; ++i)
	{
		segmentBidirectional(text, dictionary);
	}
	costTime = (System.currentTimeMillis() - start) / (double) 1000;
	System.out.printf("%.2f萬字/秒\n", text.length() * pressure / 10000 / costTime);
}

正向最長
206.19萬字/秒
逆向最長
134.23萬字/秒
雙向最長
86.21萬字/秒

上面我們是模擬的測試，一定程度上可以反映出：

（1）同等條件下，Python的運行速度比Java慢，效率只有Java的一半不到
（2）正向匹配和逆向匹配的速度差不多，是雙向的兩倍
（3）Java實現的正向匹配比逆向匹配塊，可能是內存回收的原因，不過依然比Python快

4、總結

通過以上內容我們可以對句子進行分詞了，但是在算法效率上依然可以優化，下篇將通過字典樹來優化分詞算法
內容參考自 《自然語言處理入門》何晗

【HanLP】正向、逆向及雙向最長匹配法分詞

前言

1、環境準備

2、詞典分詞

加載詞典

3、切分算法

（1）完全切分

（2）正向最長匹配

（3）逆向最長匹配

（4）雙向最長匹配

（5）三種方法分詞結果

（6）速度測評

4、總結

【LeetCode】0013——羅馬數字轉整數

【深度學習】神經網絡與BP算法

【Tensorflow】Tensorflow實現簡單神經網絡進行手寫數字識別

【NLP】分步剖析Transformer

【NLP】圖解從RNN到seq2seq+Attention

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結