hanlp訓練一元、二元語法模型

原創

zhuo木鸟

2020-05-03 11:17

文章目錄

所謂一元語法模型，就是統計單詞在語料庫中出現的頻數。
二元語法模型，就是連續的兩個單詞在語料庫中出現的頻數啦！

在漢語言處理中，要訓練一元、二元語法模型，所使用的語料庫必須是經過單詞拆分後的句子哦。

二元語法模型的訓練

下面，我們將使用微軟提供的數據集 MSR ，訓練一個一元、二元語法模型，代碼如下：

rom pyhanlp import *
NatureDictionaryMaker = SafeJClass('com.hankcs.hanlp.corpus.dictionary.NatureDictionaryMaker')
CorpusLoader = SafeJClass('com.hankcs.hanlp.corpus.document.CorpusLoader')

corpus_path = r'E:\Anaconda3\Lib\site-packages\pyhanlp\static\data\test\icwb2-data\training\msr_training.utf8'    #數據集所在路徑。可用txt文件
model_path = r'D:\桌面\比賽\msr_model'    #模型保留路徑
def train_bigram(corpus_path, model_path):
    sents = CorpusLoader.convert2SentenceList(corpus_path)    #讀取語料庫（數據集）
    for sent in sents:
        for word in sent:
            if word.label is None:
                word.setLabel("n")
    maker = NatureDictionaryMaker()    #模型生成器（字典生成器）
    maker.compute(sents)    #訓練一元語法和二元語法模型
    maker.saveTxtTo(model_path)  # 保存模型到'D:\桌面\比賽\msr_model' 中
    ```
    
    運行上述代碼（需要等待相當一段時間），可以從模型路徑中得到以下三個文件：  
![運行結果圖](https://img-blog.csdnimg.cn/20200502145745388.png)
我們打開 msr_model.ngram（二元語法），可以看到模型：
![二元語法模型展示圖](https://img-blog.csdnimg.cn/20200502145900221.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjE0MTM5MA==,size_16,color_FFFFFF,t_70)##使用二元語法模型進行分詞

在使用上述模型拆分單詞的時候，往往將目標句子與語法模型對應起來，構成一個**詞網**進而對應一個詞圖（有向圖，每個節點代碼模型中的單詞）。根據節點間的距離，按照維特比算法，或Dijstik算法，找到最合理的分詞手段。

爲了進行分詞，首先我們需要將模型加載到 HanLP.Config.CoreDictionaryPath 中。若是用二元語法來分詞，則需要將模型加載到 BiGramDictionaryPath 中：
```python
ViterbiSegment = JClass('com.hankcs.hanlp.seg.Viterbi.ViterbiSegment')
DijkstraSegment = JClass('com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment')
CoreDictionary = LazyLoadingJClass('com.hankcs.hanlp.dictionary.CoreDictionary')
CoreBiGramTableDictionary = SafeJClass('com.hankcs.hanlp.dictionary.CoreBiGramTableDictionary')

def load_bigram(model_path):
    HanLP.Config.CoreDictionaryPath = model_path + ".txt"  # 加載一元語法模型
    HanLP.Config.BiGramDictionaryPath = model_path + ".ngram.txt"  # 加載二元語法模型
    
    # 以下部分爲兼容新標註集，不感興趣可以跳過
    HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath = model_path + ".tr.txt"  # 詞性轉移矩陣，分詞時可忽略
    if model_path != msr_model:
        with open(HanLP.Config.CoreDictionaryTransformMatrixDictionaryPath, encoding='utf-8') as src:
            for tag in src.readline().strip().split(',')[1:]:
                Nature.create(tag)

根據需要，可以分別使用維特比算法和Dijkstra算法，來生成漢語分詞模型，代碼如下：

segment_1 = ViterbiSegment()    #使用維特比算法，字典已經在HanLP的Config中設置過了。
segment_1.enableAllNamedEntityRecognize(False)    #開啓命名實體識別
segment_1.enableCustomDictionary(False)     #不掛載用戶詞典

segment_2 = DijkstraSegment()
segment_2.enableAllNamedEntityRecognize(False)
segment_2.enableCustomDictionary(False)

當然，也可以在生成模型的同時，直接加載詞典。並通過 seg 接口，來實現單詞的拆分：

a = ViterbiSegment(model_path + ".ngram.txt")
print(a.seg('我愛你'))

與用戶詞典的集成

一元語法模型、二元語法模型的分詞比起字典分詞，雖然有其優點。但對於網絡新詞，未錄入詞（OOV）的拆分仍乏善可陳。因此，能否集成字典與語法模型，來進行分詞呢？HanLP 實現了這一點

集成用戶詞典有兩種方式：

其1爲低優先級，即首先在不考慮用戶詞典的情況下，由語法模型分詞。最後根據用戶詞典，將結果再次合併。
其2爲高優先級，其首先考慮用戶詞典，但具體實現由語法模型自行決定。

兩種集成方法如下所示：

from pyhanlp import *

ViterbiSegment = SafeJClass('com.hankcs.hanlp.seg.Viterbi.ViterbiSegment')
#一下內容參考何晗老師的新書《自然語言處理入門》
segment = ViterbiSegment()    
sentence = "社會搖擺簡稱社會搖"
segment.enableCustomDictionary(False)
print("不掛載詞典：", segment.seg(sentence))
CustomDictionary.insert("社會搖", "nz 100")
segment.enableCustomDictionary(True)
print("低優先級詞典：", segment.seg(sentence))
segment.enableCustomDictionaryForcing(True)
print("高優先級詞典：", segment.seg(sentence))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hanlp訓練一元、二元語法模型

文章目錄

二元語法模型的訓練

與用戶詞典的集成

【SQL進階】CASE語句的使用

Word 轉手寫體使用指南

隱馬爾科夫模型模型：原理、實現

Word 轉手寫體軟件開發

MathType 公式變圖片，如何變回去？

MTSP遺傳算法解決

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結