分詞之正向最大匹配法

原創

2020-06-20 22:26

完整資料和代碼獲取地址github:zlhcsm

知識普及-正向最大匹配法：
對於輸入的一段文本從左至右、以貪心的方式切分出當前位置上長度最大的詞。
正向最大匹配算法是基於詞典的分詞方法，其分詞原理是：單詞的顆粒度越大，所能表示的含義越確切。

步驟

1，一般從一個字符串的開始位置，選擇一個最大長度的詞長的片段，如果序列不足最大詞長，則選擇全部序列。
2，首先看該片段是否在詞典中，如果是，則算爲一個分出來的詞，如果不是，則從右邊開始，減少一個字符，然後看短一點的這個片段是否在詞典中，一次循環，直到只剩下一個字。
3，序列變爲第2步驟截取分詞後，剩下的部分序列

核心代碼

1.讀取字典文件

def init():
   """
   讀取字典文件
   載入詞典
   :return:
   """
   with open("../dic/dict.txt", "r", encoding="utf8") as dict_input:
       for word in dict_input:
           # 文件格式爲：單詞 詞頻 詞性
           words_dic.append(word.split(" ")[0].strip())

2.切詞方法

# 實現正向匹配算法中的切詞方法
def cut_words(raw_sentence, word_dic):
    # 統計詞典中最長的詞
    max_length = max(len(word) for word in words_dic)
    sentence = raw_sentence.strip()
    # 統計序列長度
    word_length = len(sentence)
    # 存儲切分好的詞語
    cut_word_list = []
    while word_length >0:
        max_cut_length = min(max_length, word_length)
        sub_sen = sentence[0:max_cut_length]
        while max_cut_length > 0:
            if sub_sen in word_dic:
                cut_word_list.append(sub_sen)
                break
            elif max_cut_length == 1:
                cut_word_list.append(sub_sen)
                break
            else:
                max_cut_length = max_cut_length - 1
                sub_sen = sub_sen[0:max_cut_length]
        sentence = sentence[max_cut_length:]
        word_length = word_length - max_cut_length
    words = '/'.join(cut_word_list)
    return words

end

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

分詞之正向最大匹配法

步驟

核心代碼

分詞之逆向最大匹配法

開始動手訓練自己的詞向量word2vec

pycharm如何增加運行時內存

J-flash ARM參數說明

分詞之正向最大匹配法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結