【自然語言處理】Hanlp分詞與去停用詞工具

原創

代码拖拉鸡

2020-06-12 17:34

可以參考這個GitHub
分詞除了使用jieba也可以用Hanlp的這個小工具，也很方便。

HanLP的詞典分詞實現

1.DoubleArrayTrieSegment

DoubleArrayTrieSegment分詞器是對DAT最長匹配的封裝，默認加載hanlp.properties中CoreDictionaryPath制定的詞典。

from pyhanlp import *

# 不顯示詞性
HanLP.Config.ShowTermNature = False

# 可傳入自定義字典 [dir1, dir2]
segment = DoubleArrayTrieSegment()
# 激活數字和英文識別
segment.enablePartOfSpeechTagging(True)

pyhanlp的安裝和常見的python庫安裝方法一樣，pip install pyhanlp
在第一次導入的時候會自動下載數據集，耗費時間較長

2.分詞效果

3.去停用詞

首先加載停用詞詞典，詞典是一個包含了大部分停用詞的txt文件，在源GitHub中可以下載

停用詞文件

#加載停用詞詞典
def load_file(filename):
    with open(filename,'r',encoding="utf-8") as f:
        contents = f.readlines()
    result = []
    for content in contents:
        result.append(content.strip())
        
    return result

對生成的分詞結果去掉停用詞

#去停用詞
def remove_stop_words(text,dic):
    result = []
    for k in text:
        if k.word not in dic:
            result.append(k.word)
    return result

去停用詞後的效果

完整代碼

from pyhanlp import *

HanLP.Config.ShowTermNature = False
segment = DoubleArrayTrieSegment()
segment.enablePartOfSpeechTagging(True)

#加載停用詞詞典
def load_file(filename):
    with open(filename,'r',encoding="utf-8") as f:
        contents = f.readlines()
    result = []
    for content in contents:
        result.append(content.strip())
        
    return result

#去停用詞
def remove_stop_words(text,dic):
    result = []
    for k in text:
        if k.word not in dic:
            result.append(k.word)
    return result


text = segment.seg('江西鄱陽湖乾枯了，中國最大的淡水湖變成了大草原')
print(text)
dic = load_file('stopwords.txt')
result = remove_stop_words(text,dic)
print(result)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【自然語言處理】Hanlp分詞與去停用詞工具

HanLP的詞典分詞實現

1.DoubleArrayTrieSegment

2.分詞效果

3.去停用詞

完整代碼

信息增益生成決策樹

Python爬蟲實現貓眼電影搜索

Python爬蟲實現豆瓣圖書搜索

Python爬蟲爬取淘寶商品信息

爬取淘寶美食信息並進行可視化展示

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結