使用HanLP增強Elasticsearch分詞功能

hanlp-ext 插件源碼地址：http://git.oschina.net/hualongdata/hanlp-ext 或 https://github.com/hualongdata/hanlp-ext

Elasticsearch 默認對中文分詞是按“字”進行分詞的，這是肯定不能達到我們進行分詞搜索的要求的。官方有一個 SmartCN 中文分詞插件，另外還有一個 IK 分詞插件使用也比較廣。但這裏，我們採用 HanLP 這款自然語言處理工具來進行中文分詞。

Elasticsearch
Elasticsearch 的默認分詞效果是慘不忍睹的。

GET /_analyze?pretty
{
  "text" : ["重慶華龍網海數科技有限公司"]
}

輸出：

{
"tokens": [

{
  "token": "重",
  "start_offset": 0,
  "end_offset": 1,
  "type": "<IDEOGRAPHIC>",
  "position": 0
},
{
  "token": "慶",
  "start_offset": 1,
  "end_offset": 2,
  "type": "<IDEOGRAPHIC>",
  "position": 1
},
{
  "token": "華",
  "start_offset": 2,
  "end_offset": 3,
  "type": "<IDEOGRAPHIC>",
  "position": 2
},
{
  "token": "龍",
  "start_offset": 3,
  "end_offset": 4,
  "type": "<IDEOGRAPHIC>",
  "position": 3
},
{
  "token": "網",
  "start_offset": 4,
  "end_offset": 5,
  "type": "<IDEOGRAPHIC>",
  "position": 4
},
{
  "token": "海",
  "start_offset": 5,
  "end_offset": 6,
  "type": "<IDEOGRAPHIC>",
  "position": 5
},
{
  "token": "數",
  "start_offset": 6,
  "end_offset": 7,
  "type": "<IDEOGRAPHIC>",
  "position": 6
},
{
  "token": "科",
  "start_offset": 7,
  "end_offset": 8,
  "type": "<IDEOGRAPHIC>",
  "position": 7
},
{
  "token": "技",
  "start_offset": 8,
  "end_offset": 9,
  "type": "<IDEOGRAPHIC>",
  "position": 8
},
{
  "token": "有",
  "start_offset": 9,
  "end_offset": 10,
  "type": "<IDEOGRAPHIC>",
  "position": 9
},
{
  "token": "限",
  "start_offset": 10,
  "end_offset": 11,
  "type": "<IDEOGRAPHIC>",
  "position": 10
},
{
  "token": "公",
  "start_offset": 11,
  "end_offset": 12,
  "type": "<IDEOGRAPHIC>",
  "position": 11
},
{
  "token": "司",
  "start_offset": 12,
  "end_offset": 13,
  "type": "<IDEOGRAPHIC>",
  "position": 12
}

]
}
可以看到，默認是按字進行分詞的。

elasticsearch-hanlp
HanLP

HanLP 是一款使用 Java 實現的優秀的，具有如下功能：

中文分詞
詞性標註
命名實體識別
關鍵詞提取
自動摘要
短語提取
拼音轉換
簡繁轉換
文本推薦
依存句法分析
語料庫工具
安裝 elasticsearch-hanlp（安裝見：https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin）插件以後，我們再來看看分詞效果。

GET /_analyze?pretty
{
  "analyzer" : "hanlp",
  "text" : ["重慶華龍網海數科技有限公司"]
}

輸出：

{
"tokens": [

{
  "token": "重慶",
  "start_offset": 0,
  "end_offset": 2,
  "type": "ns",
  "position": 0
},
{
  "token": "華龍網",
  "start_offset": 2,
  "end_offset": 5,
  "type": "nr",
  "position": 1
},
{
  "token": "海數",
  "start_offset": 5,
  "end_offset": 7,
  "type": "nr",
  "position": 2
},
{
  "token": "科技",
  "start_offset": 7,
  "end_offset": 9,
  "type": "n",
  "position": 3
},
{
  "token": "有限公司",
  "start_offset": 9,
  "end_offset": 13,
  "type": "nis",
  "position": 4
}

]
}
HanLP 的功能不止簡單的中文分詞，有很多功能都可以集成到 Elasticsearch 中。

文章來源於羊八井的博客

使用HanLP增強Elasticsearch分詞功能

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

使用HanLP增強Elasticsearch分詞功能

Hanlp使用Bug記錄

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結