使用HanLP增強Elasticsearch分詞功能

hanlp-ext 插件源碼地址:http://git.oschina.net/hualongdata/hanlp-exthttps://github.com/hualongdata/hanlp-ext

Elasticsearch 默認對中文分詞是按“字”進行分詞的,這是肯定不能達到我們進行分詞搜索的要求的。官方有一個 SmartCN 中文分詞插件,另外還有一個 IK 分詞插件使用也比較廣。但這裏,我們採用 HanLP 這款 自然語言處理工具 來進行中文分詞。

Elasticsearch
Elasticsearch 的默認分詞效果是慘不忍睹的。

GET /_analyze?pretty
{
  "text" : ["重慶華龍網海數科技有限公司"]
}

輸出:

{
"tokens": [

{
  "token": "重",
  "start_offset": 0,
  "end_offset": 1,
  "type": "<IDEOGRAPHIC>",
  "position": 0
},
{
  "token": "慶",
  "start_offset": 1,
  "end_offset": 2,
  "type": "<IDEOGRAPHIC>",
  "position": 1
},
{
  "token": "華",
  "start_offset": 2,
  "end_offset": 3,
  "type": "<IDEOGRAPHIC>",
  "position": 2
},
{
  "token": "龍",
  "start_offset": 3,
  "end_offset": 4,
  "type": "<IDEOGRAPHIC>",
  "position": 3
},
{
  "token": "網",
  "start_offset": 4,
  "end_offset": 5,
  "type": "<IDEOGRAPHIC>",
  "position": 4
},
{
  "token": "海",
  "start_offset": 5,
  "end_offset": 6,
  "type": "<IDEOGRAPHIC>",
  "position": 5
},
{
  "token": "數",
  "start_offset": 6,
  "end_offset": 7,
  "type": "<IDEOGRAPHIC>",
  "position": 6
},
{
  "token": "科",
  "start_offset": 7,
  "end_offset": 8,
  "type": "<IDEOGRAPHIC>",
  "position": 7
},
{
  "token": "技",
  "start_offset": 8,
  "end_offset": 9,
  "type": "<IDEOGRAPHIC>",
  "position": 8
},
{
  "token": "有",
  "start_offset": 9,
  "end_offset": 10,
  "type": "<IDEOGRAPHIC>",
  "position": 9
},
{
  "token": "限",
  "start_offset": 10,
  "end_offset": 11,
  "type": "<IDEOGRAPHIC>",
  "position": 10
},
{
  "token": "公",
  "start_offset": 11,
  "end_offset": 12,
  "type": "<IDEOGRAPHIC>",
  "position": 11
},
{
  "token": "司",
  "start_offset": 12,
  "end_offset": 13,
  "type": "<IDEOGRAPHIC>",
  "position": 12
}

]
}
可以看到,默認是按字進行分詞的。

elasticsearch-hanlp
HanLP

HanLP 是一款使用 Java 實現的優秀的,具有如下功能:

中文分詞
詞性標註
命名實體識別
關鍵詞提取
自動摘要
短語提取
拼音轉換
簡繁轉換
文本推薦
依存句法分析
語料庫工具
安裝 elasticsearch-hanlp(安裝見:https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin)插件以後,我們再來看看分詞效果。

GET /_analyze?pretty
{
  "analyzer" : "hanlp",
  "text" : ["重慶華龍網海數科技有限公司"]
}

輸出:

{
"tokens": [

{
  "token": "重慶",
  "start_offset": 0,
  "end_offset": 2,
  "type": "ns",
  "position": 0
},
{
  "token": "華龍網",
  "start_offset": 2,
  "end_offset": 5,
  "type": "nr",
  "position": 1
},
{
  "token": "海數",
  "start_offset": 5,
  "end_offset": 7,
  "type": "nr",
  "position": 2
},
{
  "token": "科技",
  "start_offset": 7,
  "end_offset": 9,
  "type": "n",
  "position": 3
},
{
  "token": "有限公司",
  "start_offset": 9,
  "end_offset": 13,
  "type": "nis",
  "position": 4
}

]
}
HanLP 的功能不止簡單的中文分詞,有很多功能都可以集成到 Elasticsearch 中。

文章來源於羊八井的博客

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章