前言
在使用ElasticSearch做搜索時,語句的倒排索引可以說是十分關鍵。所以如果針對中文段落時,如果進行正確的分詞索引就是重中之重,接下來就介紹如何在ElasticSearch中安裝ik中文索引。(後文均簡稱ES)
正文
安裝步驟
插件下載:
解壓配置
在ES_HOME/plugins/文件夾下新建ik文件夾
將壓縮包內容解壓縮放到ik中
項目文件結構
啓動ES
此時啓動ES應該可以看到已加載ik分詞器
測試分詞結果
普通分詞
POST {{host}}:{{port}}/_analyze
{
"analyzer":"english",
"text":"使用搜索引擎"
}
分詞結果:
{
"tokens": [
{
"token": "使",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "用",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "搜",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "索",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "引",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "擎",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
}
]
}
ik_smart分詞
POST {{host}}:{{port}}/_analyze
{
"analyzer":"ik_smart",
"text":"使用搜索引擎"
}
{
"tokens": [
{
"token": "使用",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "搜索引擎",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
}
]
}
ik_max_word
POST {{host}}:{{port}}/_analyze
{
"analyzer":"ik_max_word",
"text":"使用搜索引擎"
}
{
"tokens": [
{
"token": "使用",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "搜索引擎",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": "搜索",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "索引",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 3
},
{
"token": "引擎",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
}
]
}
搜索分詞測試
// 創建index
PUT {{host}}:{{port}}/news
// 創建mapping 並設置分詞器
POST {{host}}:{{port}}/news/sports/_mapping
{
"properties":{
"content":{
"type":"text",
"analyzer":"ik_max_word",
"index":"analyzed"
}
}
}
導入數據....
搜索引擎內數據
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "news",
"_type": "sports",
"_id": "AWCgyE7pGEKcCwwZuUe6",
"_score": 1,
"_source": {
"content": "熱火形勢一片大好"
}
},
{
"_index": "news",
"_type": "sports",
"_id": "AWCgx7fpGEKcCwwZuUe5",
"_score": 1,
"_source": {
"content": "火箭98-99不敵凱爾特人,慘遭四連敗"
}
},
{
"_index": "news",
"_type": "sports",
"_id": "AWCgyOLYGEKcCwwZuUe7",
"_score": 1,
"_source": {
"content": "曼城18連勝,英超無人能擋"
}
},
{
"_index": "news",
"_type": "sports",
"_id": "AWCgxyxXGEKcCwwZuUe4",
"_score": 1,
"_source": {
"content": "巴薩3-0擊敗皇馬贏下國家德比,梅西一球一助再獲滿分"
}
}
]
}
}
POST {{host}}:{{port}}/news/sports/_search
{
"query":{
"match":{
"content":"火箭隊新聞"
}
}
}
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.6099695,
"hits": [
{
"_index": "news",
"_type": "sports",
"_id": "AWCgx7fpGEKcCwwZuUe5",
"_score": 0.6099695,
"_source": {
"content": "火箭98-99不敵凱爾特人,慘遭四連敗"
}
}
]
}
}
POST {{host}}:{{port}}/news/sports/_search
{
"query":{
"match":{
"content":"火焰"
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
通過分詞測試,可以看到中文分詞會將帶搜索字段分成更具中文含義的字段,而非每個字都分詞。
通過搜索測試,可以看到保留了相關性的搜索結果,而過濾掉了不相關的結果,是的搜索更智能化。
參考文章
以下文章有關分詞均做了更多的解釋。如果想關注更多細節,可以查閱,本文不做更多介紹。
如何在Elasticsearch中安裝中文分詞器(IK+pinyin)
ik分詞細節