未安裝分詞插件之前只能使用默認的分詞規則:
1)普通分詞
GET _analyze
{
"text": ["他是一個前端開發工程師"],
"analyzer": "standard"
}
或者:
2)全文分詞
GET _analyze
{
"text": ["他是一個前端開發工程師"],
"analyzer": "keyword"
}
分詞結果:
{
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "一",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "個",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "前",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "端",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "開",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "發",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "工",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "程",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 9
},
{
"token" : "師",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 10
}
]
}
使用analysis-ik分詞插件
1、安裝
[elasticsearch@txvm2019 bin]./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.2.0/elasticsearch-analysis-ik-7.2.0.zip
安裝完成後可以使用list查看是否成功,
[elasticsearch@txvm2019 bin]$ ./elasticsearch-plugin list
analysis-icu
analysis-ik
2、測試:
GET _analyze
{
"text": ["他是一個前端開發工程師"],
"analyzer": "ik_max_word"
}
分詞結果:
{
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "一個",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "一",
"start_offset" : 2,
"end_offset" : 3,
"type" : "TYPE_CNUM",
"position" : 3
},
{
"token" : "個",
"start_offset" : 3,
"end_offset" : 4,
"type" : "COUNT",
"position" : 4
},
{
"token" : "前端",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "開發",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "工程師",
"start_offset" : 8,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "工程",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "師",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 9
}
]
}
3、說明:
IK分詞主要有以下兩種類型:
ik_max_word:將文本按最細粒度的組合來拆分;
ik_smart::最粗粒度的拆分;
如果在分詞的時候不添加分詞類別,Elasticsearch對於漢字默認使用standard只是將漢字拆分成一個個的漢字。