結繩記事,記錄,思考,方有成長~
環境信息:ElasticSearch 7.X
1. 使用term
查詢text分詞
的字段,實現模糊查詢,返回結果爲空。
比如我打算根據中國
來搜索我是中國人
這條記錄,但並未查到。
"query": {
"term": {
"title": "中國"
}
}
原因:在創建mapping時未指定分詞器,雖然text
字段在保存到ES前會先分詞,構建倒排索引,但如果只指定這個字段的type
爲text
這1個屬性,則默認分詞後的效果爲我、是、中、國、人
(即拆成每一個漢字,可參照第二部分的執行結果),所以需要指定分詞器
# 1 構建mapping
PUT diary
{
"settings": {
"number_of_shards": "4",
"number_of_replicas": "1"
},
"mappings": {
"properties": {
"title": {
"type": "text",
# 指定分詞器!!!,否則會被分成一個個漢字
"analyzer": "ik_max_word"
}
}
}
}
# 2 寫入記錄
POST diary/_doc/111
{
"title": "我是中國人"
}
# 3 term查詢
POST diary/_search
{
"query": {
"bool": {
"must": {
"term": {
"title": "中國"
}
}
}
}
}
2. 如何查看分詞效果
# 2.1未指定分詞器
POST diary/_analyze
{
"text": "我是中國人"
}
# 分詞效果如下
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "中",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "國",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "人",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
}
]
}
# 2.2 指定分詞器
POST diary/_analyze
{
"text": "我是中國人",
"analyzer": "ik_max_word"
}
# 執行結果如下
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "中國人",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "中國",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "國人",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
}
]
}
3. ik_smart和ik_max_word區別
ik_smart
:粗略分詞,如果詞項有包含關係,則只保留詞項長度最大的那個;
# POST diary/_analyze
{
"text": "我是中國人",
"analyzer": "ik_smart"
}
# 分詞結果
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "中國人",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
}
]
}
ik_max_word
:細分詞,不管詞項是否存在包含關係,都會作爲分詞結果。