前言
本文基於elasticsearch7.3.0版本
說明
edge_ngram和ngram是elasticsearch內置的兩個tokenizer和filter
實例
步驟
- 自定義兩個分析器edge_ngram_analyzer和ngram_analyzer
- 進行分詞測試
創建測試索引
PUT analyzer_test
{
"settings": {
"refresh_interval": "1s",
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "keyword",
"filter": [
"edge_ngram_filter"
]
},
"ngram_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "keyword",
"filter": [
"ngram_filter"
]
}
},
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 11
},
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5
}
}
}
}
}
測試edge_ngram_analyzer分析器
POST /analyzer_test/_analyze
{
"text": "虹橋機場",
"analyzer": "edge_ngram_analyzer"
}
{
"tokens" : [
{
"token" : "虹",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "虹橋",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "虹橋機",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "虹橋機場",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
測試ngram_analyzer分析器
POST /analyzer_test/_analyze
{
"text": "虹橋機場",
"analyzer": "ngram_analyzer"
}
{
"tokens" : [
{
"token" : "虹橋",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "虹橋機",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "虹橋機場",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "橋機",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "橋機場",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "機場",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}
]
}
區別
- edge_ngram是從第一個字符開始,按照步長,進行分詞,適合前綴匹配場景,比如:訂單號,手機號,郵政編碼的檢索
- ngram是從每一個字符開始,按照步長,進行分詞,適合前綴中綴檢索