Elasticsearch-edge_ngram和ngram的區別

前言

本文基於elasticsearch7.3.0版本

說明

edge_ngram和ngram是elasticsearch內置的兩個tokenizer和filter

實例

步驟

  1. 自定義兩個分析器edge_ngram_analyzer和ngram_analyzer
  2. 進行分詞測試

創建測試索引

PUT analyzer_test
{
  "settings": {
    "refresh_interval": "1s",
    "index": {
      "max_ngram_diff": 10
    },
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "type": "custom",
          "char_filter": [],
          "tokenizer": "keyword",
          "filter": [
            "edge_ngram_filter"
          ]
        },
        "ngram_analyzer": {
          "type": "custom",
          "char_filter": [],
          "tokenizer": "keyword",
          "filter": [
            "ngram_filter"
          ]
        }
      },
      "filter": {
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 11
        },
        "ngram_filter": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 5
        }
      }
    }
  }
}

測試edge_ngram_analyzer分析器

POST /analyzer_test/_analyze
{
  "text": "虹橋機場",
  "analyzer": "edge_ngram_analyzer"
}

{
  "tokens" : [
    {
      "token" : "虹",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "虹橋",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "虹橋機",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "虹橋機場",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

測試ngram_analyzer分析器

POST /analyzer_test/_analyze
{
  "text": "虹橋機場",
  "analyzer": "ngram_analyzer"
}

{
  "tokens" : [
    {
      "token" : "虹橋",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "虹橋機",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "虹橋機場",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "橋機",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "橋機場",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "機場",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

區別

  • edge_ngram是從第一個字符開始,按照步長,進行分詞,適合前綴匹配場景,比如:訂單號,手機號,郵政編碼的檢索
  • ngram是從每一個字符開始,按照步長,進行分詞,適合前綴中綴檢索
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章