elasticsearch分詞檢索的match-query匹配過程分析

1. 模擬字符串數據存儲

localhost:9200/yigo-redist.1/_analyze?analyzer=default&text=全能片(前)---TRW-GDB7891AT剎車片自帶報警線,無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片
  •     索引爲`yigo-redist.1`
  •     使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default`
  •     解析的字符串(`text`)爲"全能片(前)---TRW-GDB7891AT剎車片自帶報警線,無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片"

如果結果爲:

{
  "tokens" : [ {
    "token" : "全能",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "片",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "前",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_CHAR",
    "position" : 3
  }, {
    "token" : "trw-gdb7891at",
    "start_offset" : 9,
    "end_offset" : 22,
    "type" : "LETTER",
    "position" : 4
  }, {
    "token" : "剎車片",
    "start_offset" : 22,
    "end_offset" : 25,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "自帶",
    "start_offset" : 25,
    "end_offset" : 27,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "報警",
    "start_offset" : 27,
    "end_offset" : 29,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "線",
    "start_offset" : 29,
    "end_offset" : 30,
    "type" : "CN_CHAR",
    "position" : 8
  }, {
    "token" : "無",
    "start_offset" : 31,
    "end_offset" : 32,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "token" : "單獨",
    "start_offset" : 32,
    "end_offset" : 34,
    "type" : "CN_WORD",
    "position" : 10
  }, {
    "token" : "報警",
    "start_offset" : 34,
    "end_offset" : 36,
    "type" : "CN_WORD",
    "position" : 11
  }, {
    "token" : "線",
    "start_offset" : 36,
    "end_offset" : 37,
    "type" : "CN_CHAR",
    "position" : 12
  }, {
    "token" : "號碼",
    "start_offset" : 37,
    "end_offset" : 39,
    "type" : "CN_WORD",
    "position" : 13
  }, {
    "token" : "卡",
    "start_offset" : 40,
    "end_offset" : 41,
    "type" : "CN_CHAR",
    "position" : 14
  }, {
    "token" : "仕",
    "start_offset" : 41,
    "end_offset" : 42,
    "type" : "CN_WORD",
    "position" : 15
  }, {
    "token" : "歐",
    "start_offset" : 42,
    "end_offset" : 43,
    "type" : "CN_WORD",
    "position" : 16
  }, {
    "token" : "卡",
    "start_offset" : 44,
    "end_offset" : 45,
    "type" : "CN_CHAR",
    "position" : 17
  }, {
    "token" : "仕",
    "start_offset" : 45,
    "end_offset" : 46,
    "type" : "CN_WORD",
    "position" : 18
  }, {
    "token" : "歐",
    "start_offset" : 46,
    "end_offset" : 47,
    "type" : "CN_WORD",
    "position" : 19
  }, {
    "token" : "乘用車",
    "start_offset" : 48,
    "end_offset" : 51,
    "type" : "CN_WORD",
    "position" : 20
  }, {
    "token" : "剎車片",
    "start_offset" : 52,
    "end_offset" : 55,
    "type" : "CN_WORD",
    "position" : 21
  } ]
}

2. 關鍵詞查詢

localhost:9200//yigo-redist.1/_analyze?analyzer=default_search&text=gdb7891

  •    索引爲`yigo-redist.1`
  •    使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default_search`
  •    解析的字符串(`text`)爲"gdb7891"
返回結果:
{
  "tokens" : [ {
    "token" : "gdb7891",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  } ]
}

3. 關鍵詞使用存儲的分詞器查詢

localhost:9200//yigo-redist.1/_analyze?analyzer=default&text=gdb7891

  •      索引爲`yigo-redist.1`
  •      使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default_search`
  •      解析的字符串(`text`)爲"gdb7891"
返回結果:
{
  "tokens" : [ {
    "token" : "gdb7891",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  }, {
    "token" : "",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  }, {
    "token" : "gdb7891",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  }, {
    "token" : "",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "gdb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "gdb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "7891",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "ARABIC",
    "position" : 3
  }, {
    "token" : "7891",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "ARABIC",
    "position" : 3
  }, {
    "token" : "",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "ARABIC",
    "position" : 3
  } ]
}

總結

  •     通過步驟1可以看出,存儲的數據"全能片(前)---TRW-GDB7891AT剎車片自帶報警線,無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片",被拆分成了很多詞組碎片,然後存儲在了索引數據中
  •     通過步驟2可以看出,當關鍵詞輸入"gdb7891",這個在檢索分詞器(`default_search`)下,沒有拆分,只一個可供查詢的碎片就是"gdb7891",但是步驟1,拆分的碎片裏不存在"gb7891"的詞組碎片,唯一相近的就是"trw-gdb7891at",所以使用普通的match-query是無法匹配步驟1輸入的索引數據
  •     通過步驟3,可以看出如果使用相同的分詞器,"gdb7891"能夠拆分成"gdb","7891"等等,通過這2個碎片都能找到步驟1輸入的索引數據,但是因爲關鍵詞被拆分了,所以會查詢到更多的匹配的數據,比如:與"gdb"匹配的,與"7891"匹配的,與"gdb7891"匹配的
  •     如果說想通過分詞器(`default_search`)檢索出步驟1的數據,需要使用wildcard-query,使用"*gdb7891*",就可以匹配
{
      "query": {
            "wildcard" : { "description" : "*gdb7891*" }
      }
}


  
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章