elasticsearch分詞檢索的match-query匹配過程分析

1. 模擬字符串數據存儲

localhost:9200/yigo-redist.1/_analyze?analyzer=default&text=全能片(前)---TRW-GDB7891AT剎車片自帶報警線，無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片

索引爲`yigo-redist.1`
使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default`
解析的字符串(`text`)爲"全能片(前)---TRW-GDB7891AT剎車片自帶報警線，無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片"

如果結果爲:

{
  "tokens" : [ {
    "token" : "全能",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "片",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "前",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_CHAR",
    "position" : 3
  }, {
    "token" : "trw-gdb7891at",
    "start_offset" : 9,
    "end_offset" : 22,
    "type" : "LETTER",
    "position" : 4
  }, {
    "token" : "剎車片",
    "start_offset" : 22,
    "end_offset" : 25,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "自帶",
    "start_offset" : 25,
    "end_offset" : 27,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "報警",
    "start_offset" : 27,
    "end_offset" : 29,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "線",
    "start_offset" : 29,
    "end_offset" : 30,
    "type" : "CN_CHAR",
    "position" : 8
  }, {
    "token" : "無",
    "start_offset" : 31,
    "end_offset" : 32,
    "type" : "CN_WORD",
    "position" : 9
  }, {
    "token" : "單獨",
    "start_offset" : 32,
    "end_offset" : 34,
    "type" : "CN_WORD",
    "position" : 10
  }, {
    "token" : "報警",
    "start_offset" : 34,
    "end_offset" : 36,
    "type" : "CN_WORD",
    "position" : 11
  }, {
    "token" : "線",
    "start_offset" : 36,
    "end_offset" : 37,
    "type" : "CN_CHAR",
    "position" : 12
  }, {
    "token" : "號碼",
    "start_offset" : 37,
    "end_offset" : 39,
    "type" : "CN_WORD",
    "position" : 13
  }, {
    "token" : "卡",
    "start_offset" : 40,
    "end_offset" : 41,
    "type" : "CN_CHAR",
    "position" : 14
  }, {
    "token" : "仕",
    "start_offset" : 41,
    "end_offset" : 42,
    "type" : "CN_WORD",
    "position" : 15
  }, {
    "token" : "歐",
    "start_offset" : 42,
    "end_offset" : 43,
    "type" : "CN_WORD",
    "position" : 16
  }, {
    "token" : "卡",
    "start_offset" : 44,
    "end_offset" : 45,
    "type" : "CN_CHAR",
    "position" : 17
  }, {
    "token" : "仕",
    "start_offset" : 45,
    "end_offset" : 46,
    "type" : "CN_WORD",
    "position" : 18
  }, {
    "token" : "歐",
    "start_offset" : 46,
    "end_offset" : 47,
    "type" : "CN_WORD",
    "position" : 19
  }, {
    "token" : "乘用車",
    "start_offset" : 48,
    "end_offset" : 51,
    "type" : "CN_WORD",
    "position" : 20
  }, {
    "token" : "剎車片",
    "start_offset" : 52,
    "end_offset" : 55,
    "type" : "CN_WORD",
    "position" : 21
  } ]
}

2. 關鍵詞查詢

localhost:9200//yigo-redist.1/_analyze?analyzer=default_search&text=gdb7891

索引爲`yigo-redist.1`
使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default_search`
解析的字符串(`text`)爲"gdb7891"

返回結果：

{
  "tokens" : [ {
    "token" : "gdb7891",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  } ]
}

3. 關鍵詞使用存儲的分詞器查詢

localhost:9200//yigo-redist.1/_analyze?analyzer=default&text=gdb7891

索引爲`yigo-redist.1`
使用了索引`yigo-redist.1`中的分詞器(`analyzer`) `default_search`
解析的字符串(`text`)爲"gdb7891"

返回結果：

{
  "tokens" : [ {
    "token" : "gdb7891",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  }, {
    "token" : "",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  }, {
    "token" : "gdb7891",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "LETTER",
    "position" : 1
  }, {
    "token" : "",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "gdb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "gdb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "ENGLISH",
    "position" : 2
  }, {
    "token" : "7891",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "ARABIC",
    "position" : 3
  }, {
    "token" : "7891",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "ARABIC",
    "position" : 3
  }, {
    "token" : "",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "ARABIC",
    "position" : 3
  } ]
}

總結

通過步驟1可以看出,存儲的數據"全能片(前)---TRW-GDB7891AT剎車片自帶報警線，無單獨報警線號碼,卡仕歐,卡仕歐,乘用車,剎車片",被拆分成了很多詞組碎片,然後存儲在了索引數據中
通過步驟2可以看出,當關鍵詞輸入"gdb7891",這個在檢索分詞器(`default_search`)下,沒有拆分,只一個可供查詢的碎片就是"gdb7891",但是步驟1,拆分的碎片裏不存在"gb7891"的詞組碎片,唯一相近的就是"trw-gdb7891at",所以使用普通的match-query是無法匹配步驟1輸入的索引數據
通過步驟3,可以看出如果使用相同的分詞器,"gdb7891"能夠拆分成"gdb","7891"等等,通過這2個碎片都能找到步驟1輸入的索引數據,但是因爲關鍵詞被拆分了,所以會查詢到更多的匹配的數據,比如:與"gdb"匹配的,與"7891"匹配的,與"gdb7891"匹配的
如果說想通過分詞器(`default_search`)檢索出步驟1的數據,需要使用wildcard-query,使用"*gdb7891*",就可以匹配

{
      "query": {
            "wildcard" : { "description" : "*gdb7891*" }
      }
}

elasticsearch分詞檢索的match-query匹配過程分析

1. 模擬字符串數據存儲

2. 關鍵詞查詢

3. 關鍵詞使用存儲的分詞器查詢

總結

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

elasticsearch.yml 配置解讀

紀實:嵌入式Elasticsearch服務因爲gc無法釋放內存,導致宕機事件

ELASTICSEARCH集羣幾個注意點

elasticsearch分詞檢索的match-query匹配過程分析

全文檢索(elasticsearch) 索引mapping的配置指南

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結