ES pinyin 插件拼音搜索原理 match_phase

原創

2020-06-19 08:33

背景

中文搜索很多時候都要用到pinyin搜索，基本繞不開這個插件；如搜索人名之類的；

介紹

插件github：地址

在README的最後，舉的例子挺有意思；經過一系列操作之後，對劉德華建index，竟然搜liudh，劉dh，各種奇葩的搜索都能搜出來，這是爲啥呢？讓我們來仔細分析一下。

如官網的配置

配置analyzer

PUT /medcl3/
{
   "settings" : {
       "analysis" : {
           "analyzer" : {
               "pinyin_analyzer" : {
                   "tokenizer" : "my_pinyin"
                   }
           },
           "tokenizer" : {
               "my_pinyin" : {
                   "type" : "pinyin",
                   "keep_first_letter":true,
                   "keep_separate_first_letter" : true,
                   "keep_full_pinyin" : true,
                   "keep_original" : false,
                   "limit_first_letter_length" : 16,
                   "lowercase" : true
               }
           }
       }
   }
}

主要是用了分詞器tokenizer：my_pinyin。
具體的設置是，
keep_first_letter: true ；也就是會將劉德華 -> ldh
keep_seperate_first_letter: true; 將劉德華 -> l 、 d 、 h
keep_full_pinyin: true; 將劉德華 -> liu, de, hua

有了這些設置之後，我們發現對劉德華進行analyze：

GET /hjxtest_pinyin/_analyze
{
  "text": "劉德華",
  "analyzer": "pinyin_analyzer"
}

得到結果就是上面說的這7個key

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}

然後我們建好index，搜索liudh的時候，會先用相同的分詞方法分詞：

GET /hjxtest_pinyin/_analyze
{
  "text": "liudh",
  "analyzer": "pinyin_analyzer"
}

分詞結果

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "liudh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}

可見，我們牛皮的分詞器，會分詞出結果 liu + d + h + liudh
回顧我們建的倒排索引： liu de hua l d h ldh
搜索的時候
liu d h都能找到咱們的文檔，當然就可以搜到結果了：

GET /hjxtest_pinyin/_search
{
  "query": {"match": {
    "name.pinyin": "liudh"
  }}
}

但是我們發現一個有意思的現象，當我們搜liudh的時候，竟然會把黃渤也搜出來，這是什麼鬼？😂

盲猜是因爲 analyze的時候，黃渤 analyze的結果是：
huang + bo + h + b + hb
然後搜索的時候跟liudh的h match上了

驗證一下： 黃渤 analyze的結果是：

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "huang",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "hb",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "bo",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    }
  ]
}

果然跟猜想的一致。

那怎麼辦呢，這種準確率也太低了吧

我們看到github上給的查詢例子實際上是match_phase而不是match

區別是啥？參看官網

match_phrase要求query和doc不僅要在term上有交集，還需要順序保持一致

具體到我們這個例子，我搜liudh 文檔裏的liu d h 也必須匹配着順序出現，所以就只有劉德華可以匹配上了：

GET /hjxtest_pinyin/_search
{
  "query": {"match_phrase": {
    "name.pinyin": "liudh"
  }}
}

這樣就提高了準確率了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

ES pinyin 插件拼音搜索原理 match_phase

背景

介紹

配置analyzer

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

HTTPS 原理對稱加密非對稱加密 CA證書

ES pinyin 插件拼音搜索原理 match_phase

CAS Spring 中文文檔

redis實現分佈式鎖單實例情況多實例情況 RedLock

SSO登錄流程原理詳解舉例

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

ES pinyin 插件 拼音搜索 原理 match_phase

背景

介紹

配置analyzer

ES pinyin 插件拼音搜索原理 match_phase