ES pinyin 插件拼音搜索原理 match_phase

原創

2020-06-19 08:33

背景

中文搜索很多时候都要用到pinyin搜索，基本绕不开这个插件；如搜索人名之类的；

介绍

插件github：地址

在README的最后，举的例子挺有意思；经过一系列操作之后，对刘德华建index，竟然搜liudh，刘dh，各种奇葩的搜索都能搜出来，这是为啥呢？让我们来仔细分析一下。

如官网的配置

配置analyzer

PUT /medcl3/
{
   "settings" : {
       "analysis" : {
           "analyzer" : {
               "pinyin_analyzer" : {
                   "tokenizer" : "my_pinyin"
                   }
           },
           "tokenizer" : {
               "my_pinyin" : {
                   "type" : "pinyin",
                   "keep_first_letter":true,
                   "keep_separate_first_letter" : true,
                   "keep_full_pinyin" : true,
                   "keep_original" : false,
                   "limit_first_letter_length" : 16,
                   "lowercase" : true
               }
           }
       }
   }
}

主要是用了分词器tokenizer：my_pinyin。
具体的设置是，
keep_first_letter: true ；也就是会将刘德华 -> ldh
keep_seperate_first_letter: true; 将刘德华 -> l 、 d 、 h
keep_full_pinyin: true; 将刘德华 -> liu, de, hua

有了这些设置之后，我们发现对刘德华进行analyze：

GET /hjxtest_pinyin/_analyze
{
  "text": "刘德华",
  "analyzer": "pinyin_analyzer"
}

得到结果就是上面说的这7个key

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}

然后我们建好index，搜索liudh的时候，会先用相同的分词方法分词：

GET /hjxtest_pinyin/_analyze
{
  "text": "liudh",
  "analyzer": "pinyin_analyzer"
}

分词结果

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "liudh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}

可见，我们牛皮的分词器，会分词出结果 liu + d + h + liudh
回顾我们建的倒排索引： liu de hua l d h ldh
搜索的时候
liu d h都能找到咱们的文档，当然就可以搜到结果了：

GET /hjxtest_pinyin/_search
{
  "query": {"match": {
    "name.pinyin": "liudh"
  }}
}

但是我们发现一个有意思的现象，当我们搜liudh的时候，竟然会把黄渤也搜出来，这是什么鬼？😂

盲猜是因为 analyze的时候，黄渤 analyze的结果是：
huang + bo + h + b + hb
然后搜索的时候跟liudh的h match上了

验证一下： 黄渤 analyze的结果是：

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "huang",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "hb",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "bo",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    }
  ]
}

果然跟猜想的一致。

那怎么办呢，这种准确率也太低了吧

我们看到github上给的查询例子实际上是match_phase而不是match

区别是啥？参看官网

match_phrase要求query和doc不仅要在term上有交集，还需要顺序保持一致

具体到我们这个例子，我搜liudh 文档里的liu d h 也必须匹配着顺序出现，所以就只有刘德华可以匹配上了：

GET /hjxtest_pinyin/_search
{
  "query": {"match_phrase": {
    "name.pinyin": "liudh"
  }}
}

这样就提高了准确率了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

ES pinyin 插件拼音搜索原理 match_phase

背景

介绍

配置analyzer

前端使用 Konva 实现可视化设计器（13）- 折线 - 最优路径应用【思路篇】

HTTPS 原理對稱加密非對稱加密 CA證書

ES pinyin 插件拼音搜索原理 match_phase

CAS Spring 中文文檔

redis實現分佈式鎖單實例情況多實例情況 RedLock

SSO登錄流程原理詳解舉例

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

ES pinyin 插件 拼音搜索 原理 match_phase

背景

介绍

配置analyzer

ES pinyin 插件拼音搜索原理 match_phase