背景
中文搜索很多时候都要用到pinyin搜索,基本绕不开这个插件;如搜索人名之类的;
介绍
插件github:地址
在README的最后,举的例子挺有意思;经过一系列操作之后,对刘德华建index,竟然搜liudh,刘dh,各种奇葩的搜索都能搜出来,这是为啥呢?让我们来仔细分析一下。
如官网的配置
配置analyzer
PUT /medcl3/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_first_letter":true,
"keep_separate_first_letter" : true,
"keep_full_pinyin" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true
}
}
}
}
}
主要是用了分词器tokenizer:my_pinyin。
具体的设置是,
keep_first_letter: true ; 也就是会将刘德华
-> ldh
keep_seperate_first_letter: true; 将刘德华
-> l
、 d
、 h
keep_full_pinyin: true; 将刘德华
-> liu
, de
, hua
有了这些设置之后,我们发现对刘德华
进行analyze:
GET /hjxtest_pinyin/_analyze
{
"text": "刘德华",
"analyzer": "pinyin_analyzer"
}
得到结果就是上面说的这7个key
{
"tokens": [
{
"token": "l",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "liu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "ldh",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "d",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "de",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "hua",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
}
]
}
然后我们建好index,搜索liudh
的时候,会先用相同的分词方法分词:
GET /hjxtest_pinyin/_analyze
{
"text": "liudh",
"analyzer": "pinyin_analyzer"
}
分词结果
{
"tokens": [
{
"token": "liu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "liudh",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "d",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
}
]
}
可见,我们牛皮的分词器,会分词出结果 liu
+ d
+ h
+ liudh
回顾我们建的倒排索引: liu
de
hua
l
d
h
ldh
搜索的时候
liu
d
h
都能找到咱们的文档,当然就可以搜到结果了:
GET /hjxtest_pinyin/_search
{
"query": {"match": {
"name.pinyin": "liudh"
}}
}
但是我们发现一个有意思的现象,当我们搜liudh
的时候,竟然会把黄渤也搜出来,这是什么鬼?😂
盲猜是因为 analyze的时候,黄渤
analyze的结果是:
huang
+ bo
+ h
+ b
+ hb
然后搜索的时候跟liudh
的h
match上了
验证一下: 黄渤
analyze的结果是:
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "huang",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "hb",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "b",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "bo",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
}
]
}
果然跟猜想的一致。
那怎么办呢,这种准确率也太低了吧
我们看到github上给的查询例子实际上是match_phase
而不是match
区别是啥?参看官网
match_phrase
要求query和doc不仅要在term上有交集,还需要顺序保持一致
具体到我们这个例子,我搜liudh
文档里的liu
d
h
也必须匹配着顺序出现,所以就只有刘德华
可以匹配上了:
GET /hjxtest_pinyin/_search
{
"query": {"match_phrase": {
"name.pinyin": "liudh"
}}
}
这样就提高了准确率了。