背景
中文搜索很多時候都要用到pinyin搜索,基本繞不開這個插件;如搜索人名之類的;
介紹
插件github:地址
在README的最後,舉的例子挺有意思;經過一系列操作之後,對劉德華建index,竟然搜liudh,劉dh,各種奇葩的搜索都能搜出來,這是爲啥呢?讓我們來仔細分析一下。
如官網的配置
配置analyzer
PUT /medcl3/
{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_first_letter":true,
"keep_separate_first_letter" : true,
"keep_full_pinyin" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true
}
}
}
}
}
主要是用了分詞器tokenizer:my_pinyin。
具體的設置是,
keep_first_letter: true ; 也就是會將劉德華
-> ldh
keep_seperate_first_letter: true; 將劉德華
-> l
、 d
、 h
keep_full_pinyin: true; 將劉德華
-> liu
, de
, hua
有了這些設置之後,我們發現對劉德華
進行analyze:
GET /hjxtest_pinyin/_analyze
{
"text": "劉德華",
"analyzer": "pinyin_analyzer"
}
得到結果就是上面說的這7個key
{
"tokens": [
{
"token": "l",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "liu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "ldh",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "d",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "de",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
},
{
"token": "hua",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
}
]
}
然後我們建好index,搜索liudh
的時候,會先用相同的分詞方法分詞:
GET /hjxtest_pinyin/_analyze
{
"text": "liudh",
"analyzer": "pinyin_analyzer"
}
分詞結果
{
"tokens": [
{
"token": "liu",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "liudh",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "d",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 2
}
]
}
可見,我們牛皮的分詞器,會分詞出結果 liu
+ d
+ h
+ liudh
回顧我們建的倒排索引: liu
de
hua
l
d
h
ldh
搜索的時候
liu
d
h
都能找到咱們的文檔,當然就可以搜到結果了:
GET /hjxtest_pinyin/_search
{
"query": {"match": {
"name.pinyin": "liudh"
}}
}
但是我們發現一個有意思的現象,當我們搜liudh
的時候,竟然會把黃渤也搜出來,這是什麼鬼?😂
盲猜是因爲 analyze的時候,黃渤
analyze的結果是:
huang
+ bo
+ h
+ b
+ hb
然後搜索的時候跟liudh
的h
match上了
驗證一下: 黃渤
analyze的結果是:
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "huang",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "hb",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 0
},
{
"token": "b",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
},
{
"token": "bo",
"start_offset": 0,
"end_offset": 0,
"type": "word",
"position": 1
}
]
}
果然跟猜想的一致。
那怎麼辦呢,這種準確率也太低了吧
我們看到github上給的查詢例子實際上是match_phase
而不是match
區別是啥?參看官網
match_phrase
要求query和doc不僅要在term上有交集,還需要順序保持一致
具體到我們這個例子,我搜liudh
文檔裏的liu
d
h
也必須匹配着順序出現,所以就只有劉德華
可以匹配上了:
GET /hjxtest_pinyin/_search
{
"query": {"match_phrase": {
"name.pinyin": "liudh"
}}
}
這樣就提高了準確率了。