源碼地址: GitHub
業務需求(使用背景):
- 實現搜索引擎前綴搜索功能(中文,拼音前綴查詢及簡拼前綴查詢功能)
- 實現摘要全文檢索功能,及標題加權處理功能(按照標題權值高內容權值相對低的權值分配規則,按照索引的相關性進行排序,列出前20條相關性最高的文章)
一、搜索引擎前綴搜索功能:
中文搜索:
1、搜索“劉”,匹配到“劉德華”、“劉斌”、“劉德志”
2、搜索“劉德”,匹配到“劉德華”、“劉德志”
小結:搜索的文字需要匹配到集合中所有名字的子集。
全拼搜索:
1、搜索“li”,匹配到“劉德華”、“劉斌”、“劉德志”
2、搜索“liud”,匹配到“劉德華”、“劉德”
3、搜索“liudeh”,匹配到“劉德華”
小結:搜索的文字轉換成拼音後,需要匹配到集合中所有名字轉成拼音後的子集
簡拼搜索:
1、搜索“w”,匹配到“我是中國人”,“我愛我的祖國”
2、搜索“wszg”,匹配到“我是中國人”
小結:搜索的文字取拼音首字母進行組合,需要匹配到組合字符串中前綴匹配的子集
解決方案:
方案一:將“like”搜索的字段的中、英簡拼、英全拼 分別用索引的三個字段來進行存儲並且不進行分詞,最簡單直接(倒排索引存儲它們本身數據),檢索索引數據的時候進行 通配符查詢(like查詢),從這三個字段中分別進行搜索,查詢匹配的記錄然後返回。(優勢:存儲格式簡單,倒排索引存儲的數據量最少。缺點:like索引數據的時候開銷比較大 prefix 查詢比 term 查詢開銷大得多)
方案二:將中、中簡拼、中全拼 用一個字段衍生出三個字段(multi-field)來存儲三種數據,並且分詞器filter採用edge_ngram類型對分詞的數據進行,然後處理存儲到倒排索引中,當檢索索引數據時,檢索所有字段的數據。(優勢:格式緊湊,檢索索引數據的時候採用term 全匹配規則,也無需對入參進行分詞,查詢效率高。缺點:採用以空間換時間的策略,但是對索引來說可以接受。採用衍生字段來存儲,增加了存儲及檢索的複雜度,對於三個字段搜索會將相關度相加,容易混淆查詢相關度結果)
方案三:將索引數據存儲在一個不需分詞的字段中(keyword), 生成倒排索引時進行三種類型倒排索引的生成,倒排索引生成的時候採用edge_ngram 對倒排進一步拆分,以滿足業務場景需求,檢索時不對入參進行分詞。(優勢:索引數據存儲簡單,,檢索索引數據的時只需對一個字段 採用term 全匹配查詢規則,查詢效率極高。缺點:採用以空間換時間的策略——比方案二要少,對索引數據來說可以接受。)
ES 針對這一業務場景解決方案還有很多種,先列出比較典型的這三種方案,選擇方案三來進行處理。
準備工作:
- pinyin分詞插件安裝及參數解讀
- ElasticSearch edge_ngram 使用
- ElasticSearch multi-field 使用
- ElasticSearch 多種查詢特性熟悉
代碼:
baidu_settings.json:
{
"refresh_interval":"2s",
"number_of_replicas":1,
"number_of_shards":2,
"analysis":{
"filter":{
"autocomplete_filter":{
"type":"edge_ngram",
"min_gram":1,
"max_gram":15
},
"pinyin_first_letter_and_full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_joined_full_pinyin": true,
"keep_none_chinese" : false,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
},
"full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_joined_full_pinyin": true,
"keep_none_chinese" : false,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
}
},
"analyzer":{
"full_prefix_analyzer":{
"type":"custom",
"char_filter": [
"html_strip"
],
"tokenizer":"keyword",
"filter":[
"lowercase",
"full_pinyin_filter",
"autocomplete_filter"
]
},
"chinese_analyzer":{
"type":"custom",
"char_filter": [
"html_strip"
],
"tokenizer":"keyword",
"filter":[
"lowercase",
"autocomplete_filter"
]
},
"pinyin_analyzer":{
"type":"custom",
"char_filter": [
"html_strip"
],
"tokenizer":"keyword",
"filter":[
"pinyin_first_letter_and_full_pinyin_filter",
"autocomplete_filter"
]
}
}
}
}
baidu_mapping.json
{
"baidu_type": {
"properties": {
"full_name": {
"type": "text",
"analyzer": "full_prefix_analyzer"
},
"age": {
"type": "integer"
}
}
}
}
public class PrefixTest {
@Test
public void testCreateIndex() throws Exception{
TransportClient client = ESConnect.getInstance().getTransportClient();
//定義索引
BaseIndex.createWithSetting(client,"baidu_index","esjson/baidu_settings.json");
//定義類型及字段詳細設計
BaseIndex.createMapping(client,"baidu_index","baidu_type","esjson/baidu_mapping.json");
}
@Test
public void testBulkInsert() throws Exception{
TransportClient client = ESConnect.getInstance().getTransportClient();
List<Object> list = new ArrayList<>();
list.add(new BulkInsert(12l,"我們都有一個家名字叫中國",12));
list.add(new BulkInsert(13l,"兄弟姐妹都很多景色也不錯 ",13));
list.add(new BulkInsert(14l,"家裏盤着兩條龍是長江與黃河",14));
list.add(new BulkInsert(15l,"還有珠穆朗瑪峯兒是最高山坡",15));
list.add(new BulkInsert(16l,"我們都有一個家名字叫中國",16));
list.add(new BulkInsert(17l,"兄弟姐妹都很多景色也不錯",17));
list.add(new BulkInsert(18l,"看那一條長城萬里在雲中穿梭",18));
boolean flag = BulkOperation.batchInsert(client,"baidu_index","baidu_type",list);
System.out.println(flag);
}
}
不要意思,代碼封裝了,java生成索引網上查方式即可:重點不在java代碼怎麼實現。而是上面的思想。
接下來查看下定義的分詞器效果:
http://192.168.20.114:9200/baidu_index/_analyze?text=劉德華AT2016&analyzer=full_prefix_analyzer
{
"tokens": [
{
"token": "劉",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華a",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華at",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華at2",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華at20",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華at201",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "劉德華at2016",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "l",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "li",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "liu",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "liud",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "liude",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "liudeh",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "liudehu",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "liudehua",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "l",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ld",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldh",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldha",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldhat",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldhat2",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldhat20",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldhat201",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "ldhat2016",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
大功告成。
參考:
http://blog.csdn.net/napoay/article/details/53907921
https://elasticsearch.cn/question/407
http://blog.csdn.net/xifeijian/article/details/51095762
http://www.cnblogs.com/xing901022/p/5910139.html
http://www.cnblogs.com/clonen/p/6674492.html
https://github.com/medcl/elasticsearch-analysis-pinyin
https://github.com/medcl/elasticsearch-analysis-ik
全文檢索後續有時間再進行整理。