Elasticsearch實現類百度搜索引擎搜索功能ES5.5.0v

源碼地址： GitHub

業務需求（使用背景）：

實現搜索引擎前綴搜索功能(中文,拼音前綴查詢及簡拼前綴查詢功能)
實現摘要全文檢索功能，及標題加權處理功能(按照標題權值高內容權值相對低的權值分配規則，按照索引的相關性進行排序，列出前20條相關性最高的文章)

一、搜索引擎前綴搜索功能：

中文搜索：
1、搜索“劉”，匹配到“劉德華”、“劉斌”、“劉德志”
2、搜索“劉德”，匹配到“劉德華”、“劉德志”
小結：搜索的文字需要匹配到集合中所有名字的子集。
全拼搜索：
1、搜索“li”，匹配到“劉德華”、“劉斌”、“劉德志”
2、搜索“liud”，匹配到“劉德華”、“劉德”
3、搜索“liudeh”，匹配到“劉德華”
小結：搜索的文字轉換成拼音後，需要匹配到集合中所有名字轉成拼音後的子集

簡拼搜索：
1、搜索“w”，匹配到“我是中國人”，“我愛我的祖國”
2、搜索“wszg”，匹配到“我是中國人”
小結：搜索的文字取拼音首字母進行組合，需要匹配到組合字符串中前綴匹配的子集

解決方案：

方案一：將“like”搜索的字段的中、英簡拼、英全拼分別用索引的三個字段來進行存儲並且不進行分詞，最簡單直接(倒排索引存儲它們本身數據)，檢索索引數據的時候進行通配符查詢（like查詢），從這三個字段中分別進行搜索，查詢匹配的記錄然後返回。（優勢：存儲格式簡單，倒排索引存儲的數據量最少。缺點：like索引數據的時候開銷比較大 prefix 查詢比 term 查詢開銷大得多）

方案二：將中、中簡拼、中全拼用一個字段衍生出三個字段(multi-field)來存儲三種數據，並且分詞器filter採用edge_ngram類型對分詞的數據進行，然後處理存儲到倒排索引中,當檢索索引數據時，檢索所有字段的數據。（優勢：格式緊湊，檢索索引數據的時候採用term 全匹配規則，也無需對入參進行分詞，查詢效率高。缺點：採用以空間換時間的策略，但是對索引來說可以接受。採用衍生字段來存儲，增加了存儲及檢索的複雜度，對於三個字段搜索會將相關度相加，容易混淆查詢相關度結果）

方案三：將索引數據存儲在一個不需分詞的字段中(keyword), 生成倒排索引時進行三種類型倒排索引的生成，倒排索引生成的時候採用edge_ngram 對倒排進一步拆分，以滿足業務場景需求，檢索時不對入參進行分詞。（優勢：索引數據存儲簡單，，檢索索引數據的時只需對一個字段採用term 全匹配查詢規則，查詢效率極高。缺點：採用以空間換時間的策略——比方案二要少，對索引數據來說可以接受。）

ES 針對這一業務場景解決方案還有很多種，先列出比較典型的這三種方案，選擇方案三來進行處理。

準備工作：

pinyin分詞插件安裝及參數解讀
ElasticSearch edge_ngram 使用
ElasticSearch multi-field 使用
ElasticSearch 多種查詢特性熟悉

代碼：

baidu_settings.json:

{
  "refresh_interval":"2s",
  "number_of_replicas":1,
  "number_of_shards":2,
  "analysis":{
    "filter":{
      "autocomplete_filter":{
        "type":"edge_ngram",
        "min_gram":1,
        "max_gram":15
      },
      "pinyin_first_letter_and_full_pinyin_filter" : {
        "type" : "pinyin",
        "keep_first_letter" : true,
        "keep_full_pinyin" : false,
        "keep_joined_full_pinyin": true,
        "keep_none_chinese" : false,
        "keep_original" : false,
        "limit_first_letter_length" : 16,
        "lowercase" : true,
        "trim_whitespace" : true,
        "keep_none_chinese_in_first_letter" : true
      },
      "full_pinyin_filter" : {
        "type" : "pinyin",
        "keep_first_letter" : true,
        "keep_full_pinyin" : false,
        "keep_joined_full_pinyin": true,
        "keep_none_chinese" : false,
        "keep_original" : true,
        "limit_first_letter_length" : 16,
        "lowercase" : true,
        "trim_whitespace" : true,
        "keep_none_chinese_in_first_letter" : true
      }
    },
    "analyzer":{
      "full_prefix_analyzer":{
        "type":"custom",
        "char_filter": [
          "html_strip"
        ],
        "tokenizer":"keyword",
        "filter":[
          "lowercase",
          "full_pinyin_filter",
          "autocomplete_filter"
        ]
      },
      "chinese_analyzer":{
        "type":"custom",
        "char_filter": [
          "html_strip"
        ],
        "tokenizer":"keyword",
        "filter":[
          "lowercase",
          "autocomplete_filter"
        ]
      },
      "pinyin_analyzer":{
        "type":"custom",
        "char_filter": [
          "html_strip"
        ],
        "tokenizer":"keyword",
        "filter":[
          "pinyin_first_letter_and_full_pinyin_filter",
          "autocomplete_filter"
        ]
      }
    }
  }
}

baidu_mapping.json

{
  "baidu_type": {
    "properties": {
      "full_name": {
        "type":  "text",
        "analyzer": "full_prefix_analyzer"
      },
      "age": {
        "type":  "integer"
      }
    }
  }
}

public class PrefixTest {

    @Test
    public void testCreateIndex() throws Exception{
        TransportClient client = ESConnect.getInstance().getTransportClient();
        //定義索引
        BaseIndex.createWithSetting(client,"baidu_index","esjson/baidu_settings.json");
        //定義類型及字段詳細設計
        BaseIndex.createMapping(client,"baidu_index","baidu_type","esjson/baidu_mapping.json");
    }
    @Test
    public void testBulkInsert() throws Exception{
        TransportClient client = ESConnect.getInstance().getTransportClient();
        List<Object> list = new ArrayList<>();
        list.add(new BulkInsert(12l,"我們都有一個家名字叫中國",12));
        list.add(new BulkInsert(13l,"兄弟姐妹都很多景色也不錯 ",13));
        list.add(new BulkInsert(14l,"家裏盤着兩條龍是長江與黃河",14));
        list.add(new BulkInsert(15l,"還有珠穆朗瑪峯兒是最高山坡",15));
        list.add(new BulkInsert(16l,"我們都有一個家名字叫中國",16));
        list.add(new BulkInsert(17l,"兄弟姐妹都很多景色也不錯",17));
        list.add(new BulkInsert(18l,"看那一條長城萬里在雲中穿梭",18));
        boolean flag = BulkOperation.batchInsert(client,"baidu_index","baidu_type",list);
        System.out.println(flag);
    }
}

不要意思，代碼封裝了，java生成索引網上查方式即可：重點不在java代碼怎麼實現。而是上面的思想。

接下來查看下定義的分詞器效果：

http://192.168.20.114:9200/baidu_index/_analyze?text=劉德華AT2016&analyzer=full_prefix_analyzer

{
    "tokens": [
        {
            "token": "劉",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華a",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華at",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華at2",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華at20",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華at201",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "劉德華at2016",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "l",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "li",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "liu",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "liud",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "liude",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "liudeh",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "liudehu",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "liudehua",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "l",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ld",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldh",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldha",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldhat",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldhat2",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldhat20",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldhat201",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        },
        {
            "token": "ldhat2016",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 0
        }
    ]
}

大功告成。

參考：

http://blog.csdn.net/napoay/article/details/53907921
https://elasticsearch.cn/question/407
http://blog.csdn.net/xifeijian/article/details/51095762
http://www.cnblogs.com/xing901022/p/5910139.html
http://www.cnblogs.com/clonen/p/6674492.html

https://github.com/medcl/elasticsearch-analysis-pinyin

https://github.com/medcl/elasticsearch-analysis-ik

全文檢索後續有時間再進行整理。

Elasticsearch實現類百度搜索引擎搜索功能ES5.5.0v

業務需求（使用背景）：

一、搜索引擎前綴搜索功能：

解決方案：

準備工作：

代碼：

EventLoopGroup與EventLoop 源碼分析

日常瀏覽網站整理

ElasticSearch 聚合搜索總結

redis cluster所有節點IP修改處理方案

Dubbo + Zipkin + Brave實現全鏈路追蹤

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結