ElasticSearch 服務端開發實踐

ES 簡介

索引，分片，副本

ElasticSearch 是一個基於 Apache Lucene 搜索引擎的開源的搜索服務器項目，作爲一個文檔型搜索服務器，其存儲和架構和 mongo 等 NoSQL 數據庫十分類似，包括文檔型的存儲，分片，索引，集羣和副本集等。

索引
注意，es 的索引和數據庫的索引概念是不一樣的。
es 的索引相當於mongo 數據庫中的集合或者關係型數據庫中的庫。es 建立索引時的 mapping 字段則相當於mongo 數據庫中的表。
以 MongoDB 爲例，mongo 數據庫中有 order 集合，order 下有 info，其中order_id 爲 info 表的索引。
那麼在 es 中，索引是 order，info 是 mapping 的類型 _type。
```
//mongo 數據
use order
db.info.find()
{
   "did": 490873,
   "order_id": 3
    ...
}
//es 數據
{                                                
 "_index": "order",                         
 "_type": "info",
 "_source": {
   "did": 490873,
   "order_id": 3
   ....
 }
}
```
分片
當數據量達到單機物理極限時，可以使用分片進行水平擴展，即將數據分割爲更小的單元，存儲在不同的服務器上，每一個分片負責一部分數據的處理，總的查詢將在各個分片查詢結束後，彙總結果返回給調用方。因此一個索引的數據會分佈在不同的物理機上。
副本
副本集主要用於數據容災和提高查詢的吞吐量，每個分片可以有多個副本集，副本集只是分片的一個複製，可以認爲存儲了幾份相同的數據。分片和其對應的副本集之間，有一個主分片對外提供服務，當主分片故障或其他原因不可用時，將會從副本集中選擇一個作爲主分片，繼續對外提供服務。

如果不指定，es 將默認使用 5 個分片和 1個副本。其架構如下圖所示:

REST API 接口

ES 所有的增刪改查等操作均通過 REST API 接口實現，甚至包括管理索引，檢查集羣和節點狀態等。
一個簡單的 REST API 接口的模型就是操作 + 狀態 , es 支持的操作有增刪改查，操作後面指定es 的地址和端口，

GET	獲取對象信息，可以是集羣信息，也可以是 es 中的數據信息，索引信息等
PUT	新建一個對象
POST	修改對象，除了可以設置索引，分片和數據修改外，還可以發送關機，重啓等命令
DELETE	刪除一個對象

獲取 es 集羣基本信息

curl -XGET 127.0.0.1:9200   
{
  "name" : "127.0.0.1",
  "cluster_name" : "127.0.0.1",
  "version" : {
    "number" : "2.3.5",
    "build_hash" : "90f439ff60a3c0f497f91663701e64ccd01edbb4",
    "build_timestamp" : "2016-07-27T10:36:52Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}

新建一個文檔

# 在 curl 中使用 XPUT　時，-d 表示使用負載文本，後面的內容用於替換 1 , 所以 1 不能省略
curl -XPUT 127.0.0.1:9200/test/info/1 -d '{"title":"test"}'
{
    "_index": "test",
    "_type": "info",
    "_id": "1",
    "_version": 4,
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "created": false
}

注: es 部署的默認端口爲9200

ES 開發步驟

setting

獲取當前 setting

curl -XGET 127.0.0.1:9200/test/_settings?pretty

分片, 索引和副本集等設置

es 中分片和副本集的大小設置是在 setting 的 index 字段中

curl -XPUT 127.0.0.1:9200/test -d '
{
  "settings": {
    "index" : {
    "number_of_shards" : '5',   #分片數
    "number_of_replicas" : '1'  #副本數
  }
}

當第一次插入數據時，如果索引不存在，es 會自動創建索引，通過修改 es 的配置文件 elasticsearch.yml 關閉自動創建：

action.auto_create_index :false

通過 PUT 來創建索引，以下是創建名爲 test 的索引。

curl -XPUT 127.0.0.1:9200/test/
#創建成功會返回
{"acknowledged":true}

analyzer 自定義分析器設置

es 中的分析器 analyzer 也是在setting 字段中設置，用於字符串類型的分析，系統默認的分析器有以下幾種：
standard 、simple 、whiteSpace 、stop 、keyword 、pattern 、 language 、snowball

除使用默認之外可以自定義分析器，analyzer 在 setting 字段中設置， 1 個 analyzer = 1 個分詞器 + n 個過濾器

{
    "settings": {
        "analysis": { 
            "analyzer": {
                //自定義分析器名字爲 char_analyzer
                "char_analyzer": {   
                    "type": "custom",
                    //一個分詞器
                    "tokenizer": "char_split", //這個分詞器 char_split 是自定義的
                    //多個過濾器
                    "filter": [
                        "lowercase",        //這個過濾器是系統自帶的
                        "myFilter"          //這個過濾器是自定義的
                    ]
                }
            },
            //自定義的分詞器 char_split
            "tokenizer": {
                "char_split": {
                    "type": "nGram",
                    "min_gram": "1",
                    "max_gram": "1",
                    "token_chars": ["letter", "digit", "whitespace", "punctuation", "symbol"]
                }
            },
            //自定義的過濾器 myFilter
            "filter":{
                "myFilter":{
                    "type":"kstem"
                }
            }
        }
    }
}

mapping

在 es 的 json 結構中，mapping 字段是與 setting 字段同級的，es 通過 mapping 來自定義索引的結構和字段之間的映射關係，常用的數據類型有 long 、string 和 nested

獲取當前 mapping

curl -XGET 127.0.0.1:9200/test/_mappings?pretty

簡單數據類型及自動推導

long : 數值型和數值型的數組字段均使用 long 類型， es 中可以通過 { “dynamic”: “true” } 設置是否動態推斷數據類型，設爲 true時數值型的字段可以不用設置mapping值，由 es 自動推導其類型
string: 字符串類型，用於搜索和半匹配，可以結合分析器一起使用

複雜數據類型

對於一個包含內部對象的數組，存儲時會被扁平化，比如以下數組

{
    "followers": [
        { "age": 35, "name": "Mary White"},
        { "age": 26, "name": "Alex Jones"},
        { "age": 19, "name": "Lisa Smith"}
    ]
}

最終存儲結果：

{
    "followers.age":    [19, 26, 35],
    "followers.name":   [alex, jones, lisa, smith, mary, white]
}

{age: 35}與{name: Mary White}之間的關聯會消失，因每個多值的欄位會變成一個值集合，而非有序的陣列。
此時使用nested 類型來處理這些嵌套的結構，比如以下的 properties.prop 就是一個多值字段。
以下是一個基本的 mapping 結構

{
    "mappings": {
        "person": {
            "dynamic": "false", 
            "properties": {
                "id": { 
                    "type": "long" 
                },
                "name": {
                    "type": "string",
                    "analyzer": "char_analyzer", //指定分析器
                    //如果希望字符串是全詞匹配的，要指定 not_analyzed
                    //"index": "not_analyzed"
                },
                "prop": {   
                    //嵌套結構使用 nested
                    "type": "nested",
                    "properties": {
                        "propid":   {"type": "long", "index": "not_analyzed"},
                        "propname": {"type": "string", "analyzer": "char_analyzer"},
                   }
                }
            }
        }
    }
}

DSL

業務模塊已經對 es 接口做了一層封裝，需要使用 es 的模塊執行初始化之後，調用相應的接口函數即可，下面是使用 REST API 接口的DSL操作

增刪改

先看一個 es 文檔的具體結構:

{
  "_index": "order",
  "_type": "info",
  "_id": "did-490873_id-3",
  "_version": 6,
  "_score": 1,
  "_routing": "490873",
  "_source": {
    "did": 490873,
    "order_id": 3,
    ....
  }
}

可以看到，一個es 文檔一定包含以下字段：
_index : 索引名稱 , 可以理解爲 mongo 中的數據庫名，也用於在執行其他操作時指定的索引 $es_addr/_index

_type : 類型名稱, 可以理解爲 mongo 中的表名

_id : 唯一標識符, 一般由各個模塊自己指定，用類似 did-10000_id-1 的格式作爲 _id 的值

_version: es 自動維護的版本號，數據每次更改會自增

_source : 文檔元數據

_routing : 路由值。由於es 中的索引時存儲在各個分片上的，當我們創建或檢索一個文檔時，要知道或指定是在哪一個分片上。所有的文檔操作都接收一個_routing參數，它用來自定義文檔到分片的映射。自定義路由值可以確保所有相關文檔——例如屬於同一公司的文檔——被保存在同一分片上。可以看到目前所有業務模塊的路由值全部使用的 did

搜索

搜索可以同時在多個索引的多個類型上進行

//搜索格式：
curl -X GET '127.0.0.1:9200/index/type/_search'

//沒有指定索引 默認在所有索引上搜索
curl -X GET '127.0.0.1:9200/_search/'
//同時指定 order 和 custm 索引搜索
curl -X GET '127.0.0.1:9200/order,custm/_search/'\
//在以g或u開頭的索引的所有類型中搜索
curl -X GET '127.0.0.1:9200/g*,u*/_search'
//在order 索引的 info 類型中搜索
curl -X GET '127.0.0.1:9200/order/info/_search'
//在 order 索引的類型 info, setting 中搜索
curl -X GET '127.0.0.1:9200/order/info,setting/_search'
//在所有索引的類型爲 info 的集合上搜索
curl -X GET '127.0.0.1:9200/_all/user,tweet/_search'

查詢是業務調用最爲頻繁的接口，也是最複雜的接口，業務模塊的主要處理是根據不同的查詢操作，制定查詢方案，以下是目前一些通用的查詢，可以覆蓋大多數的搜索方案。
最外層的是 query 和 bool , bool 以內分爲四種查詢方式：must 、 filter 、should、 must_not
以下是官方文檔對四種查詢的解釋

可以看到如果無需系統評分或相關度計算，僅僅用於搜索，使用filter就可以了。一個典型的查詢結構如下圖所示：

POST _search
{
    "query": {
        "bool" : {
            "must" : {
                "term" : { "user" : "kimchy" }
            },
            "filter": {
                "term" : { "tag" : "tech" }
            },
            "must_not" : {
                "range" : {
                    "age" : { "gte" : 10, "lte" : 20 }
                }
            },
            "should" : [
                { "term" : { "tag" : "wow" } },
                { "term" : { "tag" : "elasticsearch" } }
            ],
            "minimum_should_match" : 1,
            "boost" : 1.0
        }
    }
}

在上面四種查詢方式下，就是更小一級的對數據的過濾，如 term/terms 、match 、and 、or 、range 等等

term 和 terms

term 是最常用的查詢，該查詢不會使用分詞，必須全匹配， 大小寫也是敏感的，所以常用於數字型的搜索

terms 是 term 的數組形式，用於簡單的數值型數組的匹配，滿足數組中任何一個元素即返回

{
  //查詢 did 爲 10000, 且 pid 爲數組 [22,23,24,25] 子集的文檔
  "term":{  "did":10000 },  
  "terms":{ "pid":[22,23,24,25] }
}

match 和 match_phrase

match_phrase 和 match 用於字符串搜索，在定義了分詞器的情況下都會使用分詞

在 match_phrase 中所有的 term 都出現在數據中時纔會返回數據

數據中出現的順序必須和給定的查詢順序一致纔會返回數據

netsted 類型數據查詢

netsted 類型的數據查詢需要制定 path, 也就是嵌套結構中類型爲 nested 的字段，然後嵌套結構內的字段用dot 查詢。

以下是一個完整的包含所有查詢方式的 json

{
    "query": {
        "bool": {
            "filter": {
                // and 下的條件是需要 同時滿足的
                "and": [{  
                        //對於數字類型的搜索，使用 term
                        "term": {
                            "did": 519390
                        }
                    },{
                        //對於數組類型的搜索  使用 terms
                        "terms": {
                            "follower_pids": [40984,40985] 
                        }
                    }, {
                        //範圍搜索， 用 range
                        "range": {
                            "create_time": {
                                "gte": 1488211200000, 
                                "lte": 1488988799999
                            }
                        }
                    }
                ]
            },
            //should 下的條件  滿足之一即可
            "should": [{   
                    //使用 match_phrase 的是使用分詞的，用於搜索字符串，且半詞匹配
                    "match_phrase": {   
                        "contact_names": "44"
                    }
                }
            ],
            //should 中應該至少滿足的條件個數
            "minimum_should_match": 1, 
            //must 下的也是必須滿足的，其實跟放在 and 下也可以  但是and 下一般放數值型的匹配
            "must": [{       
                    "match_phrase": {
                        "name": "234"
                    }
                }, {
         // nested 用於匹配 json 中嵌套json 的數據，在建立 mapping 的時候要使用 nested 並指定 path
                    "nested": {
                        "path": "props",
                        "query": {
                            "bool": {
                                "filter": [{
                                        "term": {
                                            "props.propid": 583
                                        }
                                    }, {
                                        "match_phrase": {
                                            "props.propvalue": "44"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }, {
                    "nested": {
                        "path": "props",
                        "query": {
                            "bool": {
                                "filter": [{
                                        "term": {
                                            "props.propid": 585
                                        }
                                    }, {
                                        "range": {
                                            "props.timestamp": {
                                                "gte": 1489507200000,
                                                "lte": 1490111999999
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }, {
                    "nested": {
                        "path": "props",
                        "query": {
                            "bool": {
                                "filter": [{
                                        "term": {
                                            "props.propid": 588
                                        }
                                    }, {
                                        "terms": {
                                            "props.propmultiselect": ["one"]
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ],
            "must_not": [{
                    "terms": {
                        "prop_ids": [584]
                    }
                }
            ]
        }
    },
    "sort": [{
            "props.timestamp": {
                "order": "asc",
                "nested_path": "props",
                "nested_filter": {
                    "term": {
                        "props.propid": 586
                    }
                }
            }
        }
    ],
    "fields": ["custmid", "contid"],
    "from": 0,
    "size": 51
}

深度分頁

es 默認採用的分頁方式是 from+ size的形式，在深度分頁的情況下，這種使用方式效率是非常低的，比如 from = 5000, size=10， es 需要在各個分片上匹配排序並得到5010 條有效數據，然後返回最後10條數據，這種方式類似於mongo的 skip + size。目前支持最大的 skip值是 max_result_window ，默認1w。爲了滿足深度分頁的場景，es 提供了 scroll + scan 的方式進行分頁讀取。

先獲取一個 scroll_id

curl -XGET 127.0.0.1:9200/product/info/_search?pretty&scroll=2m -d 
{"query":{"match_all":{}}}

# 返回結果
{
  "_scroll_id": "cXVlcnlBbmRGZXRjaDsxOzg3OTA4NDpTQzRmWWkwQ1Q1bUlwMjc0WmdIX2ZnOzA7",
  "took": 1,
  "timed_out": false,
  "_shards": {
  "total": 1,
  "successful": 1,
  "failed": 0
  },
  "hits":{...}
}

然後後續的文檔讀取根據這個scroll_id 來

使用 Go 寫 es 導入工具

重建分片和索引，並導入數據

當索引結構改變，需要重新建立索引時，要先清空數據，然後重建索引，再將數據重新導入到 es 裏

curl -XDELETE 127.0.0.1:9200/order
#或者使用數據清理腳本，其中 order 是索引地址
es_clean_data.sh 127.0.0.1:9200 order

ES 開發中的問題集合

同樣的查詢，使用curl 正常而使用head 插件時無數據返回:

將操作請求從 GET 改爲 POST
使用 skip 時，對於10000 條以後的數據無法返回:

這是 es 本身默認對skip 的限制，es 分頁使用的是
```
{ from：100 , size : 10 }
```
即從第 100 條開始取10條數據。在 es 索引中有個字段 index.max_result_window 默認設置爲 10000。
如果 from + size > index.max_result_window ，es 不會返回數據，該字段可以修改，比如指定custm 索引的值爲 50000
```
curl -XPUT "127.0.0.1:9200/custm/_settings" -d 
'{ 
"index" : { 
    "max_result_window" : 50000 
}
}'
```
設置之後可以使用以下命令查看 custm 索引的setting 信息
```
curl -XGET 127.0.0.1:9200/custm/_settings?pretty 
```
如果要將當前所有的索引都設置，將索引名改成 _all 就可以
```
curl -XPUT "127.0.0.1:9200/all/_settings" -d 
'{
"index" : { "max_result_window" : 50000 
} 
}'
```
但是後續新建的索引要自己手動加，系統不會幫你加

es 安裝問題

如果是初始化安裝部署，es 搜索有問題，先看看服務有沒有啓動，然後判斷 es 服務是否可用：

[root@local]# curl -X  GET 127.0.0.1:9200
{
 "name" : "xx.xx.xx.xx",
 "cluster_name" : "xx.xx.xx.xx",
 "version" : {
   "number" : "2.3.5",
   "build_hash" : "90f439ff60a3c0f497f91663701e64ccd01edbb4",
   "build_timestamp" : "2016-07-27T10:36:52Z",
   "build_snapshot" : false,
   "lucene_version" : "5.5.0"
 },
 "tagline" : "You Know, for Search"
}

如果顯示的是 connection refused ，要注意 es 的運行的 host 與系統的 host 是否一致，如果是使用配置運行的，檢查配置是否正確：/usr/local/elasticsearch/config/elasticsearch.yml

ES 相關資源

在線資源

https://www.elastic.co/

https://www.gitbook.com/book/looly/elasticsearch-the-definitive-guide-cn

ES head 插件

瀏覽器直接訪問地址: http://127.0.0.1:9200/_plugin/head/

使用Chrome 插件訪問: Google 應用商店搜索 ES Head 下載即可

ElasticSearch 服務端開發實踐

ES 簡介

索引，分片，副本

REST API 接口

ES 開發步驟

setting

獲取當前 setting

分片, 索引和副本集等設置

analyzer 自定義分析器設置

mapping

獲取當前 mapping

簡單數據類型及自動推導

複雜數據類型

DSL

增刪改

搜索

term 和 terms

match 和 match_phrase

netsted 類型數據查詢

深度分頁

使用 Go 寫 es 導入工具

重建分片和索引，並導入數據

ES 開發中的問題集合

ES 相關資源

在線資源

ES head 插件

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

基於雲風協程庫的協程原理解讀

循環不變量求解數組問題

深入淺出linux內存管理（一）

分佈式一致性算法 raft

二叉樹非遞歸遍歷最簡潔的方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結