ES聚合查詢主要又三種模式,分別是分桶聚合(Bucket aggregations)、指標聚合(Metrics aggregations)、管道聚合(Pipeline aggregations),三種模式處理的業務場景不同,下面開始簡要分析下.
1、分桶聚合(Bucket aggregations)
分桶聚合類似與關係型數據庫的Group By查詢,按照指定的條件,進行分組統計.下面用一張網絡圖(來自馬士兵教育)來解釋
圖中首先按照手機的品牌進行分桶統計數量,接着在小米手機的分桶基礎上,再按照小米手機的檔次進行二次分桶(分桶的嵌套查詢)統計數量.
分桶聚合大致就是爲了完成以上需求的
2、指標聚合(Metrics aggregations)
指標聚合主要是計算指標的Avg(平均值)、Max(最大值)、Min(最小值)、Sum(求和)、Cardinality(去重)、ValueCount(記數)、Stats(統計聚合)、Top Hits(聚合)等.下面用一張網絡圖(來自馬士兵教育)來解釋
可以通過指標聚合計算某個班級、某個學科的最高分、最低分等等.
3、管道聚合(Pipeline aggregations)
管道聚合主要用於對聚和結果的二次聚合,舉個例子,這裏需要計算某個商城中的各個品牌手機價格平均值中最小的手機品牌.
這裏第一步需要計算各個手機品牌價格的平均值,接着計算平均值中的最小值,這裏就需要用到管道聚合.
4、實戰演練
4.1、創建索引
進入kibna dev tools,輸入以下代碼創建索引
PUT food { "settings": { "number_of_shards": 3, //主分片3個 "number_of_replicas": 1 //每個分片包含一個副本 }, "mappings": { "date_detection": false, //關閉日期檢測 "properties": { "CreateTime":{ "type":"date", "format": "yyyy-MM-dd HH:mm:ss" //指定寫入的日期格式 }, "Desc":{ "type": "text", "fields": { "keyword":{ "type":"keyword", //創建正排索引 "ignore_above":256 } }, "analyzer": "ik_max_word",//數據寫入時拆分的粒度越小越好 "search_analyzer": "ik_smart"//一般情況下,用戶搜索時拆分的粒度不能很小,會導致用戶檢索不到想要的 }, "Level":{ "type": "text", "fields": { "keyword":{ "type":"keyword", "ignore_above":256 } } }, "Name":{ "type": "text", "fields": { "keyword":{ "type":"keyword", "ignore_above":256 } }, "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }, "Price":{ "type": "float" }, "Tags":{ "type": "text", "fields": { "keyword":{ "type":"keyword", "ignore_above":256 } } }, "Type":{ "type": "text", "fields": { "keyword":{ "type":"keyword", "ignore_above":256 } } } } } }
執行以上代碼完成索引的創建.
4.2 插入數據
PUT food/_doc/1 { "CreateTime":"2022-06-06 11:11:11", "Desc":"青菜 yyds 營養價值很高,很好喫", "Level":"普通蔬菜", "Name":"青菜", "Price":11.11, "Tags":["性價比","營養","綠色蔬菜"], "Type":"蔬菜" } PUT food/_doc/2 { "CreateTime":"2022-06-06 13:11:11", "Desc":"大白菜 好喫 便宜 水分多", "Level":"普通蔬菜", "Name":"大白菜", "Price":12.11, "Tags":["便宜","好喫","白色蔬菜"], "Type":"蔬菜" } PUT food/_doc/3 { "CreateTime":"2022-06-07 13:11:11", "Desc":"蘆筍來自國外進口的蔬菜,西餐標配", "Level":"中等蔬菜", "Name":"蘆筍", "Price":66.11, "Tags":["有點貴","國外","綠色蔬菜","營養價值高"], "Type":"蔬菜" } PUT food/_doc/4 { "CreateTime":"2022-07-07 13:11:11", "Desc":"蘋果 yyds 好喫 便宜 水分多 營養", "Level":"普通水果", "Name":"蘋果", "Price":11.11, "Tags":["性價比","易種植","水果","營養"], "Type":"水果" } PUT food/_doc/5 { "CreateTime":"2022-07-09 13:11:11", "Desc":"榴蓮 非常好喫 很貴 喫一個相當於喫一隻老母雞", "Level":"高級水果", "Name":"榴蓮", "Price":100.11, "Tags":["貴","水果","營養"], "Type":"水果" } PUT food/_doc/6 { "CreateTime":"2022-07-08 13:11:11", "Desc":"貓砂王榴蓮 榴蓮中的戰鬥機", "Level":"高級水果", "Name":"貓砂王榴蓮", "Price":300.11, "Tags":["超級貴","進口","水果","非常好喫"], "Type":"水果" }
執行以上代碼,完成索引數據的插入.
4.3 分桶聚合(Bucket aggregations)
現在查詢各個標籤的產品數據,如超級貴的食物有多少個,並按照標籤屬性進行升序排列,代碼如下:
GET food/_search { "size": 0, //關閉hit(source)數據的顯示 "aggs": { "tags_aggs": { "terms": { "field": "Tags.keyword", //一般情況下,帶有keyword的類型的字段才能進行聚合查詢,應爲keyword類型,es會爲其創建正排索引
"size": 20, //顯示的桶的個數,常用於分頁,
"order": { "_count": "asc" //按照每個桶統計的數量進行升序排列 } } } } }
搜索結果如下:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "tags_aggs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "便宜", "doc_count" : 1 }, { "key" : "國外", "doc_count" : 1 }, { "key" : "好喫", "doc_count" : 1 }, { "key" : "易種植", "doc_count" : 1 }, { "key" : "有點貴", "doc_count" : 1 }, { "key" : "白色蔬菜", "doc_count" : 1 }, { "key" : "營養價值高", "doc_count" : 1 }, { "key" : "貴", "doc_count" : 1 }, { "key" : "超級貴", "doc_count" : 1 }, { "key" : "進口", "doc_count" : 1 }, { "key" : "非常好喫", "doc_count" : 1 }, { "key" : "性價比", "doc_count" : 2 }, { "key" : "綠色蔬菜", "doc_count" : 2 }, { "key" : "水果", "doc_count" : 3 }, { "key" : "營養", "doc_count" : 3 } ] } } }
這裏需要注意兩點
(1)、一般情況下,text類型(應爲內容較長),es不會爲其創建正排索引,但是帶有keyword類型的text類型,es會爲其創建倒排索引的同時創建正派索引(但是此時的keyword正排索引會有長度限制通過ignore_above去配置)。es中一般只有正排索引才能進行聚合查詢
(2)、一般情況下,不會對text字段創建正排索引,應爲對大文本字段創建正排索引沒有什麼意義,而且正排索引會創建磁盤文件,浪費資源和空間.
(3)、通過fielddata 修改mapping通過如下代碼
POST food/_mapping { "properties":{ "Tags":{ "type":"text", "fielddata":true } } }
執行上述代碼,接着直接如下搜索
GET food/_search { "size": 0, //關閉hit(source)數據的顯示 "aggs": { "tags_aggs": { "terms": { "field": "Tags", //這裏不在用keyword "size": 20, //顯示的桶的個數,常用於分頁, "order": { "_count": "asc" //按照每個桶統計的數量進行升序排列 } } } } }
搜索結果如下:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "tags_aggs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 35, "buckets" : [ { "key" : "便", "doc_count" : 1 }, { "key" : "值", "doc_count" : 1 }, { "key" : "口", "doc_count" : 1 }, { "key" : "國", "doc_count" : 1 }, { "key" : "外", "doc_count" : 1 }, { "key" : "宜", "doc_count" : 1 }, { "key" : "常", "doc_count" : 1 }, { "key" : "易", "doc_count" : 1 }, { "key" : "有", "doc_count" : 1 }, { "key" : "植", "doc_count" : 1 }, { "key" : "點", "doc_count" : 1 }, { "key" : "白", "doc_count" : 1 }, { "key" : "種", "doc_count" : 1 }, { "key" : "級", "doc_count" : 1 }, { "key" : "超", "doc_count" : 1 }, { "key" : "進", "doc_count" : 1 }, { "key" : "非", "doc_count" : 1 }, { "key" : "高", "doc_count" : 1 }, { "key" : "喫", "doc_count" : 2 }, { "key" : "好", "doc_count" : 2 } ] } } }
這裏明顯收到了分詞器的影響,因爲Tags屬性沒有指定ik分詞器,所以這裏用的是standard分詞器.接着用分詞結果進行了桶聚合.
注意需要注意的是通過fielddata創建的正排索引是位於jvm堆空間中的,是一種臨時手段,所以通過這種方式容易引起oom,數據量大的時候要謹慎使用.
4.4 指標聚合(Metrics aggregations)
4.4.1 現在按照價格統計以下,所有食物價格的最貴的、所有食物價格的最便宜的、所有食物價格的平均值、所有食物價格的總和,代碼如下:
GET food/_search { "size": 0, //關閉hit(source)數據的顯示 "aggs": { "max_price":{ "max": { "field": "Price" } }, "min_price":{ "min": { "field": "Price" } }, "avg_price":{ "avg": { "field": "Price" } }, "sum_price":{ "sum":{ "field": "Price" } } } }
執行結果如下:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "max_price" : { "value" : 300.1099853515625 }, "min_price" : { "value" : 11.109999656677246 }, "avg_price" : { "value" : 83.44333092371623 }, "sum_price" : { "value" : 500.65998554229736 } } }
注意這裏有簡便操作通過stats進行快速的查詢,代碼如下:
GET food/_search { "size": 0, "aggs": { "price_stats": { "stats": { "field": "Price" } } } }
結果如下:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "price_stats" : { "count" : 6, "min" : 11.109999656677246, "max" : 300.1099853515625, "avg" : 83.44333092371623, "sum" : 500.65998554229736 } } }
4.4.2 按照名稱對所有的食品進行去重
GET food/_search { "size": 0, //關閉hit(source)數據的顯示 "aggs": { "name_count_no_equal":{ "cardinality": { "field": "Name.keyword" } } } }
結果如下:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "name_count_no_equal" : { "value" : 6 } } }
4.5 管道聚合(Pipeline aggregations)
現在需要計算各個物分類中價格平均值最低的食物分類,代碼如下:
GET food/_search { "size": 0, "aggs": { "type_bucket": { //首先按照Type字段進行分桶 "terms": { "field": "Type.keyword" }, //因爲要計算各個分桶的平均值,所以在分桶的基礎上做指標聚合 "aggs": { "price_bucket": { "avg": { "field": "Price" } } } }, //這裏通過buckets_path實現查找平均值最低的食物分類的桶 "min_bucket":{ "min_bucket": { "buckets_path": "type_bucket>price_bucket" } } } }
搜索結果如下:
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "type_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "水果", "doc_count" : 3, "price_bucket" : { "value" : 137.1099952061971 } }, { "key" : "蔬菜", "doc_count" : 3, "price_bucket" : { "value" : 29.77666664123535 } } ] }, "min_bucket" : { "value" : 29.77666664123535, "keys" : [ "蔬菜" ] } } }
結果中首先buckets實現了按照Type進行分桶,內部的price_bucket實現了各個分桶的平均值計算,最後再通過min_bucket的buckets_path實現了平均值最小的Type的查找.
這裏大致的邏輯是過程化的,第一步先按照Type進行分桶計算,爲了計算每個分桶的平均值,所以需要在分桶計算的基礎上進行指標計算,這裏對應的步驟就是在type_bucket的內部在次做了agg運算,最後在前面結果集的基礎上通過bucket_path,查找平均值最低的分桶的類型.
4.6 複雜的嵌套聚合查詢
現在需要計算每個食物分類中,不同檔次的食品中,價格最低的食物,代碼如下:
GET food/_search { "size": 0, "aggs": { "type_bucket": { //首先按照Type進行分桶 "terms": { "field": "Type.keyword" }, "aggs": { "level_bucket": { //然後按照Level進行分桶 "terms": { "field": "Level.keyword" }, "aggs": { //接着計算不同Type下的Level分桶的平均值 "price_avg": { "avg": { "field": "Price" } } } }, //因爲是要計算最低平均值的分類,所以buckets_path要和level分桶查詢平級 "min_leve_bucket":{ "min_bucket": { "buckets_path": "level_bucket>price_avg" } } } } } }
查詢結果如下:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 6, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "type_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "水果", "doc_count" : 3, "level_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "高級水果", "doc_count" : 2, "price_avg" : { "value" : 200.10999298095703 } }, { "key" : "普通水果", "doc_count" : 1, "price_avg" : { "value" : 11.109999656677246 } } ] }, "min_leve_bucket" : { "value" : 11.109999656677246, "keys" : [ "普通水果" ] } }, { "key" : "蔬菜", "doc_count" : 3, "level_bucket" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "普通蔬菜", "doc_count" : 2, "price_avg" : { "value" : 11.609999656677246 } }, { "key" : "中等蔬菜", "doc_count" : 1, "price_avg" : { "value" : 66.11000061035156 } } ] }, "min_leve_bucket" : { "value" : 11.609999656677246, "keys" : [ "普通蔬菜" ] } } ] } } }
這裏還是過程化的腳本,但是要注意的是bucket_path,要和統計的目標對象平級.