一、聚合分析簡介
1. ES聚合分析是什麼?
聚合分析是數據庫中重要的功能特性,完成對一個查詢的數據集中數據的聚合計算,如:找出某字段(或計算表達式的結果)的最大值、最小值,計算和、平均值等。ES作爲搜索引擎兼數據庫,同樣提供了強大的聚合分析能力。
對一個數據集求最大、最小、和、平均值等指標的聚合,在ES中稱爲指標聚合 metric
而關係型數據庫中除了有聚合函數外,還可以對查詢出的數據進行分組group by,再在組上進行指標聚合。在 ES 中group by 稱爲分桶,桶聚合 bucketing
ES中還提供了矩陣聚合(matrix)、管道聚合(pipleline),但還在完善中。
2. ES聚合分析查詢的寫法
在查詢請求體中以aggregations節點按如下語法定義聚合分析:
"aggregations" : { "<aggregation_name>" : { <!--聚合的名字 --> "<aggregation_type>" : { <!--聚合的類型 --> <aggregation_body> <!--聚合體:對哪些字段進行聚合 --> } [,"meta" : { [<meta_data_body>] } ]? <!--元 --> [,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合裏面在定義子聚合 --> } [,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 --> }
說明:
aggregations 也可簡寫爲 aggs
3. 聚合分析的值來源
聚合計算的值可以取字段的值,也可是腳本計算的結果。
二、指標聚合
1. max min sum avg
示例1:查詢所有客戶中餘額的最大值
POST /bank/_search? { "size": 0, "aggs": { "masssbalance": { "max": { "field": "balance" } } } }
結果1:
{ "took": 2080, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "masssbalance": { "value": 49989 } } }
示例2:查詢年齡爲24歲的客戶中的餘額最大值
POST /bank/_search? { "size": 2, "query": { "match": { "age": 24 } }, "sort": [ { "balance": { "order": "desc" } } ], "aggs": { "max_balance": { "max": { "field": "balance" } } } }
結果2:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 42, "max_score": null, "hits": [ { "_index": "bank", "_type": "_doc", "_id": "697", "_score": null, "_source": { "account_number": 697, "balance": 48745, "firstname": "Mallory", "lastname": "Emerson", "age": 24, "gender": "F", "address": "318 Dunne Court", "employer": "Exoplode", "email": "[email protected]", "city": "Montura", "state": "LA" }, "sort": [ 48745 ] }, { "_index": "bank", "_type": "_doc", "_id": "917", "_score": null, "_source": { "account_number": 917, "balance": 47782, "firstname": "Parks", "lastname": "Hurst", "age": 24, "gender": "M", "address": "933 Cozine Avenue", "employer": "Pyramis", "email": "[email protected]", "city": "Lindcove", "state": "GA" }, "sort": [ 47782 ] } ] }, "aggregations": { "max_balance": { "value": 48745 } } }
示例3:值來源於腳本,查詢所有客戶的平均年齡是多少,並對平均年齡加10
POST /bank/_search?size=0 { "aggs": { "avg_age": { "avg": { "script": { "source": "doc.age.value" } } }, "avg_age10": { "avg": { "script": { "source": "doc.age.value + 10" } } } } }
結果3:
{ "took": 86, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "avg_age": { "value": 30.171 }, "avg_age10": { "value": 40.171 } } }
示例4:指定field,在腳本中用_value 取字段的值
POST /bank/_search?size=0 { "aggs": { "sum_balance": { "sum": { "field": "balance", "script": { "source": "_value * 1.03" } } } } }
結果4:
{ "took": 165, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "sum_balance": { "value": 26486282.11 } } }
示例5:爲沒有值字段指定值。如未指定,缺失該字段值的文檔將被忽略。
POST /bank/_search?size=0 { "aggs": { "avg_age": { "avg": { "field": "age", "missing": 18 } } } }
2. 文檔計數 count
示例1:統計銀行索引bank下年齡爲24的文檔數量
POST /bank/_doc/_count { "query": { "match": { "age" : 24 } } }
結果1:
{ "count": 42, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 } }
3. Value count 統計某字段有值的文檔數
示例1:
POST /bank/_search?size=0 { "aggs": { "age_count": { "value_count": { "field": "age" } } } }
結果1:
{ "took": 2022, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_count": { "value": 1000 } } }
4. cardinality 值去重計數
示例1:
POST /bank/_search?size=0 { "aggs": { "age_count": { "cardinality": { "field": "age" } }, "state_count": { "cardinality": { "field": "state.keyword" } } } }
說明:state的使用它的keyword版
結果1:
{ "took": 2074, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "state_count": { "value": 51 }, "age_count": { "value": 21 } } }
5. stats 統計 count max min avg sum 5個值
示例1:
POST /bank/_search?size=0 { "aggs": { "age_stats": { "stats": { "field": "age" } } } }
結果1:
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_stats": { "count": 1000, "min": 20, "max": 40, "avg": 30.171, "sum": 30171 } } }
6. Extended stats
高級統計,比stats多4個統計結果: 平方和、方差、標準差、平均值加/減兩個標準差的區間
示例1:
POST /bank/_search?size=0 { "aggs": { "age_stats": { "extended_stats": { "field": "age" } } } }
結果1:
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_stats": { "count": 1000, "min": 20, "max": 40, "avg": 30.171, "sum": 30171, "sum_of_squares": 946393, "variance": 36.10375899999996, "std_deviation": 6.008640362012022, "std_deviation_bounds": { "upper": 42.18828072402404, "lower": 18.153719275975956 } } } }
7. Percentiles 佔比百分位對應的值統計
對指定字段(腳本)的值按從小到大累計每個值對應的文檔數的佔比(佔所有命中文檔數的百分比),返回指定佔比比例對應的值。默認返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中間的結果,可以理解爲:佔比爲50%的文檔的age值 <= 31,或反過來:age<=31的文檔數佔總命中文檔數的50%
示例1:
POST /bank/_search?size=0 { "aggs": { "age_percents": { "percentiles": { "field": "age" } } } }
結果1:
{ "took": 87, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_percents": { "values": { "1.0": 20, "5.0": 21, "25.0": 25, "50.0": 31, "75.0": 35.00000000000001, "95.0": 39, "99.0": 40 } } } }
結果說明:
佔比爲50%的文檔的age值 <= 31,或反過來:age<=31的文檔數佔總命中文檔數的50%
示例2:指定分位值
POST /bank/_search?size=0 { "aggs": { "age_percents": { "percentiles": { "field": "age", "percents" : [95, 99, 99.9] } } } }
結果2:
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_percents": { "values": { "95.0": 39, "99.0": 40, "99.9": 40 } } } }
8. Percentiles rank 統計值小於等於指定值的文檔佔比
示例1:統計年齡小於25和30的文檔的佔比,和第7項相反
POST /bank/_search?size=0 { "aggs": { "gge_perc_rank": { "percentile_ranks": { "field": "age", "values": [ 25, 30 ] } } } }
結果2:
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "gge_perc_rank": { "values": { "25.0": 26.1, "30.0": 49.2 } } } }
結果說明:年齡小於25的文檔佔比爲26.1%,年齡小於30的文檔佔比爲49.2%,
9. Geo Bounds aggregation 求文檔集中的地理位置座標點的範圍
參考官網鏈接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html
10. Geo Centroid aggregation 求地理位置中心點座標值
參考官網鏈接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html
三、桶聚合
1. Terms Aggregation 根據字段值項分組聚合
示例1:
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age" } } } }
結果1:
{ "took": 2000, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 463, "buckets": [ { "key": 31, "doc_count": 61 }, { "key": 39, "doc_count": 60 }, { "key": 26, "doc_count": 59 }, { "key": 32, "doc_count": 52 }, { "key": 35, "doc_count": 52 }, { "key": 36, "doc_count": 52 }, { "key": 22, "doc_count": 51 }, { "key": 28, "doc_count": 51 }, { "key": 33, "doc_count": 50 }, { "key": 34, "doc_count": 49 } ] } } }
結果說明:
"doc_count_error_upper_bound": 0:文檔計數的最大偏差值
"sum_other_doc_count": 463:未返回的其他項的文檔數
默認情況下返回按文檔計數從高到低的前10個分組:
"buckets": [ { "key": 31, "doc_count": 61 }, { "key": 39, "doc_count": 60 }, ............. ]
年齡爲31的文檔有61個,年齡爲39的文檔有60個
size 指定返回多少個分組:
示例2:指定返回20個分組
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age", "size": 20 } } } }
結果2:
View Code
示例3:每個分組上顯示偏差值
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age", "size": 5, "shard_size": 20, "show_term_doc_count_error": true } } } }
結果3:
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 25, "sum_other_doc_count": 716, "buckets": [ { "key": 31, "doc_count": 61, "doc_count_error_upper_bound": 0 }, { "key": 39, "doc_count": 60, "doc_count_error_upper_bound": 0 }, { "key": 26, "doc_count": 59, "doc_count_error_upper_bound": 0 }, { "key": 32, "doc_count": 52, "doc_count_error_upper_bound": 0 }, { "key": 36, "doc_count": 52, "doc_count_error_upper_bound": 0 } ] } } }
示例4:shard_size 指定每個分片上返回多少個分組
shard_size 的默認值爲:
索引只有一個分片:= size
多分片:= size * 1.5 + 10
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age", "size": 5, "shard_size": 20 } } } }
結果4:
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 25, "sum_other_doc_count": 716, "buckets": [ { "key": 31, "doc_count": 61 }, { "key": 39, "doc_count": 60 }, { "key": 26, "doc_count": 59 }, { "key": 32, "doc_count": 52 }, { "key": 36, "doc_count": 52 } ] } } }
order 指定分組的排序
示例5:根據文檔計數排序
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age", "order" : { "_count" : "asc" } } } } }
結果5:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 584, "buckets": [ { "key": 29, "doc_count": 35 }, { "key": 27, "doc_count": 39 }, { "key": 38, "doc_count": 39 }, { "key": 23, "doc_count": 42 }, { "key": 24, "doc_count": 42 }, { "key": 25, "doc_count": 42 }, { "key": 37, "doc_count": 42 }, { "key": 20, "doc_count": 44 }, { "key": 40, "doc_count": 45 }, { "key": 21, "doc_count": 46 } ] } } }
示例6:根據分組值排序
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age", "order" : { "_key" : "asc" } } } } }
結果6:
{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 549, "buckets": [ { "key": 20, "doc_count": 44 }, { "key": 21, "doc_count": 46 }, { "key": 22, "doc_count": 51 }, { "key": 23, "doc_count": 42 }, { "key": 24, "doc_count": 42 }, { "key": 25, "doc_count": 42 }, { "key": 26, "doc_count": 59 }, { "key": 27, "doc_count": 39 }, { "key": 28, "doc_count": 51 }, { "key": 29, "doc_count": 35 } ] } } }
示例7:取分組指標值排序
POST /bank/_search?size=0 { "aggs": { "age_terms": { "terms": { "field": "age", "order": { "max_balance": "asc" } }, "aggs": { "max_balance": { "max": { "field": "balance" } }, "min_balance": { "min": { "field": "balance" } } } } } }
結果7:
View Code
示例8:篩選分組-正則表達式匹配值
GET /_search { "aggs" : { "tags" : { "terms" : { "field" : "tags", "include" : ".*sport.*", "exclude" : "water_.*" } } } }
示例9:篩選分組-指定值列表
GET /_search { "aggs" : { "JapaneseCars" : { "terms" : { "field" : "make", "include" : ["mazda", "honda"] } }, "ActiveCarManufacturers" : { "terms" : { "field" : "make", "exclude" : ["rover", "jensen"] } } } }
示例10:根據腳本計算值分組
GET /_search { "aggs" : { "genres" : { "terms" : { "script" : { "source": "doc['genre'].value", "lang": "painless" } } } } }
示例1:缺失值處理
GET /_search { "aggs" : { "tags" : { "terms" : { "field" : "tags", "missing": "N/A" } } } }
結果10:
View Code
2. filter Aggregation 對滿足過濾查詢的文檔進行聚合計算
在查詢命中的文檔中選取符合過濾條件的文檔進行聚合,先過濾再聚合
示例1:
POST /bank/_search?size=0 { "aggs": { "age_terms": { "filter": {"match":{"gender":"F"}}, "aggs": { "avg_age": { "avg": { "field": "age" } } } } } }
結果1:
{ "took": 163, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count": 493, "avg_age": { "value": 30.3184584178499 } } } }
3. Filters Aggregation 多個過濾組聚合計算
示例1:
準備數據:
PUT /logs/_doc/_bulk?refresh {"index":{"_id":1}} {"body":"warning: page could not be rendered"} {"index":{"_id":2}} {"body":"authentication error"} {"index":{"_id":3}} {"body":"warning: connection timed out"}
獲取組合過濾後聚合的結果:
GET logs/_search { "size": 0, "aggs": { "messages": { "filters": { "filters": { "errors": { "match": { "body": "error" } }, "warnings": { "match": { "body": "warning" } } } } } } }
上面的結果:
{ "took": 18, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 0, "hits": [] }, "aggregations": { "messages": { "buckets": { "errors": { "doc_count": 1 }, "warnings": { "doc_count": 2 } } } } }
示例2:爲其他值組指定key
GET logs/_search { "size": 0, "aggs": { "messages": { "filters": { "other_bucket_key": "other_messages", "filters": { "errors": { "match": { "body": "error" } }, "warnings": { "match": { "body": "warning" } } } } } } }
結果2:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 0, "hits": [] }, "aggregations": { "messages": { "buckets": { "errors": { "doc_count": 1 }, "warnings": { "doc_count": 2 }, "other_messages": { "doc_count": 0 } } } } }
4. Range Aggregation 範圍分組聚合
示例1:
POST /bank/_search?size=0 { "aggs": { "age_range": { "range": { "field": "age", "ranges": [ { "to": 25 }, { "from": 25, "to": 35 }, { "from": 35 } ] }, "aggs": { "bmax": { "max": { "field": "balance" } } } } } }
結果1:
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_range": { "buckets": [ { "key": "*-25.0", "to": 25, "doc_count": 225, "bmax": { "value": 49587 } }, { "key": "25.0-35.0", "from": 25, "to": 35, "doc_count": 485, "bmax": { "value": 49795 } }, { "key": "35.0-*", "from": 35, "doc_count": 290, "bmax": { "value": 49989 } } ] } } }
示例2:爲組指定key
POST /bank/_search?size=0 { "aggs": { "age_range": { "range": { "field": "age", "keyed": true, "ranges": [ { "to": 25, "key": "Ld" }, { "from": 25, "to": 35, "key": "Md" }, { "from": 35, "key": "Od" } ] } } } }
結果2:
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "age_range": { "buckets": { "Ld": { "to": 25, "doc_count": 225 }, "Md": { "from": 25, "to": 35, "doc_count": 485 }, "Od": { "from": 35, "doc_count": 290 } } } } }
5. Date Range Aggregation 時間範圍分組聚合
示例1:
POST /bank/_search?size=0 { "aggs": { "range": { "date_range": { "field": "date", "format": "MM-yyy", "ranges": [ { "to": "now-10M/M" }, { "from": "now-10M/M" } ] } } } }
結果1:
{ "took": 115, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "range": { "buckets": [ { "key": "*-2017-08-01T00:00:00.000Z", "to": 1501545600000, "to_as_string": "2017-08-01T00:00:00.000Z", "doc_count": 0 }, { "key": "2017-08-01T00:00:00.000Z-*", "from": 1501545600000, "from_as_string": "2017-08-01T00:00:00.000Z", "doc_count": 0 } ] } } }
6. Date Histogram Aggregation 時間直方圖(柱狀)聚合
就是按天、月、年等進行聚合統計。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 間隔聚合或指定的時間間隔聚合。
示例1:
POST /bank/_search?size=0 { "aggs": { "sales_over_time": { "date_histogram": { "field": "date", "interval": "month" } } } }
結果1:
{ "took": 9, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "sales_over_time": { "buckets": [] } } }
7. Missing Aggregation 缺失值的桶聚合
POST /bank/_search?size=0 { "aggs" : { "account_without_a_age" : { "missing" : { "field" : "age" } } } }
8. Geo Distance Aggregation 地理距離分區聚合
參考官網鏈接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html