ES terms 聚合功能理解

本文介紹 ES（ES7.8.0）裏面兩種不同的聚合統計，cardinality aggregations 和 terms aggregations。爲了方便理解，以 MySQL 表的示例數據來講解 ES 的這兩個聚合功能。MySQL 表結構如下：

CREATE TABLE `es_agg_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主鍵id',
  `name` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL COMMENT '名稱',
  `label` varchar(128) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '標籤',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=9 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='es agg 測試示例'

示例數據如下：第一列是主鍵id，第二列是 name，第三列是 label

1,apple,iphone12
2,apple,iphone11
3,apple,iphone11
4,huawei,mate30
5,huawei,mate30
6,huawei,mate30
7,huawei,p30
8,huawei,mate20

一、cardinality 聚合

1、計算 es_agg_test 表中一共有多少個不同的 label？

SQL 寫法：

//SQL，輸出 5
select count(distinct (label)) from es_agg_test;

ES 代碼：

// ES 代碼
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.fetchSource(true);
distinct = AggregationBuilders.cardinality("labels").field(label).precisionThreshold(10000);
sourceBuilder.aggregation(distinct);
SearchRequest searchRequest = buildSearchRequest(INDEX, sourceBuilder);
SearchResponse response = esClient.search(searchRequest, RequestOptions.DEFAULT);

protected SearchRequest buildSearchRequest(String index, SearchSourceBuilder sourceBuilder) {
        SearchRequest request = new SearchRequest(index);
        request.searchType(SearchType.DEFAULT);
        request.source(sourceBuilder);
        return request;
    }

cardinality 聚合有個 precision_threshold 參數，ES7.8.0 默認是3000，最大可配置成40000，也即：如果 es_agg_test 表裏面不同 label 的記錄超過4w，ES 統計出來的結果可能不準確。

二、terms 聚合

2.1 全部 label 聚合統計

有時候，知道一共有多少個不同的 label 還不夠，還想知道每個 label 對應的行數（記錄數）是多少？

在示例數據中，一共有5個不同的 label，我們統計出了所有的這5個 label 對應的行數（記錄數）。

而有時候，往往需要的是 top N 統計，比如統計行數最多的前2個 label，在示例數據中，分別是 "mate30" 和 "iphone11"

SQL 寫法：

select label,count((label)) from es_agg_test group by label;

輸出的結果如下：

mate30,3
iphone11,2
iphone12,1
p30,1
mate20,1

相應地，ES 要實現統計每一個 label 對應的行數（記錄數），可以通過 terms 聚合來實現。terms 聚合需要傳一個 size 參數，具體到上面的示例，也即一個有多少個不同的 label，這可以通過 cardinality 聚合來得到。但是，需要注意 cardinality 聚合參數 precision_threshold 的限制。

2.2 top N label 聚合統計

如果只需要統計行數最多的前2個 label，那 size 參數如何設置呢？可能大家的第一反應就是 size 參數設置成2。由於 ES 底層是分佈式存儲，數據分散在不同的分片中，因此存在一個分佈式統計的誤差問題。如下 ES 索引有2個分片，每個分片上的記錄數量如下。如果分片 top2 聚合，就會導致2種錯誤：

1、label 不正確。真正的 top2 label 是 "iphone11" 和 "mate20"，但是分片 top2 聚合產生的結果是 "iphone11" 和 "mate30"

2、數量不正確。label 爲 "iphone" 的行數應該是 510，但是聚合出來的結果是 500

正是因爲分佈式聚合統計存在如上問題，所以 ES 在 terms 聚合時，size 越大，聚合的結果越精確，但是性能開銷也越大。

The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).

實際需求是求解 top2，但是若在每個分片上計算 topN 時，是按 top3 來統計的話，上面的示例計算出來的結果就和“上帝視角”保持一致了。這也是爲什麼 terms 聚合裏面有個 shard_size 參數的原因，shard_size 的計算公式是：shard_size = (size * 1.5 + 10)

如果要計算 topN，在 ES 每個分片上計算的是 top (N*1.5+10)，然後再彙總排序得出 topN。如果在求解 topN 過程中，導致 shard_size 參數超過了1萬，ES7.8 就會報錯：

Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting

shard_size 參數由 ES 的索引動態配置參數 search.max_buckets 參數限制，ES7.8.0 默認是 10000，參考：search.max_buckets 配置。

三、參考鏈接

1、https://stackoverflow.com/questions/57393548/control-number-of-buckets-created-in-an-aggregation

2、https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

ES terms 聚合功能理解

一、cardinality 聚合

二、terms 聚合

2.1 全部 label 聚合統計

2.2 top N label 聚合統計

三、參考鏈接

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

ES 離線索引構建

ES + FAISS 分佈式向量檢索引擎的實現原理

業界使用 ES 的一些工程實踐

lucene posting list 編碼之Frame of Reference

ES集羣查詢穩定性優化

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結