elasticsearch aggregations 之一：引入buckets、metrics

原創

2020-06-25 21:43

今天聊聊elasticsearch的聚合aggregation功能。

在解釋elasticsearch的時候，都喜歡將es與關係數據庫做對比參照，一來大部分coder對關係數據庫都有或多或少了解，基本的關係模型、select功能都清楚；二來忽略內部實現如何，就表現出來的功能而言，兩者也有可比之處。將兩者作對比，可以幫助es新人更好的瞭解、使用es。接下來就看看兩者的aggregation對比如何。

假定有es有一個index，存儲了某家汽車經銷商的銷售信息：包括車的售價、銷售時間、車身顏色、生產廠商等。結構如下：

PUT cars
{
  "mappings": {
    "properties": {
      "price":{
        "type": "long"
      },
      "color":{
        "type": "keyword"
      },
      "make":{
        "type": "keyword"
      },
      "sold":{
        "type": "date"
      }
    }
  }
}

那這個index在mysql中可能就對應下面的表結構：

create table cars(price integer,color  varchar(20),make  varchar(50),sold datetime);

往es中寫入數據：

POST /cars/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

mysql也寫入數據：

insert into cars values(10000,'red','honda','2014-10-28'),(20000,'red','honda','2014-11-05'),(30000,'green','ford','2014-05-18'),(15000,'blue','toyota','2014-07-02'),(12000,'green','toyota','2014-08-19'),(20000,'red','honda','2014-11-05'),(80000,'red','bmw','2014-01-01'),(25000,'blue','ford','2014-02-12')

現在，假設我們要做個aggregation，要統計下每種顏色的車子都賣了多少。mysql裏，我們可以這樣寫：

mysql> select count(1) from cars group by color;
+----------+
| count(1) |
+----------+
|        4 |
|        2 |
|        2 |
+----------+
3 rows in set (0.00 sec)

那麼，在es中需要如何寫呢？請看下面：

GET cars/_search
{
  "size": 0, 
  "aggs": {
    "cts_by_color": {
      "terms": {
        "field": "color",
        "size": 10
      }
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "cts_by_color" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "red",
          "doc_count" : 4
        },
        {
          "key" : "blue",
          "doc_count" : 2
        },
        {
          "key" : "green",
          "doc_count" : 2
        }
      ]
    }
  }
}

es中通過qsl來實現。我們來一步一步分解這個qsl：

GET cars/_search  ①
{
"size": 0,  ②
"aggs": { ③
"cts_by_color": { ④
"terms": { ⑤
"field": "color",  ⑥
"size": 10  ⑦
}
}
}
}

①：aggregation處於search檢索上下文中，表明aggregation是一個search request；
②：size 0表示結果集中不顯示滿足條件的document記錄。這裏實際上是所有文檔。省略了match_all的部分；

"query": {
"match_all": {}
}

③：aggs 這是一個頂層的關鍵字，表示是aggregation操作；
④：這是自定義的名稱，目前用不到。但是不能省略；
⑤：term aggregation_type，聚合的一種類型，用於keyword類型以及其他適合buckets aggregations的類型。如果用於index的text類型的話，需要啓用fielddata；
⑥：以字段color來動態生成buckets，每一個unique的值生成一個buckets；
⑦：顯示前10個。

這個qsl的功能是：在所有的document上，以字段color的不同值動態的生成buckets，然後以相同color的document記錄數metrics從高到低倒序，取前10個值。

這裏來看看aggregation的語法圖：

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
        [,"aggregations" : { [<sub_aggregation>]+ } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

與mysql的sql語句對照起來，這裏的buckets就相當於sql中的group by，省略的默認document計數metrics就相當於sql中的count(1)。

這就引入了es總的兩個重要概念：buckets、metrics。

buckets：一個buckets是滿足特定條件的document的集合，簡而言之，就是對documents按照條件進行分組，屬於相同組的document落入一個buckets。當進行aggregation的時候，如果document上對應field的值滿足對應的條件，則此document會落入這個buckets。buckets之間可以嵌套，以形成層次關係。

metrics：就是我們的最終目的，在buckets對文檔分組後，取響應的指標值。例如avg，sum，max，min等。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

elasticsearch aggregations 之一：引入buckets、metrics

DAPPER 事務 TRANSACTION

[ERROR] InnoDB: Ignoring the redo log due to missing MLOG_CHECKPOINT between the checkpoint

elasticsearch aggregations 之一：引入buckets、metrics

elasticsearch數據類型--nested

elasticsearch的空值處理

elasticsearch shard--refresh

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結