ES學習筆記八-聚合搜索

ES中的聚合搜索可以理解爲關係型數據庫中的group by,將具有相同條件的數據分組，並分析每一組數據的不同表現。

high-level concepts

要理解什麼是聚合查詢(統計) 需要了解下邊的兩個重要的概念。

Buckets

Collections of documents that meet a criterion 符合條件的一組數據

Metrics

Statistics calculated on the documents in a bucket 在這組數據中進行統計計算

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {  這是一個聚合查詢
        "colors" : {  此聚合查詢的名字(自己定義)
            "terms" : {
              "field" : "color"  定義聚合條件。以color分組
            }
        }
    }
}

You’ll notice that we used the count search_type. Because we don’t care about search results—the aggregation totals—the count search_type will be faster because it omits the fetch phase.

在講query 執行時，elasticsearch會分爲兩個階段,query階段，fetch階段。我們並不需要查詢結果，只需要知道統計結果，所以省去了fetch階段，search_type=count使聚合查詢更高效

{
...
   "hits": {
      "hits": []  沒有數據是因爲我們search_type=count 並沒有fetch階段
   },
   "aggregations": {
      "colors": {  你定義的聚合查詢的名字
         "buckets": [
            {
               "key": "red",  紅色分組
               "doc_count": 4  符合此條件的文檔數
            },
            {
               "key": "blue",
               "doc_count": 2
            },
            {
               "key": "green",
               "doc_count": 2
            }
         ]
      }
   }
}

adding a metric to the mix

GET /cars/transactions/_search?search_type=count
{
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {  最外層是aggs,用來包裹住我們的統計條件
            "avg_price": {  統計名稱
               "avg": {
                  "field": "price"  我們將計算每組的price平均值
               }
            }
         }
      }
   }
}

buckets inside buckets

分組數據的嵌套，group by color,make 先按 color分組，再按make分組

GET /cars/transactions/_search?search_type=count
{
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": { 
            "avg_price": {  注意它的順序。他統計的平均值，是緊接的上一個條件的統計值
               "avg": {
                  "field": "price"
               }
            },
            "make": { 
                "terms": {
                    "field": "make" 
                }
            }
         }
      }
   }
}

one final modification

GET /cars/transactions/_search?search_type=count
{
   "aggs": {
      "colors": {
         "terms": {
            "field": "color"
         },
         "aggs": {
            "avg_price": { "avg": { "field": "price" }
            },
            "make" : {
                "terms" : {
                    "field" : "make"
                },
                "aggs" : {  添加第二個聚合統計 統計的是以color和make分組後的數據
                    "min_price" : { "min": { "field": "price"} },  最低價格
                    "max_price" : { "max": { "field": "price"} }  最高價格
                }
            }
         }
      }
   }
}

building bar charts 創建柱形圖

{
   "aggs":{
      "price":{
         "histogram":{ 
            "field": "price",
            "interval": 20000 間隔2000 所得出來的結果是[0-19999,20000-399999,40000-59999,60000-79999]
         },
         "aggs":{
            "revenue": {
               "sum": { 
                 "field" : "price"
               }
             }
         }
      }
   }
}

As you can see, our query is built around the price aggregation, which contains a histogrambucket. This bucket requires a numeric field to calculate buckets on, and an interval size. The interval defines how "wide" each bucket is. An interval of 20000 means we will have the ranges

[0-19999,
20000-39999, ...]

If search is the most popular activity in Elasticsearch, building date histograms must be the second most popular. Why would you want to use a date histogram?

GET /cars/transactions/_search?search_type=count
{
   "aggs": {
      "sales": {
         "date_histogram": {
            "field": "sold",
            "interval": "month", 
            "format": "yyyy-MM-dd" 
         }
      }
   }
}

returning empty buckets

Yep, that’s right. We are missing a few months! By default, the date_histogram (and histogram too) returns only buckets that have a nonzero document count.

某些月份缺失了，因爲沒有數據，但更多的時候我們需要顯示，即使沒有數據。

GET /cars/transactions/_search?search_type=count
{
   "aggs": {
      "sales": {
         "date_histogram": {
            "field": "sold",
            "interval": "month",
            "format": "yyyy-MM-dd",
            "min_doc_count" : 0,  既然全部的月份都顯示出來了爲什麼還要定義min_doc_count呢？原因：but by default Elasticsearch will return only buckets that are between the minimum and maximum value in your data.默認只返回最大值最小值啊
            "extended_bounds" : {  this parameter forces the entire year to be returned 全部的月份都要顯示出來
                "min" : "2014-01-01",
                "max" : "2014-12-31"
            }
         }
      }
   }
}

extended example

GET /cars/transactions/_search?search_type=count
{
   "aggs": {
      "sales": {
         "date_histogram": {
            "field": "sold",
            "interval": "quarter", 
            "format": "yyyy-MM-dd",
            "min_doc_count" : 0,
            "extended_bounds" : {
                "min" : "2014-01-01",
                "max" : "2014-12-31"
            }
         },
         "aggs": {
            "per_make_sum": {
               "terms": {
                  "field": "make"
               },
               "aggs": {
                  "sum_price": {
                     "sum": { "field": "price" } 
                  }
               }
            },
            "total_sum": {
               "sum": { "field": "price" } 
            }
         }
      }
   }
}

scoping aggregations

GET /cars/transactions/_search  
{
    "query" : {
        "match" : {
            "make" : "ford"
        }
    },
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color"
            }
        }
    }
}

query與aggs是同級別的

global bucket

GET /cars/transactions/_search?search_type=count
{
    "query" : {
        "match" : {
            "make" : "ford"
        }
    },
    "aggs" : {
        "single_avg_price": {
            "avg" : { "field" : "price" }  all doc match ford
        },
        "all": {
            "global" : {},  global bucket has no parameters
            "aggs" : {
                "avg_price": {
                    "avg" : { "field" : "price" }  這個操作針對所有的數據，而不是match ford的數據
                }

            }
        }
    }
}

filtered query

GET /cars/transactions/_search?search_type=count
{
    "query" : {
        "filtered": {
            "filter": {
                "range": {
                    "price": {
                        "gte": 10000
                    }
                }
            }
        }
    },
    "aggs" : {
        "single_avg_price": {
            "avg" : { "field" : "price" }
        }
    }
}

filter bucket

{
   "query":{
      "match": {
         "make": "ford"
      }
   },
   "aggs":{
      "recent_sales": {
         "filter": {  把filter用在aggs裏。
            "range": {
               "sold": {
                  "from": "now-1M"
               }
            }
         },
         "aggs": {
            "average_price":{
               "avg": {
                  "field": "price"  計算即符合match 又符合filter的price 平均值
               }
            }
         }
      }
   }
}

post filter

You may be thinking to yourself, "hmm…is there a way to filter just the search results but not the aggregation?" The answer is to use a post_filter.

這個filter只對查詢數據有效，對聚合操作無效，請使用post_filter

GET /cars/transactions/_search?search_type=count
{
    "query": {
        "match": {
            "make": "ford"
        }
    },
    "post_filter": {    
        "term" : {
            "color" : "green"
        }
    },
    "aggs" : {
        "all_colors": {
            "terms" : { "field" : "color" }
        }
    }
}

recap

重點回顧

在filtered中的filter 即會影響搜索結果，也會影響聚合結果

在aggs種的filter 只會影響聚合結果

在query中的post_filter只會影響搜索結果。

sorting multivalue buckets

對聚合結果進行排序，默認按照每個聚合結果中的doc_count降序排序。

intrinsic sorts

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "_count" : "asc"  按照doc_count 升序排序
              }
            }
        }
    }
}

We introduce an order object into the aggregation, which allows us to sort on one of several values:

_count: Sort by document count. Works with terms, histogram, date_histogram.
_term: Sort by the string value of a term alphabetically. Works only with terms.
_key: Sort by the numeric value of each bucket’s key (conceptually similar to _term). Works only with histogram and date_histogram.

sorting by a metric

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "avg_price" : "asc" 
              }
            },
            "aggs": {
                "avg_price": {
                    "avg": {"field": "price"} 
                }
            }
        }
    }
}

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "stats.variance" : "asc" 
              }
            },
            "aggs": {
                "stats": {
                    "extended_stats": {"field": "price"}This lets you override the sort order with any metric, simply by referencing the name of the metric. Some metrics, however, emit multiple values. The extended_stats metric is a good example: it provides half a dozen individual metrics.
                }
            }
        }
    }
}

sorting based on "deep" metrics

finding distinct counts

GET /cars/transactions/_search?search_type=count
{
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color"
            }
        }
    }
}