ES學習筆記八-聚合搜索

ES中的聚合搜索可以理解爲關係型數據庫中的group by,將具有相同條件的數據分組,並分析每一組數據的不同表現。

high-level concepts

要理解什麼是聚合查詢(統計) 需要了解下邊的兩個重要的概念。
Buckets
Collections of documents that meet a criterion 符合條件的一組數據
Metrics
Statistics calculated on the documents in a bucket 在這組數據中進行統計計算
GET /cars/transactions/_search?search_type=count
{
   
"aggs" : { 這是一個聚合查詢
       
"colors" : { 此聚合查詢的名字(自己定義)
           
"terms" : {
             
"field" : "color" 定義聚合條件。以color分組
           
}
       
}
   
}
}
You’ll notice that we used the count search_type. Because we don’t care about search results—the aggregation totals—the count search_type will be faster because it omits the fetch phase.

在講query 執行時,elasticsearch會分爲兩個階段,query階段,fetch階段。我們並不需要查詢結果,只需要知道統計結果,所以省去了fetch階段,search_type=count使聚合查詢更高效

{
...
   
"hits": {
     
"hits": [] 沒有數據是因爲我們search_type=count 並沒有fetch階段
   
},
   
"aggregations": {
     
"colors": { 你定義的聚合查詢的名字
         
"buckets": [
           
{
               
"key": "red", 紅色分組
               
"doc_count": 4 符合此條件的文檔數
           
},
           
{
               
"key": "blue",
               
"doc_count": 2
           
},
           
{
               
"key": "green",
               
"doc_count": 2
           
}
         
]
     
}
   
}
}

adding a metric to the mix

GET /cars/transactions/_search?search_type=count
{
   
"aggs": {
     
"colors": {
         
"terms": {
           
"field": "color"
         
},
         
"aggs": { 最外層是aggs,用來包裹住我們的統計條件
           
"avg_price": { 統計名稱
               
"avg": {
                 
"field": "price" 我們將計算每組的price平均值
               
}
           
}
         
}
     
}
   
}
}

buckets inside buckets

分組數據的嵌套,group by color,make 先按 color分組,再按make分組

GET /cars/transactions/_search?search_type=count
{
   
"aggs": {
     
"colors": {
         
"terms": {
           
"field": "color"
         
},
         
"aggs": {
           
"avg_price": { 注意它的順序。他統計的平均值,是緊接的上一個條件的統計值
               
"avg": {
                 
"field": "price"
               
}
           
},
           
"make": {
               
"terms": {
                   
"field": "make"
               
}
           
}
         
}
     
}
   
}
}

one final modification

GET /cars/transactions/_search?search_type=count
{
   
"aggs": {
     
"colors": {
         
"terms": {
           
"field": "color"
         
},
         
"aggs": {
           
"avg_price": { "avg": { "field": "price" }
           
},
           
"make" : {
               
"terms" : {
                   
"field" : "make"
               
},
               
"aggs" : { 添加第二個聚合統計 統計的是以color和make分組後的數據
                   
"min_price" : { "min": { "field": "price"} }, 最低價格
                   
"max_price" : { "max": { "field": "price"} } 最高價格
               
}
           
}
         
}
     
}
   
}
}

building bar charts 創建柱形圖

{
   
"aggs":{
     
"price":{
         
"histogram":{
           
"field": "price",
           
"interval": 20000 間隔2000 所得出來的結果是[0-19999,20000-399999,40000-59999,60000-79999]
         
},
         
"aggs":{
           
"revenue": {
               
"sum": {
                 
"field" : "price"
               
}
             
}
         
}
     
}
   
}
}
As you can see, our query is built around the price aggregation, which contains a histogrambucket. This bucket requires a numeric field to calculate buckets on, and an interval size. The interval defines how "wide" each bucket is. An interval of 20000 means we will have the ranges [0-19999, 20000-39999, ...].

If search is the most popular activity in Elasticsearch, building date histograms must be the second most popular. Why would you want to use a date histogram?

GET /cars/transactions/_search?search_type=count
{
   
"aggs": {
     
"sales": {
         
"date_histogram": {
           
"field": "sold",
           
"interval": "month",
           
"format": "yyyy-MM-dd"
         
}
     
}
   
}
}

returning empty buckets

Yep, that’s right. We are missing a few months! By default, the date_histogram (and histogram too) returns only buckets that have a nonzero document count.

某些月份缺失了,因爲沒有數據,但更多的時候我們需要顯示,即使沒有數據。

GET /cars/transactions/_search?search_type=count
{
   
"aggs": {
     
"sales": {
         
"date_histogram": {
           
"field": "sold",
           
"interval": "month",
           
"format": "yyyy-MM-dd",
           
"min_doc_count" : 0, 既然全部的月份都顯示出來了爲什麼還要定義min_doc_count呢?原因:but by default Elasticsearch will return only buckets that are between the minimum and maximum value in your data.默認只返回最大值最小值啊
           
"extended_bounds" : { this parameter forces the entire year to be returned 全部的月份都要顯示出來
               
"min" : "2014-01-01",
               
"max" : "2014-12-31"
           
}
         
}
     
}
   
}
}

extended example

GET /cars/transactions/_search?search_type=count
{
   
"aggs": {
     
"sales": {
         
"date_histogram": {
           
"field": "sold",
           
"interval": "quarter",
           
"format": "yyyy-MM-dd",
           
"min_doc_count" : 0,
           
"extended_bounds" : {
               
"min" : "2014-01-01",
               
"max" : "2014-12-31"
           
}
         
},
         
"aggs": {
           
"per_make_sum": {
               
"terms": {
                 
"field": "make"
               
},
               
"aggs": {
                 
"sum_price": {
                     
"sum": { "field": "price" }
                 
}
               
}
           
},
           
"total_sum": {
               
"sum": { "field": "price" }
           
}
         
}
     
}
   
}
}

scoping aggregations

GET /cars/transactions/_search  
{
   
"query" : {
       
"match" : {
           
"make" : "ford"
       
}
   
},
   
"aggs" : {
       
"colors" : {
           
"terms" : {
             
"field" : "color"
           
}
       
}
   
}
}
query與aggs是同級別的

global bucket

GET /cars/transactions/_search?search_type=count
{
   
"query" : {
       
"match" : {
           
"make" : "ford"
       
}
   
},
   
"aggs" : {
       
"single_avg_price": {
           
"avg" : { "field" : "price" } all doc match ford
       
},
       
"all": {
           
"global" : {}, global bucket has no parameters
           
"aggs" : {
               
"avg_price": {
                   
"avg" : { "field" : "price" } 這個操作針對所有的數據,而不是match ford的數據
               
}

           
}
       
}
   
}
}

filtered query

GET /cars/transactions/_search?search_type=count
{
   
"query" : {
       
"filtered": {
           
"filter": {
               
"range": {
                   
"price": {
                       
"gte": 10000
                   
}
               
}
           
}
       
}
   
},
   
"aggs" : {
       
"single_avg_price": {
           
"avg" : { "field" : "price" }
       
}
   
}
}

filter bucket

{
   
"query":{
     
"match": {
         
"make": "ford"
     
}
   
},
   
"aggs":{
     
"recent_sales": {
         
"filter": { 把filter用在aggs裏。
           
"range": {
               
"sold": {
                 
"from": "now-1M"
               
}
           
}
         
},
         
"aggs": {
           
"average_price":{
               
"avg": {
                 
"field": "price" 計算即符合match 又符合filter的price 平均值
               
}
           
}
         
}
     
}
   
}
}

post filter

You may be thinking to yourself, "hmm…is there a way to filter just the search results but not the aggregation?" The answer is to use a post_filter.

這個filter只對查詢數據有效,對聚合操作無效,請使用post_filter

GET /cars/transactions/_search?search_type=count
{
   
"query": {
       
"match": {
           
"make": "ford"
       
}
   
},
   
"post_filter": {    
       
"term" : {
           
"color" : "green"
       
}
   
},
   
"aggs" : {
       
"all_colors": {
           
"terms" : { "field" : "color" }
       
}
   
}
}

recap

重點回顧

在filtered中的filter 即會影響搜索結果,也會影響聚合結果

在aggs種的filter 只會影響聚合結果

在query中的post_filter只會影響搜索結果。

sorting multivalue buckets

對聚合結果進行排序,默認按照每個聚合結果中的doc_count降序排序。

intrinsic sorts

GET /cars/transactions/_search?search_type=count
{
   
"aggs" : {
       
"colors" : {
           
"terms" : {
             
"field" : "color",
             
"order": {
               
"_count" : "asc" 按照doc_count 升序排序
             
}
           
}
       
}
   
}
}

We introduce an order object into the aggregation, which allows us to sort on one of several values:

_count
Sort by document count. Works with termshistogramdate_histogram.
_term
Sort by the string value of a term alphabetically. Works only with terms.
_key
Sort by the numeric value of each bucket’s key (conceptually similar to _term). Works only with histogram and date_histogram.

sorting by a metric

GET /cars/transactions/_search?search_type=count
{
   
"aggs" : {
       
"colors" : {
           
"terms" : {
             
"field" : "color",
             
"order": {
               
"avg_price" : "asc"
             
}
           
},
           
"aggs": {
               
"avg_price": {
                   
"avg": {"field": "price"}
               
}
           
}
       
}
   
}
}
GET /cars/transactions/_search?search_type=count
{
   
"aggs" : {
       
"colors" : {
           
"terms" : {
             
"field" : "color",
             
"order": {
               
"stats.variance" : "asc"
             
}
           
},
           
"aggs": {
               
"stats": {
                   
"extended_stats": {"field": "price"}This lets you override the sort order with any metric, simply by referencing the name of the metric. Some metrics, however, emit multiple values. The extended_stats metric is a good example: it provides half a dozen individual metrics.
               
}
           
}
       
}
   
}
}

sorting based on "deep" metrics

finding distinct counts

GET /cars/transactions/_search?search_type=count
{
   
"aggs" : {
       
"distinct_colors" : {
           
"cardinality" : {
             
"field" : "color"
           
}
       
}
   
}
}


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章