ES學習筆記十-數據建模

handling relationships

transactions involving multiple documents are not. There is no way to roll back the index to its previous state if part of a transaction fails.

application-side joins

簡單來說,es不允許join操作,不過你可以建立一些簡單的relation,通過編程(查詢兩次)獲取自己想要的結果。

denormalizing your data

簡單來說,推薦使用適當的數據冗餘來處理數據間的關係
PUT /my_index/user/1
{
 
"name":     "John Smith",
 
"email":    "[email protected]",
 
"dob":      "1970/10/24"
}

PUT
/my_index/blogpost/2
{
 
"title":    "Relationships",
 
"body":     "It's complicated...",
 
"user":     {
   
"id":       1,
   
"name":     "John Smith"
 
}
}
The advantage of data denormalization is speed。文檔種包含所有的信息,而不要再做join

field collapsing

簡單來說,就是將數據摺疊起來,利用json的特性將數據分塊,比如
{
"blob":{
  title:"..."
},
"user":{
  "name":{
    "firstname":"ddd",
    "lastname":"dddd"
   }
}
}

denormalization and concurrency

nested objects

PUT /my_index/blogpost/1
{
 
"title": "Nest eggs",
 
"body":  "Making your money work...",
 
"tags":  [ "cash", "shares" ],
 
"comments": [
   
{
     
"name":    "John Smith",
     
"comment": "Great article",
     
"age":     28,
     
"stars":   4,
     
"date":    "2014-09-01"
   
},
   
{
     
"name":    "Alice White",
     
"comment": "More like this please",
     
"age":     31,
     
"stars":   5,
     
"date":    "2014-10-22"
   
}
 
]
}
GET /_search
{
 
"query": {
   
"bool": {
     
"must": [
       
{ "match": { "name": "Alice" }},
       
{ "match": { "age":  28      }}
     
]
   
}
 
}
}
這個查詢將會查到數據,原因是被分詞,每個詞之間的關係被破壞了,也就是說 有Alice這個term,也有age這個term,但是他們的關係丟失了
{
 
"title":            [ eggs, nest ],
 
"body":             [ making, money, work, your ],
 
"tags":             [ cash, shares ],
 
"comments.name":    [ alice, john, smith, white ],
 
"comments.comment": [ article, great, like, more, please, this ],
 
"comments.age":     [ 28, 31 ],
 
"comments.stars":   [ 4, 5 ],
 
"comments.date":    [ 2014-09-01, 2014-10-22 ]
}

如何解決
再定義mapping的時候將type設置爲nested,嵌套的文檔將會作爲一個個分離的對象
{ 
 
"comments.name":    [ john, smith ],
 
"comments.comment": [ article, great ],
 
"comments.age":     [ 28 ],
 
"comments.stars":   [ 4 ],
 
"comments.date":    [ 2014-09-01 ]
}
{
 
"comments.name":    [ alice, white ],
 
"comments.comment": [ like, more, please, this ],
 
"comments.age":     [ 31 ],
 
"comments.stars":   [ 5 ],
 
"comments.date":    [ 2014-10-22 ]
}
{
 
"title":            [ eggs, nest ],
 
"body":             [ making, money, work, your ],
 
"tags":             [ cash, shares ]
}
PUT /my_index
{
 
"mappings": {
   
"blogpost": {
     
"properties": {
       
"comments": {
         
"type": "nested",
         
"properties": {
           
"name":    { "type": "string"  },
           
"comment": { "type": "string"  },
           
"age":     { "type": "short"   },
           
"stars":   { "type": "short"   },
           
"date":    { "type": "date"    }
         
}
       
}
     
}
   
}
 
}
}

Because nested objects are indexed as separate hidden documents, we can’t query them directly. Instead, we have to use the nested query or nested filter to access them:

GET /my_index/blogpost/_search
{
 
"query": {
   
"bool": {
     
"must": [
       
{ "match": { "title": "eggs" }},
       
{
         
"nested": {
           
"path": "comments",
           
"query": {
             
"bool": {
               
"must": [
                 
{ "match": { "comments.name": "john" }},
                 
{ "match": { "comments.age":  28     }}
               
]
       
}}}}
     
]
}}}

sorting by nested fields

PUT /my_index/blogpost/2
{
 
"title": "Investment secrets",
 
"body":  "What they don't tell you ...",
 
"tags":  [ "shares", "equities" ],
 
"comments": [
   
{
     
"name":    "Mary Brown",
     
"comment": "Lies, lies, lies",
     
"age":     42,
     
"stars":   1,
     
"date":    "2014-10-18"
   
},
   
{
     
"name":    "John Smith",
     
"comment": "You're making it up!",
     
"age":     28,
     
"stars":   2,
     
"date":    "2014-10-16"
   
}
 
]
}
GET /_search
{
 
"query": {
   
"nested": { nestedfilter
     
"path": "comments",
     
"filter": {
       
"range": {
         
"comments.date": {
           
"gte": "2014-10-01",
           
"lt":  "2014-11-01"
         
}
       
}
     
}
   
}
 
},
 
"sort": {
   
"comments.stars": { 對starts進行排序
     
"order": "asc",   升序
     
"mode":  "min",   最小值
     
"nested_filter": { The nested_filter in the sort clause is the same as the nested query in the main queryclause.
       
"range": {
         
"comments.date": {
           
"gte": "2014-10-01",
           
"lt":  "2014-11-01"
         
}
       
}
     
}
   
}
 
}
}
Why do we need to repeat the query conditions in the nested_filter? The reason is that sorting happens after the query has been executed. The query matches blog posts that received comments in October, but it returns blog post documents as the result. If we didn’t include the nested_filter clause, we would end up sorting based on any comments that the blog post has ever received, not just those received in October.(什麼玩意???沒看懂)

nested aggregations

GET /my_index/blogpost/_search?search_type=count
{
 
"aggs": {
   
"comments": {
     
"nested": {
       
"path": "comments"
     
},
     
"aggs": {
       
"by_month": {
         
"date_histogram": {
           
"field":    "comments.date",
           
"interval": "month",
           
"format":   "yyyy-MM"
         
},
         
"aggs": {
           
"avg_stars": {
             
"avg": {
               
"field": "comments.stars"
             
}
           
}
         
}
       
}
     
}
   
}
 
}
}
GET /my_index/blogpost/_search?search_type=count
{
 
"aggs": {
   
"comments": {
     
"nested": {
       
"path": "comments"
     
},
     
"aggs": {
       
"age_group": {
         
"histogram": {
           
"field":    "comments.age",
           
"interval": 10
         
},
         
"aggs": {
           
"blogposts": {
             
"reverse_nested": {}, 我們從nested object中返回,到root object 如果不使用reverse_nested,則無法對root object中的字段進行聚合
             
"aggs": {
               
"tags": {
                 
"terms": {
                   
"field": "tags" root object 中的字段
                 
}
               
}
             
}
           
}
         
}
       
}
     
}
   
}
 
}
}

parent-child relationship

PUT /company
{
 
"mappings": {
   
"branch": {},
   
"employee": {
     
"_parent": {
       
"type": "branch"
     
}
   
}
 
}
}

finding parents by their children

GET /company/branch/_search
{
 
"query": {
   
"has_child": {
     
"type": "employee",
     
"query": {
       
"range": {
         
"dob": {
           
"gte": "1980-01-01"
         
}
       
}
     
}
   
}
 
}
}
GET /company/branch/_search
{
 
"query": {
   
"has_child": {
     
"type":       "employee",
     
"score_mode": "max",
     
"query": {
       
"match": {
         
"name": "Alice Smith"
       
}
     
}
   
}
 
}
}

finding children by their parents

GET /company/employee/_search
{
 
"query": {
   
"has_parent": {
     
"type": "branch",
     
"query": {
       
"match": {
         
"country": "UK"
       
}
     
}
   
}
 
}
}

children aggregation

GET /company/branch/_search?search_type=count
{
 
"aggs": {
   
"country": {
     
"terms": {
       
"field": "country"
     
},
     
"aggs": {
       
"employees": {
         
"children": {
           
"type": "employee"
         
},
         
"aggs": {
           
"hobby": {
             
"terms": {
               
"field": "employee.hobby"
             
}
           
}
         
}
       
}
     
}
   
}
 
}
}

grandparents and grandchildren

The shard routing of the employee document would be decided by the parent ID—london—but the london document was routed to a shard by its own parent ID—uk. It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning.(why?祖父跟孩子在同一個shard,(孩子跟孫子在同一個shard)?(通過下邊的分析,孩子跟孫子很明顯不在同一個分片,所以祖父跟孫子也不在同一個shard),祖父跟孫子不在同一個shard?不會傳遞嗎?)

routing: hash(ID)%shards

祖父 hash("uk")

孩子 hash("uk")

孫子 hash("london") 所以孫子所存儲的shard依賴於hash("london")的值,很顯然hash("uk")!=hash("london")(很顯然是這樣的,他們之間的關係具體取決於hash算法的實現) 所以要加一個routing="uk"

那孫子的hash算法: hash("uk");

三代將會位於同一分片。

Instead, we need to add an extra routing parameter, set to the ID of the grandparent, to ensure that all three generations are indexed on the same shard. The indexing request should look like this:


PUT /company/employee/1?parent=london&routing=uk 
{
 
"name":  "Alice Smith",
 
"dob":   "1970-10-24",
 
"hobby": "hiking"
}

practical considerations

Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

You can check how much memory is being used by the parent-child cache by consulting the indices-stats API (for a summary at the index level) or the node-stats API (for a summary at the node level):

GET /_nodes/stats/indices/id_cache?human 

multigenerations and concluding thoughtsedit

The ability to join multiple generations (see Grandparents and Grandchildren) sounds attractive until you think of the costs involved:

  • The more joins you have, the worse performance will be. 連接越多,性能越差。
  • Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM. 每一代的父母將他們的_id存在內存中,可能會消耗大量的內存

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:關於使用父子關係的建議

  • Use parent-child relationships sparingly, and only when there are many more children than parents.
  • 簡潔的使用parent-child(層次關係不要太複雜),僅當孩子的數量大大多於父親的數量的時候使用。
  • Avoid using multiple parent-child joins in a single query.
  • 避免單個查詢的父子關係深度連接
  • Avoid scoring by using the has_child filter, or the has_child query with score_modeset to none.
  • 使用has_child filter,或者 has_child query避免評分。
  • Keep the parent IDs short, so that they require less memory.
  • 使parent id 儘量簡潔,更加節省內存

Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.

儘量使用前兩種關係。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章