handling relationships
application-side joins
簡單來說,es不允許join操作,不過你可以建立一些簡單的relation,通過編程(查詢兩次)獲取自己想要的結果。denormalizing your data
簡單來說,推薦使用適當的數據冗餘來處理數據間的關係PUT /my_index/user/1The advantage of data denormalization is speed。文檔種包含所有的信息,而不要再做join
{
"name": "John Smith",
"email": "[email protected]",
"dob": "1970/10/24"
}
PUT /my_index/blogpost/2
{
"title": "Relationships",
"body": "It's complicated...",
"user": {
"id": 1,
"name": "John Smith"
}
}
field collapsing
簡單來說,就是將數據摺疊起來,利用json的特性將數據分塊,比如denormalization and concurrency
nested objects
PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"body": "Making your money work...",
"tags": [ "cash", "shares" ],
"comments": [
{
"name": "John Smith",
"comment": "Great article",
"age": 28,
"stars": 4,
"date": "2014-09-01"
},
{
"name": "Alice White",
"comment": "More like this please",
"age": 31,
"stars": 5,
"date": "2014-10-22"
}
]
}
GET /_search這個查詢將會查到數據,原因是被分詞,每個詞之間的關係被破壞了,也就是說 有Alice這個term,也有age這個term,但是他們的關係丟失了
{
"query": {
"bool": {
"must": [
{ "match": { "name": "Alice" }},
{ "match": { "age": 28 }}
]
}
}
}
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ],
"comments.name": [ alice, john, smith, white ],
"comments.comment": [ article, great, like, more, please, this ],
"comments.age": [ 28, 31 ],
"comments.stars": [ 4, 5 ],
"comments.date": [ 2014-09-01, 2014-10-22 ]
}
{
"comments.name": [ john, smith ],
"comments.comment": [ article, great ],
"comments.age": [ 28 ],
"comments.stars": [ 4 ],
"comments.date": [ 2014-09-01 ]
}
{
"comments.name": [ alice, white ],
"comments.comment": [ like, more, please, this ],
"comments.age": [ 31 ],
"comments.stars": [ 5 ],
"comments.date": [ 2014-10-22 ]
}
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ]
}
PUT /my_index
{
"mappings": {
"blogpost": {
"properties": {
"comments": {
"type": "nested",
"properties": {
"name": { "type": "string" },
"comment": { "type": "string" },
"age": { "type": "short" },
"stars": { "type": "short" },
"date": { "type": "date" }
}
}
}
}
}
}
Because nested objects are
indexed as separate hidden documents, we can’t query them directly. Instead,
we have to use the nested
query or nested
filter to
access them:
GET /my_index/blogpost/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "eggs" }},
{
"nested": {
"path": "comments",
"query": {
"bool": {
"must": [
{ "match": { "comments.name": "john" }},
{ "match": { "comments.age": 28 }}
]
}}}}
]
}}}
sorting by nested fields
PUT /my_index/blogpost/2
{
"title": "Investment secrets",
"body": "What they don't tell you ...",
"tags": [ "shares", "equities" ],
"comments": [
{
"name": "Mary Brown",
"comment": "Lies, lies, lies",
"age": 42,
"stars": 1,
"date": "2014-10-18"
},
{
"name": "John Smith",
"comment": "You're making it up!",
"age": 28,
"stars": 2,
"date": "2014-10-16"
}
]
}
GET /_searchWhy do we need to repeat the query conditions in the
{
"query": {
"nested": { nestedfilter
"path": "comments",
"filter": {
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
},
"sort": {
"comments.stars": { 對starts進行排序
"order": "asc", 升序
"mode": "min", 最小值
"nested_filter": { Thenested_filter
in the sort clause is the same as thenested
query in the mainquery
clause.
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
}
}
nested_filter
?
The reason is that sorting happens after the query has been executed. The query matches blog posts that received comments in October, but it returns blog post documents as the result. If we didn’t include the nested_filter
clause,
we would end up sorting based on any comments that the blog post has ever received, not just those received in October.(什麼玩意???沒看懂)nested aggregations
GET /my_index/blogpost/_search?search_type=count
{
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"by_month": {
"date_histogram": {
"field": "comments.date",
"interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"avg_stars": {
"avg": {
"field": "comments.stars"
}
}
}
}
}
}
}
}
GET /my_index/blogpost/_search?search_type=count
{
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"age_group": {
"histogram": {
"field": "comments.age",
"interval": 10
},
"aggs": {
"blogposts": {
"reverse_nested": {}, 我們從nested object中返回,到root object 如果不使用reverse_nested,則無法對root object中的字段進行聚合
"aggs": {
"tags": {
"terms": {
"field": "tags" root object 中的字段
}
}
}
}
}
}
}
}
}
}
parent-child relationship
PUT /company
{
"mappings": {
"branch": {},
"employee": {
"_parent": {
"type": "branch"
}
}
}
}
finding parents by their children
GET /company/branch/_search
{
"query": {
"has_child": {
"type": "employee",
"query": {
"range": {
"dob": {
"gte": "1980-01-01"
}
}
}
}
}
}
GET /company/branch/_search
{
"query": {
"has_child": {
"type": "employee",
"score_mode": "max",
"query": {
"match": {
"name": "Alice Smith"
}
}
}
}
}
finding children by their parents
GET /company/employee/_search
{
"query": {
"has_parent": {
"type": "branch",
"query": {
"match": {
"country": "UK"
}
}
}
}
}
children aggregation
GET /company/branch/_search?search_type=count
{
"aggs": {
"country": {
"terms": {
"field": "country"
},
"aggs": {
"employees": {
"children": {
"type": "employee"
},
"aggs": {
"hobby": {
"terms": {
"field": "employee.hobby"
}
}
}
}
}
}
}
}
grandparents and grandchildren
The shard routing of the employee document would be decided by the parent ID—london
—but
the london
document
was routed to a shard by its own parent ID—uk
.
It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning.(why?祖父跟孩子在同一個shard,(孩子跟孫子在同一個shard)?(通過下邊的分析,孩子跟孫子很明顯不在同一個分片,所以祖父跟孫子也不在同一個shard),祖父跟孫子不在同一個shard?不會傳遞嗎?)
routing: hash(ID)%shards
祖父 hash("uk")
孩子 hash("uk")
孫子 hash("london") 所以孫子所存儲的shard依賴於hash("london")的值,很顯然hash("uk")!=hash("london")(很顯然是這樣的,他們之間的關係具體取決於hash算法的實現) 所以要加一個routing="uk"
那孫子的hash算法: hash("uk");
三代將會位於同一分片。
Instead, we need to add an extra routing
parameter,
set to the ID of the grandparent, to ensure that all three generations are indexed on the same shard. The indexing request should look like this:
PUT /company/employee/1?parent=london&routing=uk
{
"name": "Alice Smith",
"dob": "1970-10-24",
"hobby": "hiking"
}
practical considerations
Parent-child queries can be 5 to 10 times slower than the equivalent nested query!
You can check how much memory is being used by the parent-child cache by consulting the indices-stats
API
(for a summary at the index level) or the node-stats
API
(for a summary at the node level):
GET /_nodes/stats/indices/id_cache?human
multigenerations and concluding thoughtsedit
The ability to join multiple generations (see Grandparents and Grandchildren) sounds attractive until you think of the costs involved:
- The more joins you have, the worse performance will be. 連接越多,性能越差。
- Each generation of parents needs to have their string
_id
fields stored in memory, which can consume a lot of RAM. 每一代的父母將他們的_id存在內存中,可能會消耗大量的內存
As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:關於使用父子關係的建議
- Use parent-child relationships sparingly, and only when there are many more children than parents.
- 簡潔的使用parent-child(層次關係不要太複雜),僅當孩子的數量大大多於父親的數量的時候使用。
- Avoid using multiple parent-child joins in a single query.
- 避免單個查詢的父子關係深度連接
- Avoid scoring by using the
has_child
filter, or thehas_child
query withscore_mode
set tonone
. - 使用has_child filter,或者 has_child query避免評分。
- Keep the parent IDs short, so that they require less memory.
- 使parent id 儘量簡潔,更加節省內存
Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.
儘量使用前兩種關係。