ES系列之嵌套文檔和父子文檔

需求背景

很多時候mysql的表之間是一對多的關係，比如訂單表和商品表。一筆訂單可以包含多個商品。他們的關係如下圖所示。

ElasticsSearch（以下簡稱ES）處理這種關係雖然不是特別擅長（相對於關係型數據庫），因爲ES和大多數 NoSQL 數據庫類似，是扁平化的存儲結構。索引是獨立文檔的集合體。不同的索引之間一般是沒有關係的。

不過ES目前畢竟發展到7.x版本了，已經有幾種可選的方式能夠高效的支持這種一對多關係的映射。

比較常用的方案是嵌套對象，嵌套文檔和父子文檔。後兩種是我們本文要講的重點。

我下面聚合分析使用的數據都是kibana自帶的，這樣方便有些讀者實際測試文中的示例。

ES處理一對多關係的方案

普通內部對象

kibana自帶的電商數據就是這種方式，我們來看看它的mapping。

"kibana_sample_data_ecommerce" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            }
          }
        },
        "currency" : {
          "type" : "keyword"
        },
        "customer_full_name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        //省略部分
       
        "products" : {
          "properties" : {
            "_id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "base_price" : {
              "type" : "half_float"
            },
            "base_unit_price" : {
              "type" : "half_float"
            },
            "category" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "created_on" : {
              "type" : "date"
            },
            "discount_amount" : {
              "type" : "half_float"
            },
            "discount_percentage" : {
              "type" : "half_float"
            },
            "manufacturer" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "min_price" : {
              "type" : "half_float"
            },
            "price" : {
              "type" : "half_float"
            },
            "product_id" : {
              "type" : "long"
            },
            "product_name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              },
              "analyzer" : "english"
            },
            "quantity" : {
              "type" : "integer"
            },
            "sku" : {
              "type" : "keyword"
            },
            "tax_amount" : {
              "type" : "half_float"
            },
            "taxful_price" : {
              "type" : "half_float"
            },
            "taxless_price" : {
              "type" : "half_float"
            },
            "unit_discount_amount" : {
              "type" : "half_float"
            }
          }
        },
        "sku" : {
          "type" : "keyword"
        },
        "taxful_total_price" : {
          "type" : "half_float"
        },
        //省略部分

我們可以看到電商的訂單索引裏面包含了一個products的字段，它是對象類型，內部有自己的字段屬性。這其實就是一個包含關係，表示一個訂單可以有多個商品信息。我們可以查詢下看看結果，

查詢語句，

POST kibana_sample_data_ecommerce/_search
{
  "query": {
    "match_all": {}
  }
}

返回結果（我去掉了一些內容方便觀察），

"hits" : [
      {
        "_index" : "kibana_sample_data_ecommerce",
        "_type" : "_doc",
        "_id" : "VJz1f28BdseAsPClo7bC",
        "_score" : 1.0,
        "_source" : {
          "customer_first_name" : "Eddie",
          "customer_full_name" : "Eddie Underwood",
          "order_date" : "2020-01-27T09:28:48+00:00",
          "order_id" : 584677,
          "products" : [
            {
              "base_price" : 11.99,
              "discount_percentage" : 0,
              "quantity" : 1,
              "sku" : "ZO0549605496",
              "manufacturer" : "Elitelligence",
              "tax_amount" : 0,
              "product_id" : 6283,
            },
            {
              "base_price" : 24.99,
              "discount_percentage" : 0,
              "quantity" : 1,
              "sku" : "ZO0299602996",
              "manufacturer" : "Oceanavigations",
              "tax_amount" : 0,
              "product_id" : 19400,
            }
          ],
          "taxful_total_price" : 36.98,
          "taxless_total_price" : 36.98,
          "total_quantity" : 2,
          "total_unique_products" : 2,
          "type" : "order",
          "user" : "eddie",
            "region_name" : "Cairo Governorate",
            "continent_name" : "Africa",
            "city_name" : "Cairo"
          }
        }
      },

可以看到返回的products其實是個list，包含兩個對象。這就表示了一個一對多的關係。

這種方式的優點很明顯，由於所有的信息都在一個文檔中,查詢時就沒有必要去ES內部沒有必要再去join別的文檔，查詢效率很高。那麼它優缺點嗎？

當然有，我們還用上面的例子，如下的查詢，

GET kibana_sample_data_ecommerce/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "products.base_price": 24.99 }},
        { "match": { "products.sku":"ZO0549605496"}},
        {"match": { "order_id": "584677"}}
      ]
    }
  }
}

我這裏搜索有三個條件，order_id，商品的價格和sku，事實上同時滿足這三個條件的文檔並不存在（sku=ZO0549605496的商品價格是11.99）。但是結果卻返回了一個文檔，這是爲什麼呢？

原來在ES中對於json對象數組的處理是壓扁了處理的，比如上面的例子在ES存儲的結構是這樣的：

{
  "order_id":            [ 584677 ],
  "products.base_price":    [ 11.99, 24.99... ],
  "products.sku": [ ZO0549605496, ZO0299602996 ],
  ...
}

很明顯，這樣的結構丟失了商品金額和sku的關聯關係。

如果你的業務場景對這個問題不敏感，就可以選擇這種方式，因爲它足夠簡單並且效率也比下面兩種方案高。

嵌套文檔

很明顯上面對象數組的方案沒有處理好內部對象的邊界問題，JSON數組對象被ES強行存儲成扁平化的鍵值對列表。爲了解決這個問題，ES推出了一種所謂的嵌套文檔的方案，官方對這種方案的介紹是這樣的：

The nested type is a specialised version of the object datatype that allows arrays of objects to be indexed in a way that they can be queried independently of each other.

可以看到嵌套文檔的方案其實是對普通內部對象這種方案的補充。上面那個電商的例子mapping太長了，我換個簡單一些的例子，只要能說明問題就行了。

先設置給索引設置一個mapping，

PUT test_index
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}

user屬性是nested，表示是個內嵌文檔。其它的屬性這裏沒有設置，讓es自動mapping就可以了。

插入兩條數據，

PUT test_index/_doc/1
{
  "group" : "root",
  "user" : [
    {
      "name" : "John",
      "age" :  30
    },
    {
      "name" : "Alice",
      "age" :  28
    }
  ]
}

PUT test_index/_doc/2
{
  "group" : "wheel",
  "user" : [
    {
      "name" : "Tom",
      "age" :  33
    },
    {
      "name" : "Jack",
      "age" :  25
    }
  ]
}

查詢的姿勢是這樣的，

GET test_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.name": "Alice" }},
            { "match": { "user.age":  28 }} 
          ]
        }
      }
    }
  }
}

注意到nested文檔查詢有特殊的語法，需要指明nested關鍵字和路徑（path），再來看一個更具代表性的例子，查詢的條件在主文檔和子文檔都有。

GET test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "group": "root"
          }
        },
        {
          "nested": {
            "path": "user",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "user.name": "Alice"
                    }
                  },
                  {
                    "match": {
                      "user.age": 28
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

說了這麼多，似乎嵌套文檔很好用啊。沒有前面那個方案對象邊界缺失的問題，用起來似乎也不復雜。那麼它有缺點嗎？當然，我們先來做個試驗。

先看看當前索的文檔數量，

GET _cat/indices?v

查詢結果，

green  open   test_index                   FJsEIFf_QZW4Q4SlZBsqJg   1   1          6            0     17.7kb          8.8kb

你可能已經注意到我這裏查看文檔數量並不是用的

GET test_index/_count

而是直接查看的索引信息，他們的區別打算後面專門的文章講解，現在你只需要知道前者可以看到底層真實的文檔數量即可。

是不是很奇怪問啥文檔的數量是6而不是2呢？這是因爲nested子文檔在ES內部其實也是獨立的lucene文檔，只是我們在查詢的時候，ES內部幫我們做了join處理。最終看起來好像是一個獨立的文檔一樣。

那可想而知同樣的條件下，這個性能肯定不如普通內部對象的方案。在實際的業務應用中要根據實際情況決定是否選擇這種方案。

父子文檔

我們還是看上面那個例子，假如我需要更新文檔的group屬性的值，需要重新索引這個文檔。儘管嵌套的user對象我不需要更新，他也隨着主文檔一起被重新索引了。

還有就是如果某個表屬於跟多個表有一對多的關係，也就是一個子文檔可以屬於多個主文檔的場景，用nested無法實現。

下面來看示例。

首先我們定義mapping，如下，

PUT my_index
{
  "mappings": {
    "properties": {
      "my_id": {
        "type": "keyword"
      },
      "my_join_field": { 
        "type": "join",
        "relations": {
          "question": "answer" 
        }
      }
    }
  }
}

my_join_field是給我們的父子文檔關係的名字，這個可以自定義。join關鍵字表示這是一個父子文檔關係，接下來relations裏面表示question是父，answer是子。

插入兩個父文檔，

PUT my_index/_doc/1
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}


PUT my_index/_doc/2
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}

"name": "question"表示插入的是父文檔。

然後插入兩個子文檔

PUT my_index/_doc/3?routing=1
{
  "my_id": "3",
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", 
    "parent": "1" 
  }
}

PUT my_index/_doc/4?routing=1
{
  "my_id": "4",
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

子文檔要解釋的東西比較多，首先從文檔id我們可以判斷子文檔都是獨立的文檔（跟nested不一樣）。其次routing關鍵字指明瞭路由的id是父文檔1，這個id和下面的parent關鍵字對應的id是一致的。

需要強調的是，索引子文檔的時候，routing是必須的，因爲要確保子文檔和父文檔在同一個分片上。

name關鍵字指明瞭這是一個子文檔。

現在my_index中有四個獨立的文檔，我們來父子文檔在搜索的時候是什麼姿勢。

先來一個無條件查詢，

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["my_id"]
}

返回結果(部分)，

{
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_routing" : "1",
        "_source" : {
          "my_id" : "3",
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        },

可以看到返回的結果帶了my_join_field關鍵字指明這是個父文檔還是子文檔。

Has Child 查詢,返回父文檔

POST my_index/_search
{
  "query": {
    "has_child": {
      "type": "answer",
      "query" : {
                "match": {
                    "text" : "answer"
                }
            }
    }
  }
}

返回結果（部分），

"hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my_id" : "1",
          "text" : "This is a question",
          "my_join_field" : {
            "name" : "question"
          }
        }
      }
    ]

Has Parent 查詢，返回相關的子文檔

POST my_index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "question",
      "query" : {
                "match": {
                    "text" : "question"
                }
            }
    }
  }
}

結果（部分），

 "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "my_id" : "3",
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "my_id" : "4",
          "text" : "This is another answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      }
    ]

Parent Id 查詢子文檔

POST my_index/_search
{
  "query": {
    "parent_id": { 
      "type": "answer",
      "id": "1"
    }
  }
}

返回的結果和上面基本一樣，區別在於parent id搜索默認使用相關性算分，而Has Parent默認情況下不使用算分。

使用父子文檔的模式有一些需要特別關注的點：

每一個索引只能定義一個 join field
父子文檔必須在同一個分片上，意味着查詢，更新操作都需要加上routing
可以向一個已經存在的join field上新增關係

總的來說，嵌套對象通過冗餘數據來提高查詢性能，適用於讀多寫少的場景。父子文檔類似關係型數據庫中的關聯關係，適用於寫多的場景，減少了文檔修改的範圍。

總結

普通子對象模式實現一對多關係，會損失子對象的邊界，子對象的屬性之前關聯性喪失。
嵌套對象可以解決普通子對象存在的問題，但是它有兩個缺點，一個是更新主文檔的時候要全部更新，另外就是不支持子文檔從屬多個主文檔的場景。
父子文檔能解決前面兩個存在的問題，但是它適用於寫多讀少的場景。

參考：

*《elasticsearch 官方文檔》

ES系列之嵌套文檔和父子文檔

需求背景

ES處理一對多關係的方案

普通內部對象

嵌套文檔

父子文檔

總結

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

說說我創業踩過的幾個坑

spring data操作ES簡直不能再香

一文說透訪問者模式

帶你瞭解控制線程執行順序的幾種方法

從一個生產上的錯誤看kafka的消費再均衡問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結