Presto查詢內存優化,可緩解內存不足的症狀

個人博客原文







使用條件

  • Hive v1 bucketing table: v1版本的分桶表(v2沒測試,presto對hive 3.x的支持目前還在進行中)

其他支持分桶的數據源connector,需要實現presto特定的方法
@david: Assuming it’s hashing as in Hive, and two tables bucketed the same way are compatible, then that could in theory be implemented in the Kudu connector.
The connector needs to expose the bucketing and splits to the engine in a specific way.


原理

Presto的Grouped Execution特性。

根據相同字段(orderid)分桶(bucketing)且分桶數量相同的兩個表(orders,orders_item),
在通過orderid進行join的時候,由於兩個表相同的orderid都分到相同id的桶裏,所以是可以獨立進行join以及聚合計算的(參考MapReduer的partition過程)。

通過控制並行處理桶的數量來限制內存的佔用。

計算理論佔用的內存:優化後的內存佔用=原內存佔用/表的桶數量*並行處理桶的數量


測試環境

  • Ubuntu 14.04
  • PrestoSQL-317
  • Hive connector (Hive 3.1)
  • TPCH connector

測試步驟

使用Hive作爲默認的數據源連接(免寫hive前綴)

1 建表

-- 複製數據到hive
create table orders as select * from tpch.sf1.orders;

-- drop table test_grouped_join1;
CREATE TABLE test_grouped_join1
WITH (bucket_count = 13, bucketed_by = ARRAY['key1']) as
SELECT orderkey key1, comment value1 FROM orders;

-- drop table test_grouped_join2;
CREATE TABLE test_grouped_join2
WITH (bucket_count = 13, bucketed_by = ARRAY['key2']) as
SELECT orderkey key2, comment value2 FROM orders;

-- drop table test_grouped_join3;
CREATE TABLE test_grouped_join3
WITH (bucket_count = 13, bucketed_by = ARRAY['key3']) as
SELECT orderkey key3, comment value3 FROM orders;

2 測試不使用Grouped Execution特性

-- 默認
set session colocated_join=false;
set session grouped_execution=false;

-- 查看執行計劃
-- explain analyze
explain (TYPE DISTRIBUTED)
SELECT key1, value1, key2, value2, key3, value3
FROM test_grouped_join1
JOIN test_grouped_join2
ON key1 = key2
JOIN test_grouped_join3
ON key2 = key3

執行計劃結果(太長,可忽略)

Fragment 0 [SINGLE]
    Output layout: [key1, value1, key1, value2, key1, value3]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Output[key1, value1, key2, value2, key3, value3]
    │   Layout: [key1:bigint, value1:varchar(79), key1:bigint, value2:varchar(79), key1:bigint, value3:varchar(79)]
    │   Estimates: {rows: 1500000 (268.28MB), cpu: 1.85G, memory: 204.60MB, network: 447.13MB}
    │   key2 := key1
    │   key3 := key1
    └─ RemoteSource[1]
           Layout: [key1:bigint, value1:varchar(79), value2:varchar(79), value3:varchar(79)]

Fragment 1 [hive:buckets=13, hiveTypes=[bigint]]
    Output layout: [key1, value1, value2, value3]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    InnerJoin[("key1" = "key3")][$hashvalue, $hashvalue_34]
    │   Layout: [key1:bigint, value1:varchar(79), value2:varchar(79), value3:varchar(79)]
    │   Estimates: {rows: 1500000 (242.53MB), cpu: 1.85G, memory: 204.60MB, network: 204.60MB}
    │   Distribution: PARTITIONED
    ├─ InnerJoin[("key1" = "key2")][$hashvalue, $hashvalue_31]
    │  │   Layout: [key1:bigint, value1:varchar(79), $hashvalue:bigint, value2:varchar(79)]
    │  │   Estimates: {rows: 1500000 (178.85MB), cpu: 971.52M, memory: 102.30MB, network: 102.30MB}
    │  │   Distribution: PARTITIONED
    │  ├─ ScanProject[table = hive:test:test_grouped_join1 bucket=13, grouped = false]
    │  │      Layout: [key1:bigint, value1:varchar(79), $hashvalue:bigint]
    │  │      Estimates: {rows: 1500000 (102.30MB), cpu: 89.43M, memory: 0B, network: 0B}/{rows: 1500000 (102.30MB), cpu: 191.73M, memory: 0B, network: 0B}
    │  │      $hashvalue := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key1"), 0))
    │  │      key1 := key1:bigint:0:REGULAR
    │  │      value1 := value1:varchar(79):1:REGULAR
    │  └─ LocalExchange[HASH][$hashvalue_31] ("key2")
    │     │   Layout: [key2:bigint, value2:varchar(79), $hashvalue_31:bigint]
    │     │   Estimates: {rows: 1500000 (102.30MB), cpu: 396.33M, memory: 0B, network: 102.30MB}
    │     └─ RemoteSource[2]
    │            Layout: [key2:bigint, value2:varchar(79), $hashvalue_32:bigint]
    └─ LocalExchange[HASH][$hashvalue_34] ("key3")
       │   Layout: [key3:bigint, value3:varchar(79), $hashvalue_34:bigint]
       │   Estimates: {rows: 1500000 (102.30MB), cpu: 396.33M, memory: 0B, network: 102.30MB}
       └─ RemoteSource[3]
              Layout: [key3:bigint, value3:varchar(79), $hashvalue_35:bigint]

Fragment 2 [hive:buckets=13, hiveTypes=[bigint]]
    Output layout: [key2, value2, $hashvalue_33]
    Output partitioning: hive:buckets=13, hiveTypes=[bigint] [key2]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanProject[table = hive:test:test_grouped_join2 bucket=13, grouped = false]
        Layout: [key2:bigint, value2:varchar(79), $hashvalue_33:bigint]
        Estimates: {rows: 1500000 (102.30MB), cpu: 89.43M, memory: 0B, network: 0B}/{rows: 1500000 (102.30MB), cpu: 191.73M, memory: 0B, network: 0B}
        $hashvalue_33 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key2"), 0))
        key2 := key2:bigint:0:REGULAR
        value2 := value2:varchar(79):1:REGULAR

Fragment 3 [hive:buckets=13, hiveTypes=[bigint]]
    Output layout: [key3, value3, $hashvalue_36]
    Output partitioning: hive:buckets=13, hiveTypes=[bigint] [key3]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanProject[table = hive:test:test_grouped_join3 bucket=13, grouped = false]
        Layout: [key3:bigint, value3:varchar(79), $hashvalue_36:bigint]
        Estimates: {rows: 1500000 (102.30MB), cpu: 89.43M, memory: 0B, network: 0B}/{rows: 1500000 (102.30MB), cpu: 191.73M, memory: 0B, network: 0B}
        $hashvalue_36 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key3"), 0))
        key3 := key3:bigint:0:REGULAR
        value3 := value3:varchar(79):1:REGULAR

3 測試使用Grouped Execution特性

set session colocated_join=true;
set session grouped_execution=true;
-- 並行處理桶的數量:0爲一次性處理全部
set session concurrent_lifespans_per_task=1;
-- 此屬性設爲默認,其作用不在這裏說明
set session dynamic_schedule_for_grouped_execution=false;

-- 查看執行計劃
-- explain (TYPE DISTRIBUTED)
explain analyze
SELECT key1, value1, key2, value2, key3, value3
FROM test_grouped_join1
JOIN test_grouped_join2
ON key1 = key2
JOIN test_grouped_join3
ON key2 = key3

執行計劃結果(太長,可忽略)

Fragment 0 [SINGLE]
    Output layout: [key1, value1, key1, value2, key1, value3]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Output[key1, value1, key2, value2, key3, value3]
    │   Layout: [key1:bigint, value1:varchar(79), key1:bigint, value2:varchar(79), key1:bigint, value3:varchar(79)]
    │   Estimates: {rows: 1500000 (268.28MB), cpu: 1.65G, memory: 204.60MB, network: 242.53MB}
    │   key2 := key1
    │   key3 := key1
    └─ RemoteSource[1]
           Layout: [key1:bigint, value1:varchar(79), value2:varchar(79), value3:varchar(79)]

Fragment 1 [hive:buckets=13, hiveTypes=[bigint]]
    Output layout: [key1, value1, value2, value3]
    Output partitioning: SINGLE []
    Stage Execution Strategy: FIXED_LIFESPAN_SCHEDULE_GROUPED_EXECUTION
    InnerJoin[("key1" = "key3")][$hashvalue, $hashvalue_33]
    │   Layout: [key1:bigint, value1:varchar(79), value2:varchar(79), value3:varchar(79)]
    │   Estimates: {rows: 1500000 (242.53MB), cpu: 1.65G, memory: 204.60MB, network: 0B}
    │   Distribution: PARTITIONED
    ├─ InnerJoin[("key1" = "key2")][$hashvalue, $hashvalue_31]
    │  │   Layout: [key1:bigint, value1:varchar(79), $hashvalue:bigint, value2:varchar(79)]
    │  │   Estimates: {rows: 1500000 (178.85MB), cpu: 869.21M, memory: 102.30MB, network: 0B}
    │  │   Distribution: PARTITIONED
    │  ├─ ScanProject[table = hive:test:test_grouped_join1 bucket=13, grouped = true]
    │  │      Layout: [key1:bigint, value1:varchar(79), $hashvalue:bigint]
    │  │      Estimates: {rows: 1500000 (102.30MB), cpu: 89.43M, memory: 0B, network: 0B}/{rows: 1500000 (102.30MB), cpu: 191.73M, memory: 0B, network: 0B}
    │  │      $hashvalue := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key1"), 0))
    │  │      key1 := key1:bigint:0:REGULAR
    │  │      value1 := value1:varchar(79):1:REGULAR
    │  └─ LocalExchange[HASH][$hashvalue_31] ("key2")
    │     │   Layout: [key2:bigint, value2:varchar(79), $hashvalue_31:bigint]
    │     │   Estimates: {rows: 1500000 (102.30MB), cpu: 294.03M, memory: 0B, network: 0B}
    │     └─ ScanProject[table = hive:test:test_grouped_join2 bucket=13, grouped = true]
    │            Layout: [key2:bigint, value2:varchar(79), $hashvalue_32:bigint]
    │            Estimates: {rows: 1500000 (102.30MB), cpu: 89.43M, memory: 0B, network: 0B}/{rows: 1500000 (102.30MB), cpu: 191.73M, memory: 0B, network: 0B}
    │            $hashvalue_32 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key2"), 0))
    │            key2 := key2:bigint:0:REGULAR
    │            value2 := value2:varchar(79):1:REGULAR
    └─ LocalExchange[HASH][$hashvalue_33] ("key3")
       │   Layout: [key3:bigint, value3:varchar(79), $hashvalue_33:bigint]
       │   Estimates: {rows: 1500000 (102.30MB), cpu: 294.03M, memory: 0B, network: 0B}
       └─ ScanProject[table = hive:test:test_grouped_join3 bucket=13, grouped = true]
              Layout: [key3:bigint, value3:varchar(79), $hashvalue_34:bigint]
              Estimates: {rows: 1500000 (102.30MB), cpu: 89.43M, memory: 0B, network: 0B}/{rows: 1500000 (102.30MB), cpu: 191.73M, memory: 0B, network: 0B}
              $hashvalue_34 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("key3"), 0))
              key3 := key3:bigint:0:REGULAR
              value3 := value3:varchar(79):1:REGULAR

分析

表的桶數量爲13(設爲t)一個表讀到內存之後是102MB,所以一個桶佔用內存=102MB/13=7.8MB(設爲m)。

測試Presto爲單機,-Xmx=1GB,單個query最大佔用(query.max-memory-per-node)爲102MB(設爲a,默認0.1*Max JVM大小)。

最大並行處理桶的數量(設爲n)

上述的SQL join了3個表(數據相同),所以

concurrent_lifespans_per_task設置小於4.4才能不OOM

測試情況覈實:
當設置concurrent_lifespans_per_task=5的時候

SQL Error [131079]: Query failed (#20190821_054413_00220_r4jkt): Query exceeded per-node user memory limit of 102.40MB [Allocated: 102.38MB, Delta: 59.11kB, Top Consumers: {HashBuilderOperator=102.38MB}]

注意:這是理論值,僅供參考價值。(受“分桶不可能做到平均”等因素影響)


使用場景

  • 假設單個query最大內存爲1GB
  • 假設所有參與join的表,讀到內存後的大小爲10GB

場景1:將所有的表,根據相同的字段分成10個桶(或更多,因爲實際情況需要預留更多的空間。如預留20%);設置concurrent_lifespans_per_task=1

場景2:將所有的表,根據相同的字段分成20個桶(或更多,因爲實際情況需要預留更多的空間。如預留20%);設置concurrent_lifespans_per_task=2


參考文檔

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章