GreenPlum 數據傾斜排查

在MPP無共享環境中，查詢的總響應時間取決於單個node執行最長的process。如果數據偏斜，則具有更多數據的node將花費更多時間來完成，因此每個node必須具有大約相等的行數並執行大約相同的處理量。如果一個node要處理的數據比其他node多得多，可能會導致性能差和內存不足的情況。

將大表連接在一起時，最佳分配至關重要。要執行聯接，匹配的行必須一起位於同一node上。如果數據未分配在同一DSK列上，則表之一中所需的行將動態重新分配給其他node。在某些情況下，將執行廣播動作，在該動作中，每個node將其各自的行發送到所有其他node，而不是進行重新分配動作，在該重新分配動作中，每個node都會對數據進行哈希處理，然後根據哈希鍵將行發送到適當的node。

在GPDB中的所有表都是分佈的，這意味着它們的數據被分割到系統的所有節點上。如果數據分佈的不平坦，查詢的性能可能會受到影響。下面的視圖可以幫助診斷一張表是否出現了數據不平坦分佈。

gp_skew_coefficients
gp_skew_idle_fractions

gp_skew_coefficients

The gp_toolkit.gp_skew_coefficients view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. The skccoeff column shows the coefficient of variation (CV), which is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

此視圖通過計算各實例之間的差異係數顯示數據分佈的傾斜。該視圖對所有用戶都可以訪問，不過非超級用戶只能查看到那些有訪問權限的關係。

字段 描述
skcoid 表的對象標識符
skcnamespace表定義的名字空間
skcrelname 表的名字
skccoeff

差異系統是通過標準差除以平均值得到的。這樣既考慮到了平均數也考慮到了差異性。值越小越好。越大的值表示數據傾斜越嚴重。

gp_skew_idle_fractions

The gp_toolkit.gp_skew_idle_fractions view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew. The siffraction column shows the percentage of the system that is idle during a table scan. This is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than10% skew should have their distribution policies evaluated.
此視圖通過計算表掃描期間的系統空閒百分比顯示數據分佈的傾斜，其作爲數據處理傾斜的指標。該視圖對所有用戶都可以訪問，不過非超級用戶只能查看到那些有訪問權限的關係。

字段 描述
sifoid 表的對象標識符
sifnamespace表定義的名字空間
sifname 表的名字
siffraction

表掃描期間的系統空閒百分比，其作爲數據分佈或者查詢處理的傾斜指標。例如，0.1表示10%的傾斜，0.5表示50%的傾斜，等等。對於出現10%傾斜的表，應該對其分佈策略進行評估。

以上兩個view只能從靜態去分析數據是否傾斜，事實上，在建立分佈鍵的時候都有充分考慮，因此因爲分佈鍵設計不合理導致的數據傾斜很少。後續可以繼續逐步排查。

造成GP性能不好的真正的兇手應該是正在運行的某個sql產生了大量的數據motion。這個對系統的I/O 網絡 CPU的壓力都是很大的。SQL中常見的join、order by、group by以及其他OLAP類型的sql，可能產生傾斜的時間並不久，但是這足以影響其他sql，影響數據庫效能，如果大量的傾斜sql打到數據庫上，這個是致命的。

因爲process產生的傾斜是一瞬間，因此不容易catch到這些異常。

GP官方給出來一個步驟來分析process skew的例子。

1、先確定要排查的數據OID，這是爲下一步要着手分析哪個數據庫上有傾斜。

=# SELECT oid, datname FROM pg_database;
  oid  |  datname  
-------+-----------
     1 | template1
 12813 | template0
 12816 | postgres
 16384 | qmstst
 64919 | gpperfmon
 78257 | pgbench
 78258 | results
(7 rows)

2、使用gpssh 從segment上統計出每個seg上data下pgsql_tmp目錄的大小

目前該腳本我還在調試中

[gpadmin@mdw kend]$ gpssh -f ~/hosts -e \ "du -b /data[1-2]/primary/gpseg*/base/<OID>/pgsql_tmp/*" | \ grep -v "du -b" | sort | awk -F" " '{ arr[$1] = arr[$1] + $2 ; tot = tot + $2 }; END \ { for ( i in arr ) print "Segment node" i, arr[i], "bytes (" arr[i]/(1024**3)" GB)"; \ print "Total", tot, "bytes (" tot/(1024**3)" GB)" }' -

Example output:

Segment node[sdw1] 2443370457 bytes (2.27557 GB)
Segment node[sdw2] 1766575328 bytes (1.64525 GB)
Segment node[sdw3] 1761686551 bytes (1.6407 GB)
Segment node[sdw4] 1780301617 bytes (1.65804 GB)
Segment node[sdw5] 1742543599 bytes (1.62287 GB)
Segment node[sdw6] 1830073754 bytes (1.70439 GB)
Segment node[sdw7] 1767310099 bytes (1.64594 GB)
Segment node[sdw8] 1765105802 bytes (1.64388 GB)
Total 14856967207 bytes (13.8366 GB)

如果每個node的硬盤使用率差距比較大且持續了一段時間，應該懷疑是否爲data skew導致的。腳本一次的查詢並不能說明data skew，在數據查詢時在GPCC界面看到某個seg的I/O 網絡....較高。

3、如果出現明顯且持續的偏斜，則下一個任務是識別有問題的查詢。

上一步中的命令彙總了整個節點。這次，找到實際的node目錄。您可以從主服務器執行此操作，也可以登錄到上一步中確定的特定節點。以下是從主服務器運行的示例。

本示例專門針對排序文件。並非所有溢出文件或傾斜情況都是由排序文件引起的，因此您需要自定義命令：

$ gpssh -f ~/hosts -e
    "ls -l /data[1-2]/primary/gpseg*/base/19979/pgsql_tmp/*"
    | grep -i sort | awk '{sub(/base.*tmp\//, ".../", $10); print $1,$6,$10}' | sort -k2 -n

Here is output from this command:

[sdw1] 288718848
      /data1/primary/gpseg2/.../pgsql_tmp_slice0_sort_17758_0001.0[sdw1] 291176448
      /data2/primary/gpseg5/.../pgsql_tmp_slice0_sort_17764_0001.0[sdw8] 924581888
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0010.9[sdw4] 980582400
      /data1/primary/gpseg18/.../pgsql_tmp_slice10_sort_29425_0001.0[sdw6] 986447872
      /data2/primary/gpseg35/.../pgsql_tmp_slice10_sort_29602_0001.0...[sdw5] 999620608
      /data1/primary/gpseg26/.../pgsql_tmp_slice10_sort_28637_0001.0[sdw2] 999751680
      /data2/primary/gpseg9/.../pgsql_tmp_slice10_sort_3969_0001.0[sdw3] 1000112128
      /data1/primary/gpseg13/.../pgsql_tmp_slice10_sort_24723_0001.0[sdw5] 1000898560
      /data2/primary/gpseg28/.../pgsql_tmp_slice10_sort_28641_0001.0...[sdw8] 1008009216
      /data1/primary/gpseg44/.../pgsql_tmp_slice10_sort_15671_0001.0[sdw5] 1008566272
      /data1/primary/gpseg24/.../pgsql_tmp_slice10_sort_28633_0001.0[sdw4] 1009451008
      /data1/primary/gpseg19/.../pgsql_tmp_slice10_sort_29427_0001.0[sdw7] 1011187712
      /data1/primary/gpseg37/.../pgsql_tmp_slice10_sort_18526_0001.0[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0001.0[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0002.1[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0003.2[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0004.3[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0005.4[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0006.5[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0007.6[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0008.7[sdw8] 1573741824
      /data2/primary/gpseg45/.../pgsql_tmp_slice10_sort_15673_0009.8

Scanning this output reveals that segment gpseg45 on host sdw8 is the culprit, as its sort files are larger than the others in the output.

4、

Log in to the offending node with ssh and become root. Use the lsof command to find the PID for the process that owns one of the sort files:

[root@sdw8 ~]# lsof /data2/primary/gpseg45/base/19979/pgsql_tmp/pgsql_tmp_slice10_sort_15673_0002.1
COMMAND  PID    USER    FD   TYPE DEVICE  SIZE        NODE        NAME
postgres 15673  gpadmin 11u  REG  8,48    1073741824  64424546751 /data2/primary/gpseg45/base/19979/pgsql_tmp/pgsql_tmp_slice10_sort_15673_0002.1

The PID, 15673, is also part of the file name, but this may not always be the case.

5、

Use the ps command with the PID to identify the database and connection information:

[root@sdw8 ~]# ps -eaf | grep 15673
gpadmin  15673 27471 28 12:05 ?        00:12:59 postgres: port 40003, sbaskin bdw
        172.28.12.250(21813) con699238 seg45 cmd32 slice10 MPPEXEC SELECT
root     29622 29566  0 12:50 pts/16   00:00:00 grep 15673

On the master, check the pg_log log file for the user in the previous command (sbaskin), connection (con699238, and command (cmd32). The line in the log file with these three values should be the line that contains the query, but occasionally, the command number may differ slightly. For example, the ps output may show cmd32, but in the log file it is cmd34. If the query is still running, the last query for the user and connection is the offending query.

參考：

1、https://gpdb.docs.pivotal.io/43270/admin_guide/distribution.html

GreenPlum 數據傾斜排查

杭州的 IT 崩盤了麼？

開源高性能結構化日誌模塊NanoLog

Python 潮流週刊#55：分享 9 個高質量的技術類信息源！

Azure Virtual Network (22) 多訂閱使用Azure DNS解析問題 Windows Azure Platform 系列文章目錄

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

WinForm應用實戰開發指南 - 表格數據錄入問題解析

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

redis主從同步參數repl-backlog-size測算

「揭祕GP」Greenplum 6 軟件包目錄暢遊

「揭祕GP」Greenplum 的人工智能應用場景：MADlib、GPText、GPU

「揭祕GP」Greenplum新一代數據遷移工具gpcopy，更快更穩更易用

【揭祕GP】全新 Greenplum 集羣傳輸工具—GPCOPY 2.1.0 正式發佈

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結