Spark SQL 優化筆記

我的原創地址：https://dongkelun.com/2018/12/26/sparkSqlOptimize/

前言

記錄自己在工作開發中遇到的SQL優化問題

1、避免用in 和 not in

解決方案：

用exists 和 not exists代替
用join代替

not exists示例

not in:

select stepId,province_code,polyline from route_step where stepId not in (select stepId from stepIds)

not exists:

select stepId,province_code,polyline from route_step where road!='解析異常' and  not exists (select stepId from stepIds where route_step.stepId = stepIds.stepId)

自己遇到的問題

上面not in會拋出異常

18/12/26 11:20:26 WARN TaskSetManager: Stage 3 contains a task of very large size (17358 KB). The maximum recommended task size is 100 KB.
Exception in thread "dispatcher-event-loop-11" java.lang.OutOfMemoryError: Java heap space

首先會導致某個task數量很大，且總task數量很少（task數目不等於rdd或df的分區數，目前不知道原因），接着報java.lang.OutOfMemoryError，試了很多方法，最後用not exists，沒有上面的異常

效率

not in慢的原因是 not in不走索引

疑問：not in是非相關子查詢，not exists是相關子查詢，而從理論上來說非相關子查詢比相關子查詢效率高（看下面的參考），但是這裏卻相反，矛盾，不知道爲啥~

參考博客：

2、in 會導致數據傾斜

longitudeAndLatitudes和lineIds都有160個分區，且數據平衡（每個分區的數目差不多），但是下面的語句則有問題

select * from longitudeAndLatitudes where lineId  in (select lineId from lineIds)

雖然分區數還是160，但是隻有兩三個分區有數，其他分區的數量都爲0，這樣就導致數據傾斜，程序執行很慢，如果非要用in的話，那麼需要repartition一下

3、大表join小表

策略：將小表廣播（broadcast）
參數：spark.sql.autoBroadcastJoinThreshold 默認值10485760（10M）,當小表或df的大小小於此值，Spark會自動的將該表廣播到每個節點上
原理：join是個shuffle類算子，shuffle時，各個節點上會先將相同的key寫到本地磁盤，之後再通過網絡傳輸從其他節點的磁盤文件在拉取相同的key，因此shuffle可能會發生大量的磁盤IO和網絡傳輸，性能很低，而broadcast先將小表廣播到每個節點，這樣join時都是在本地完成，不需要網絡傳輸，所以會提升性能

注意：broadcast join 也稱爲replicated join 或者 map-side join

具體操作

提交代碼時適當調大閾值，如將閾值修改爲100M,具體看自己環境的內存限制和小表的大小

--conf spark.sql.autoBroadcastJoinThreshold=104857600

如何看是否進行了broadcast join：
以df爲例（df是join之後的結果）

df.explain

如果爲broadcast join，則打印：

== Physical Plan ==
*(14) Project [lineId#81, stepIds#85, userId#1, freq#2]
+- *(14) BroadcastHashJoin [lineId#81], [lineId#42], Inner, BuildLeft
...

能看到關鍵字BroadcastHashJoin即可，否則打印：

== Physical Plan ==
*(17) Project [lineId#42, stepIds#85, freq#2, userId#1]
+- *(17) SortMergeJoin [lineId#42], [lineId#81], Inner
...

能看到SortMergeJoin即可

查看閾值：

val threshold =  spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt
threshold / 1024 / 1024

參考

4、寫MySQL慢

Spark df批量寫MySQL很慢，如我900萬條數據寫需要5-10個小時
解決辦法:在url後面加上

&rewriteBatchedStatements=true

加上之後，寫數據10分鐘左右，快很多。

個人環境經驗：MySQL不用加就沒問題，MariaDB需要加，也就是不同的MySQL版本不一樣

5、run at ThreadPoolExecutor.java:1149

之前就在Spark Web UI經常看到這個描述，但不知道是幹啥，現在在總結上面的broadcast join發現了規律:當兩個表join，如果爲BroadcastHashJoin則有這個描述，如果爲SortMergeJoin則沒有。
BroadcastHashJoin 用ThreadPool進行異步廣播源碼見:BroadcastHashJoinExec和BroadcastExchangeExec
參考：What are ThreadPoolExecutors jobs in web UI's Spark Jobs?

Spark SQL 優化筆記

前言

1、避免用in 和 not in

解決方案：

not exists示例

自己遇到的問題

效率

參考博客：

2、in 會導致數據傾斜

3、大表join小表

具體操作

參考

4、寫MySQL慢

5、run at ThreadPoolExecutor.java:1149

Vue 自動獲取本地ip，並打開瀏覽器

Vue版本Echarts中國地圖三級鑽取及Vue踩坑筆記

Spark 異常總結及解決辦法

Linux 安裝 oh-my-zsh

Spark SQL 優化筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結