pyspark報錯及處理

原創

2020-02-23 21:47

一、基本內存的介紹：

--driver-memory 40g \       內存
--executor-memory 40g \       內存
--num-executors 200 \       個數
--executor-cores 4 \       速度
--driver-cores 4 \           速度

1、由於spark節點分Driver(只有一個)和Executor(一般有多個)兩種概念。但兩種節點內存模型一樣，且OOM常發生在Executor。

2、每個Spark Executor會單獨佔用一個Container，單個Container內存的上限，就是Spark Executor內存上限，後面稱此值爲MonitorMemory。

3、MonitorMemory = spark.yarn.executor.memoryOverhead + spark.executor.memory。我們集羣中，memoryOverhead設置成固定的4G，用戶也可以通過參數自己調整。
4、spark.executor.memory 需要用戶自己設置。我們的集羣中，建議 1 excutor core 對應 2~4G executor.memory。

二、常見OOM類型，及解決辦法

1、
報錯：啓動時發現報錯：java.lang.IllegalArgumentException: Required executor memory (102400+4096 MB) is above the max threshold (28672 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
原因:提交任務時，啓動失敗，memoryOverhead+memory > (Max)MonitorMemory。
解決: 減小兩個參數的值即可。
2、
報錯：：Driver中發現報錯：YarnAllocator: Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
原因：運行時失敗，memoryOverhead+memory > MonitorMemory
解決：通過–conf spark.sql.shuffle.partitions=XXX增大partitions個數；或增大overhead/memory的大小，不超過(Max)MonitorMemory即可。若已經到max仍然報錯，可以減少單個Executor的併發數(cores)，增大Executor數量。
3、
報錯：Excutor日誌中發現報錯：java.lang.OutOfMemoryError: Java heap space 、org.apache.spark.shuffle.FetchFailedException: Java heap space、 Container killed on request. Exit code is 143
原因：程序運行時所需內存 > memory。一般是因爲處理數據量或者緩存的數據量較大，已有內存不足並且內存分配速度 > GC回收速度導致。
解決：增大memory、減少單個Executor的併發數(cores)、減少不必要的cache操作、儘量不要對比較大的數據做broadcast、儘量避免shuffle算子或者對程序邏輯/底層數據進行優化。
4、
報錯：org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: ...shuffle_xxx.index (No such file or directory)
原因：OOM導致shuffle獲取文件失敗，此報錯只是一個結果，原因是由於上述OOM導致，請詳細查看日誌是否有OOM的關鍵字。
解決：找到OOM關鍵字，解決OOM問題即可。或者開啓external shuffle service， --conf spark.shuffle.service.enabled=true
5、
報錯：org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby
原因：訪問(讀、寫)hdfs，namenode(hdfs的管理節點)較忙壓力較大時，spark超過一定時間會自動切換到namenod的standby節點嘗試訪問。standby會返回這個異常信息，然後spark會自動連接active節點。所以該報錯只是hdfs較忙的一個提示，並不是spark運行報錯。
解決：繼續等待即可，或者聯繫集市管理人員諮詢爲什麼hfds壓力較大。
6、
報錯：YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
ERROR FileFormatWriter: Aborting job null.
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
原因: driver啓動後提交job，超過300s(默認值)沒有啓動Executor會導致app超時失敗
解決: 1. 調整任務執行時間，選擇資源充足的時段執行。 2. 調整超時時間 --conf spark.sql.broadcastTimeout=900 (15分鐘或者其他值)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pyspark報錯及處理

一、基本內存的介紹：

二、常見OOM類型，及解決辦法

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

word2vec tf實戰

目標檢測指標性能評價(IOU,mAP等)

特徵選擇(過濾法、包裝法、嵌入法)

word2vec與詞嵌入

python數據類型及互相轉化

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結