Spark讀寫和Lost Excutor錯誤的分析和解決過程
http://www.aboutyun.com/thread-15842-1-1.html
問題導讀
1.大規模數據往HDFS中寫時候,報了HDFS讀寫超時,本文是如何分析的?
2.大規模數據往HDFS中寫時候,報了超時如何解決?
3.總結你遇到問題,是如何解決的?
一、概述
上篇blog記錄了些在用spark-sql時遇到的一些問題,今天繼續記錄用Spark提供的RDD轉化方法開發公司第一期標籤分析系統(一部分scala作業邏輯代碼後面blog再給大家分享)遇到的一些SPARK作業錯誤信息。其中有些問題可能一些數據量或者shuffle量比較小的作業時不會遇到的,我們整套標籤系統的初級輸入數據大概是8T左右,這裏也是個參考。(下面的Spark部署模式爲spark on yarn)
二、問題
1、大規模數據往HDFS中寫時候,報了HDFS讀寫超時,具體日誌看下面。
(1)具體到某個Excutor的錯誤日誌:
(2)具體到各個數據節點DataNode的日誌:
分析:
從這兩個錯誤信息首先可以將錯誤定位到整個HDFS的讀寫過程中,其中對於讀寫超時可以定位到2個參數:dfs.client.socket-timeout(默認60s)、dfs.datanode.socket.write.timeout(默認80s)。在spark的程序中按照自己的實際情況設置這兩個值,問題可以解決。給個例子:
[Bash shell] 純文本查看 複製代碼
01 |
val
dwd_new_pc_list_patch = "/user/hive/warehouse/pc.db/dwd_new_pc_list/2015-01-*/action=play" |
02 |
val
sparkConf = new SparkConf().setAppName( "TagSystem_compositeTag" ) |
03 |
. set ( "spark.kryoserializer.buffer.max.mb" , "128" ). set ( "spark.rdd.compress" , "true" ) |
04 |
val
sc = new SparkContext(sparkConf) |
07 |
sc.hadoopConfiguration. set ( "dfs.client.socket-timeout" , "180000" ) |
09 |
sc.hadoopConfiguration. set ( "dfs.datanode.socket.write.timeout" , "180000" ) |
10 |
val
sqlContext = new org.apache.spark.sql.SQLContext(sc) |
11 |
val
hiveSqlContext = new org.apache.spark.sql.hive.HiveContext(sc) |
13 |
//(user_id,fo,fo_2,sty,fs) |
14 |
val source =
sc.textFile(dwd_new_pc_list_patch).filter(p => (p.trim != "" &&
p. split ( "\\|" ).length
>= 105)).mapPartitions({ it => |
17 |
}
yield (line. split ( "\\|" )(21),
line. split ( "\\|" )(9),
line. split ( "\\|" )(104),
line. split ( "\\|" )(40),
line. split ( "\\|" )(7)) |
18 |
}).persist(StorageLevel.MEMORY_AND_DISK_SER) |
2、由spark.reducer.maxMbInFlight引起的Lost Excutor問題。
這個錯誤主要是發生在shuffle中的fetch階段,由於Excutor 已經lost掉了,由於容錯機制另外重新啓動一個Excutor,但是在之前lost掉的Excutor中保存的blockManager已經完全丟失,所以之前的stage需要重新計算。具體在dirver或者CoarseGrainedExecutorBackend的日誌主要提示超時和讀寫文件失敗,截了下超時的錯誤提示:
解決方法:
處理Lost Excutor問題還是花了比較長的時間,調整了很多參數都不行。最後將spark.reducer.maxMbInFlight調小或者將spark.shuffle.copier.threads調小問題解決。在家裏還是詳細的研究了下spark.reducer.maxMbInFlight這個參數的具體機制含義。spark.reducer.maxMbInFlight官方的配置文檔的說明有些籠統:大概的意思是同事從reduce task中取出的ShuffleTask輸出最大值(默認48MB)。這個從字面上理解還是不怎麼容易的,從源碼上search這個參數,定位到org.apache.spark.storage.BlockFetcherIterator.BasicBlockFetcherIterator#splitLocalRemoteBlocks
[Bash shell] 純文本查看 複製代碼
01 |
protected
def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = { |
02 |
//
Make remote requests at most maxBytesInFlight / 5 in length;
the reason to keep them |
03 |
//
smaller than maxBytesInFlight is to allow multiple, parallel fetches from up to 5 |
04 |
//
nodes, rather than blocking on reading output from one node. |
05 |
//每個fetch線程獲取的數據量大小(默認5個fetch線程) |
06 |
val
targetRequestSize = math.max(maxBytesInFlight / 5, 1L) |
07 |
logInfo( "maxBytesInFlight:
" +
maxBytesInFlight + ",
targetRequestSize: " +
targetRequestSize) |
09 |
//
Split local and
remote blocks. Remote blocks are further split into
FetchRequests of size |
10 |
//
at most maxBytesInFlight in order
to limit the amount of data in flight. |
11 |
val
remoteRequests = new ArrayBuffer[FetchRequest] |
13 |
for ((address,
blockInfos) <- blocksByAddress) { // address實際上是executor_id |
14 |
totalBlocks
+= blockInfos.size |
15 |
if (address
== blockManagerId) { |
16 |
//
Filter out zero-sized blocks |
17 |
localBlocksToFetch
++= blockInfos.filter(_._2 != 0).map(_._1) |
18 |
_numBlocksToFetch
+= localBlocksToFetch.size |
20 |
val
iterator = blockInfos.iterator |
21 |
var
curRequestSize = 0L |
22 |
var
curBlocks = new ArrayBuffer[(BlockId, Long)] |
23 |
while (iterator.hasNext)
{ |
24 |
//
blockId 是org.apache.spark.storage.ShuffleBlockId, |
25 |
//
格式: "shuffle_" +
shuffleId + "_" +
mapId + "_" +
reduceId |
26 |
val
(blockId, size) = iterator.next() |
29 |
curBlocks
+= ((blockId, size)) |
30 |
remoteBlocksToFetch
+= blockId |
31 |
_numBlocksToFetch
+= 1 |
32 |
curRequestSize
+= size |
33 |
} else if (size
< 0) { |
34 |
throw
new BlockException(blockId, "Negative
block size " +
size) |
37 |
if (curRequestSize
>= targetRequestSize) { |
38 |
//
Add this FetchRequest |
39 |
remoteRequests
+= new FetchRequest(address, curBlocks) |
40 |
curBlocks
= new ArrayBuffer[(BlockId, Long)] |
41 |
logDebug(s "Creating
fetch request of $curRequestSize at $address" ) |
45 |
//
Add in the
final request |
46 |
//
將剩餘的請求放到最後一個request中。 |
47 |
if (!curBlocks.isEmpty)
{ |
48 |
remoteRequests
+= new FetchRequest(address, curBlocks) |
52 |
logInfo( "Getting
" +
_numBlocksToFetch + "
non-empty blocks out of " + |
53 |
totalBlocks
+ "
blocks" ) |
從代碼上看我的個人理解是在shuffle節點每個reduce task會啓動5個fetch線程(可以由spark.shuffle.copier.threads配置)去最多spark.reducer.maxMbInFlight個(默認5)其他Excuctor中獲取文件位置,然後去fetch它們,並且每次fetch的抓取量不會超過spark.reducer.maxMbInFlight(默認值爲48MB)/5。這種機制我個人理解,第一:可以減少單個fetch連接的網絡IO、第二:這種將fetch數據並行執行有助於抓取速度提高,減少請求數據的抓取時間總和。
回來結合我現在的問題分析,我將spark.reducer.maxMbInFlight調小,從而減少了每個reduce task中的每個fetch線程的抓取數據量,進而減少了每個fetch連接的持續連接時間,降低了由於reduce task過多導致每個Excutor中存在的fetch線程太多而導致的fetch超時,另外降低內存的佔用。
上述分析爲個人理解,如有更深入的想法歡迎交流。