問題說明
最近藉助自定義spark累加器的方式去監控我司大數據平臺處理接入的數據量(自定義累加器定義可以參考方法,可以參考AccumulatorV2.scala),但是最近項目局點同事反饋,文件數很多的情況下,spark driver端會出現OOM。(現場driver內存配置了80G)
問題定位過程
因爲現場dump文件達到80+G,因此,不能把dump文件完整的拿到辦公機器來分析,只能藉助linux memory analyzer tool工具去分析dump文件,得到一個分析結果(leak_suspects)。memory analyzer tool分析結果如下圖所示:
根據圖中的retained heap size,及number of objects信息,可以計算出平均一個taskInfo對象佔了4.67M內存。
查看spark taskinfo部分源代碼
class TaskInfo(
val taskId: Long,
/**
* The index of this task within its task set. Not necessarily the same as the ID of the RDD
* partition that the task is computing.
*/
val index: Int,
val attemptNumber: Int,
val launchTime: Long,
val executorId: String,
val host: String,
val taskLocality: TaskLocality.TaskLocality,
val speculative: Boolean) {
var gettingResultTime: Long = 0
private[this] var _accumulables: Seq[AccumulableInfo] = Nil
var finishTime: Long = 0
var failed = false
var killed = false
}
從上述代碼可知,該對象中最佔用空間的肯定是AccumulableInfo列表。
查看DAGScheduler源碼
private def updateAccumulators(event: CompletionEvent): Unit = {
val task = event.task
val stage = stageIdToStage(task.stageId)
event.accumUpdates.foreach { updates =>
val id = updates.id
try {
// Find the corresponding accumulator on the driver and update it
val acc: AccumulatorV2[Any, Any] = AccumulatorContext.get(id) match {
case Some(accum) => accum.asInstanceOf[AccumulatorV2[Any, Any]]
case None =>
throw new SparkException(s"attempted to access non-existent accumulator $id")
}
acc.merge(updates.asInstanceOf[AccumulatorV2[Any, Any]])
// To avoid UI cruft, ignore cases where value wasn't updated
if (acc.name.isDefined && !updates.isZero) {
stage.latestInfo.accumulables(id) = acc.toInfo(None, Some(acc.value))
event.taskInfo.setAccumulables(
acc.toInfo(Some(updates.value), Some(acc.value)) +: event.taskInfo.accumulables)
}
} catch {
case NonFatal(e) =>
// Log the class name to make it easy to find the bad implementation
val accumClassName = AccumulatorContext.get(id) match {
case Some(accum) => accum.getClass.getName
case None => "Unknown class"
}
logError(
s"Failed to update accumulator $id ($accumClassName) for task ${task.partitionId}",
e)
}
}
}
由上述源代碼可知,當使用sparkContext提供的register(acc: AccumulatorV2[_, _], name: String)時,會往taskinfo中添加累加器對象,最終造成taskInfo對象佔用過多內存。
總結
註冊累加器時,優先選擇如下方法:
def register(acc: AccumulatorV2[_, ]): Unit
而不要使用
def register(acc: AccumulatorV2[, _], name: String): Unit