累加器使用不當，導致spark driver內存溢出問題

原創

2019-09-26 17:27

問題說明

最近藉助自定義spark累加器的方式去監控我司大數據平臺處理接入的數據量（自定義累加器定義可以參考方法，可以參考AccumulatorV2.scala），但是最近項目局點同事反饋，文件數很多的情況下，spark driver端會出現OOM。（現場driver內存配置了80G）

問題定位過程

因爲現場dump文件達到80+G，因此，不能把dump文件完整的拿到辦公機器來分析，只能藉助linux memory analyzer tool工具去分析dump文件，得到一個分析結果（leak_suspects）。memory analyzer tool分析結果如下圖所示：

根據圖中的retained heap size，及number of objects信息，可以計算出平均一個taskInfo對象佔了4.67M內存。

查看spark taskinfo部分源代碼

class TaskInfo(
    val taskId: Long,
    /**
     * The index of this task within its task set. Not necessarily the same as the ID of the RDD
     * partition that the task is computing.
     */
    val index: Int,
    val attemptNumber: Int,
    val launchTime: Long,
    val executorId: String,
    val host: String,
    val taskLocality: TaskLocality.TaskLocality,
    val speculative: Boolean) {

  var gettingResultTime: Long = 0

  private[this] var _accumulables: Seq[AccumulableInfo] = Nil

  var finishTime: Long = 0

  var failed = false

  var killed = false
}

從上述代碼可知，該對象中最佔用空間的肯定是AccumulableInfo列表。

查看DAGScheduler源碼

private def updateAccumulators(event: CompletionEvent): Unit = {
  val task = event.task
  val stage = stageIdToStage(task.stageId)

  event.accumUpdates.foreach { updates =>
    val id = updates.id
    try {
      // Find the corresponding accumulator on the driver and update it
      val acc: AccumulatorV2[Any, Any] = AccumulatorContext.get(id) match {
        case Some(accum) => accum.asInstanceOf[AccumulatorV2[Any, Any]]
        case None =>
          throw new SparkException(s"attempted to access non-existent accumulator $id")
      }
      acc.merge(updates.asInstanceOf[AccumulatorV2[Any, Any]])
      // To avoid UI cruft, ignore cases where value wasn't updated
      if (acc.name.isDefined && !updates.isZero) {
        stage.latestInfo.accumulables(id) = acc.toInfo(None, Some(acc.value))
        event.taskInfo.setAccumulables(
          acc.toInfo(Some(updates.value), Some(acc.value)) +: event.taskInfo.accumulables)
      } 
    } catch {
      case NonFatal(e) =>
        // Log the class name to make it easy to find the bad implementation
        val accumClassName = AccumulatorContext.get(id) match {
          case Some(accum) => accum.getClass.getName
          case None => "Unknown class"
        }
        logError(
          s"Failed to update accumulator $id ($accumClassName) for task ${task.partitionId}",
          e)
    }
  }
}

由上述源代碼可知，當使用sparkContext提供的register(acc: AccumulatorV2[_, _], name: String)時，會往taskinfo中添加累加器對象，最終造成taskInfo對象佔用過多內存。

總結

註冊累加器時，優先選擇如下方法：
def register(acc: AccumulatorV2[_, ]): Unit
而不要使用
def register(acc: AccumulatorV2[, _], name: String): Unit

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

累加器使用不當，導致spark driver內存溢出問題

問題說明

問題定位過程

總結

日常問題系列——使用parquet-hadoop-1.8.1.jar提供的parquet文件合併，出現too many open files錯誤

日常問題定位——kafka topic leader none ISR爲空

Kafka總結——KafkaProducer

日常問題系列——Java字節碼解決nosuchmethoderror

日常問題系列——藉助Arthas解決noclassdeferror/nosuchmethoderror問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結