累加器使用不當,導致spark driver內存溢出問題

問題說明

  最近藉助自定義spark累加器的方式去監控我司大數據平臺處理接入的數據量(自定義累加器定義可以參考方法,可以參考AccumulatorV2.scala),但是最近項目局點同事反饋,文件數很多的情況下,spark driver端會出現OOM。(現場driver內存配置了80G)

問題定位過程

  因爲現場dump文件達到80+G,因此,不能把dump文件完整的拿到辦公機器來分析,只能藉助linux memory analyzer tool工具去分析dump文件,得到一個分析結果(leak_suspects)。memory analyzer tool分析結果如下圖所示:
在這裏插入圖片描述
根據圖中的retained heap size,及number of objects信息,可以計算出平均一個taskInfo對象佔了4.67M內存。

  查看spark taskinfo部分源代碼

class TaskInfo(
    val taskId: Long,
    /**
     * The index of this task within its task set. Not necessarily the same as the ID of the RDD
     * partition that the task is computing.
     */
    val index: Int,
    val attemptNumber: Int,
    val launchTime: Long,
    val executorId: String,
    val host: String,
    val taskLocality: TaskLocality.TaskLocality,
    val speculative: Boolean) {

  var gettingResultTime: Long = 0

  private[this] var _accumulables: Seq[AccumulableInfo] = Nil

  var finishTime: Long = 0

  var failed = false

  var killed = false
}

從上述代碼可知,該對象中最佔用空間的肯定是AccumulableInfo列表。

  查看DAGScheduler源碼

private def updateAccumulators(event: CompletionEvent): Unit = {
  val task = event.task
  val stage = stageIdToStage(task.stageId)

  event.accumUpdates.foreach { updates =>
    val id = updates.id
    try {
      // Find the corresponding accumulator on the driver and update it
      val acc: AccumulatorV2[Any, Any] = AccumulatorContext.get(id) match {
        case Some(accum) => accum.asInstanceOf[AccumulatorV2[Any, Any]]
        case None =>
          throw new SparkException(s"attempted to access non-existent accumulator $id")
      }
      acc.merge(updates.asInstanceOf[AccumulatorV2[Any, Any]])
      // To avoid UI cruft, ignore cases where value wasn't updated
      if (acc.name.isDefined && !updates.isZero) {
        stage.latestInfo.accumulables(id) = acc.toInfo(None, Some(acc.value))
        event.taskInfo.setAccumulables(
          acc.toInfo(Some(updates.value), Some(acc.value)) +: event.taskInfo.accumulables)
      } 
    } catch {
      case NonFatal(e) =>
        // Log the class name to make it easy to find the bad implementation
        val accumClassName = AccumulatorContext.get(id) match {
          case Some(accum) => accum.getClass.getName
          case None => "Unknown class"
        }
        logError(
          s"Failed to update accumulator $id ($accumClassName) for task ${task.partitionId}",
          e)
    }
  }
}

由上述源代碼可知,當使用sparkContext提供的register(acc: AccumulatorV2[_, _], name: String)時,會往taskinfo中添加累加器對象,最終造成taskInfo對象佔用過多內存。

總結

  註冊累加器時,優先選擇如下方法:
    def register(acc: AccumulatorV2[_, ]): Unit
  而不要使用
    def register(acc: AccumulatorV2[
, _], name: String): Unit

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章