第16課：Spark Streaming源碼解讀之數據清理內幕徹底解密

本期內容：

Spark Streaming數據清理原因和現象
Spark Streaming數據清理代碼解析

對Spark Streaming解析了這麼多課之後，我們越來越能感知，Spark Streaming只是基於Spark Core的一個應用程序，因此掌握Spark Streaming對於我們怎麼編寫Spark應用是絕對有好處的。

Spark Streaming 不像Spark Core的應用程序，Spark Core的應用的數據是存儲在底層文件系統，如HDFS等別的存儲系統中，而Spark Streaming一直在運行，不斷計算，每一秒中在不斷運行都會產生大量的累加器、廣播變量，所以需要對對象及元數據需要定期清理。每個batch duration運行時不斷觸發job後需要清理rdd和元數據。Client模式可以看到打印的日誌，從文件日誌也可以看到清理日誌內容。

Spark運行在jvm上，jvm會產生對象，jvm需要對對象進行回收工作，如果我們不管理gc（對象產生和回收），jvm很快耗盡。現在研究的是Spark Streaming的Spark GC。Spark Streaming對rdd的數據管理、元數據管理相當jvm對gc管理。數據、元數據是操作DStream時產生的，數據、元數據的回收則需要研究DStream的產生和回收。

數據輸入靠InputDStream，數據輸入、數據操作、數據輸出，整個生命週期都是基於DStream構建的，DStream負責rdd的生命週期，rrd是DStream產生的，對rdd的操作也是對DStream的操作，所以不斷產生batchDuration的循環，所以研究對rdd的操作也就是研究對DStream的操作。以從kafka中 Direct方式爲例, DirectKafkaInputDStream會產生KafkaRDD

override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {  val untilOffsets = clamp(latestLeaderOffsets(maxRetries))  val rdd = KafkaRDD[K, V, U, T, R](
    context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)  // Report the record number and metadata of this batch interval to InputInfoTracker.
  val offsetRanges = currentOffsets.map { case (tp, fo) =>
    val uo = untilOffsets(tp)    OffsetRange(tp.topic, tp.partition, fo, uo.offset)
  }  val description = offsetRanges.filter { offsetRange =>
    // Don't display empty ranges.
    offsetRange.fromOffset != offsetRange.untilOffset
  }.map { offsetRange =>
    s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
      s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
  }.mkString("\n")  // Copy offsetRanges to immutable.List to prevent from being modified by the user
  val metadata = Map(    "offsets" -> offsetRanges.toList,    StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)  val inputInfo = StreamInputInfo(id, rdd.count, metadata)
  ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
  currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)  Some(rdd)
}

DStream隨着時間進行,數據週期性產生和週期性釋放，在JobGenerator中有一個定時器：

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,  longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

而JobGenerator中也有一個EventLoop來週期性的接收消息事件：

/** Processes all events */private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {    case GenerateJobs(time) => generateJobs(time)    case ClearMetadata(time) => clearMetadata(time)    case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater)    case ClearCheckpointData(time) => clearCheckpointData(time)
  }
}

裏面就有清理元數據和清理checkpoint數據的方法 clearMetadata：清楚元數據信息。

/** Clear DStream metadata for the given `time`. */private def clearMetadata(time: Time) {
  ssc.graph.clearMetadata(time)  // If checkpointing is enabled, then checkpoint,
  // else mark batch to be fully processed
  if (shouldCheckpoint) {
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))
  } else {    // If checkpointing is not enabled, then delete metadata information about
    // received blocks (block data not saved in any case). Otherwise, wait for
    // checkpointing of this batch to complete.
    val maxRememberDuration = graph.getMaxInputStreamRememberDuration()
    jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)
    jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)
    markBatchFullyProcessed(time)
  }
}

DStreamGraph:首先會清理outputDStream，其實就是ForEachDStream

 def clearMetadata(time: Time) {
  logDebug("Clearing metadata for time " + time)  this.synchronized {
    outputStreams.foreach(_.clearMetadata(time))
  }
  logDebug("Cleared old metadata for time " + time)
}

DStream.clearMetadata:除了清除RDD，也可以清除metadata元數據。如果想RDD跨Batch Duration的話可以設置rememberDuration時間. rememberDuration

/** * Clear metadata that are older than `rememberDuration` of this DStream. * This is an internal method that should not be called directly. This default * implementation clears the old generated RDDs. Subclasses of DStream may override * this to clear their own metadata along with the generated RDDs. */private[streaming] def clearMetadata(time: Time) {  val unpersistData = ssc.conf.getBoolean("spark.streaming.unpersist", true)// rememberDuration記憶週期 查看下RDD是否是oldRDD
  val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration))
  logDebug("Clearing references to old RDDs: [" +
    oldRDDs.map(x => s"${x._1} -> ${x._2.id}").mkString(", ") + "]")//從generatedRDDs中將key清理掉。
  generatedRDDs --= oldRDDs.keys  if (unpersistData) {
    logDebug("Unpersisting old RDDs: " + oldRDDs.values.map(_.id).mkString(", "))
    oldRDDs.values.foreach { rdd =>
      rdd.unpersist(false)      // Explicitly remove blocks of BlockRDD
      rdd match {        case b: BlockRDD[_] =>
          logInfo("Removing blocks of RDD " + b + " of time " + time)
          b.removeBlocks() //清理掉RDD的數據
        case _ =>
      }
    }
  }
  logDebug("Cleared " + oldRDDs.size + " RDDs that were older than " +
    (time - rememberDuration) + ": " + oldRDDs.keys.mkString(", "))//依賴的DStream也需要清理掉。
  dependencies.foreach(_.clearMetadata(time))
}

在BlockRDD中,BlockManagerMaster根據blockId將Block刪除。刪除Block的操作是不可逆的。

 /** * Remove the data blocks that this BlockRDD is made from. NOTE: This is an * irreversible operation, as the data in the blocks cannot be recovered back * once removed. Use it with caution. */private[spark] def removeBlocks() {
  blockIds.foreach { blockId =>
    sparkContext.env.blockManager.master.removeBlock(blockId)
  }
  _isValid = false}

回到JobGenerator中的processEvent看看 clearCheckpoint：清除緩存數據

/** Clear DStream checkpoint data for the given `time`. */private def clearCheckpointData(time: Time) {
  ssc.graph.clearCheckpointData(time)  // All the checkpoint information about which batches have been processed, etc have
  // been saved to checkpoints, so its safe to delete block metadata and data WAL files
  val maxRememberDuration = graph.getMaxInputStreamRememberDuration()
  jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)
  jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)
  markBatchFullyProcessed(time)
}

clearCheckpointData:

def clearCheckpointData(time: Time) {
  logInfo("Clearing checkpoint data for time " + time)  this.synchronized {
    outputStreams.foreach(_.clearCheckpointData(time))
  }
  logInfo("Cleared checkpoint data for time " + time)
}

ClearCheckpointData: 和清除元數據信息一樣，還是清除DStream依賴的緩存數據。

private[streaming] def clearCheckpointData(time: Time) {
  logDebug("Clearing checkpoint data")
  checkpointData.cleanup(time)
  dependencies.foreach(_.clearCheckpointData(time))
  logDebug("Cleared checkpoint data")
}

DStreamCheckpointData:清除緩存的數據

/** * Cleanup old checkpoint data. This gets called after a checkpoint of `time` has been * written to the checkpoint directory. */def cleanup(time: Time) {  // Get the time of the oldest checkpointed RDD that was written as part of the
  // checkpoint of `time`
  timeToOldestCheckpointFileTime.remove(time) match {    case Some(lastCheckpointFileTime) =>
      // Find all the checkpointed RDDs (i.e. files) that are older than `lastCheckpointFileTime`
      // This is because checkpointed RDDs older than this are not going to be needed
      // even after master fails, as the checkpoint data of `time` does not refer to those files
      val filesToDelete = timeToCheckpointFile.filter(_._1 < lastCheckpointFileTime)
      logDebug("Files to delete:\n" + filesToDelete.mkString(","))
      filesToDelete.foreach {        case (time, file) =>
          try {            val path = new Path(file)            if (fileSystem == null) {
              fileSystem = path.getFileSystem(dstream.ssc.sparkContext.hadoopConfiguration)
            }
            fileSystem.delete(path, true)
            timeToCheckpointFile -= time
            logInfo("Deleted checkpoint file '" + file + "' for time " + time)
          } catch {            case e: Exception =>
              logWarning("Error deleting old checkpoint file '" + file + "' for time " + time, e)
              fileSystem = null
          }
      }    case None =>
      logDebug("Nothing to delete")
  }
}

至此，我們知道了怎麼清理舊的數據以及清理什麼數據，但是清理數據什麼時候被觸發的？在最終提交Job的時候，是交給JobHandler去執行的。

private class JobHandler(job: Job) extends Runnable with Logging {    import JobScheduler._

    def run() {      try {        val formattedTime = UIUtils.formatBatchTime(
          job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)        val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
        val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

        ssc.sc.setJobDescription(
          s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
        ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
        ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)        // We need to assign `eventLoop` to a temp variable. Otherwise, because
        // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
        // it's possible that when `post` is called, `eventLoop` happens to null.
        var _eventLoop = eventLoop        if (_eventLoop != null) {
          _eventLoop.post(JobStarted(job, clock.getTimeMillis()))          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            job.run()
          }
          _eventLoop = eventLoop          if (_eventLoop != null) {//當Job完成的時候，eventLoop會發消息初始化onReceive
            _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
          }
        } else {          // JobScheduler has been stopped.
        }
      } finally {
        ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
        ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
      }
    }
  }
}

EventLoop 的onReceive初始化接收到消息JobCompleted.

def start(): Unit = synchronized {  if (eventLoop != null) return // scheduler has already been started

  logDebug("Starting JobScheduler")
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
  }
  eventLoop.start()

processEvent:

private def processEvent(event: JobSchedulerEvent) {  try {
    event match {      case JobStarted(job, startTime) => handleJobStart(job, startTime)      case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)      case ErrorReported(m, e) => handleError(m, e)
    }
  } catch {    case e: Throwable =>
      reportError("Error in job scheduler", e)
  }
}

調用JobGenerator的onBatchCompletion方法清除元數據。

private def handleJobCompletion(job: Job, completedTime: Long) {  val jobSet = jobSets.get(job.time)
  jobSet.handleJobCompletion(job)
  job.setEndTime(completedTime)
  listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
  logInfo("Finished job " + job.id + " from job set of time " + jobSet.time)  if (jobSet.hasCompleted) {
    jobSets.remove(jobSet.time)
    jobGenerator.onBatchCompletion(jobSet.time)
    logInfo("Total delay: %.3f s for time %s (execution: %.3f s)".format(
      jobSet.totalDelay / 1000.0, jobSet.time.toString,
      jobSet.processingDelay / 1000.0
    ))
    listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
  }
  job.result match {    case Failure(e) =>
      reportError("Error running job " + job, e)    case _ =>
  }
}

至此我們明白了什麼時候觸發清楚舊數據的過程。

備註：

1、DT大數據夢工廠微信公衆號DT_Spark
2、IMF晚8點大數據實戰YY直播頻道號：68917580
3、新浪微博: http://www.weibo.com/ilovepains

第16課：Spark Streaming源碼解讀之數據清理內幕徹底解密

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

第35講：List的map、flatMap、foreach、filter操作代碼實戰

第42講：Scala中泛型類、泛型函數、泛型在Spark中的廣泛應用

第40講：Set、Map、TreeSet、TreeMap操作代碼實戰

第53課：Hive 第一課：Hive的價值、Hive的架構設計簡介

第36講：List的partition、find、takeWhile、dropWhile、span、forall、exsists操作代碼實戰

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結