spark-core_20: MapOutputTrackerMaster、MapOutputTracker、MapOutputTrackerMasterEndpoint等源碼分析

1,在SparkEnv.create()初始化了MapOutputTrackerMaster(記錄ShuffleMapTask輸出信息)

val mapOutputTracker = if (isDriver) {
 
/* MapOutputTrackerMaster屬於driver,這裏使用TimeStampedHashMap來跟蹤 map的輸出信息,也可以將舊信息進行清除
    * 一、MapOutputTracker的作用
    * 1,獲得mapper的輸入信息,方便reducer取得對應的信息
    * 2,每個mapper和reducer都有自己的唯一標識mapperid,reducerId
    * 3,每個reducer可以對應多個map的輸入,reducer會去取每個map中的Block,這個過程稱爲shuffle,每個shuffle也對應shuffleId
    */

  new MapOutputTrackerMaster(conf)
}
else {
 
//是運行在executor中的
  new MapOutputTrackerWorker(conf)
}

// Have to assign trackerActor afterinitialization as MapOutputTrackerActor
// requires the MapOutputTracker itself
//初始化MapOutputTracker需要給的成員trackerEndpoint進行賦MapOutputTrackerMasterEndpoint, MapOutputTracker.ENDPOINT_NAME的值:MapOutputTracker
mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
 
new MapOutputTrackerMasterEndpoint(
   
rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

2,MapOutputTrackerMaster初始化過程做了什麼

a,默認開啓reduce Task 數據本地性:spark.shuffle.reduceLocality.enabled:true

b,限制每個rdd的partition的總個數:SHUFFLE_PREF_MAP_THRESHOLD小於1000,這樣做更省資源

c,可優化的地方:Map output輸出到本地性的比率:private val REDUCER_PREF_LOCS_FRACTION = 0.2

d,driver端:將存儲Map output輸出的block manager的地址及task運行時輸出大小給reduce 的值給 mapStatuses,基於Timestamp的hashMap來存放mapStatuses和緩存系列化statuses:TimeStampedHashMap[Int, Array[Byte]]()

e, 可優化的地方:初始new MetadataCleaner()來清理mapStatatus和緩存系列化statuses,默認情況如果不設置spark.cleaner.ttl的值是不會清理的

/**
 * MapOutputTracker for the driver. This uses TimeStampedHashMap to keep track of map
 * output information, which allows old output information based on a TTL.
  *
  * MapOutputTrackerMaster屬於driver,這裏使用TimeStampedHashMap來跟蹤 map的輸出信息,
    對於存儲的舊信息可以被清除掉
  *   爲每個shuffle準備其所需要的所有map out,可以加速map outs傳送給shuffle的速度
  * 一、MapOutputTracker(是MapOutputTrackerMaster和MapOutputTrackerWorker父類)的作用
  * 1,獲得mapper的輸入信息,方便reducer取得對應的信息
  * 2,每個mapper和reducer都有自己的唯一標識mapperid,reducerId
  * 3,每個reducer可以對應多個map的輸入,reducer會去取每個map中的Block,這個過程稱爲shuffle,
    每個shuffle也對應shuffleId
  *
  *二、MapOutputTrackerMaster和MapOutputTrackerWorker(運行在Executor中)
   都繼承了MapOutputTracker
  * 1,MapOutputTrackerMaster是用來記錄每個stage中ShuffleMapTasks的map out輸出
  *    a,shuffleReader讀取shuffle文件之前就是去請求MapOutputTrackerMaster 要自己處理的數據
       在哪裏
  *    b,MapOutputTracker給它返回一批 MapOutputTrackerWorker的列表(地址,port等信息)
  * 2,MapOutputTrackerWorker是僅僅作爲cache用來執行shuffle計算
  *
 */
private[spark] class MapOutputTrackerMaster(conf: SparkConf)
  extends MapOutputTracker(conf) {

  /** Cache a serialized version of the output statuses for each shuffle to send them out faster
    * 緩存每個shuffle的輸出狀態的序列化版本,以更快地發送它們 */
  private var cacheEpoch = epoch

  /** Whether to compute locality preferences for reduce tasks
    * 是否爲reduce task 計算本地性*/
  private val shuffleLocalityEnabled = conf.getBoolean("spark.shuffle.reduceLocality.enabled", true)

  // Number of map and reduce tasks above which we do not assign 
preferred locations based on map  output sizes.
 We limit the size of jobs for which assign preferred locations as
 computing the top locations by size becomes expensive.

//在一定數量的map和reduce task之上,我們不會基於map的輸入大小來賦值數據本地性,
// 直接限制job的大小來賦數據本地性,會比map的輸出大小來計算本地性更省簡單一些
  private val SHUFFLE_PREF_MAP_THRESHOLD = 1000
  // NOTE: This should be less than 2000 as we use HighlyCompressedMapStatus beyond that
  //注意:這應該是小於2000,因爲我們使用HighlyCompressedMapStatus
  private val SHUFFLE_PREF_REDUCE_THRESHOLD = 1000

  // Fraction of total map output that must be at a location for it to considered as a preferred
  // location for a reduce task. Making this larger will focus on fewer locations
 where most data can be read locally, but may lead to more delay in scheduling 
if those locations are busy.
  //所有map輸出的比率,必須考慮reduce task的數據本地性,這個值變大之後,本地性數據變多,
可能會造成延遲
  private val REDUCER_PREF_LOCS_FRACTION = 0.2

  /**
   * Timestamp based HashMap for storing mapStatuses and cached serialized 
statuses in the driver,  so that statuses are dropped only by explicit 
de-registering or by TTL-based cleaning (if set).  Other than these two scenarios,
 nothing should be dropped from this HashMap.
    * driver基於Timestamp的hashMap來存放mapStatuses和緩存系列化statuses:
TimeStampedHashMap[Int, Array[Byte]](),
    * 所以將MapStatus去掉只能顯示的去注消或週期性的刪除,除了這兩種情況,
      這個TimeStampedHashMap不會去掉任何數據
    *
    * TimeStampedHashMap的key是Timestamp,MapStatus是一個由 ShuffleMapTask
      從DAGScheduler調度中之後返回的對象:
    * 該對象包含block manager的地址是task運行時reduce輸出的大小,傳遞給ReduceTask
   */
  protected val mapStatuses = new TimeStampedHashMap[Int, Array[MapStatus]]()
  private val cachedSerializedStatuses = new TimeStampedHashMap[Int, Array[Byte]]()

  // For cleaning up TimeStampedHashMaps,定時去清理mapStatuses,
cachedSerializedStatuses中的kv,如果spark.cleaner.ttl不設置值不會清理,可以優先的地方
  private val metadataCleaner =
    new MetadataCleaner(MetadataCleanerType.MAP_OUTPUT_TRACKER, this.cleanup, conf)
  //在map out的集合mapStatuses中註冊新的Shuffle,參數爲Shuffle id和map的個數
  def registerShuffle(shuffleId: Int, numMaps: Int) {
    if (mapStatuses.put(shuffleId, new Array[MapStatus](numMaps)).isDefined) {
      throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")
    }
  }
  //根據Shuffle id取得TimeStampedHashMap[Int, Array[MapStatus]]對應的Array[MapStatus],
給這個Array[MapStatus]對應在索引賦MapStatus
  def registerMapOutput(shuffleId: Int, mapId: Int, status: MapStatus) {
    //mapStatuses:  TimeStampedHashMap[Int, Array[MapStatus]]()
    val array = mapStatuses(shuffleId)
    array.synchronized {
      array(mapId) = status
    }
  }

。。。。。

3,查看一下MapOutputTracker做了哪些工作:

a, 將trackerEndpoint成員設置出來,方便sparkEnv初始化時將MapOutputTrackerMasterEndpoint設置給它

mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(MapOutputTracker.ENDPOINT_NAME,
 
new MapOutputTrackerMasterEndpoint(
   
rpcEnv, mapOutputTracker.asInstanceOf[MapOutputTrackerMaster], conf))

b,初始mapStatuses有不同的行爲在driver端和executers
1),在driver上,它記錄ShuffleMapTasks的map outputs輸出的記錄
2),在executors上,只是簡單的cache一下,會有相應的trigger去driver端取HashMap數據

c, 初始化epoch的值用來記錄fetch時失敗的次數,方便客戶端去清除數據

d,private val fetching = new HashSet[Int],記錄哪個executor在獲取map的輸出

/**
 * Class that keeps track of the locationof the map output of
 * a stage. This is abstract becausedifferent versions of MapOutputTracker
 * (driver and executor) use differentHashMap to store its metadata.
  *

  * 1,MapOutputTracker會在每個stage跟蹤map的輸出,是抽像類是因爲在driver和executor上

使用不同的hashMap來存儲元數據。

   * master上,用來記錄ShuffleMapTasks所需的map out的所在地;worker上,

僅僅作爲cache用來執行shuffle計算

  *
  * 2,MapOutputTrackerMaster和MapOutputTrackerWorker都繼承了MapOutputTracker
  * 網友總結:
  * MapOutputTracker是 SparkEnv初始化時重要組件之一 是master-slave的結構
  * 用來跟蹤記錄shuffleMapTask的輸出位置(shuffleMapTask要寫到哪裏去),
  * shuffleReader讀取shuffle文件之前就是去請求MapOutputTrackerMaster 要自己處理的數據在哪裏?
  * MapOutputTracker給它返回一批 MapOutputTrackerWorker的列表(地址,port等信息)
  * shuffleReader開始讀取文件 進行後期處理
  *
 */

private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging {

 
/** Set to the MapOutputTrackerMasterEndpoint living onthe driver.
    * 在driver上設置MapOutputTrackerMasterEndpoint爲living活動的。
    * trackerEndpoint:的值是MapOutputTrackerMasterEndpoint,
    * 在sparkEnv.create時初始化時,當MapOutputTrackerMaster實例化時,會給該屬性設置值
* mapOutputTracker.trackerEndpoint =registerOrLookupEndpoint(..new MapOutputTrackerMasterEndpoint(
        ...,mapOutputTracker.asInstanceOf[MapOutputTrackerMaster],..) */

 
var trackerEndpoint: RpcEndpointRef = _

  /**
   * This HashMap has different behaviorfor the driver and the executors.
   *
   * On the driver, it serves as thesource of map outputs recorded from ShuffleMapTasks.
   * On the executors, it simply servesas a cache, in which a miss triggers a fetch from the
   * driver's corresponding HashMap.
   *
   * Note: because mapStatuses is accessedconcurrently, subclasses should make sure it's a
   * thread-safe map.
    * 這個mapStatuses有不同的行爲在driver端和executers
    * 1,在driver上,它記錄ShuffleMapTasks的map outputs輸出的記錄
    * 2,在executors上,只是簡單的cache一下,會有相應的trigger去driver端取HashMap數據
   */

 
protected val mapStatuses: Map[Int, Array[MapStatus]]

 
/**
   * Incremented every time a fetch failsso that client nodes know to clear
   * their cache of map output locationsif this happens.
    * 每次當一個fetch失敗時遞增該值,這樣客戶節點知道如果發生這種情況,就可以清除它們的映射輸出位置緩存。
   */

 
protected var epoch: Long = 0
 
protected val epochLock = new AnyRef

 
/** Remembers which map output locations are currentlybeing fetched on an executor.
    * 記住,哪個map輸出位置當前正在被exeuctor獲取。*/

 
private val fetching = new HashSet[Int]

。。。。。

 

5、MapOutputTrackerMasterEndpoint初始化時沒有做什麼就是設置了一下,mapOutputStatuses.length不能大於128M,超過會報錯。(2.2版本我看一下不是這麼實現 的,直直接生成一個GetMapOutputMessage(shuffleId,RpcConllContext)傳給MapoutputTackerMaster了)

/** RpcEndpointclass for MapOutputTrackerMaster.
  * 是RpcEndpoint子類,可以被多線程使用,這是放在MapOutputTrackerMaster構造方法中的,在sparkEnv.create時初始出來的 */

private[spark] class MapOutputTrackerMasterEndpoint(
   
override val rpcEnv: RpcEnv, tracker:MapOutputTrackerMaster, conf: SparkConf)
 
extends RpcEndpoint with Logging {
 
//以字節爲單位返回Akka消息的已配置最大幀frame大小。這個maxFrameSizeBytes返回值是128M
  val maxAkkaFrameSize= AkkaUtils.maxFrameSizeBytes(conf)

 
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] ={
   
case GetMapOutputStatuses(shuffleId:Int) =>
     
val hostPort= context.senderAddress.hostPort
     
logInfo("Asked to send map output locations for shuffle" + shuffleId + " to " + hostPort)
     
val mapOutputStatuses= tracker.getSerializedMapOutputStatuses(shuffleId)
     
val serializedSize= mapOutputStatuses.length
  
   if (serializedSize> maxAkkaFrameSize) {
       
val msg= s"Map output statuses were $serializedSize bytes which " +
         
s"exceeds spark.akka.frameSize ($maxAkkaFrameSize bytes)."

       
/* For SPARK-1244 we'll opt for just logging an error andthen sending it to the sender.
         * A bigger refactoring(SPARK-1239) will ultimately remove this entire code path. */

        val exception= new SparkException(msg)
       
logError(msg, exception)
       
context.sendFailure(exception)
      } else {
       
context.reply(mapOutputStatuses)
      }

6,MetadataCleaner初始化時做了哪些事,這個類在Blockmanager都用到這個類

/**
 * Runs a timer task to periodicallyclean up metadata (e.g. old files or hashtable entries)
  * 運行一個定時器定期清理原數據,如舊文件或hashTable實例kv
  * 從sparkEnv初始化進來時是對應 MetadataCleanerType.MAP_OUTPUT_TRACKER
 */

private[spark] class MetadataCleaner(
   
cleanerType: MetadataCleanerType.MetadataCleanerType,
   
cleanupFunc: (Long) => Unit,
   
conf: SparkConf)
 
extends Logging
{
 
val name= cleanerType.toString
  //初始化進來的時候,這個getDelaySeconds,對應spark.cleaner.ttl.MAP_OUTPUT_TRACKER值返回-1
  private val delaySeconds = MetadataCleaner.getDelaySeconds(conf, cleanerType)
 
//math.max(10,-1/10) ==> 10s
 
private val periodSeconds = math.max(10, delaySeconds / 10)
 
private val timer = new Timer(name+ " cleanup timer", true)


 
private val task = new TimerTask {
   
override def run() {
     
try {
       
cleanupFunc(System.currentTimeMillis()- (delaySeconds * 1000))
       
logInfo("Ran metadata cleaner for " + name)
     
} catch {
       
case e: Exception => logError("Error running cleanup task for " + name, e)
     
}
    }
  }

  //spark.cleaner.ttl的默認是-1, 得到的spark.cleaner.ttl.MAP_OUTPUT_TRACKER值對應的

delaySeconds的值是-1所以默認timer是不會清理數據的

  //所以默認timer是不會清理數據的
  if (delaySeconds > 0) {
   
logDebug(
      "Starting metadata cleaner for " + name + " with delay of " + delaySeconds + "seconds " + "and period of " + periodSeconds + "secs")
   
timer.schedule(task, delaySeconds* 1000, periodSeconds * 1000)
 
}

  def cancel() {
   
timer.cancel()
 
}
}

private[spark] objectMetadataCleanerType extends Enumeration {

 
val MAP_OUTPUT_TRACKER, SPARK_CONTEXT, HTTP_BROADCAST, BLOCK_MANAGER,
 
SHUFFLE_BLOCK_MANAGER, BROADCAST_VARS= Value

 
type MetadataCleanerType= Value

  //從sparkEnv初始化進來時是對應 MetadataCleanerType.MAP_OUTPUT_TRACKER,

得到的值是spark.cleaner.ttl.MAP_OUTPUT_TRACKER

  def systemProperty(which: MetadataCleanerType.MetadataCleanerType): String= {
   
"spark.cleaner.ttl." + which.toString
 
}
}

// TODO: This mutates a Conf to set properties right now,which is kind of ugly when used in the
// initialization of StreamingContext. It'sokay for users trying to configure stuff themselves.
private[spark] objectMetadataCleaner {
 
/** spark.cleaner.ttl:
    * Spark會記住任何元數據(生成的階段,生成的任務等)的持續時間(秒)。 定期清理將確保比此時間更早的元數據。
    * 這對於運行Spark幾個小時/天是很有用的(例如,在Spark Streaming應用程序中運行24/7)。 請注意,任何持續存儲超過此持續時間的RDD也會被清除。
    * 默認值是-1
    */

 
def getDelaySeconds(conf: SparkConf): Int = {
   
conf.getTimeAsSeconds("spark.cleaner.ttl", "-1").toInt
 
}

  def getDelaySeconds(
     
conf: SparkConf,
     
cleanerType: MetadataCleanerType.MetadataCleanerType): Int = {
   
//初始化進來的時候,這個getDelaySeconds,對應spark.cleaner.ttl.MAP_OUTPUT_TRACKER值返回-1
    conf.get(MetadataCleanerType.systemProperty(cleanerType), getDelaySeconds(conf).toString).toInt
 
}

  def setDelaySeconds(
     
conf: SparkConf,
     
cleanerType: MetadataCleanerType.MetadataCleanerType,
     
delay: Int) {
   
conf.set(MetadataCleanerType.systemProperty(cleanerType), delay.toString)
 
}

===》如果設置spark.cleaner.ttl就會調用cleanup方法

/**
  *  在指定時間清除mapStatuses:TimeStampedHashMap[Int,Array[MapStatus
]]和cachedSerializedStatuses:TimeStampedHashMap[Int,Array[Byte]]
裏面的kv數據
  *  spark.cleaner.ttl的默認是-1, 得到的spark.cleaner.ttl.MAP_OUTPUT_TRACKER值對應的delaySeconds的值是-1
      所以默認timer是不會清理數據的
  */

private def cleanup(cleanupTime: Long) {
 
mapStatuses.clearOldValues(cleanupTime)
 
cachedSerializedStatuses.clearOldValues(cleanupTime)
}

===》清理的方法很簡單,就是將ConcurrentHashMap對應的Iterater取出遍歷,然後判斷key的時間,小於參數時間就進行清理

 

private[spark] case class TimeStampedValue[V](value: V, timestamp: Long)
/**
  * 這是scala.collection.mutable.Map的自定義實現,它存儲插入時間戳和每個鍵值對。
  * 如果指定,則每次訪問時每個對的時間戳都可以更新。 然後可以使用clearOldValues方法刪除時間戳超過特定閾值時間的鍵值對。
  * 這個kv對的是scala的可變map: scala.collection.mutable.HashMap
 */

private[spark] class TimeStampedHashMap[A, B](updateTimeStampOnGet: Boolean = false)
 
extends mutable.Map[A, B]() with Logging{
 
//聲明瞭一個併發的ConcurrentHashMap
  private val internalMap = new ConcurrentHashMap[A, TimeStampedValue[B]]()
 

  def getEntrySet: Set[Entry[A, TimeStampedValue[B]]]= internalMap.entrySet
 

  override def size: Int = internalMap.size
 
override def foreach[U](f: ((A, B)) => U) {
   
//這是一個ConcurrentHashMap[A,TimeStampedValue[B]]對應的Set[Entry[A,TimeStampedValue[B]]]
   val it = getEntrySet.iterator
   
while(it.hasNext){
     
val entry= it.next()
     
val kv =(entry.getKey, entry.getValue.value)
     
f(kv)
    }
  }
….
 
def clearOldValues(threshTime: Long, f: (A, B) => Unit) {
   
//這是一個ConcurrentHashMap[A,TimeStampedValue[B]]對應的Set[Entry[A,TimeStampedValue[B]]]
    val it =getEntrySet.iterator
   
while (it.hasNext){
     
val entry= it.next()
     
//小於threshTime的kv都清掉
      if (entry.getValue.timestamp< threshTime) {
       
f(entry.getKey, entry.getValue.value)
       
logDebug("Removing key " + entry.getKey)
       
it.remove() //iterator調用remove方法只能每next一次,才能對應it.remove()
      }
   
}
  }

  /** Removes old key-value pairs that have timestampearlier than `threshTime`. */
 
def clearOldValues(threshTime: Long) {
   
clearOldValues(threshTime, (_, _) => ())
 
}。。。。

7,最後看一下MapStatus這個類,它會根據partition的長度來選擇不同的子類來存儲ShuffleMapTask的輸出

/**
 * Result returned by a ShuffleMapTask toa scheduler. Includes the block manager address that the task ran on as well asthe sizes of outputs for each reducer, for passing on to the reduce tasks.
  * MapStatus是一個由 ShuffleMapTask從DAGScheduler調度中之後返回的對象:block manager的地址及task運行時輸出大小給reduce,傳遞給ReduceTask
  *
   mapStatuses有不同的行爲在driver端和executers
    1),在driver上,它記錄ShuffleMapTasks的map outputs輸出的記錄
    2),在executors上,只是簡單的cache一下,會有相應的trigger去driver端取HashMap數據

 */

private[spark] sealed trait MapStatus {
 
/** Location where this task was run. */
 
def location: BlockManagerId

 
/**
   * Estimated size for the reduce block,in bytes.
   * If a block is non-empty, then thismethod MUST return a non-zero size.  Thisinvariant is necessary for correctness, since block fetchers are allowed toskip zero-size blocks.
    * 評估reduce塊的大小,單位是字節。
    * 如果一個塊是非空的,那麼這個方法務必返回一個非零大小。 這個非變量的值必須正確,因爲塊提取器,允許跳過零大小的塊。
   */

 
def getSizeForBlock(reduceId: Int): Long
}
private[spark] objectMapStatus {
 
/**
    * 在 partition 小於2000 和大於 2000 的兩種場景下,Spark 使用不同的數據結構來在 shuffle 時記錄相關信息,
    * 在 partition大於 2000 時,會用HighlyCompressedMapStatus更高效 [壓縮] 的數據結構來存儲信息。
    * 所以如果你的partition 沒到 2000,但是很接近 2000,使用CompressedMapStatus來存信息。
    * 可以放心的把partition 設置爲 2000 以上。
    */

 
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
   
if (uncompressedSizes.length> 2000) {
     
HighlyCompressedMapStatus(loc, uncompressedSizes)
   
} else {
     
new CompressedMapStatus(loc, uncompressedSizes)
   
}
  }
。。。
 
}
}


/**
 * A
[[MapStatus]] implementation that tracks the size of each block. Sizefor each block is represented using a single byte.
 *
  * 是
[[MapStatus]]
實現類,用於跟蹤每個塊的大小。 每個塊的大小用一個字節表示。
 * @param loc location where thetask is being executed.
 * @param compressedSizes size ofthe blocks, indexed by reduce partition id.
 */

private[spark] class CompressedMapStatus(
   
private[this] var loc: BlockManagerId,
   
private[this] var compressedSizes: Array[Byte])
 
extends MapStatus with Externalizable {

 
protected def this() = this(null, null.asInstanceOf[Array[Byte]])  // For deserialization only

 
def this(loc: BlockManagerId, uncompressedSizes: Array[Long]) {
   
this(loc, uncompressedSizes.map(MapStatus.compressSize))
 
}

  override def location:BlockManagerId = loc
。。。。
}

/**
 * A
[[MapStatus]] implementation that only stores the average size ofnon-empty blocks,
 * plus a bitmap for tracking whichblocks are empty.
 *
  * 是
[[MapStatus]]
實現類,它只存儲非空塊的平均大小,使用bitmpa跟蹤非空塊
  *
 * @param loc location where thetask is being executed
 * @param numNonEmptyBlocks thenumber of non-empty blocks
 * @param emptyBlocks a bitmaptracking which blocks are empty
 * @param avgSize average size ofthe non-empty blocks
 */

private[spark] class HighlyCompressedMapStatusprivate (
   
private[this] var loc: BlockManagerId,
   
private[this] var numNonEmptyBlocks: Int,
   
private[this] var emptyBlocks: RoaringBitmap,
   
private[this] var avgSize: Long)
 
extends MapStatus with Externalizable {


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章