Spark-shuffle

Spark-shuffle

@(spark)[shuffle]

ShuffleHandle/BaseShuffleHande

An opaque handle to a shuffle, used by a ShuffleManager to pass information about it to tasks

ShuffleMemoryManager

控制shuffle的memory使用的

/**                                                                                                                                                                     
 * Allocates a pool of memory to task threads for use in shuffle operations. Each disk-spilling                                                                         
 * collection (ExternalAppendOnlyMap or ExternalSorter) used by these tasks can acquire memory                                                                          
 * from this pool and release it as it spills data out. When a task ends, all its memory will be                                                                        
 * released by the Executor.                                                                                                                                            
 *                                                                                                                                                                      
 * This class tries to ensure that each thread gets a reasonable share of memory, instead of some                                                                       
 * thread ramping up to a large amount first and then causing others to spill to disk repeatedly.                                                                       
 * If there are N threads, it ensures that each thread can acquire at least 1 / 2N of the memory                                                                        
 * before it has to spill, and at most 1 / N. Because N varies dynamically, we keep track of the                                                                        
 * set of active threads and redo the calculations of 1 / 2N and 1 / N in waiting threads whenever                                                                      
 * this set changes. This is all done by synchronizing access on "this" to mutate state and using                                                                       
 * wait() and notifyAll() to signal changes.                                                                                                                            
 */         

ShuffleManager

/**                                                                                                                                                                     
 * Pluggable interface for shuffle systems. A ShuffleManager is created in SparkEnv on the driver                                                                       
 * and on each executor, based on the spark.shuffle.manager setting. The driver registers shuffles                                                                      
 * with it, and executors (or tasks running locally in the driver) can ask to read and write data.                                                                      
 *                                                                                                                                                                      
 * NOTE: this will be instantiated by SparkEnv so its constructor can take a SparkConf and                                                                              
 * boolean isDriver as parameters.                                                                                                                                      
 */                                                                                                                                                                     
private[spark] trait ShuffleManager {   

ShuffleWriter

/**                                                                                                                                                                     
 * Obtained inside a map task to write out records to the shuffle system.                                                                                               
 */                                                                                                                                                                     
private[spark] trait ShuffleWriter[K, V] {                                                                                                                              
  /** Write a bunch of records to this task's output */                                                                                                                 
  def write(records: Iterator[_ <: Product2[K, V]]): Unit                                                                                                               

  /** Close this writer, passing along whether the map completed */                                                                                                     
  def stop(success: Boolean): Option[MapStatus]                                                                                                                         
}  

ShuffleReader

/**                                                                                                                                                                     
 * Obtained inside a reduce task to read combined records from the mappers.                                                                                             
 */                                                                                                                                                                     
private[spark] trait ShuffleReader[K, C] {                                                                                                                              
  /** Read the combined key-values for this reduce task */                                                                                                              
  def read(): Iterator[Product2[K, C]]                                                                                                                                  

  /**                                                                                                                                                                   
   * Close this reader.                                                                                                                                                 
   * TODO: Add this back when we make the ShuffleReader a developer API that others can implement                                                                       
   * (at which point this will likely be necessary).                                                                                                                    
   */                                                                                                                                                                   
  // def stop(): Unit                                                                                                                                                   
} 

上面三個就是整個shuffle的接口

FileShuffleBlockManager

/**                                                                                                                                                                     
 * Manages assigning disk-based block writers to shuffle tasks. Each shuffle task gets one file                                                                         
 * per reducer (this set of files is called a ShuffleFileGroup).                                                                                                        
 *                                                                                                                                                                      
 * As an optimization to reduce the number of physical shuffle files produced, multiple shuffle                                                                         
 * blocks are aggregated into the same file. There is one "combined shuffle file" per reducer                                                                           
 * per concurrently executing shuffle task. As soon as a task finishes writing to its shuffle                                                                           
 * files, it releases them for another task.                                                                                                                            
 * Regarding the implementation of this feature, shuffle files are identified by a 3-tuple:                                                                             
 *   - shuffleId: The unique id given to the entire shuffle stage.                                                                                                      
 *   - bucketId: The id of the output partition (i.e., reducer id)                                                                                                      
 *   - fileId: The unique id identifying a group of "combined shuffle files." Only one task at a                                                                        
 *       time owns a particular fileId, and this id is returned to a pool when the task finishes.                                                                       
 * Each shuffle file is then mapped to a FileSegment, which is a 3-tuple (file, offset, length)                                                                         
 * that specifies where in a given file the actual block data is located.                                                                                               
 *                                                                                                                                                                      
 * Shuffle file metadata is stored in a space-efficient manner. Rather than simply mapping                                                                              
 * ShuffleBlockIds directly to FileSegments, each ShuffleFileGroup maintains a list of offsets for                                                                      
 * each block stored in each file. In order to find the location of a shuffle block, we search the                                                                      
 * files within a ShuffleFileGroups associated with the block's reducer.                                                                                                
 */                                                                                                                                                                     
// Note: Changes to the format in this file should be kept in sync with                                                                                                 
// org.apache.spark.network.shuffle.StandaloneShuffleBlockManager#getHashBasedShuffleBlockData().                                                                       
private[spark]                                                                                                                                                          
class FileShuffleBlockManager(conf: SparkConf)                                                                                                                          
  extends ShuffleBlockManager with Logging { 

IndexShuffleBlockManager

/**                                                                                                                                                                     
 * Create and maintain the shuffle blocks' mapping between logic block and physical file location.                                                                      
 * Data of shuffle blocks from the same map task are stored in a single consolidated data file.                                                                         
 * The offsets of the data blocks in the data file are stored in a separate index file.                                                                                 
 *                                                                                                                                                                      
 * We use the name of the shuffle data's shuffleBlockId with reduce ID set to 0 and add ".data"                                                                         
 * as the filename postfix for data file, and ".index" as the filename postfix for index file.                                                                          
 *                                                                                                                                                                      
 */                                                                                                                                                                     
// Note: Changes to the format in this file should be kept in sync with                                                                                                 
// org.apache.spark.network.shuffle.StandaloneShuffleBlockManager#getSortBasedShuffleBlockData().                                                                       
private[spark]                                                                                                                                                          
class IndexShuffleBlockManager(conf: SparkConf) extends ShuffleBlockManager {   

Hash

HashShuffleManager 封裝了對外的接口,不過實際上的邏輯都在相應的Reader和Writer裏。
這兩部分的成本是很高的。

Sort

SortShuffleManager 封裝了對外的接口,不過實際上的邏輯都在相應的Writer,注意它的Reader就是用的Hash的reader。

在sort中有個重要優化就是最終的merge過程實際上可以省略掉,因爲反正是做shuffle,不要求嚴格排序。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章