背景

簡單分析一下GraphX是怎麼爲圖數據建模和存儲的。

入口

可以看GraphLoader的函數，

def edgeListFile(
      sc: SparkContext,
      path: String,
      canonicalOrientation: Boolean = false,
      numEdgePartitions: Int = -1,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
    : Graph[Int, Int]

path可以是本地路徑(文件或文件夾)，也可以是hdfs路徑，本質上是使用sc.textFile來生成HadoopRDD的，numEdgePartitions是分區數。
Graph的存儲是分EdgeRDD和VertexRDD兩塊，可以分別設置StorageLevel。默認是內存。
這個函數接受邊文件，即’1 2’, ‘4 1’這樣的點到點的數據對組成的文件。把這份文件按分區數和存儲level轉化成一個可以操作的圖。

流程

sc.textFile讀文件，生成原始的RDD
每個分區(的計算節點)把每條記錄放進PrimitiveVector裏，這個結構是spark裏爲primitive數據優化的存儲結構。
把PrimitiveVector裏的數據一條條取出，轉化成EdgePartition，即EdgeRDD的分區實現。這個過程中生成了面向列存的結構：src點的array，dst點的array，edge的屬性array，以及兩個正反向map(用於對應點的local id和global id)。
對EdgeRDD 做一次count觸發這次邊建模任務，真正persist起來。
用EdgePartition去生成一個RoutingTablePartition，裏面是vertexId到partitionId的對應關係，藉助RoutingTablePartition生成VertexRDD。
由EdgeRDD和VertexRDD生成Graph。前者維護了邊的屬性、邊兩頭頂點的屬性、兩頭頂點各自的global vertexID、兩頭頂點各自的local Id（在一個edge分區裏的array index）、用於尋址array的正反向map。後者維護了點存在於哪個邊的分區上的Map。

以下是代碼，比較清晰地展現了內部存儲結構。

private[graphx]
class EdgePartition[
    @specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED: ClassTag, VD: ClassTag](
    localSrcIds: Array[Int],
    localDstIds: Array[Int],
    data: Array[ED],
    index: GraphXPrimitiveKeyOpenHashMap[VertexId, Int],
    global2local: GraphXPrimitiveKeyOpenHashMap[VertexId, Int],
    local2global: Array[VertexId],
    vertexAttrs: Array[VD],
    activeSet: Option[VertexSet])
  extends Serializable {

/**
 * Stores the locations of edge-partition join sites for each vertex attribute in a particular
 * vertex partition. This provides routing information for shipping vertex attributes to edge
 * partitions.
 */
private[graphx]
class RoutingTablePartition(
    private val routingTable: Array[(Array[VertexId], BitSet, BitSet)]) extends Serializable {

細節

分區擺放

EdgeRDD的分區怎麼切分的呢？因爲數據是根據HadoopRDD從文件里根據offset掃出來的。可以理解爲對邊數據的切分是沒有任何處理的，因爲文件也沒有特殊排列過，所以切分成多少個分區應該就是隨機的。

VertexRDD的分區怎麼切分的呢？EdgeRDD生成的vertexIdToPartitionId這份RDD數據是RDD[VertexId, Int]型，它根據hash分區規則，分成和EdgeRDD分區數一樣大。所以VertexRDD的分區數和Edge一樣，分區規則是Long取hash。

所以我可以想象的計算過程是：

對點操作的時候，首先對vertexId(是個Long)進行hash，找到對應分區的位置，在這個分區上，如果是內存存儲的VertexRDD，那很快可以查到它的邊所在的幾個Edge分區的所在位置，然後把計算分到這幾個Edge所在的分區上去計算。
第一步根據點hash後找邊分區位置的過程就類似一次建好索引的查詢。

配官方圖方面理解：

高效數據結構

對原生類型的存儲和讀寫有比較好的數據結構支持，典型的是EdgePartition裏使用的map：

/**
 * A fast hash map implementation for primitive, non-null keys. This hash map supports
 * insertions and updates, but not deletions. This map is about an order of magnitude
 * faster than java.util.HashMap, while using much less space overhead.
 *
 * Under the hood, it uses our OpenHashSet implementation.
 */
private[graphx]
class GraphXPrimitiveKeyOpenHashMap[@specialized(Long, Int) K: ClassTag,
                              @specialized(Long, Int, Double) V: ClassTag](

以及之前提到的vector

/**
 * An append-only, non-threadsafe, array-backed vector that is optimized for primitive types.
 */
private[spark]
class PrimitiveVector[@specialized(Long, Int, Double) V: ClassTag](initialSize: Int = 64) {
  private var _numElements = 0
  private var _array: Array[V] = _

全文完 :)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

GraphX 圖數據建模和存儲

背景

入口

流程

細節

分區擺放

高效數據結構

Spark SQL CLI 實現分析

論文摘抄 - Infobright

常見計算框架算子層對比

Scala Learning(1): 使用Pattern Matching表達JSON

Spark Core Runtime分析: DAGScheduler, TaskScheduler, SchedulerBackend

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結