高效細粒度更新的RDD：Spark IndexedRDD

1. 問題由來

由於RDD是只讀不可更改的，即Spark RDD的Immutable特性，如果想要更新或刪除RDD裏面的數據，就要遍歷整個RDD進行操作，並生成一個新的RDD。

有的同學會有疑問，爲什麼不把RDD設計成可讀寫，這樣就不會有這些問題。我剛開始研究Spark時也有這個困惑，後來查了相關資料，RDD設計爲只讀不可更改是有原因的。

這樣設計是爲了保證數據一致性，簡化不必要的鎖機制。當執行update或者delete時不能直接在原先數據上操作，修改原先的數據內容，以前的做法是從原數據中拷貝一份出來進行修改或刪除。

並且對於Streaming Aggregation（聚合）以及Incremental（增量） Algorithm之類的算法，每次迭代都會更新少量數據，但是需要迭代非常多的次數，所以每一次對RDD的更新代價都很大。

針對這個問題AMPLab的Ankur Dave提出了IndexedRDD，它是Immutability和Fine-Grained updates的精妙結合。IndexedRDD是一個基於RDD的Key-Value Store，擴展自RDD[(K, V)]，可以在IndexRDD上進行高效的查找、更新以及刪除。

該問題的地址點擊這裏，詳細設計文檔參考這裏。

2. 設計思路

按照Key的Hash值把數據保持到不同的Partition中。
在每個Partition中根據Key建立索引，通過新建節點複用老節點的方式來實現數據的更新。

3. IndexedRDD API

IndexedRDD主要提供了三個接口：

multiget: 獲取一組Key的Value
multiput: 更新一組Key的Value
delete: 刪除一組Key的Value

    class IndexedRDD[K: ClassTag, V: ClassTag] extends RDD[(K, V)] {

        /** Gets the values corresponding to the specified keys, if any. */
        def multiget(ks: Array[K]): Map[K, V]

        /**
           * Updates the keys in `kvs` to their corresponding values, running `merge` on old and new values
           * if necessary. Returns a new IndexedRDD that reflects the modification.
           */
        def multiput[U: ClassTag](kvs: Map[K, U], z: (K, U) => V, f: (K, V, U) => V): IndexedRDD[K, V]

        /**
          * Deletes the specified keys. Returns a new IndexedRDD that reflects the deletions.
          */
        def delete(ks: Array[K]): IndexedRDD[K, V]
    }

此外IndexedRDD還提供了基於RDD 構建IndexedRDD的函數：

    object IndexedRDD {
      /**
       * Constructs an updatable IndexedRDD from an RDD of pairs, merging duplicate keys arbitrarily.
       */
      def apply[K: ClassTag : KeySerializer, V: ClassTag] (elems: RDD[(K, V)]): IndexedRDD[K, V]
    }

4. IndexedRDD使用

下面這個例子來自IndexedRDD的Github頁面，展示IndexedRDD的使用例子。

    import edu.berkeley.cs.amplab.spark.indexedrdd.IndexedRDD

    // Create an RDD of key-value pairs with Long keys.
    val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
    // Construct an IndexedRDD from the pairs, hash-partitioning and indexing
    // the entries.
    val indexed = IndexedRDD(rdd).cache()

    // Perform a point update.
    val indexed2 = indexed.put(1234L, 10873).cache()
    // Perform a point lookup. Note that the original IndexedRDD remains
    // unmodified.
    indexed2.get(1234L) // => Some(10873)
    indexed.get(1234L) // => Some(0)

    // Efficiently join derived IndexedRDD with original.
    val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
    indexed3.collect // => Array((1234L, 10873))

    // Perform insertions and deletions.
    val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
    indexed2.get(-100L) // => None
    indexed4.get(-100L) // => Some(111)
    indexed2.get(999L) // => Some(0)
    indexed4.get(999L) // => None

目前IndexedRDD還沒有merge到spark源碼中，所以使用IndexedRDD需要添加以下依賴：

    resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
    libraryDependencies += "amplab" % "spark-indexedrdd" % "0.3"

5. Persistent Adaptive Radix Trees（PART）

IndexedRDD的每個Partition的存儲用的是Persisten Adaptive Radix Trees，翻譯出來應該是“持久化自適應基數樹”。在Linux中也是有“基數樹”，主要作用是做內存管理。IndexedRDD的PART 主要特點有：

基於索引的內存存儲結構
針對CPU Cache進行優化(相對B-Tree)
支持多個Key同時查詢 (Hash Table每次只能查一個Key)
支持快速插入和刪除
數據保持有序，支持Range Scan和Prefix Lookup

更多細節請看PART論文以及Github: ART Java實現。

6. PART的主要函數

    public class ArtTree extends ChildPtr implements Serializable {

      //拷貝一份鏡像，其實就是增加一個root節點的引用
      public ArtTree snapshot();

      //尋找Key對應的Value  
      public Object search(final byte[] key);

      //插入  
      public void insert(final byte[] key, Object value) throws UnsupportedOperationException;

      //刪除
      public void delete(final byte[] key);

      //返回迭代器
      public Iterator<Tuple2<byte[], Object>> iterator();

      //元素個數
      public long size();

      //析構
      public int destroy();

      ...
    }  //刪除
      public void delete(final byte[] key);

      //返回迭代器
      public Iterator<Tuple2<byte[], Object>> iterator();

      //元素個數
      public long size();

      //析構
      public int destroy();

      ...
      //元素個數
      public long size();

      //析構
      public int destroy();

      ...

7. 實現分析

IndexedRDD的實現相當簡潔，只有800LOC(Line Of Code)。

KeySerializer.scala：定義瞭如何把Key序列化成Byte Array，以及反序列化的方法

    trait KeySerializer[K] extends Serializable {
      def toBytes(k: K): Array[Byte]
      def fromBytes(b: Array[Byte]): K
    }

    //默認實現了Long和String類型的KeySerializer
    class LongSerializer extends KeySerializer[Long]

    class StringSerializer extends KeySerializer[String]

IndexedRDDPartition.scala：定義了Partition的接口

    private[indexedrdd] abstract class IndexedRDDPartition[K, V] extends Serializable {
      def multiget(ks: Iterator[K]): Iterator[(K, V)]

      def multiput[U](
          kvs: Iterator[(K, U)], z: (K, U) => V, f: (K, V, U) => V): IndexedRDDPartition[K, V] =
        throw new UnsupportedOperationException("modifications not supported")

      def delete(ks: Iterator[K]): IndexedRDDPartition[K, V] =
        throw new UnsupportedOperationException("modifications not supported")

        ...
    }

PARTPartition.scala： Partion的PART實現，其中使用到了最重要的數據結構，即map: ArtTree。

    private[indexedrdd] class PARTPartition[K, V]
        (protected val map: ArtTree)
        (override implicit val kTag: ClassTag[K],
         override implicit val vTag: ClassTag[V],
         implicit val kSer: KeySerializer[K])
      extends IndexedRDDPartition[K, V] with Logging {

      override def apply(k: K): V = map.search(kSer.toBytes(k)).asInstanceOf[V]

      override def multiget(ks: Iterator[K]): Iterator[(K, V)] =
        ks.flatMap { k => Option(this(k)).map(v => (k, v)) }

      override def multiput[U](
            kvs: Iterator[(K, U)], z: (K, U) => V, f: (K, V, U) => V): IndexedRDDPartition[K, V] = {
          val newMap = map.snapshot()
          for (ku <- kvs) {
            val kBytes = kSer.toBytes(ku._1)
            val oldV = newMap.search(kBytes).asInstanceOf[V]
            val newV = if (oldV == null) z(ku._1, ku._2) else f(ku._1, oldV, ku._2)
            newMap.insert(kBytes, newV)
          }
          this.withMap[V](newMap)
        }

      override def delete(ks: Iterator[K]): IndexedRDDPartition[K, V] = {
        val newMap = map.snapshot()
        for (k <- ks) {
          newMap.delete(kSer.toBytes(k))
        }
        this.withMap[V](newMap)
      }

      ...
    }

IndexedRDD.scala：基於PARTPartition，IndexedRDD的實現就非常簡單：

    class IndexedRDD[K: ClassTag, V: ClassTag](
        private val partitionsRDD: RDD[IndexedRDDPartition[K, V]])
      extends RDD[(K, V)](partitionsRDD.context, List(new OneToOneDependency(partitionsRDD))) {

      def multiget(ks: Array[K]): Map[K, V] = {
        val ksByPartition = ks.groupBy(k => partitioner.get.getPartition(k))
        val partitions = ksByPartition.keys.toSeq
        // TODO: avoid sending all keys to all partitions by creating and zipping an RDD of keys
        val results: Array[Array[(K, V)]] = context.runJob(partitionsRDD,
          (context: TaskContext, partIter: Iterator[IndexedRDDPartition[K, V]]) => {
            if (partIter.hasNext && ksByPartition.contains(context.partitionId)) {
              val part = partIter.next()
              val ksForPartition = ksByPartition.get(context.partitionId).get
              part.multiget(ksForPartition.iterator).toArray
            } else {
              Array.empty
            }
          }, partitions, allowLocal = true)
        results.flatten.toMap
      }

      def multiput[U: ClassTag](kvs: Map[K, U], z: (K, U) => V, f: (K, V, U) => V): IndexedRDD[K, V] = {
        val updates = context.parallelize(kvs.toSeq).partitionBy(partitioner.get)
        zipPartitionsWithOther(updates)(new MultiputZipper(z, f))
      }

      private class MultiputZipper[U](z: (K, U) => V, f: (K, V, U) => V)
        extends OtherZipPartitionsFunction[U, V] with Serializable {
      def apply(thisIter: Iterator[IndexedRDDPartition[K, V]], otherIter: Iterator[(K, U)])
        : Iterator[IndexedRDDPartition[K, V]] = {
        val thisPart = thisIter.next()
        Iterator(thisPart.multiput(otherIter, z, f))
      }
    }

      def delete(ks: Array[K]): IndexedRDD[K, V] = {
        val deletions = context.parallelize(ks.map(k => (k, ()))).partitionBy(partitioner.get)
        zipPartitionsWithOther(deletions)(new DeleteZipper)
      }

      private class DeleteZipper extends OtherZipPartitionsFunction[Unit, V] with Serializable {
       def apply(thisIter: Iterator[IndexedRDDPartition[K, V]], otherIter: Iterator[(K, Unit)])
         : Iterator[IndexedRDDPartition[K, V]] = {
         val thisPart = thisIter.next()
         Iterator(thisPart.delete(otherIter.map(_._1)))
       }
     }

      ...
    }

8. 性能

插入的吞吐率，在Batch Size比較大的情況下，比較有優勢。
查詢的速度是最快的，掃描和內存佔用處於中間水平。

【完】

白楊

發佈了84 篇原創文章 · 獲贊 324 · 訪問量 66萬+

私信關注

高效細粒度更新的RDD：Spark IndexedRDD

1. 問題由來

2. 設計思路

3. IndexedRDD API

4. IndexedRDD使用

5. Persistent Adaptive Radix Trees（PART）

6. PART的主要函數

7. 實現分析

8. 性能

IndexedRDD 源碼解讀一

分佈式圖並行計算框架：PowerGraph

Chapter11 類型參數

Chapter07 包和引入

Chapter10 注解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結