Spark 閉包中的閉包

Spark 閉包中ClosureCleaner操作

在Scala,函數是第一等公民,可以作爲參數的值傳給相應的rdd轉換和動作,進而進行迭代處理。 閱讀spark源碼,我們發現,spark對我們所傳入的所有閉包函數都做了一次sc.clean操作,如下

def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))
private[spark] def clean[F <: AnyRef](f: F, checkSerializable: Boolean = true): F = {
    ClosureCleaner.clean(f, checkSerializable)
    f
}

函數clean對閉包做了一次清理的操作,那麼什麼是閉包清理呢?

什麼是閉包

簡單來說,閉包類似於java中的內部類,但是和內部類還是有區別的:

  • 編譯出來的class不一樣
  • 局部/匿名內部類與也是局部定義的閉包對局部變量的可見性不同
  • scala局部內部類對局部變量的可見性沒有final/val變量的要求
  • scala把本身屬於棧的基本類型變量,轉換爲引用類型,從而實現scala內部類可以讀取外部定義函數的局部變量,並且不受final的限制.

詳見:function-closure-cleaner.md

爲什麼要執行ClosureCleaner.clean(f,checkSerializable)

爲了防止spark程序中將閉包發送到slaves節點中發生序列化失敗而做的限制,因爲閉包函數有可能引用了未序列化的外部變量

詳見:What does Closure.cleaner (func) mean in Spark

When Scala constructs a closure, it determines which outer variables the closure will use and stores references to them in the closure object. This allows the closure to work properly even when it's called from a different scope than it was created in.

Scala sometimes errs on the side of capturing too many outer variables (see SI-1419). That's harmless in most cases, because the extra captured variables simply don't get used (though this prevents them from getting GC'd). But it poses a problem for Spark, which has to send closures across the network so they can be run on slaves. When a closure contains unnecessary references, it wastes network bandwidth. More importantly, some of the references may point to non-serializable objects, and Spark will fail to serialize the closure.

To work around this bug in Scala, the ClosureCleaner traverses the object at runtime and prunes the unnecessary references. Since it does this at runtime, it can be more accurate than the Scala compiler can. Spark can then safely serialize the cleaned closure.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章