Spark 閉包中的閉包

Spark 閉包中ClosureCleaner操作

在Scala，函數是第一等公民，可以作爲參數的值傳給相應的rdd轉換和動作，進而進行迭代處理。閱讀spark源碼，我們發現，spark對我們所傳入的所有閉包函數都做了一次sc.clean操作，如下

def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))
private[spark] def clean[F <: AnyRef](f: F, checkSerializable: Boolean = true): F = {
    ClosureCleaner.clean(f, checkSerializable)
    f
}

函數clean對閉包做了一次清理的操作，那麼什麼是閉包清理呢？

什麼是閉包

簡單來說，閉包類似於java中的內部類，但是和內部類還是有區別的：

編譯出來的class不一樣
局部/匿名內部類與也是局部定義的閉包對局部變量的可見性不同
scala局部內部類對局部變量的可見性沒有final/val變量的要求
scala把本身屬於棧的基本類型變量,轉換爲引用類型,從而實現scala內部類可以讀取外部定義函數的局部變量,並且不受final的限制.

詳見：function-closure-cleaner.md

爲什麼要執行ClosureCleaner.clean(f,checkSerializable)

爲了防止spark程序中將閉包發送到slaves節點中發生序列化失敗而做的限制，因爲閉包函數有可能引用了未序列化的外部變量

詳見：What does Closure.cleaner (func) mean in Spark

When Scala constructs a closure, it determines which outer variables the closure will use and stores references to them in the closure object. This allows the closure to work properly even when it's called from a different scope than it was created in.

Scala sometimes errs on the side of capturing too many outer variables (see SI-1419). That's harmless in most cases, because the extra captured variables simply don't get used (though this prevents them from getting GC'd). But it poses a problem for Spark, which has to send closures across the network so they can be run on slaves. When a closure contains unnecessary references, it wastes network bandwidth. More importantly, some of the references may point to non-serializable objects, and Spark will fail to serialize the closure.

To work around this bug in Scala, the ClosureCleaner traverses the object at runtime and prunes the unnecessary references. Since it does this at runtime, it can be more accurate than the Scala compiler can. Spark can then safely serialize the cleaned closure.

Spark 閉包中的閉包

Spark 閉包中ClosureCleaner操作

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

【Python】保存gym截圖

【譯】使用 GitHub Copilot 作爲你的編碼 GPS

Linux 服務器配置-安裝portainer-ce社區版

外行也能讀懂的網絡硬件設備功能原理速成

安裝Auto-GPT

sparkCore-RDD詳解

spark自動引包

sbt配置——數據源問題解決

solr高級查詢——group和facet

BitMap的JAVA實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結