Spark 闭包中的闭包

Spark 闭包中ClosureCleaner操作

在Scala,函数是第一等公民,可以作为参数的值传给相应的rdd转换和动作,进而进行迭代处理。 阅读spark源码,我们发现,spark对我们所传入的所有闭包函数都做了一次sc.clean操作,如下

def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))
private[spark] def clean[F <: AnyRef](f: F, checkSerializable: Boolean = true): F = {
    ClosureCleaner.clean(f, checkSerializable)
    f
}

函数clean对闭包做了一次清理的操作,那么什么是闭包清理呢?

什么是闭包

简单来说,闭包类似于java中的内部类,但是和内部类还是有区别的:

  • 编译出来的class不一样
  • 局部/匿名内部类与也是局部定义的闭包对局部变量的可见性不同
  • scala局部内部类对局部变量的可见性没有final/val变量的要求
  • scala把本身属于栈的基本类型变量,转换为引用类型,从而实现scala内部类可以读取外部定义函数的局部变量,并且不受final的限制.

详见:function-closure-cleaner.md

为什么要执行ClosureCleaner.clean(f,checkSerializable)

为了防止spark程序中将闭包发送到slaves节点中发生序列化失败而做的限制,因为闭包函数有可能引用了未序列化的外部变量

详见:What does Closure.cleaner (func) mean in Spark

When Scala constructs a closure, it determines which outer variables the closure will use and stores references to them in the closure object. This allows the closure to work properly even when it's called from a different scope than it was created in.

Scala sometimes errs on the side of capturing too many outer variables (see SI-1419). That's harmless in most cases, because the extra captured variables simply don't get used (though this prevents them from getting GC'd). But it poses a problem for Spark, which has to send closures across the network so they can be run on slaves. When a closure contains unnecessary references, it wastes network bandwidth. More importantly, some of the references may point to non-serializable objects, and Spark will fail to serialize the closure.

To work around this bug in Scala, the ClosureCleaner traverses the object at runtime and prunes the unnecessary references. Since it does this at runtime, it can be more accurate than the Scala compiler can. Spark can then safely serialize the cleaned closure.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章