spark閉包清理器ClosureCleaner

在spark給出的閉包清理器中的代碼註釋中，關於閉包的例子，給出了一個類作爲例子，稍作修改如下。

class SomethingNotSerializable {
  def someMethod(): Unit = scope("one") {
    def y = someValue

    scope("two") {
      println(y + 1)
    }
  }

  def someValue = 1

  def scope(name: String)(body: => Unit) = body
}

在這個類中，可以看到，通過scope()函數分別實現了兩層閉包。

其中，除了最外部的class文件的基礎上，還生成了另外兩個class文件，這兩個就是scala中的閉包產生的函數class文件，可以從其名字帶有$anonfun$中看出其爲閉包的class文件。同時，在scope(“one”)函數中的方法體中也通過scope(“two”)，產生了嵌套閉包，所以產生第二個閉包class。

通過javap -v命令查看閉包class文件的字節碼。查看第一層嵌套的閉包class文件，在上面的代碼中，可以看到在someMethod()函數中，定義了y()函數。在y()函數的定義中，設計到了外部函數someValue()，顯然這裏就是閉包操作將要引用的地方，看到y()函數的字節碼。

這裏的字節碼引用了外部類的某個field，這裏可以試圖通過getField指令得到對應的函數的MethodRef，並在下一個指令中調用，這裏所需要的目標對象的field則是以$outer指代的閉包外部對象，也就是上文中的類SomethigNotSerializable。

再回到構造函數。

可以在構造方法中清楚地看到，外部類將以$outer的名字存放在本地變量表中，以便通過$outer指針獲取閉包所需要的資源。同時，此處將會通過putfiled命令給其賦值。

由於該class文件中還有一層閉包嵌套，直接看到class文件的innerClass部分，可以看到其內部類，其嵌套的閉包也在其中。

回到spark的閉包清理器closureCleaner，每個spark的算子在調用時都會通過clean()方法來對閉包的資源進行清理，實則調用到的正是closureCleaner，也正是對上文的$outer的處理。

主要邏輯實現在其clean()函數中，目標類爲func。

if (!isClosure(func.getClass) && lambdaFunc.isEmpty) {
  logDebug(s"Expected a closure; got ${func.getClass.getName}")
  return
}

首先，將會通過isClosure()函數判斷是否爲閉包類。

private def isClosure(cls: Class[_]): Boolean = {
  cls.getName.contains("$anonfun$")
}

具體邏輯很簡單，就如上文所示，直接判斷其名稱是否包含$anonfun$即可簡單判斷。

而後，將會通過getOuterClassesAndObejct()函數獲取func中所有對於外部閉包對象的引用。

private def getOuterClassesAndObjects(obj: AnyRef): (List[Class[_]], List[AnyRef]) = {
  for (f <- obj.getClass.getDeclaredFields if f.getName == "$outer") {
    f.setAccessible(true)
    val outer = f.get(obj)
    // The outer pointer may be null if we have cleaned this closure before
    if (outer != null) {
      if (isClosure(f.getType)) {
        val recurRet = getOuterClassesAndObjects(outer)
        return (f.getType :: recurRet._1, outer :: recurRet._2)
      } else {
        return (f.getType :: Nil, outer :: Nil) // Stop at the first $outer that is not a closure
      }
    }
  }
  (Nil, Nil)
}

如上文所說，實則實在遍歷class文件中的所有field，找到$outer的就是外部的閉包引用，同時，如果外部閉包引用也是閉包類，那麼同樣獲取其外部，直到最外層的非閉包類。

而後，禁止func中出現return返回值，在這裏掃描，如果出現return直接報錯。

getClassReader(func.getClass).accept(new ReturnStatementFinder(), 0)

private class ReturnStatementFinder(targetMethodName: Option[String] = None)
  extends ClassVisitor(ASM6) {
  override def visitMethod(access: Int, name: String, desc: String,
      sig: String, exceptions: Array[String]): MethodVisitor = {

    // $anonfun$ covers Java 8 lambdas
    if (name.contains("apply") || name.contains("$anonfun$")) {
      // A method with suffix "$adapted" will be generated in cases like
      // { _:Int => return; Seq()} but not { _:Int => return; true}
      // closure passed is $anonfun$t$1$adapted while actual code resides in $anonfun$s$1
      // visitor will see only $anonfun$s$1$adapted, so we remove the suffix, see
      // https://github.com/scala/scala-dev/issues/109
      val isTargetMethod = targetMethodName.isEmpty ||
        name == targetMethodName.get || name == targetMethodName.get.stripSuffix("$adapted")

      new MethodVisitor(ASM6) {
        override def visitTypeInsn(op: Int, tp: String) {
          if (op == NEW && tp.contains("scala/runtime/NonLocalReturnControl") && isTargetMethod) {
            throw new ReturnStatementInClosureException
          }
        }
      }
    } else {
      new MethodVisitor(ASM6) {}
    }
  }
}

之後，在驗證完沒有return返回值，確認完所有的外部引用對象，之後只要再次確認所有外部對象中被引用到的field就可以準備進行相應的複製與對應值的賦值。

if (accessedFields.isEmpty) {
  logDebug(" + populating accessed fields because this is the starting closure")
  // Initialize accessed fields with the outer classes first
  // This step is needed to associate the fields to the correct classes later
  initAccessedFields(accessedFields, outerClasses)

  // Populate accessed fields by visiting all fields and methods accessed by this and
  // all of its inner closures. If transitive cleaning is enabled, this may recursively
  // visits methods that belong to other classes in search of transitively referenced fields.
  for (cls <- func.getClass :: innerClasses) {
    getClassReader(cls).accept(new FieldAccessFinder(accessedFields, cleanTransitively), 0)
  }
}

所有外部類所被引用的字段將會被存儲在accessedFields中，在後面的操作判斷是否被引用到。

之後，將會開始對所有出現過的$outer進行初始化與克隆。

for ((cls, obj) <- outerPairs) {
  logDebug(s" + cloning the object $obj of class ${cls.getName}")
  // We null out these unused references by cloning each object and then filling in all
  // required fields from the original object. We need the parent here because the Java
  // language specification requires the first constructor parameter of any closure to be
  // its enclosing object.
  val clone = cloneAndSetFields(parent, obj, cls, accessedFields)

  // If transitive cleaning is enabled, we recursively clean any enclosing closure using
  // the already populated accessed fields map of the starting closure
  if (cleanTransitively && isClosure(clone.getClass)) {
    logDebug(s" + cleaning cloned closure $clone recursively (${cls.getName})")
    // No need to check serializable here for the outer closures because we're
    // only interested in the serializability of the starting closure
    clean(clone, checkSerializable = false, cleanTransitively, accessedFields)
  }
  parent = clone
}
if (parent != null) {
  val field = func.getClass.getDeclaredField("$outer")
  field.setAccessible(true)
  // If the starting closure doesn't actually need our enclosing object, then just null it out
  if (accessedFields.contains(func.getClass) &&
    !accessedFields(func.getClass).contains("$outer")) {
    logDebug(s" + the starting closure doesn't actually need $parent, so we null it out")
    field.set(func, null)
  } else {
    // Update this closure's parent pointer to point to our enclosing object,
    // which could either be a cloned closure or the original user object
    field.set(func, parent)
  }
}

在這裏，前面掃描得到的$outer對象都會在這裏被深克隆一份，其所需要被引用的字段也將被專門賦值到被克隆的對象上以便閉包類進行引用。而後回到被清理的閉包類func中，如果func中的assessedField中不存在該$outer的field，也就是閉包函數中並沒有用到外部的這個對象，將會直接被賦值爲null，達到減少網絡傳輸和降低序列化要求的目的，否則將會直接被賦值在該field上。

閉包清理器ClosureCleaner的主要流程也結束。

spark閉包清理器ClosureCleaner

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

spark反壓速率計算

spark閉包清理器ClosureCleaner

Java1.8HashMap一段註釋的解釋

spark job生成的時間驅動

spark RadixSort基數排序源碼實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結