Spark Caused by: java.io.NotSerializableException 序列化異常踩過的坑

最近有需求需要在driver端創建好類實例,然後在rdd裏面調用,但是使用過程中發現 Caused by: java.io.NotSerializableException,即序列化異常,通過查處網上資料發現是構建的類沒有繼承Serializable,沒有繼承Serializable的類是不會自動執行自動序列化操作的,因此我把構建的類繼承了Serializable這個類,再次運行的時候發現依舊是序列化異常,百般不得其解。主要的我調用方式目前網上各大神提供的並不一樣,用當前披露的方法並沒有解決我的問題。好了,廢話了這麼多,寫把問題貼出來:

問題

類ClassB中有一個屬性的值是類ClassA, 在類ClassB有一個方法fun,方法的主要功能是在rdd裏面調用ClassA裏面的方法執行某種操作。然後實例化類ClassB後調用fun方法出現序列化異常,我這裏ClassA和ClassB都繼承了Serializable。具體代碼和異常如下:

import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by zhoujiamu on 2019/1/21.
  */

class ClassA extends Serializable{
  def getClassName: String = this.getClass.getSimpleName
}

class ClassB(sc: SparkContext) extends Serializable{
  val classA = new ClassA()

  def fun(): Unit = {
    val rdd = sc.makeRDD(1 to 5)
    rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
      .collect.foreach(println)
  }
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    lazy val sc = new SparkContext(conf)
    
    val classB = new ClassB(sc)
    
    classB.fun()

  }
}

異常如下:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
	at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
	at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.map(RDD.scala:369)
	at com.haizhi.test.ClassB.fun(SerializableTest.scala:18)
	at com.haizhi.test.SerializableTest$.main(SerializableTest.scala:35)
	at com.haizhi.test.SerializableTest.main(SerializableTest.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
	- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@433e536f)
	- field (class: com.haizhi.test.ClassB, name: sc, type: class org.apache.spark.SparkContext)
	- object (class com.haizhi.test.ClassB, com.haizhi.test.ClassB@667e34b1)
	- field (class: com.haizhi.test.ClassB$$anonfun$fun$1, name: $outer, type: class com.haizhi.test.ClassB)
	- object (class com.haizhi.test.ClassB$$anonfun$fun$1, <function1>)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
	... 12 more

按照網上大神們披露的方式去繼承了可序列化的類,結果我上面的問題依然沒有解決。趁此機會自己把探索過程記錄一下,並把整個spark類序列都總結在此。

探索&解決

下面把遇到過的序列化異常原因和相應的解決方法以及例子一一分析

序列異常天坑1(網上常見的)

在rdd外實例化的類沒有繼承Serializable,在實例化類在rdd中使用,如下代碼塊:

class ClassA {
  def getClassName: String = this.getClass.getSimpleName
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      
    lazy val sc = new SparkContext(conf)

    val classA = new ClassA()
    val rdd = sc.makeRDD(1 to 5)
   
    rdd.map(i => "getClassName in main -> " + classA.getClassName + s": $i")
      .collect().foreach(println)
  }
}
填坑方法

方法1:將ClassA修改爲繼承Serializable類

class ClassA extends Serializable {
  def getClassName: String = this.getClass.getSimpleName
}

方法2:將ClassA放在rdd裏面進行實例化

rdd.map(i => {
      val classA = new ClassA
      "getClassName in main -> " + classA.getClassName + s": $i"
    }).collect().foreach(println)

方法3:將ClassA改成靜態類,靜態類自動實例化,在rdd裏面直接調用其方法

object ClassA {
  def getClassName: String = this.getClass.getSimpleName
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      
    lazy val sc = new SparkContext(conf)
    val rdd = sc.makeRDD(1 to 5)
   
    rdd.map(i => "getClassName in main -> " + ClassA.getClassName + s": $i")
      .collect().foreach(println)
  }
}

序列異常天坑2

在rdd裏面調用類中某個類的方法報序列化異常,代碼如下:

class ClassA {
  def getClassName: String = this.getClass.getSimpleName
}

class ClassB(sc: SparkContext) extends Serializable{
  val classA = new ClassA()

  def fun(): Unit = {
    val rdd = sc.makeRDD(1 to 5)
    rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
      .collect.foreach(println)
  }
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    lazy val sc = new SparkContext(conf)
    val classB = new ClassB(sc)
    val rdd = sc.makeRDD(1 to 5)

    rdd.map(i => "getClassName in main -> " + classB.classA.getClassName + s": $i")
      .collect().foreach(println)
  }
}

如上述,在rdd裏面調用ClassB中屬性ClassA中的方法報序列化異常

填坑方法

方法1:這個ClassB有點腦殘,把ClassA作爲屬性實在不可取,如果只是爲了達到調用ClassA內的方法,則可以讓ClassB去繼承ClassA

class ClassA extends Serializable {
  def getClassName: String = this.getClass.getSimpleName
}

class ClassB(sc: SparkContext) extends ClassA with Serializable{
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    lazy val sc = new SparkContext(conf)
    val classB = new ClassB(sc)
    val rdd = sc.makeRDD(1 to 5)

    rdd.map(i => "getClassName in main -> " + classB.getClassName + s": $i")
      .collect().foreach(println)
  }
}

方法2:在rdd外先把ClassB中ClassA取出來放到一個變量裏面去,再在rdd裏面調用該變量

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    lazy val sc = new SparkContext(conf)

    val classB = new ClassB(sc)

    val a = classB.classA

    val rdd = sc.makeRDD(1 to 5)

    rdd.map(i => "getClassName in main -> " + a.getClassName + s": $i")
      .collect().foreach(println)
  }
}

這種類似填坑1裏面的,相當於重新new了一個ClassA

序列異常天坑3

囉裏囉嗦寫了辣麼多,這個就是我踏進去的坑。
在類ClassB中有方法fun,和屬性classA,fun調用了classA中的方法

class ClassA extends Serializable {
  def getClassName: String = this.getClass.getSimpleName
}

class ClassB(sc: SparkContext) extends Serializable{
  val classA = new ClassA()

  def fun(): Unit = {
    val rdd = sc.makeRDD(1 to 5)
    rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
      .collect.foreach(println)
  }
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    lazy val sc = new SparkContext(conf)
    val classB = new ClassB(sc)
    classB.fun()
  }
}
填坑方法

方法1:在fun裏面不使用屬性classA,而是在fun裏面重新構建ClassA

def fun(): Unit = {
    val classA = new ClassA()
    val rdd = sc.makeRDD(1 to 5)
    rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
      .collect.foreach(println)
  }

這類似於天坑1的解決方式。但是很多時候我們的ClassA是一個比較全的工具類,不僅僅是在fun單個方法體裏面調用,因此需要將放到ClassB作爲屬性。

方法2:與前面的一樣,可以在fun方法裏面的rdd前面先新增一個變量在調用

def fun(): Unit = {
    val a = classA
    val rdd = sc.makeRDD(1 to 5)
    rdd.map(i => "getClassName in ClassB -> "+a.getClassName + s": $i")
      .collect.foreach(println)
  }

方法3:把ClassB修改成object修飾靜態類

class ClassA extends Serializable {
  def getClassName: String = this.getClass.getSimpleName
}

object ClassB extends Serializable{
  val classA = new ClassA()

  def fun(sc: SparkContext): Unit = {
    val rdd = sc.makeRDD(1 to 5)
    rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
      .collect.foreach(println)
  }
}

object SerializableTest {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true)
      .setMaster("local[*]")
      .setAppName("SerializableTest")
      .set("spark.rdd.compress", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    lazy val sc = new SparkContext(conf)
    val classB = ClassB
    classB.fun(sc)
  }
}

總結

通過上面填坑過程發現如下規律:
1、在rdd應該外部變量類實例的時候,類需要繼承Serializable
2、在非靜態類中(class聲明的類),若是類屬性是一個對象,則該屬性不能在rdd裏面直接使用,儘管該對象是已經繼承了Serializable,可以直接在rdd前將該屬性賦值爲一個變量,再在rdd裏面調用該變量

囉裏囉嗦了這麼多,spark序列化問題就此告一段落,有不明白之處歡迎comment

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章