Cannot create encoder for Option of Product type, because Product type is represented as a row

使用sparksql會遇到下面錯誤：

Cannot create encoder for Option of Product type, because Product type is represented as a row, and the entire row can not be null in Spark SQL like normal databases. You can wrap your type with Tuple1 if you do want top level null Product objects, e.g. instead of creating `Dataset[Option[MyClass]]`, you can do something like `val ds: Dataset[Tuple1[MyClass]] = Seq(Tuple1(MyClass(...)), Tuple1(null)).toDS`
java.lang.UnsupportedOperationException: Cannot create encoder for Option of Product type, because Product type is represented as a row, and the entire row can not be null in Spark SQL like normal databases. You can wrap your type with Tuple1 if you do want top level null Product objects, e.g. instead of creating `Dataset[Option[MyClass]]`, you can do something like `val ds: Dataset[Tuple1[MyClass]] = Seq(Tuple1(MyClass(...)), Tuple1(null)).toDS`
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:52)
	at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
	at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
	at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)

此問題主要是由於將data[Row]轉換成對應的的dataSet類型時，找不到對應的類型轉換導致的，需要爲對應的類型添加隱式轉換，一般添加代碼:

implicit val registerKryoEncoder = Encoders.kryo[MyClass]

背景：

開始寫spark代碼，之前是使用的spark 1.xx的方法，因此sc.textFile(ads_channel_type_path).map(...)返回的是一個RDD，但是我將spark的入口改爲SparkSession的時候，如下：

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

同樣的代碼，也是讀取HDFS的文件，spark.read.textFile(basePath).map(...) 這個時候返回的是Dataset，再次提交作業的時候，就會報如上的錯誤。但是原因依然是data[Row]轉換成對應的的dataSet類型時，找不到對應的類型轉換導致的，使用如上的解決辦法不奏效。

看我的僞代碼代碼：

//創建sparkSession    
val spark = SparkSession
   .builder()
   .appName(appname)
   .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .config("spark.kryoserializer.buffer.max","1024m")
   .getOrCreate()


//注意：map返回的類型是DataSet
val rdd = spark.read.textFile(path).map(line=>{
  業務代碼省略...
  if (event == "show"){
    業務代碼省略...
    if (currentDay==day){
      業務代碼省略...
      Option(id,age,name)   //返回類型
    }else{
      None   //返回類型
    }
  }else{
    None     //返回類型
  }
})
  .toDF("id","age","name")    //轉df
  .cache()

那應該怎麼辦？問題就是data[Row]轉換成對應的的dataSet類型時，找不到對應的類型，看這篇文章中的例子：

https://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row

主要的內容如下：

There is nothing unexpected here. You're trying to use code which has been written with Spark 1.x and is no longer supported in Spark 2.0:

in 1.x DataFrame.map is ((Row) ⇒ T)(ClassTag[T]) ⇒ RDD[T]
in 2.x Dataset[Row].map is ((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]

To be honest it didn't make much sense in 1.x either. Independent of version you can simply use DataFrame API:

import org.apache.spark.sql.functions.{when, lower}

val df = Seq(
  (2012, "Tesla", "S"), (1997, "Ford", "E350"),
  (2015, "Chevy", "Volt")
).toDF("year", "make", "model")

df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))

If you really want to use map you should use statically typed Dataset:

import spark.implicits._

case class Record(year: Int, make: String, model: String)

df.as[Record].map {
  case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
  case rec => rec
}

or at least return an object which will have implicit encoder:

df.map {
  case Row(year: Int, make: String, model: String) => 
    (year, if(make.toLowerCase == "tesla") "S" else make, model)
}

Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:

import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
  StructField("year", IntegerType),
  StructField("make", StringType),
  StructField("model", StringType)
))

val encoder = RowEncoder(schema)

df.map {
  case Row(year, make: String, model) if make.toLowerCase == "tesla" => 
    Row(year, "S", model)
  case row => row
} (encoder)

仔細看上面的描述以及給出的解決的方法，對row操作的時候，就給它一個encoder，因爲dataset是強數據類型的。因此，本人的代碼就可以將Option改爲Row，給一個encoder，僞代碼如下：

//創建sparkSession    
val spark = SparkSession
   .builder()
   .appName(appname)
   .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .config("spark.kryoserializer.buffer.max","1024m")
   .getOrCreate()


 //由於dataset數強數據類型，所以在row轉化爲dataset額時候，需要給一個encoder，df轉dataset的時候。需要as[object]
 val schema = StructType(Seq(
     StructField("id", IntegerType),
     StructField("age", IntegerType),
     StructField("name", StringType)
 ))
 val encoder = RowEncoder(schema)


//注意：map返回的類型是DataSet
val rdd = spark.read.textFile(path).map(line=>{
  業務代碼省略...
  if (event == "show"){
    業務代碼省略...
    if (currentDay==day){
      業務代碼省略...
      Row(id,age,name)   //返回類型
    }else{
      Row.empty   //返回類型
    }
  }else{
    Row.empty     //返回類型
  }
})(encoder)
  .filter(_ != Row.empty)
  .toDF("id","age","name")    //轉df
  .cache()

對很奇怪的異常，對思考，多讀一下官網，對自己有很大的幫助！

感謝這位博主對我的指導！！！

sparkSQL學習筆記

https://segmentfault.com/a/1190000010039233?utm_source=tag-newest

Cannot create encoder for Option of Product type, because Product type is represented as a row

背景：

DAPPER 事務 TRANSACTION

mysql查詢優化實戰一（合理使用索引）

openpose環境搭建（全教程）

kafka的分佈式安裝

MapReduce中shuffle和排序(轉)

Ucloud大數據面試題（二面涼涼）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結