struct streaming+Kakfa消費者讀取單條記錄過長問題

使用struct streaming讀kafak數據,fetch數據過大,報錯日誌如下:

20/06/06 11:40:01 org.apache.spark.internal.Logging$class.logError(Logging.scala:70) ERROR TaskSetManager: Task 7 in stage 96.0 failed 4 times; aborting job
20/06/06 11:40:01 org.apache.spark.internal.Logging$class.logError(Logging.scala:91) ERROR MicroBatchExecution: Query sink alarm result to event table [id = f0960793-2c6e-4202-b099-ffd614471716, runId = 28fcb5c3-68a4-49be-814a-a7197336c449] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 96.0 failed 4 times, most recent failure: Lost task 7.3 in stage 96.0 (TID 1295, 10.123.42.47, executor 1): org.apache.kafka.common.errors.RecordTooLargeException: There are some messages at [Partition=Offset]: {intelligent_driving-3=13632613} whose size is larger than the fetch size 504827599 and hence cannot be ever returned. Increase the fetch size on the client (using max.partition.fetch.bytes), or decrease the maximum message size the broker will allow (using message.max.bytes).

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
	

本來這個問題很常見,網上很多資料都說配置參數“max.partition.fetch.bytes”就行,但是我在struct streaming中配置了“max.partition.fetch.bytes”,依然報錯,沒有變化。

原因:後來通過查看整個struct streaming 源碼 創建數據流的整個過程,發現struct streaming是有將參數名進行處理的,關鍵代碼如下:

override def createRelation(
      sqlContext: SQLContext,
      parameters: Map[String, String]): BaseRelation = {
    validateBatchOptions(parameters)
    val caseInsensitiveParams = parameters.map { case (k, v) => (k.toLowerCase(Locale.ROOT), v) }
    val specifiedKafkaParams =
      parameters
        .keySet
        .filter(_.toLowerCase(Locale.ROOT).startsWith("kafka."))
        .map { k => k.drop(6).toString -> parameters(k) }
        .toMap
  }

struct streaming默認將kafak參數,都是需要加kafak前綴的。所以需要在代碼中配置參數名爲“kafka.max.partition.fetch.bytes”

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章