Spark Streaming+Kafka+ES使用筆記

emm

非專業開發，一點筆記

Kafka

當Kafka內容特別大時會報錯，此時設置下fetch.message.max.bytes爲一個比較大的值就好。

val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,"fetch.message.max.bytes" -> "10485760" )

關於partitions
KafkaDirectStream提供的分片數與Kafka的分片數相同
如果運算量遠遠超過了重新分片的消耗，可以用repartition，否則還是增加同時運行的Jobs數量。

Spark Streaming

增加同時運行的任務數量

SparkStreaming默認只啓動一個Job，所以使用核心再多如果任務數量不夠的話核心也不能充分利用。

爲了提高任務個數需要使用設置spark.streaming.concurrentJobs參數：

spark-submit --conf spark.streaming.concurrentJobs=8 ....

實際上一個Job會分成多個Tasks，每個CPU核心執行一個Task，Task執行完成則Core被釋放，也就是說8個partitions的Streaming，使用32個核心並不是只能執行4個Jobs，可以根據Spark WebUI的executor頁面核心的使用量，適當的增大concurrentJobs或減少核心使用量。
像我這就是利用率比較低，並且Streaming任務一直跟得上，就可以適當降低Cores的數量。

2. GC優化
使用CMS內存收集器：

spark-submit --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC"

自從使用了這個收集器，GC時間下來了，內存也不容易超限了，一口氣上五樓都不費勁了～具體原理不清楚，回頭補。

有關cache
數據有複用的位置一定要記得cache，否則會從頭開始執行處理流程。
被cache的類型需要能夠進行序列化。
有關序列化與反序列化
Driver會將Task的內容打包序列化發給Executor，所以需要Task中所有被引用的類型都可以序列化。
如果類型不可序列化則會報object not serializable的錯誤，此時需要自己實現序列化與反序列化方法，一般只需要實現反序列化方法（readObject）
scala需要加上serializable註解，java實現serializable接口。

private def readObject(in: ObjectInputStream):Unit = {
		//調用默認的ReadObject函數
        in.defaultReadObject()
        //重新初始化一些無法被序列化的內容
        this.init(this.config_map)
    }

對於無法序列化的屬性（Mysql連接、Redis連接等等等)需要在屬性前加上transient修飾符，表示在序列化時忽略，然後在readObject中再進行構造。

scala還有一個lazy修飾符，表明使用時再進行構建，所以也可以使用lazy+transient修飾符，讓其在使用時重新進行構建。

 @transient lazy val logger:Logger = LogManager.getLogger(this.getClass.getName)

如下例子進行參考,一個包裝的Kafka的類：

class KafkaSink[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
  /* This is the key idea that allows us to work around running into
     NotSerializableExceptions. */
  lazy val producer = createProducer()
  def send(topic: String, key: K, value: V): Future[RecordMetadata] =
    producer.send(new ProducerRecord[K, V](topic, key, value))
  def send(topic: String, value: V): Future[RecordMetadata] =
    producer.send(new ProducerRecord[K, V](topic, value))
}

object KafkaSink {
  import scala.collection.JavaConversions._
  def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
    val createProducerFunc = () => {
      val producer = new KafkaProducer[K, V](config)
      sys.addShutdownHook {
        // Ensure that, on executor JVM shutdown, the Kafka producer sends
        // any buffered messages to Kafka before shutting down.
        producer.close()
      }
      producer
    }
    new KafkaSink(createProducerFunc)
  }
  def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
}

5.廣播變量
Task過程中使用的變量每次都會序列化傳輸一次，如果想驗證可以使用上面的方法重寫readObject打印一些調試信息進行記錄。

而一些長時間不變並且比較大、複雜的內容可以使用廣播變量進行保存，保證每個executor只存在一份該變量。

比如 Redis連接、MySQL連接、規則配置什麼的就可以使用廣播變量。

廣播變量包裹的類同樣需要能夠序列化。

廣播變量爲只讀變量。

詳細廣播變量的使用可以看如下文章：
https://www.jianshu.com/p/3bd18acd2f7f

利用廣播變量進行配置更新：

詳細可以看這篇文章：
https://www.cnblogs.com/liuliliuli2017/p/6782687.html
某個大佬寫的包裝類：

// This wrapper lets us update brodcast variables within DStreams' foreachRDD
// without running into serialization issues
case class BroadcastWrapper[T: ClassTag](
                                          @transient private val ssc: StreamingContext,
                                          @transient private val _v: T) {

  @transient private var v = ssc.sparkContext.broadcast(_v)

  def update(newValue: T, blocking: Boolean = false): Unit = {
    // 刪除RDD是否需要鎖定
    v.unpersist(blocking)
    v = ssc.sparkContext.broadcast(newValue)
  }

  def value: T = v.value

  private def writeObject(out: ObjectOutputStream): Unit = {
    out.writeObject(v)
  }

  private def readObject(in: ObjectInputStream): Unit = {
    v = in.readObject().asInstanceOf[Broadcast[T]]
  }
}

ElasticSearch

index名字只能小寫，赤裸裸的教訓啊= =
如果只使用單個ES賬號，可以使用全局配置的es.nodes等參數，如果使用多個ES源的話可以寫入多個配置MAP:

 var esin_setting = Map[String,String]("es.nodes"->"es1"
     ,"es.port"->"7001"
   )
 var esout_setting = Map[String,String]("es.nodes"->"es2"
   ,"es.port"->"7001"
   ,"es.scroll.size"->"5000")
val rdd = sc.esRDD("indexin", query,esin_setting)
rdd.saveToEs("esout/type1",esout_setting)

3.更多複雜的配置項，可以參考ElasticSearch的官網配置文檔：
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

Spark Streaming+Kafka+ES使用筆記

emm

Kafka

Spark Streaming

ElasticSearch

基於請求/響應對象搜索的Java中間件通用回顯方法（針對HTTP）

Metaspolit結合Python的使用

YSOSERIAL Payloads分析筆記(1)

YSOSERIAL Payloads分析筆記(2)

XMLDecoder解析流程分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結