SparkStreaming消費kafka有兩週模式（receive 和 Direct）

一、SparkStreaming + Kafka Receiver模式

SparkStreaming + Kafka Reveiver模式處理數據採用了Reveiver接收器的模式，需要一個task一直處於佔用接收數據，接收來的數據存儲級別：MEMORY_AND_DISH_SER_2，這種模式幾乎是沒有用的。

在SparkStreaming程序運行起來後，Executor中會有receiver tasks接收kafka推送過來的數據。數據會被持久化，默認級別爲MEMORY_AND_DISK_SER_2,這個級別也可以修改。receiver task對接收過來的數據進行存儲和備份，這個過程會有節點之間的數據傳輸。備份完成後去zookeeper中更新消費偏移量，然後向Driver中的receiver tracker彙報數據的位置。最後Driver根據數據本地化將task分發到不同節點上執行。
原因：
存在丟失數據的問題
當接收完消息後，更新完zookeeper offset後，如果Driver掛掉，Driver下的Executor也會被killed，在Executor內存中的數據多少會有丟失。

如何解決數據丟失問題
開啓WAL（Write Ahead Log）,預寫日誌機制，當Executor備份完數據之後，向HDFS中也備份一份數據，備份完成之後，再去更新消費者offset，如果開啓WAL機制，可以將接收來的數據存儲級別降級，例如，MEMORY_AND_DISK_SER。開啓WAL機制要設置checkpoint。

開啓WAL機制，帶來了新問題
必須數據備份到HDFS完成之後，纔會更新offset，下一步纔會彙報數據位置，再發task處理數據，會造成數據處理的延遲加大。

Reveiver模式的並行度：[每一批次生成的DStream中的RDD的分區數]

spark.streaming.blockInterval = 200ms，在batchInterval內每個200ms，將接收來的數據封裝到一個block中，batchInterval時間內生成的這些block組成了當這個batch。假設batchInterval = 5s ，5s內生成的batch中就有25個block。RDD->partition.batch->block，這裏每一個block就是對應RDD中的partition。

如何提高RDD的並行度：當在batchInterval時間一定情況下，減少spark.streaming.blockInterval值，建議這個值不要低於50ms。

SparkStreaming + Kafka Reveiver模式：

1. 存在數據丟失問題，不常用
2. 即使開始了WAL機制解決了丟失數據問題，但是，數據處理延遲大。
3. Reveiver模式底層消費kafka，採用的是High Level Consumer API實現，不關心消費者offset，無法從每批次中獲取消費者offset和指定總某個offset繼續消費數據。
4. Receiver模式採用zookeeper來維護消費者offset。

二、SparkStreaming + Kafka Direct模式

Spark Streaming + Kafka Direct模式：
不需要一個task一直接收數據，當前批次處理數據時，直接讀取數據處理，Direct模式並行度與讀取的topic中的partition的個數一對一。

SparkStreaming+kafka 的Driect模式就是將kafka看成存數據的一方，不是被動接收數據，而是主動去取數據。消費者偏移量也不是用zookeeper來管理，而是SparkStreaming內部對消費者偏移量自動來維護，默認消費偏移量是在內存中，當然如果設置了checkpoint目錄，那麼消費偏移量也會保存在checkpoint中。當然也可以實現用zookeeper來管理。
Direct模式使用Spark 來自己來維護消費者offset，默認offset存儲在內存中，如果設置了checkpoint，在checkpoint中也有一份，Direct模式可以做到手動維護消費者offset。

如何提高並行度？

增大讀取的topic中的partition個數
讀取過來DStream之後，可以重新分區

三、Direct模式與Receiver模式比較

簡化了並行度，默認的並行度與讀取的kafka中的topic的partition個數一對一。
Reveiver模式採用zookeeper來維護消費者offset，Direct模式使用Spark來自己維護消費者offset。
Receiver模式採用消費Kafka的High Level Consumer API實現，Direct模式採用的是讀取kafka的Simple Consumer API可以做到手動維護offset。

#########

SparkStreaming To HDFS

package bigData.spark.flume_kafka_sparkStreaming

import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkStreaming_Direct_HDFS {
  def main(args: Array[String]): Unit = {
    val chkDir = "/tmp/tmp/SparkStreaming_Direct_HDFS"
    val hdfs = "/tmp/tmp/ssc/Direct_HDFS/a"
    def functionToCreateContext():StreamingContext = {
      val conf = new SparkConf ().setMaster ("local[2]").setAppName ("SparkStreaming_Direct_HDFS")
      val sc = new SparkContext (conf)
      sc.setLogLevel ("WARN")
      val ssc = new StreamingContext (sc, Seconds (4) )
      ssc.checkpoint (chkDir)

      val topics = Set ("flume-kafka")
      val kafkaParams = Map ("metadata.broker.list" -> "hadoop01:9092,hadoop02:9092,hadoop03:9092")

      val lines = KafkaUtils.createDirectStream [String,String,StringDecoder,StringDecoder](ssc, kafkaParams, topics).map (_._2)
      lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey((x:Seq[Int],y:Option[Int])=>{
        Some(x.sum + y.getOrElse(0))
      }).saveAsTextFiles(hdfs)

      ssc.start()
      ssc.awaitTermination()
      ssc
    }

    val context = StreamingContext.getOrCreate(chkDir,functionToCreateContext)

    context.start()
    context.awaitTermination()
    context.stop()
  }
}

SparkStreaming 處理過的數據存到HDFS目錄可以映射爲一張表。

SparkStreaming To Mysql

flume 配置：

agent1.sources = r1
agent1.channels = c1
agent1.sinks = k1
 
#define sources
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /home/hadoop/logs/flume.log
 
#define channels
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
 
#define sink
agent1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.k1.brokerList = hadoop01:9092,hadoop02:9092,hadoop03:9092
agent1.sinks.k1.topic = flume-kafka
agent1.sinks.k1.batchSize = 4
agent1.sinks.k1.requiredAcks = 1
 
#bind sources and sink to channel 
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1

package bigData.spark.flume_kafka_sparkStreaming

import java.sql.{Connection, DriverManager, PreparedStatement}
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkStreaming_Direct_Mysql {

  def main(args: Array[String]): Unit = {

    val chkDir = "/tmp/tmp/SparkStreaming_Direct_Mysql"
    def functionToCreateContext():StreamingContext = {
      val conf = new SparkConf().setMaster("local[2]").setAppName("SparkStreaming_Direct_Mysql")
      val sc = new SparkContext(conf)
      sc.setLogLevel("WARN")
      val ssc = new StreamingContext(sc, Seconds(5))

      ssc.checkpoint(chkDir)
      val topics = Set ("flume-kafka")
      val kafkaParams = Map ("metadata.broker.list" -> "hadoop01:9092,hadoop02:9092,hadoop03:9092")

      val lines: DStream[String] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics).map(_._2)
      lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey((x:Seq[Int],y:Option[Int])=>{
        Some(x.sum+y.getOrElse(0))
      }).foreachRDD(rdd => {
        def func(records: Iterator[(String,Int)]){
          var conn:Connection = null
          var stmt:PreparedStatement = null
          try {
            val url = "jdbc:mysql://192.168.88.100:3306/jerry?useSSL=true"
            val user = "root"
            val password = "root"
            conn = DriverManager.getConnection(url, user, password)
            records.foreach(word => {
              val sql = "insert into aa values (?,?)"
              stmt = conn.prepareStatement(sql)
              stmt.setString(1, word._1)
              stmt.setInt(2, word._2)
              stmt.executeUpdate()
            })
          }catch {
            case e: Exception => e.printStackTrace()
          } finally {
            if (stmt != null) {
              stmt.close()
            }
            if (conn != null) {
              conn.close()
            }
          }
        }
        rdd.foreachPartition(func)
      })
      ssc.start()
      ssc.awaitTermination()
      ssc
    }
    val context = StreamingContext.getOrCreate(chkDir,functionToCreateContext)
    context.start()
    context.awaitTermination()
    context.stop()
  }
}

啓動 flume 向 kafka 中追加數據後,啓動 SparkStreaming 代碼將結果追加到 mysql 中，簡單時間數據到mysql，但是不是update到mysql中，程序需要調整。

傳入kafka中的數據爲：

hello Jerry

hello Tome

sparkStream 學習代碼

SparkStreaming消費kafka有兩週模式（receive 和 Direct）

一、SparkStreaming + Kafka Receiver模式

二、SparkStreaming + Kafka Direct模式

三、Direct模式與Receiver模式比較

SparkStreaming To HDFS

SparkStreaming To Mysql

SparkStreamig To HBase

SparkStreaming To Redis

WaterDrop On Spark（v1.x 版本只支持spark）

SparkStreaming + kafka 的 offset 保存在 Zookeeper、MySQL、HBase、Redis，kafka 中

DataX 使用筆記

WaterDrop on spark/flink(v2.x 支持spark/flink)

sparkStream 學習代碼

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結