SparkStreaming+Kafka1.0.x多主題多分區偏移量維護

偏移量保存到數據庫

一、版本區別

之前版本的kafka偏移量都是保存在kafka中的，而現在的kafka偏移量保存在了自己的一個特殊主題__consumer__offsets中

二、維護思路

根據傳入的主題以及消費者組，先判斷庫中是否存在當前消費者組的消費記錄，如果不存在，則證明爲第一次消費，獲取主題每分區當前的偏移量保存入庫，如果存在，則讀取庫中各分區偏移量字段，封裝爲MAP，傳入創建Dstream函數中創建離散流。當spark流中每一個spark任務完成之後，同步更新庫中偏移量字段，完成偏移量提交。
可能存在的問題，如果任務停止時間過長，當前庫中的偏移量已經不存在kafka緩衝區中此時爆出異常OffsetOutofRangeException，爲了避免異常出現，需要每次啓動創建流的時候，判斷當前的偏移量是否存在kafka中，如果不存在則自動矯正

三、代碼實現

首先我們需要實現一個方法獲取當前偏移量的最小值

1.獲取偏移量範圍

def getTopicOffset(topicName: String, MinOrMax: Int): Map[TopicPartition, Long] = {
  val parser = new OptionParser(false)
  val clientId = "GetOffset"
  val brokerList = brokerListOpt
  ToolsUtils.validatePortOrDie(parser, brokerList)
  val metadataTargetBrokers = ClientUtils.parseBrokerList(brokerList)
  val topic = topicName
  val time = MinOrMax
  val topicsMetadata = ClientUtils.fetchTopicMetadata(Set(topic), metadataTargetBrokers, clientId, 1000).topicsMetadata
  if (topicsMetadata.size != 1 || !topicsMetadata.head.topic.equals(topic)) {
    System.err.println(("Error: no valid topic metadata for topic: %s, " + " probably the topic does not exist, run ").format(topic) +
      "kafka-list-topic.sh to verify")
    Exit.exit(1)
  }
  val partitions = topicsMetadata.head.partitionsMetadata.map(_.partitionId)
  val fromOffsets = collection.mutable.HashMap.empty[TopicPartition, Long]
  partitions.foreach { partitionId =>
    val partitionMetadataOpt = topicsMetadata.head.partitionsMetadata.find(_.partitionId == partitionId)
    partitionMetadataOpt match {
      case Some(metadata) =>
        metadata.leader match {
          case Some(leader) =>
            val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
            val topicAndPartition = TopicAndPartition(topic, partitionId)
            val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, 1)))
            val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
            fromOffsets += (new TopicPartition(topic, partitionId.toInt) -> offsets.mkString(",").toLong)
          case None => System.err.println("Error: partition %d does not have a leader. Skip getting offsets".format(partitionId))
        }
      case None => System.err.println("Error: partition %d does not exist".format(partitionId))
    }
  }
  fromOffsets.toMap
}

參數說明
topicName: String，當前主題的名字
MinOrMax: Int，-1當前主題偏移量最大值，-2當前主題偏移量最小值

2.獲取該消費者組最後提交的偏移量並且自動矯正偏移量

def getLastCommittedOffsets(topicName: Array[String], groups: String): Map[TopicPartition, Long] = {
  val toplen = topicName.size
  if (LOG.isInfoEnabled())
    LOG.info("||--Topic:{},getLastCommittedOffsets from PGSQL By JINGXI--||", topicName)
  //從PGSQL獲取上一次存的Offset
  //根據主題獲取數據庫中保存的偏移量
  var sql_str = "SELECT * FROM spark_offsets_manager where groups = ? and topics = ?"
  for (x <- 0 until toplen - 1) {
    sql_str += "or topics = ?"
  }
  val conn = getConn()
  //開啓事務
  conn.setAutoCommit(false)
  val fromOffsets = collection.mutable.HashMap.empty[TopicPartition, Long]
  //判斷當前組主題是否存在
  try {
    for (x <- 0 until toplen) {
      val statement = conn.prepareStatement("SELECT * FROM spark_offsets_manager where groups = ? and topics = ?")
      statement.setString(1, groups)
      statement.setString(2, topicName(x))
      val result = statement.executeQuery()
      //        println(result.next())
      if (!result.next()) {
        val a = getTopicOffset(topicName(x), -2)
        val tops = a.keys.mkString(",").split(",")
        for (m <- 0 until tops.size) {
          val partition = tops(m).split("-")(1)
          val statement = conn.prepareStatement("INSERT INTO spark_offsets_manager (topics,partitions,lastsaveoffsets,groups) VALUES(?,?,?,?)")
          statement.setString(1, topicName(x))
          statement.setInt(2, partition.toInt)
          statement.setLong(3, 0L)
          statement.setString(4, groups)
          statement.execute()
          conn.commit()
        }
      }
    }
  }
  try {
    val statement = conn.prepareStatement(sql_str)
    statement.setString(1, groups)
    for (x <- 0 until toplen) {
      statement.setString(x + 2, topicName(x))
    }
    // Execute Query
    val rs = statement.executeQuery()
    var columnCount = rs.getMetaData().getColumnCount();
    while (rs.next) {

      val topic = rs.getString("topics")
      val partition = rs.getString("partitions")
      val lastsaveoffset = rs.getString("lastsaveoffsets")
      val minOffset = getTopicOffset(topic, -2).get(new TopicPartition(topic, partition.toInt)).mkString(",").toLong
      val lastOffset = if (lastsaveoffset == null) {

        val statement = conn.prepareStatement("UPDATE spark_offsets_manager SET lastsaveoffsets = ? WHERE topics = ? and partitions = ? and groups = ?")
        statement.setLong(1, minOffset)
        statement.setString(2, topic)
        statement.setInt(3, partition.toInt)
        statement.setString(4, groups)
        statement.execute()
        minOffset
      } else if (lastsaveoffset.toLong < minOffset) {
        val statement = conn.prepareStatement("UPDATE spark_offsets_manager SET lastsaveoffsets = ? WHERE topics = ? and partitions = ? and groups = ?")
        statement.setLong(1, minOffset)
        statement.setString(2, topic)
        statement.setInt(3, partition.toInt)
        statement.setString(4, groups)
        statement.execute()
        minOffset
      } else lastsaveoffset.toLong
      fromOffsets += (new TopicPartition(topic, partition.toInt) -> lastOffset.toLong)
      //        println(topic + ","+partition + "," + lastOffset)
    }
    conn.commit()
  } finally {
    conn.close
  }
  fromOffsets.toMap
}

代碼優化沒有做到最有，但是邏輯已經實現，各位大佬可以自由發揮。

3.獲取數據庫連接方法

def getConn(): Connection = {
  val conn = DatabaseUtils.getConn()
  conn
}

可以自行封裝一個DatabaseUtils，我使用的是pgsql，可以根據自己業務需求，選擇合適的數據庫，當然如果想保存在非數據庫中，邏輯需要自己實現

四、庫表字段

可以在此字段的基礎上進行擴展

SparkStreaming+Kafka1.0.x多主題多分區偏移量維護

一、版本區別

二、維護思路

三、代碼實現

1.獲取偏移量範圍

2.獲取該消費者組最後提交的偏移量並且自動矯正偏移量

3.獲取數據庫連接方法

四、庫表字段

ES監控及解決方案探究

HDFS standbyNameNode Java.io.IOException:Premature EOF from inputStream[運維必備]

zookerper單機部署及操作命令詳解

HugeGraph圖數據庫入門

Spark實戰經驗

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結