偏移量保存到數據庫
一、版本區別
之前版本的kafka偏移量都是保存在kafka中的,而現在的kafka偏移量保存在了自己的一個特殊主題__consumer__offsets中
二、維護思路
根據傳入的主題以及消費者組,先判斷庫中是否存在當前消費者組的消費記錄,如果不存在,則證明爲第一次消費,獲取主題每分區當前的偏移量保存入庫,如果存在,則讀取庫中各分區偏移量字段,封裝爲MAP,傳入創建Dstream函數中創建離散流。當spark流中每一個spark任務完成之後,同步更新庫中偏移量字段,完成偏移量提交。
可能存在的問題,如果任務停止時間過長,當前庫中的偏移量已經不存在kafka緩衝區中此時爆出異常OffsetOutofRangeException,爲了避免異常出現,需要每次啓動創建流的時候,判斷當前的偏移量是否存在kafka中,如果不存在則自動矯正
三、代碼實現
首先我們需要實現一個方法獲取當前偏移量的最小值
1.獲取偏移量範圍
def getTopicOffset(topicName: String, MinOrMax: Int): Map[TopicPartition, Long] = {
val parser = new OptionParser(false)
val clientId = "GetOffset"
val brokerList = brokerListOpt
ToolsUtils.validatePortOrDie(parser, brokerList)
val metadataTargetBrokers = ClientUtils.parseBrokerList(brokerList)
val topic = topicName
val time = MinOrMax
val topicsMetadata = ClientUtils.fetchTopicMetadata(Set(topic), metadataTargetBrokers, clientId, 1000).topicsMetadata
if (topicsMetadata.size != 1 || !topicsMetadata.head.topic.equals(topic)) {
System.err.println(("Error: no valid topic metadata for topic: %s, " + " probably the topic does not exist, run ").format(topic) +
"kafka-list-topic.sh to verify")
Exit.exit(1)
}
val partitions = topicsMetadata.head.partitionsMetadata.map(_.partitionId)
val fromOffsets = collection.mutable.HashMap.empty[TopicPartition, Long]
partitions.foreach { partitionId =>
val partitionMetadataOpt = topicsMetadata.head.partitionsMetadata.find(_.partitionId == partitionId)
partitionMetadataOpt match {
case Some(metadata) =>
metadata.leader match {
case Some(leader) =>
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, 1)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
fromOffsets += (new TopicPartition(topic, partitionId.toInt) -> offsets.mkString(",").toLong)
case None => System.err.println("Error: partition %d does not have a leader. Skip getting offsets".format(partitionId))
}
case None => System.err.println("Error: partition %d does not exist".format(partitionId))
}
}
fromOffsets.toMap
}
參數說明
topicName: String,當前主題的名字
MinOrMax: Int,-1當前主題偏移量最大值,-2當前主題偏移量最小值
2.獲取該消費者組最後提交的偏移量並且自動矯正偏移量
def getLastCommittedOffsets(topicName: Array[String], groups: String): Map[TopicPartition, Long] = {
val toplen = topicName.size
if (LOG.isInfoEnabled())
LOG.info("||--Topic:{},getLastCommittedOffsets from PGSQL By JINGXI--||", topicName)
//從PGSQL獲取上一次存的Offset
//根據主題獲取數據庫中保存的偏移量
var sql_str = "SELECT * FROM spark_offsets_manager where groups = ? and topics = ?"
for (x <- 0 until toplen - 1) {
sql_str += "or topics = ?"
}
val conn = getConn()
//開啓事務
conn.setAutoCommit(false)
val fromOffsets = collection.mutable.HashMap.empty[TopicPartition, Long]
//判斷當前組主題是否存在
try {
for (x <- 0 until toplen) {
val statement = conn.prepareStatement("SELECT * FROM spark_offsets_manager where groups = ? and topics = ?")
statement.setString(1, groups)
statement.setString(2, topicName(x))
val result = statement.executeQuery()
// println(result.next())
if (!result.next()) {
val a = getTopicOffset(topicName(x), -2)
val tops = a.keys.mkString(",").split(",")
for (m <- 0 until tops.size) {
val partition = tops(m).split("-")(1)
val statement = conn.prepareStatement("INSERT INTO spark_offsets_manager (topics,partitions,lastsaveoffsets,groups) VALUES(?,?,?,?)")
statement.setString(1, topicName(x))
statement.setInt(2, partition.toInt)
statement.setLong(3, 0L)
statement.setString(4, groups)
statement.execute()
conn.commit()
}
}
}
}
try {
val statement = conn.prepareStatement(sql_str)
statement.setString(1, groups)
for (x <- 0 until toplen) {
statement.setString(x + 2, topicName(x))
}
// Execute Query
val rs = statement.executeQuery()
var columnCount = rs.getMetaData().getColumnCount();
while (rs.next) {
val topic = rs.getString("topics")
val partition = rs.getString("partitions")
val lastsaveoffset = rs.getString("lastsaveoffsets")
val minOffset = getTopicOffset(topic, -2).get(new TopicPartition(topic, partition.toInt)).mkString(",").toLong
val lastOffset = if (lastsaveoffset == null) {
val statement = conn.prepareStatement("UPDATE spark_offsets_manager SET lastsaveoffsets = ? WHERE topics = ? and partitions = ? and groups = ?")
statement.setLong(1, minOffset)
statement.setString(2, topic)
statement.setInt(3, partition.toInt)
statement.setString(4, groups)
statement.execute()
minOffset
} else if (lastsaveoffset.toLong < minOffset) {
val statement = conn.prepareStatement("UPDATE spark_offsets_manager SET lastsaveoffsets = ? WHERE topics = ? and partitions = ? and groups = ?")
statement.setLong(1, minOffset)
statement.setString(2, topic)
statement.setInt(3, partition.toInt)
statement.setString(4, groups)
statement.execute()
minOffset
} else lastsaveoffset.toLong
fromOffsets += (new TopicPartition(topic, partition.toInt) -> lastOffset.toLong)
// println(topic + ","+partition + "," + lastOffset)
}
conn.commit()
} finally {
conn.close
}
fromOffsets.toMap
}
代碼優化沒有做到最有,但是邏輯已經實現,各位大佬可以自由發揮。
3.獲取數據庫連接方法
def getConn(): Connection = {
val conn = DatabaseUtils.getConn()
conn
}
可以自行封裝一個DatabaseUtils,我使用的是pgsql,可以根據自己業務需求,選擇合適的數據庫,當然如果想保存在非數據庫中,邏輯需要自己實現
四、庫表字段
可以在此字段的基礎上進行擴展