KAFKA入門:【七】能否說一下KAFKA服務端的容災架構?

大家好,這是一個爲了夢想而保持學習的博客。這個專題會記錄我對於KAFKA的學習和實戰經驗,希望對大家有所幫助,目錄形式依舊爲問答的方式,相當於是模擬面試。

【概述】

在kafka集羣中,還存在一個角色:Controller

這個角色和kafka集羣中的各個broker是什麼關係呢?其實就是任意一個broker都可以去扮演這個麼一個Controller的角色,然後去履行這個角色所需要執行的動作。

 


一、爲什麼要設計controller角色呢?

我們可以思考一些分佈式系統中可能存在的問題:
1、整個集羣中的的元數據要如何維護呢?如果每一個broker都可以隨意修改元數據,那麼元數據的管理會變得非常複雜;如果單獨創建一個組件單獨來管理元數據,又需要單獨爲這個組件進行容災部署;
2、集羣中如果有某個Broker掉線了該如何保證服務可用呢?也就是如何做容災處理呢?
3、集羣中如果加入了一個新的Broker該如何調整集羣狀態呢?
以上這三個問題,在不同的分佈式系統中有着不同的答案與對應的設計;但是在kafka中,對應的答案就是選舉出一個broker來擔任Controller的角色,來負責解決以上這些問題。
總結一下的話:kafka中的controller角色,主要是爲了集羣容災,以及集羣管理而設計的。

整個集羣完整的架構圖現在看起來就是這個樣子的:

 


二、controller角色的職責是什麼?

Controller職責,其實翻譯一下就是Controller角色要做哪些事情呢?
這個看代碼最直接不過了,以下代碼是某個Broker在成爲Controller之後的回調函數:

private def onControllerFailover() {
  info("Reading controller epoch from ZooKeeper")
  readControllerEpochFromZooKeeper()
  info("Incrementing controller epoch in ZooKeeper")
  incrementControllerEpoch()
  info("Registering handlers")

  // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
  // 註冊一大堆必要的handlers
  val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler,
    isrChangeNotificationHandler)
  childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler)
  // 註冊最優副本選舉/分區重分配的事件handler
  val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler)
  nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence)

  info("Deleting log dir event notifications")
  zkClient.deleteLogDirEventNotifications()
  info("Deleting isr change notifications")
  zkClient.deleteIsrChangeNotifications()
  info("Initializing controller context")
  initializeControllerContext()
  info("Fetching topic deletions in progress")
  val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress()
  info("Initializing topic deletion manager")
  topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion)

  // We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
  // are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
  // they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
  // partitionStateMachine.startup().
  info("Sending update metadata request")
  sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)

  replicaStateMachine.startup()
  partitionStateMachine.startup()

  info(s"Ready to serve as the new controller with epoch $epoch")
  maybeTriggerPartitionReassignment(controllerContext.partitionsBeingReassigned.keySet)
  topicDeletionManager.tryTopicDeletion()
  val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections()
  onPreferredReplicaElection(pendingPreferredReplicaElections)
  info("Starting the controller scheduler")
  kafkaScheduler.startup()
  if (config.autoLeaderRebalanceEnable) {
    scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS)
  }

  if (config.tokenAuthEnabled) {
    info("starting the token expiry check scheduler")
    tokenCleanScheduler.startup()
    tokenCleanScheduler.schedule(name = "delete-expired-tokens",
      fun = tokenManager.expireTokens,
      period = config.delegationTokenExpiryCheckIntervalMs,
      unit = TimeUnit.MILLISECONDS)
  }
}

可以看到上面加粗的幾行代碼,是不是可以準確的看到Controller需要註冊一大堆的監聽器,然後綁定一大堆對應的Handlers,這就是Controller負責的主要工作。
總結一下的話就是:
1、負責集羣管理,包括Broker管理、Topic管理、ISR變更管理等等。其中Broker管理裏面就有一項當某個Broker宕機以後該更新集羣狀態讓剩餘存活的Broker正確提供服務。
2、負責更新元數據,也就是一旦集羣Controller感知到了集羣的狀態變化,都會把最新的元數據信息發送給還存活的Broker,以更新整個集羣的元數據。


而從上面源碼中我們也能知道,Controller是如何進行管理的呢?就是依賴於zk的臨時節點,以及註冊對應的監聽器。一旦目標節點出現對應的事件,就去執行對應的Handler邏輯
下面我們從源碼中看一下,當某個Broker宕機後,Controller是怎麼實現容災的呢?以下是Broker節點出現變化對應的handler:

case object BrokerChange extends ControllerEvent {
  override def state: ControllerState = ControllerState.BrokerChange

  override def process(): Unit = {
    // 如果當前節點不是controller,不讓執行
    if (!isActive) return
    // zk中獲取最新的brokers
    val curBrokers = zkClient.getAllBrokersInCluster.toSet
    val curBrokerIds = curBrokers.map(_.id)
    // 原先本地緩存的brokers
    val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
    // new - old = 新增的brokers
    val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
    // old - new = 掛掉的brokers
    val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
    val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
    // 更新本地的緩存信息
    controllerContext.liveBrokers = curBrokers
    val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
    val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
    val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
    info(s"Newly added brokers: ${newBrokerIdsSorted.mkString(",")}, " +
      s"deleted brokers: ${deadBrokerIdsSorted.mkString(",")}, all live brokers: ${liveBrokerIdsSorted.mkString(",")}")
    // 遍歷新增的brokers,爲每個新增的broker    // 1、初始化網絡模塊,networkClinet
    // 2、初始化對應的請求線程,requestThread
    // 最後將這些信息緩存在controllermap    newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
    // 清理掉GGbrokers對應的資源,並從map中移除
    deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
    if (newBrokerIds.nonEmpty)
      // 其實做兩件事:
      // 1、給存活的所有broker,通過他們的sendRequestThread發送updateMetaDaraRequest
      // 2、給新增的broker節點,添加節點變更的監聽器(這個監聽器的內容,主要還是更新本地元數據以後,發送元數據更新請求)
      onBrokerStartup(newBrokerIdsSorted)
    if (deadBrokerIds.nonEmpty)
      // 三件事:
      // 1、移除本地該節點的緩存數據
      // 2、下線該節點上的副本
      // 3、取消掉對這些節點的變更監聽器
      onBrokerFailure(deadBrokerIdsSorted)
  }
}
private def onBrokerFailure(deadBrokers: Seq[Int]) {
  info(s"Broker failure callback for ${deadBrokers.mkString(",")}")
  // 移除本地緩存信息
  deadBrokers.foreach(controllerContext.replicasOnOfflineDirs.remove)
  val deadBrokersThatWereShuttingDown =
    deadBrokers.filter(id => controllerContext.shuttingDownBrokerIds.remove(id))
  info(s"Removed $deadBrokersThatWereShuttingDown from list of shutting down brokers.")
  val allReplicasOnDeadBrokers = controllerContext.replicasOnBrokers(deadBrokers.toSet)
  // 下線對應broker上的副本
  onReplicasBecomeOffline(allReplicasOnDeadBrokers)
  // 取消這些節點的節點變更觸發器
  unregisterBrokerModificationsHandler(deadBrokers)
}
private def onReplicasBecomeOffline(newOfflineReplicas: Set[PartitionAndReplica]): Unit = {
  val (newOfflineReplicasForDeletion, newOfflineReplicasNotForDeletion) =
    newOfflineReplicas.partition(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic))

  val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader =>
    !controllerContext.isReplicaOnline(partitionAndLeader._2.leaderAndIsr.leader, partitionAndLeader._1) &&
      !topicDeletionManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet

  // trigger OfflinePartition state for all partitions whose current leader is one amongst the newOfflineReplicas
  // 觸發更新那些有leader在這些下線的副本之內的節點的元數據
  partitionStateMachine.handleStateChanges(partitionsWithoutLeader.toSeq, OfflinePartition)
  // trigger OnlinePartition state changes for offline or new partitions
  partitionStateMachine.triggerOnlinePartitionStateChange()
  // trigger OfflineReplica state change for those newly offline replicas
  replicaStateMachine.handleStateChanges(newOfflineReplicasNotForDeletion.toSeq, OfflineReplica)

  // fail deletion of topics that are affected by the offline replicas
  if (newOfflineReplicasForDeletion.nonEmpty) {
    // it is required to mark the respective replicas in TopicDeletionFailed state since the replica cannot be
    // deleted when its log directory is offline. This will prevent the replica from being in TopicDeletionStarted state indefinitely
    // since topic deletion cannot be retried until at least one replica is in TopicDeletionStarted state
    topicDeletionManager.failReplicaDeletion(newOfflineReplicasForDeletion)
  }

  // If replica failure did not require leader re-election, inform brokers of the offline replica
  // Note that during leader re-election, brokers update their metadata
  if (partitionsWithoutLeader.isEmpty) {
    sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
  }
}

其實就是感知到節點變化之後,計算出下線的Broker,然後通知分佈在其他節點上的follower副本成爲leader,繼續提供服務。

 


三、controller角色是如何選舉的呢?

同樣的,源碼出真知,我們直接看一下controller初始化的代碼:

/* start kafka controller */
kafkaController = new KafkaController(config, zkClient, time, metrics, brokerInfo, tokenManager, threadNamePrefix)
kafkaController.startup()
/**
 * Invoked when the controller module of a Kafka server is started up. This does not assume that the current broker
 * is the controller. It merely registers the session expiration listener and starts the controller leader
 * elector
  *
  * 當kafka啓動的時候,並不直到誰是controller
  * 因此他就在zk上註冊了一個超時的listener
  * 並且開啓選舉
 */
def startup() = {
  // 註冊一個handler,在對應的事件觸發的時候執行
  // handlername爲:controller-state-change-handler
  zkClient.registerStateChangeHandler(new StateChangeHandler {
    override val name: String = StateChangeHandlers.ControllerHandler
    override def afterInitializingSession(): Unit = {
      eventManager.put(RegisterBrokerAndReelect)
    }
    override def beforeInitializingSession(): Unit = {
      val expireEvent = new Expire
      eventManager.clearAndPut(expireEvent)
      expireEvent.waitUntilProcessed()
    }
  })
  //  ControllerEventManager 中放入啓動事件
  // 其實就是丟到了一個queue中,阻塞隊列
  // 這裏其實是一個生產者消費者模式 + 策略模式
  eventManager.put(Startup)
  // 啓動內部的處理線程
  eventManager.start()
}
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {
    // 檢查/controller這個znode是否存在
    // 然後註冊該節點(創建/刪除/數據改變)對應handler
    // 這裏註冊的handler其實就是註冊對這個znodewatcher,然後一旦節點發生變化,
    // 就根據具體的事件執行具體的handler邏輯
    zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
    // 然後執行選舉
    elect()
  }
}
private def elect(): Unit = {
  val timestamp = time.milliseconds
  // 讀取controllerId的值,沒有的話,就返回-1
  activeControllerId = zkClient.getControllerId.getOrElse(-1)
  /*
   * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition,
   * it's possible that the controller has already been elected when we get here. This check will prevent the following
   * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
   */
  // 如果已經有broker成爲了controller,則直接中斷操作
  if (activeControllerId != -1) {
    debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.")
    return
  }

  try {
    // 嘗試在zk上註冊臨時節點
    // 節點信息包括自己的brokerid和當前時間戳
    zkClient.checkedEphemeralCreate(ControllerZNode.path, ControllerZNode.encode(config.brokerId, timestamp))
    // 如果沒有報錯,則說明註冊成功,直接打印信息
    info(s"${config.brokerId} successfully elected as the controller")
    // 更新當前的controllerId爲自己的brokerId
    activeControllerId = config.brokerId
    // 做一些controller應該做的事情
    // 更改controllerEpoch、增加一些監聽器啥的
    onControllerFailover()
  } catch {
    case _: NodeExistsException =>
      // 如果註冊失敗了,得到這個節點已存在的異常
      // 那麼此時就會更新自己本地的activeControllerId爲那個節點裏面的brokerId
      // If someone else has written the path, then
      activeControllerId = zkClient.getControllerId.getOrElse(-1)
      // 這裏如果是拿到的controllerId依舊是-1,那麼說明剛剛選上的brokerGG      // 所以重新開啓新的一輪選舉
      if (activeControllerId != -1)
        debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}")
      else
        warn("A controller has been elected but just resigned, this will result in another round of election")

    case e2: Throwable =>
      error(s"Error while electing or becoming controller on broker ${config.brokerId}", e2)
      triggerControllerMove()
  }
}

上面的代碼是不是寫的非常清晰明瞭?Kafka不僅僅是架構優秀,內部代碼設計和註釋,簡直也是業界良心,只不過是編寫語言是scale從而勸退一些人,
不過其實我們瞭解一點語法,其實就能大概明白是這些代碼是什麼意思了,更何況還有很完善的註釋信息。如果沒有二次開發的需要,只需要瞭解一點點語法即可。

總結一下的話就是:broker在啓動的時候會去zk上註冊一個/controller臨時節點,寫入自己的brokerid和時間戳,如果成功就說明自己選上了,如果失敗則獲取當前已經當選的broker的brokerId。
所以通常一個集羣中最開始的controller就是最先啓動的那一個broker。

 


四、controller宕機了會發生什麼

最後,我們在思考一個問題,一個kafka集羣中只有一個Broker會成爲Controller,那麼一旦掛掉的Broker是Controller該怎麼辦呢?是否也是需要多個Controller角色互爲主備纔可靠呢?
顯然不是的,我們在最開始提到Controller角色的設計優雅之處,其中一點就是不需要單獨部署,單獨做容災處理,直接可以複用kafka集羣自身的容災機制。
那麼kafka是如何實現的呢?我們繼續看對應部分的源碼:

case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {
    // 檢查/controller這個znode是否存在
    // 然後註冊該節點(創建/刪除/數據改變)對應handler
    // 這裏註冊的handler其實就是註冊對這個znodewatcher,然後一旦節點發生變化,
    // 就根據具體的事件執行具體的handler邏輯
    zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
    // 然後執行選舉
    elect()
  }
}
class ControllerChangeHandler(controller: KafkaController, eventManager: ControllerEventManager) extends ZNodeChangeHandler {
  override val path: String = ControllerZNode.path

  override def handleCreation(): Unit = eventManager.put(controller.ControllerChange)
  override def handleDeletion(): Unit = eventManager.put(controller.Reelect)
  override def handleDataChange(): Unit = eventManager.put(controller.ControllerChange)
}
case object Reelect extends ControllerEvent {
  override def state = ControllerState.ControllerChange

  override def process(): Unit = {
    val wasActiveBeforeChange = isActive
    // 再次註冊
    zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
    activeControllerId = zkClient.getControllerId.getOrElse(-1)
    // 如果之前是controller,但是現在不是了,就需要清除一些狀態
    if (wasActiveBeforeChange && !isActive) {
      onControllerResignation()
    }
    // 執行選舉
    elect()
  }
}

從源碼中我們可以看到,依舊是通過zk來實現的。

總結一下:
在啓動的時候如果/controller節點存在則會對該節點增加一個監聽器,並且綁定對應的handler;
一旦承擔controller角色的broker宕機了,對應的臨時節點就會被刪除,因此就會觸發集羣中剩餘的所有broker的監聽事件,再次執行選舉動作,也就是競爭創建/controller節點;
最終誰創建成功就成爲新的controller;最後新的controller會重新更新集羣狀態,集羣繼續穩定提供服務。
整個controller切換過程的耗時,正常情況下在秒級以內。

 

補充:

1、zk中還存在一個持久節點/controller_epoch,用於記錄Controller的變更次數,一旦controller發生切換這個節點值就會+1;
每個和controller交互的請求,都必須要攜帶controller_epoch值,一旦這個值和broker中緩存的值不一樣,那麼就說明可能是過期請求(小於內存中的controller_epoch),
或者controller發生了切換(大於內存中的controller_epoch),那麼此時這個請求就會被視爲無效請求。
由此可見,kakfa通過controller_epoch來保證控制器的唯一性,進而保證相關操作的一致性。

2、controller的設計,還減少了集羣中所有broker對zk的依賴,因爲整個集羣中只有controller需要註冊大量的監聽器,而其他的broker只需要註冊很少的監聽器即可。
這樣可以很好地避免集羣中所有的broker註冊大量的監聽器,從而重度依賴zk的設計,帶來的腦裂問題,羊羣效應等。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章