Kafka源碼閱讀-Controller(一)

源碼版本:0.10.2.x

引用 Kafka Controller Internals:

In a Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of
partitions and replicas and for performing administrative tasks like reassigning partitions.

集羣中的broker嘗試在Zookeeper的’/controller’路徑下創建一個臨時節點,先創建成功者成爲Controller。

Kafka 架構圖
在這裏插入圖片描述

Kafka Controller主要內容:

  • 集羣啓動時,KafkaServerStartable啓動KafkaServer, KafkaServer啓動KafkaController,隨即開始Controller的選舉。
  • 在Controller選舉後,選舉成功的Controller被激活,註冊Zookeeper的watch,開始監聽節點以便隨時處理Controller的Failover。然後是各種狀態改變的處理,包括:
  • 新創建的topic或已有topic的分區擴容,重新分配分區的副本、分區副本leader的選舉;
  • 處理Broker的啓動和下線;One broker in a cluster acts as controller • Monitor the liveness of brokers • Elect new leaders on broker failure • Communicate new leaders to brokers
  • 管理副本狀態機replicaStateMachine和分區狀態機partitionStateMachine。

可見,Controller在集羣中具有舉足輕重的地位。

1. Controller的啓動

KafkaServerStartable 管理一個單獨的KafkaServer實例,負責KafkaServer的啓動、關閉、setServerState和awaitShutdown.

KafkaServeKafkaServerKafkaControllerZKLeaderElectorZKCheckedstartupnew, constructstartup1.註冊SessionExpirationListener2.啓動controllerElectorstartup1.在"/controller"路徑訂閱DataChanges註冊LeaderChangeListener處理自動重選舉。2.選舉electamILeaderKafkaServeKafkaServerKafkaControllerZKLeaderElectorZKChecked

圖中的ZKLeaderElector是指ZookeeperLeaderElector對象,它負責所有Leader的選舉,通過指定electionPath: String參數。

2. Controller的選舉

Controller啓動時初始化並啓動ZookeeperLeaderElector。通過ZookeeperLeaderElector類在Controller的electionPath也就是"/controller"路徑上註冊leaderChangeListener,節點的數據變化時就會通知leaderChangeListener進行相應的處理。
ZookeeperLeaderElector類:This class handles zookeeper based leader election based on an ephemeral path.
節點的數據發生變化時就會通知LeaderChangeListener對象。當前Controller Fail時,對應的Controller Path會自動消失,所有“活着”的Broker都會去競選成爲新的Controller(創建新的Controller Path)。

  // ZookeeperLeaderElector
  def startup {
    inLock(controllerContext.controllerLock) {
      controllerContext.zkUtils.zkClient.subscribeDataChanges(electionPath, leaderChangeListener)
      elect
    }
  }

接下來看elect:

Controller通過在zookeeper上的"/controller"路徑創建臨時節點來實現Controller選舉,並在該節點中寫入當前broker的信息:

  {“version”:1,”brokerid”:1,”timestamp”:”1512018424988”}

利用Zookeeper的強一致性特性,一個節點只能被一個客戶端創建成功,創建成功的broker即爲Controller。
如果Controller已經選舉成功,此處參與選舉的broker會直接返回amILeader,避免被選爲Controller的broker進入創建節點的死循環。

def amILeader : Boolean = leaderId == brokerId

  // ZookeeperLeaderElector
  def elect: Boolean = {
    val timestamp = time.milliseconds.toString
    val electString = Json.encode(Map("version" -> 1, "brokerid" -> brokerId, "timestamp" -> timestamp))
   
   leaderId = getControllerID 
    /* 
     * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition, 
     * it's possible that the controller has already been elected when we get here. This check will prevent the following 
     * createEphemeralPath method from getting into an infinite loop if this broker is already the controller.
     */
    if(leaderId != -1) {
       debug("Broker %d has been elected as leader, so stopping the election process.".format(leaderId))
       return amILeader
    }

    try {
      val zkCheckedEphemeral = new ZKCheckedEphemeral(electionPath,
                                                      electString,
                                                      controllerContext.zkUtils.zkConnection.getZookeeper,
                                                      JaasUtils.isZkSecurityEnabled())
      zkCheckedEphemeral.create()
      info(brokerId + " successfully elected as leader")
      leaderId = brokerId //記錄leader broker的id,成爲controller
      onBecomingLeader()
    } catch {
      case _: ZkNodeExistsException =>
        // If someone else has written the path, then
        leaderId = getControllerID 

        if (leaderId != -1)
          debug("Broker %d was elected as leader instead of broker %d".format(leaderId, brokerId))
        else
          warn("A leader has been elected but just resigned, this will result in another round of election")

      case e2: Throwable =>
        error("Error while electing or becoming leader on broker %d".format(brokerId), e2)
        resign()
    }
    amILeader
  }

選舉路徑下的節點創建成功後,elect方法記錄LeaderID,調用onBecomingLeader(),該函數來自ZookeeperLeaderElector類的第三個函數參數:

class ZookeeperLeaderElector(controllerContext: ControllerContext,
                             electionPath: String,
                             onBecomingLeader: () => Unit,
                             onResigningAsLeader: () => Unit,
                             brokerId: Int,
                             time: Time)

Controller初始化時傳入的其實就是onControllerFailover:

  private val controllerElector = new ZookeeperLeaderElector(controllerContext, ZkUtils.ControllerPath, 
    onControllerFailover,
    onControllerResignation, config.brokerId, time)

這表示一旦成爲Controller,就時刻準備着Failover,才能高可用。

3. Controller的故障轉移

成爲controller後會觸發第一次onControllerFailover。以後controller節點被刪除了,也會重新選舉,
並在elect中回調onControllerFailover實現故障轉移,該函數

  1. 註冊controller epoch改變的listener;
  2. 增加controller的Epoch;
  3. 初始化ControllerContext對象,它持有當前topics, live brokers和所有現存partitions的leaders信息.
  4. 啓動controller的ChannelManager
  5. 啓動replica的狀態機和partition的狀態機
  6. 可能觸發Partition重分配,以及PreferredReplica的重選舉等

如果在Controller運行中遭遇任何異常或錯誤,它將退出當前的controller,這確保了另一次controller election被觸發,並且總是隻有一個controller在服務。

  def onControllerFailover() {
    if(isRunning) {
      info("Broker %d starting become controller state transition".format(config.brokerId))
      readControllerEpochFromZookeeper()
      incrementControllerEpoch(zkUtils.zkClient)

      // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks
      registerReassignedPartitionsListener()
      registerIsrChangeNotificationListener()
      registerPreferredReplicaElectionListener()
      partitionStateMachine.registerListeners()
      replicaStateMachine.registerListeners()

      // 更新ControllerContext
      initializeControllerContext()

      // We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines
      // are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before
      // they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and
      // partitionStateMachine.startup().
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)

      replicaStateMachine.startup()
      partitionStateMachine.startup()

      // register the partition change listeners for all existing topics on failover
      controllerContext.allTopics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic))
      info("Broker %d is ready to serve as the new controller with epoch %d".format(config.brokerId, epoch))
      maybeTriggerPartitionReassignment()
      maybeTriggerPreferredReplicaElection()
      if (config.autoLeaderRebalanceEnable) {
        info("starting the partition rebalance scheduler")
        autoRebalanceScheduler.startup()
        autoRebalanceScheduler.schedule("partition-rebalance-thread", checkAndTriggerPartitionRebalance,
          5, config.leaderImbalanceCheckIntervalSeconds.toLong, TimeUnit.SECONDS)
      }
      deleteTopicManager.start()
    }
    else
      info("Controller has been shut down, aborting startup/failover")
  }
ControllerzkUtilsControllerContextPartitionStateMachinReplicaStateMachinControllerBrok從"/controller_epoch"讀取Epoch信息,if existconditionalUpdatePersistentPath: 使epoch + 1更新epoch和epochZkVersion註冊Listener "$AdminPath/reassign_partitions"註冊Listener "/isr_change_notification"註冊Listener "$AdminPath/preferred_replica_election"註冊TopicChangeListener註冊DeleteTopicListener註冊BrokerChangeListener更新上下文(liveBrokers,allTopics,partitionReplica等)重新創建和啓動ControllerChannelManager發送更新live Brokers請求啓動初始化replica state, set started flagmove all Online replicas 狀態 to Online註冊PartitionModificationsListenerControllerzkUtilsControllerContextPartitionStateMachinReplicaStateMachinControllerBrok

一旦觸發Controller的Failover,Controller根據從ZK的節點讀取的內容重新初始化ControllerContext,更新Controller和ISR leader的緩存信息。執行updateLeaderAndIsrCache(),把/brokers/topics/[topic]/partitions/[partition]/state路徑的信息以及leaderIsrAndControllerEpoch信息更新到ControllerContext的內存緩存中。然後發送更新元數據的請求給brokers。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章