大家好,這是一個爲了夢想而保持學習的博客。這個專題會記錄我對於KAFKA的學習和實戰經驗,希望對大家有所幫助,目錄形式依舊爲問答的方式,相當於是模擬面試。
【概述】
在kafka集羣中,還存在一個角色:Controller
這個角色和kafka集羣中的各個broker是什麼關係呢?其實就是任意一個broker都可以去扮演這個麼一個Controller的角色,然後去履行這個角色所需要執行的動作。
一、爲什麼要設計controller角色呢?
我們可以思考一些分佈式系統中可能存在的問題:
1、整個集羣中的的元數據要如何維護呢?如果每一個broker都可以隨意修改元數據,那麼元數據的管理會變得非常複雜;如果單獨創建一個組件單獨來管理元數據,又需要單獨爲這個組件進行容災部署;
2、集羣中如果有某個Broker掉線了該如何保證服務可用呢?也就是如何做容災處理呢?
3、集羣中如果加入了一個新的Broker該如何調整集羣狀態呢?
以上這三個問題,在不同的分佈式系統中有着不同的答案與對應的設計;但是在kafka中,對應的答案就是選舉出一個broker來擔任Controller的角色,來負責解決以上這些問題。
總結一下的話:kafka中的controller角色,主要是爲了集羣容災,以及集羣管理而設計的。
整個集羣完整的架構圖現在看起來就是這個樣子的:
二、controller角色的職責是什麼?
Controller職責,其實翻譯一下就是Controller角色要做哪些事情呢?
這個看代碼最直接不過了,以下代碼是某個Broker在成爲Controller之後的回調函數:
private def onControllerFailover() { info("Reading controller epoch from ZooKeeper") readControllerEpochFromZooKeeper() info("Incrementing controller epoch in ZooKeeper") incrementControllerEpoch() info("Registering handlers") // before reading source of truth from zookeeper, register the listeners to get broker/topic callbacks // 註冊一大堆必要的handlers val childChangeHandlers = Seq(brokerChangeHandler, topicChangeHandler, topicDeletionHandler, logDirEventNotificationHandler, isrChangeNotificationHandler) childChangeHandlers.foreach(zkClient.registerZNodeChildChangeHandler) // 註冊最優副本選舉/分區重分配的事件handler val nodeChangeHandlers = Seq(preferredReplicaElectionHandler, partitionReassignmentHandler) nodeChangeHandlers.foreach(zkClient.registerZNodeChangeHandlerAndCheckExistence) info("Deleting log dir event notifications") zkClient.deleteLogDirEventNotifications() info("Deleting isr change notifications") zkClient.deleteIsrChangeNotifications() info("Initializing controller context") initializeControllerContext() info("Fetching topic deletions in progress") val (topicsToBeDeleted, topicsIneligibleForDeletion) = fetchTopicDeletionsInProgress() info("Initializing topic deletion manager") topicDeletionManager.init(topicsToBeDeleted, topicsIneligibleForDeletion) // We need to send UpdateMetadataRequest after the controller context is initialized and before the state machines // are started. The is because brokers need to receive the list of live brokers from UpdateMetadataRequest before // they can process the LeaderAndIsrRequests that are generated by replicaStateMachine.startup() and // partitionStateMachine.startup(). info("Sending update metadata request") sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq) replicaStateMachine.startup() partitionStateMachine.startup() info(s"Ready to serve as the new controller with epoch $epoch") maybeTriggerPartitionReassignment(controllerContext.partitionsBeingReassigned.keySet) topicDeletionManager.tryTopicDeletion() val pendingPreferredReplicaElections = fetchPendingPreferredReplicaElections() onPreferredReplicaElection(pendingPreferredReplicaElections) info("Starting the controller scheduler") kafkaScheduler.startup() if (config.autoLeaderRebalanceEnable) { scheduleAutoLeaderRebalanceTask(delay = 5, unit = TimeUnit.SECONDS) } if (config.tokenAuthEnabled) { info("starting the token expiry check scheduler") tokenCleanScheduler.startup() tokenCleanScheduler.schedule(name = "delete-expired-tokens", fun = tokenManager.expireTokens, period = config.delegationTokenExpiryCheckIntervalMs, unit = TimeUnit.MILLISECONDS) } }
可以看到上面加粗的幾行代碼,是不是可以準確的看到Controller需要註冊一大堆的監聽器,然後綁定一大堆對應的Handlers,這就是Controller負責的主要工作。
總結一下的話就是:
1、負責集羣管理,包括Broker管理、Topic管理、ISR變更管理等等。其中Broker管理裏面就有一項當某個Broker宕機以後該更新集羣狀態讓剩餘存活的Broker正確提供服務。
2、負責更新元數據,也就是一旦集羣Controller感知到了集羣的狀態變化,都會把最新的元數據信息發送給還存活的Broker,以更新整個集羣的元數據。
而從上面源碼中我們也能知道,Controller是如何進行管理的呢?就是依賴於zk的臨時節點,以及註冊對應的監聽器。一旦目標節點出現對應的事件,就去執行對應的Handler邏輯。
下面我們從源碼中看一下,當某個Broker宕機後,Controller是怎麼實現容災的呢?以下是Broker節點出現變化對應的handler:
case object BrokerChange extends ControllerEvent { override def state: ControllerState = ControllerState.BrokerChange override def process(): Unit = { // 如果當前節點不是controller,不讓執行 if (!isActive) return // 從zk中獲取最新的brokers val curBrokers = zkClient.getAllBrokersInCluster.toSet val curBrokerIds = curBrokers.map(_.id) // 原先本地緩存的brokers val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds // new - old = 新增的brokers val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds // old - new = 掛掉的brokers val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id)) // 更新本地的緩存信息 controllerContext.liveBrokers = curBrokers val newBrokerIdsSorted = newBrokerIds.toSeq.sorted val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted info(s"Newly added brokers: ${newBrokerIdsSorted.mkString(",")}, " + s"deleted brokers: ${deadBrokerIdsSorted.mkString(",")}, all live brokers: ${liveBrokerIdsSorted.mkString(",")}") // 遍歷新增的brokers,爲每個新增的broker: // 1、初始化網絡模塊,networkClinet // 2、初始化對應的請求線程,requestThread // 最後將這些信息緩存在controller的map中 newBrokers.foreach(controllerContext.controllerChannelManager.addBroker) // 清理掉GG的brokers對應的資源,並從map中移除 deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker) if (newBrokerIds.nonEmpty) // 其實做兩件事: // 1、給存活的所有broker,通過他們的sendRequestThread發送updateMetaDaraRequest // 2、給新增的broker節點,添加節點變更的監聽器(這個監聽器的內容,主要還是更新本地元數據以後,發送元數據更新請求) onBrokerStartup(newBrokerIdsSorted) if (deadBrokerIds.nonEmpty) // 三件事: // 1、移除本地該節點的緩存數據 // 2、下線該節點上的副本 // 3、取消掉對這些節點的變更監聽器 onBrokerFailure(deadBrokerIdsSorted) } }
private def onBrokerFailure(deadBrokers: Seq[Int]) { info(s"Broker failure callback for ${deadBrokers.mkString(",")}") // 移除本地緩存信息 deadBrokers.foreach(controllerContext.replicasOnOfflineDirs.remove) val deadBrokersThatWereShuttingDown = deadBrokers.filter(id => controllerContext.shuttingDownBrokerIds.remove(id)) info(s"Removed $deadBrokersThatWereShuttingDown from list of shutting down brokers.") val allReplicasOnDeadBrokers = controllerContext.replicasOnBrokers(deadBrokers.toSet) // 下線對應broker上的副本 onReplicasBecomeOffline(allReplicasOnDeadBrokers) // 取消這些節點的節點變更觸發器 unregisterBrokerModificationsHandler(deadBrokers) }
private def onReplicasBecomeOffline(newOfflineReplicas: Set[PartitionAndReplica]): Unit = { val (newOfflineReplicasForDeletion, newOfflineReplicasNotForDeletion) = newOfflineReplicas.partition(p => topicDeletionManager.isTopicQueuedUpForDeletion(p.topic)) val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader => !controllerContext.isReplicaOnline(partitionAndLeader._2.leaderAndIsr.leader, partitionAndLeader._1) && !topicDeletionManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet // trigger OfflinePartition state for all partitions whose current leader is one amongst the newOfflineReplicas // 觸發更新那些有leader在這些下線的副本之內的節點的元數據 partitionStateMachine.handleStateChanges(partitionsWithoutLeader.toSeq, OfflinePartition) // trigger OnlinePartition state changes for offline or new partitions partitionStateMachine.triggerOnlinePartitionStateChange() // trigger OfflineReplica state change for those newly offline replicas replicaStateMachine.handleStateChanges(newOfflineReplicasNotForDeletion.toSeq, OfflineReplica) // fail deletion of topics that are affected by the offline replicas if (newOfflineReplicasForDeletion.nonEmpty) { // it is required to mark the respective replicas in TopicDeletionFailed state since the replica cannot be // deleted when its log directory is offline. This will prevent the replica from being in TopicDeletionStarted state indefinitely // since topic deletion cannot be retried until at least one replica is in TopicDeletionStarted state topicDeletionManager.failReplicaDeletion(newOfflineReplicasForDeletion) } // If replica failure did not require leader re-election, inform brokers of the offline replica // Note that during leader re-election, brokers update their metadata if (partitionsWithoutLeader.isEmpty) { sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq) } }
其實就是感知到節點變化之後,計算出下線的Broker,然後通知分佈在其他節點上的follower副本成爲leader,繼續提供服務。
三、controller角色是如何選舉的呢?
同樣的,源碼出真知,我們直接看一下controller初始化的代碼:
/* start kafka controller */ kafkaController = new KafkaController(config, zkClient, time, metrics, brokerInfo, tokenManager, threadNamePrefix) kafkaController.startup()
/** * Invoked when the controller module of a Kafka server is started up. This does not assume that the current broker * is the controller. It merely registers the session expiration listener and starts the controller leader * elector * * 當kafka啓動的時候,並不直到誰是controller * 因此他就在zk上註冊了一個超時的listener * 並且開啓選舉 */ def startup() = { // 註冊一個handler,在對應的事件觸發的時候執行 // handler的name爲:controller-state-change-handler zkClient.registerStateChangeHandler(new StateChangeHandler { override val name: String = StateChangeHandlers.ControllerHandler override def afterInitializingSession(): Unit = { eventManager.put(RegisterBrokerAndReelect) } override def beforeInitializingSession(): Unit = { val expireEvent = new Expire eventManager.clearAndPut(expireEvent) expireEvent.waitUntilProcessed() } }) // 向 ControllerEventManager 中放入啓動事件 // 其實就是丟到了一個queue中,阻塞隊列 // 這裏其實是一個生產者消費者模式 + 策略模式 eventManager.put(Startup) // 啓動內部的處理線程 eventManager.start() }
case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { // 檢查/controller這個znode是否存在 // 然後註冊該節點(創建/刪除/數據改變)對應handler // 這裏註冊的handler其實就是註冊對這個znode的watcher,然後一旦節點發生變化, // 就根據具體的事件執行具體的handler邏輯 zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) // 然後執行選舉 elect() } }
private def elect(): Unit = { val timestamp = time.milliseconds // 讀取controllerId的值,沒有的話,就返回-1 activeControllerId = zkClient.getControllerId.getOrElse(-1) /* * We can get here during the initial startup and the handleDeleted ZK callback. Because of the potential race condition, * it's possible that the controller has already been elected when we get here. This check will prevent the following * createEphemeralPath method from getting into an infinite loop if this broker is already the controller. */ // 如果已經有broker成爲了controller,則直接中斷操作 if (activeControllerId != -1) { debug(s"Broker $activeControllerId has been elected as the controller, so stopping the election process.") return } try { // 嘗試在zk上註冊臨時節點 // 節點信息包括自己的brokerid和當前時間戳 zkClient.checkedEphemeralCreate(ControllerZNode.path, ControllerZNode.encode(config.brokerId, timestamp)) // 如果沒有報錯,則說明註冊成功,直接打印信息 info(s"${config.brokerId} successfully elected as the controller") // 更新當前的controllerId爲自己的brokerId activeControllerId = config.brokerId // 做一些controller應該做的事情 // 更改controllerEpoch、增加一些監聽器啥的 onControllerFailover() } catch { case _: NodeExistsException => // 如果註冊失敗了,得到這個節點已存在的異常 // 那麼此時就會更新自己本地的activeControllerId爲那個節點裏面的brokerId // If someone else has written the path, then activeControllerId = zkClient.getControllerId.getOrElse(-1) // 這裏如果是拿到的controllerId依舊是-1,那麼說明剛剛選上的broker又GG了 // 所以重新開啓新的一輪選舉 if (activeControllerId != -1) debug(s"Broker $activeControllerId was elected as controller instead of broker ${config.brokerId}") else warn("A controller has been elected but just resigned, this will result in another round of election") case e2: Throwable => error(s"Error while electing or becoming controller on broker ${config.brokerId}", e2) triggerControllerMove() } }
上面的代碼是不是寫的非常清晰明瞭?Kafka不僅僅是架構優秀,內部代碼設計和註釋,簡直也是業界良心,只不過是編寫語言是scale從而勸退一些人,
不過其實我們瞭解一點語法,其實就能大概明白是這些代碼是什麼意思了,更何況還有很完善的註釋信息。如果沒有二次開發的需要,只需要瞭解一點點語法即可。
總結一下的話就是:broker在啓動的時候會去zk上註冊一個/controller臨時節點,寫入自己的brokerid和時間戳,如果成功就說明自己選上了,如果失敗則獲取當前已經當選的broker的brokerId。
所以通常一個集羣中最開始的controller就是最先啓動的那一個broker。
四、controller宕機了會發生什麼
最後,我們在思考一個問題,一個kafka集羣中只有一個Broker會成爲Controller,那麼一旦掛掉的Broker是Controller該怎麼辦呢?是否也是需要多個Controller角色互爲主備纔可靠呢?
顯然不是的,我們在最開始提到Controller角色的設計優雅之處,其中一點就是不需要單獨部署,單獨做容災處理,直接可以複用kafka集羣自身的容災機制。
那麼kafka是如何實現的呢?我們繼續看對應部分的源碼:
case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { // 檢查/controller這個znode是否存在 // 然後註冊該節點(創建/刪除/數據改變)對應handler // 這裏註冊的handler其實就是註冊對這個znode的watcher,然後一旦節點發生變化, // 就根據具體的事件執行具體的handler邏輯 zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) // 然後執行選舉 elect() } }
class ControllerChangeHandler(controller: KafkaController, eventManager: ControllerEventManager) extends ZNodeChangeHandler { override val path: String = ControllerZNode.path override def handleCreation(): Unit = eventManager.put(controller.ControllerChange) override def handleDeletion(): Unit = eventManager.put(controller.Reelect) override def handleDataChange(): Unit = eventManager.put(controller.ControllerChange) }
case object Reelect extends ControllerEvent { override def state = ControllerState.ControllerChange override def process(): Unit = { val wasActiveBeforeChange = isActive // 再次註冊 zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) activeControllerId = zkClient.getControllerId.getOrElse(-1) // 如果之前是controller,但是現在不是了,就需要清除一些狀態 if (wasActiveBeforeChange && !isActive) { onControllerResignation() } // 執行選舉 elect() } }
從源碼中我們可以看到,依舊是通過zk來實現的。
總結一下:
在啓動的時候如果/controller節點存在則會對該節點增加一個監聽器,並且綁定對應的handler;
一旦承擔controller角色的broker宕機了,對應的臨時節點就會被刪除,因此就會觸發集羣中剩餘的所有broker的監聽事件,再次執行選舉動作,也就是競爭創建/controller節點;
最終誰創建成功就成爲新的controller;最後新的controller會重新更新集羣狀態,集羣繼續穩定提供服務。
整個controller切換過程的耗時,正常情況下在秒級以內。
補充:
1、zk中還存在一個持久節點/controller_epoch,用於記錄Controller的變更次數,一旦controller發生切換這個節點值就會+1;
每個和controller交互的請求,都必須要攜帶controller_epoch值,一旦這個值和broker中緩存的值不一樣,那麼就說明可能是過期請求(小於內存中的controller_epoch),
或者controller發生了切換(大於內存中的controller_epoch),那麼此時這個請求就會被視爲無效請求。
由此可見,kakfa通過controller_epoch來保證控制器的唯一性,進而保證相關操作的一致性。
2、controller的設計,還減少了集羣中所有broker對zk的依賴,因爲整個集羣中只有controller需要註冊大量的監聽器,而其他的broker只需要註冊很少的監聽器即可。
這樣可以很好地避免集羣中所有的broker註冊大量的監聽器,從而重度依賴zk的設計,帶來的腦裂問題,羊羣效應等。