Kafka源碼閱讀-Controller(二)

上一篇kafka源碼(一)correspond to/explain Kafka設計解析(二) 中的3.2、3.3。以前一直用kafka 0.8.2.x，那時候redis開始風靡，hadoop方興未艾，一晃四五年過去了，終於老得可以讀讀源碼。

不得不說Kafka的代碼風格比spark好多了。畢竟spark太龐大，相對來說kafka小而美吧。
可能出於性能的考慮，以及ZooKeeper的機制，kafka大部分都是異步回調的事件機制。類似epoll對IO的處理。
源碼中幾乎對每個回調函數都註釋了該方法什麼情況下會被Invoke，以及觸發後做哪些工作。這對於開發維護和閱讀都很友好，真是相見恨晚哈哈。

本文第3、4部分呼應前文的3.4 broker failover、3.5 Partition的Leader選舉。

內容目錄

1. BrokerChangeListener的源起

broker被選舉爲controller後，會在onBecomingLeader(亦即onControllerFailover)回調中註冊兩個狀態機的Listener：

    partitionStateMachine.registerListeners()
    replicaStateMachine.registerListeners()

其中replica狀態機registerListeners()時，調用registerBrokerChangeListener()，在"/brokers/ids" 路徑註冊brokerChangeListener。
當前節點以及子節點增加或者刪除的狀態改變，都會觸發這個Listener。

  // 位於 ReplicaStateMachine.scala
  // register ZK listeners of the replica state machine
  def registerListeners() {
    // register broker change listener
    registerBrokerChangeListener()
  }
  
  private def registerBrokerChangeListener() = {
    zkUtils.zkClient.subscribeChildChanges(ZkUtils.BrokerIdsPath, brokerChangeListener)
  }

註冊Listener使用的是zkClient的subscribeChildChanges方法，它觸發的回調(第二個參數)是一個IZkChildListener接口：

// 見 zkClient document
subscribeChildChanges(java.lang.String path, IZkChildListener listener)

ControllerZkChildListener實現了IZkChildListener接口：

// ControllerZkListener.scala
trait ControllerZkChildListener extends IZkChildListener with ControllerZkListener {
  @throws[Exception]
  final def handleChildChange(parentPath: String, currentChildren: java.util.List[String]): Unit = {
    // Due to zkclient's callback order, it's possible for the callback to be triggered after the controller has moved
    if (controller.isActive)
      doHandleChildChange(parentPath, currentChildren.asScala)
  }

  @throws[Exception]
  def doHandleChildChange(parentPath: String, currentChildren: Seq[String]): Unit
}

觸發時，如果controller是Active的，就執行ControllerZkListener的doHandleChildChange方法。
由於BrokerChangeListener繼承了ControllerZkChildListener，也就是執行BrokerChangeListener的doHandleChildChange。

Kafka對所有新增的broker和死掉的broker的處理，都在這個回調函數中，如下：

  /**
   * 位於ReplicaStateMachine.scala
   * This is the zookeeper listener that triggers all the state transitions for a replica
   */
  class BrokerChangeListener(protected val controller: KafkaController) extends ControllerZkChildListener {

    protected def logName = "BrokerChangeListener"
    // 同時處理新增的Broker和死掉的Broker。
    // 通過controller來對新Broker執行onBrokerStartup，對死掉的Broker執行onBrokerFailure。
    def doHandleChildChange(parentPath: String, currentBrokerList: Seq[String]) {
      info("Broker change listener fired for path %s with children %s"
          .format(parentPath, currentBrokerList.sorted.mkString(",")))
      inLock(controllerContext.controllerLock) {
        // ReplicaStateMachine 已經啓動(startup)時纔會執行
        if (hasStarted.get) {
          ControllerStats.leaderElectionTimer.time {
            try {
              // 從結點路徑讀取當前的Broker信息，也就是節點變化後的
              val curBrokers = currentBrokerList.map(_.toInt).toSet.flatMap(zkUtils.getBrokerInfo)
              val curBrokerIds = curBrokers.map(_.id)
              val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
              // liveOrShuttingDownBrokerIds是一個Set，--是求差集操作。
              // https://docs.scala-lang.org/zh-cn/overviews/collections/sets.html
              // 節點變化後當前的Broker減去已有的(包括活着的和正在關閉的)Broker，得到新建的Broker；
              val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
              // 結點變化前的Brokers減去變化後的，得到掛掉的Brokers
              val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
              val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
              controllerContext.liveBrokers = curBrokers
              val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
              val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
              val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
              info("Newly added brokers: %s, deleted brokers: %s, all live brokers: %s"
                .format(newBrokerIdsSorted.mkString(","), deadBrokerIdsSorted.mkString(","), liveBrokerIdsSorted.mkString(",")))
              // 對新添加的每個broker，將它添加到controllerChannelManager中去，執行一系列操作
              newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
              // 對dead brokers，從controllerChannelManager中移除，包括關閉requestSendThread，從ChannelMgr的內存緩存清除
              deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
              if(newBrokerIds.nonEmpty)
                controller.onBrokerStartup(newBrokerIdsSorted)
              if(deadBrokerIds.nonEmpty)
                controller.onBrokerFailure(deadBrokerIdsSorted)
            } catch {
              case e: Throwable => error("Error while handling broker changes", e)
            }
          }
        }
      }
    }
  }
}

對於新添加的每個broker，將它添加到controllerChannelManager中去，Controller會保持與新Broker的連接，通過創建和啓動專門的線程發送請求。
然後執行controller.onBrokerStartup()。
對於每一個dead broker，會將它從controllerChannelManager中移除，關閉requestSendThread這個“Channel”，並從ChannelMgr的內存緩存清除。
然後執行controller.onBrokerFailure()

另外，ReplicaState共有7種狀態：

sealed trait ReplicaState { def state: Byte }
case object NewReplica extends ReplicaState { val state: Byte = 1 }
case object OnlineReplica extends ReplicaState { val state: Byte = 2 }
case object OfflineReplica extends ReplicaState { val state: Byte = 3 }
case object ReplicaDeletionStarted extends ReplicaState { val state: Byte = 4}
case object ReplicaDeletionSuccessful extends ReplicaState { val state: Byte = 5}
case object ReplicaDeletionIneligible extends ReplicaState { val state: Byte = 6}
case object NonExistentReplica extends ReplicaState { val state: Byte = 7 }

上面是整體情況，下面細看。

2. Controller對新建Broker的處理

主要是這兩行：

newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
controller.onBrokerStartup(newBrokerIdsSorted)

controllerChannelManager的addBroker是個函數：

  def addBroker(broker: Broker) {
    // be careful here. Maybe the startup() API has already started the request send thread
    brokerLock synchronized {
      if(!brokerStateInfo.contains(broker.id)) {
        addNewBroker(broker)
        startRequestSendThread(broker.id)
      }
    }
  }

addNewBroker(broker)中，對新broker創建了brokerNode、Selector、NetworkClient，
最後創建一個RequestSendThread線程對象，並把broker的ControllerBrokerStateInfo保存到brokerStateInfo對象。
RequestSendThread是一個ShutdownableThread。
接下來startRequestSendThread就是啓動這個線程，每隔100ms發送clientRequest消息，並處理回覆：

    // key code in startRequestSendThread:
    clientResponse = networkClient.blockingSendAndReceive(clientRequest)(time)

3. Controller對Broker failure的處理

Controller執行onBrokerFailure，該函數由replica狀態機的BrokerChangeListener觸發，帶上failed brokers列表作爲輸入。它做下面4件事：

把leaders 死掉的分區標記爲offline
對所有offline或新的partitions觸發OnlinePartition的狀態改變
在輸入的failed brokers列表上，調用OfflineReplica的狀態改變
如果沒有partitions受影響，就發送UpdateMetadataRequest消息給live or shutting down brokers。

  /**
   * This callback is invoked by the replica state machine's broker change listener with the list of failed brokers
   * as input. It does the following -
   * 1. Mark partitions with dead leaders as offline
   * 2. Triggers the OnlinePartition state change for all new/offline partitions
   * 3. (這句好像寫錯了：)Invokes the OfflineReplica state change on the input list of newly started brokers
   * 4. If no partitions are effected then send UpdateMetadataRequest to live or shutting down brokers
   *
   * Note that we don't need to refresh the leader/isr cache for all topic/partitions at this point.  This is because
   * the partition state machine will refresh our cache for us when performing leader election for all new/offline
   * partitions coming online.
   */
  def onBrokerFailure(deadBrokers: Seq[Int]) {
    info("Broker failure callback for %s".format(deadBrokers.mkString(",")))
    val deadBrokersThatWereShuttingDown =
      deadBrokers.filter(id => controllerContext.shuttingDownBrokerIds.remove(id))
    info("Removed %s from list of shutting down brokers.".format(deadBrokersThatWereShuttingDown))
    val deadBrokersSet = deadBrokers.toSet
    
    // trigger OfflinePartition state for all partitions whose current leader is one amongst the dead brokers
    val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader =>
      deadBrokersSet.contains(partitionAndLeader._2.leaderAndIsr.leader) &&
        !deleteTopicManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet
    // 1，爲當前leader在deadBrokers中的所有partitions 觸發 OfflinePartition的目的狀態
    partitionStateMachine.handleStateChanges(partitionsWithoutLeader, OfflinePartition)
    
    // trigger OnlinePartition state changes for offline or new partitions
    // 2，立即重新進入OnlinePartition狀態
    // 把所有NewPartition or OfflinePartition狀態的(除了要刪除的)partition，觸發OnlinePartition狀態，
    // 使用controller.offlinePartitionSelector 重新選舉partition的leader。
    partitionStateMachine.triggerOnlinePartitionStateChange()
    
    // filter out the replicas that belong to topics that are being deleted
    // 3，從dead Brokers過濾掉(filterNot)屬於將要刪除的Topics的replicas，得到activeReplicas On DeadBrokers
    var allReplicasOnDeadBrokers = controllerContext.replicasOnBrokers(deadBrokersSet)
    val activeReplicasOnDeadBrokers = allReplicasOnDeadBrokers.filterNot(p => deleteTopicManager.isTopicQueuedUpForDeletion(p.topic))
    // 4，處理 activeReplicas，使進入OfflineReplica狀態
    // 向activeReplicas發送stop replica 命令，使停止從leader拉取數據.
    replicaStateMachine.handleStateChanges(activeReplicasOnDeadBrokers, OfflineReplica)
    
    // check if topic deletion state for the dead replicas needs to be updated
    // 5，過濾出 設置爲要刪除的replicas，從topicMgr中刪除，並觸發ReplicaDeletionIneligible狀態
    val replicasForTopicsToBeDeleted = allReplicasOnDeadBrokers.filter(p => deleteTopicManager.isTopicQueuedUpForDeletion(p.topic))
    if(replicasForTopicsToBeDeleted.nonEmpty) {
      // it is required to mark the respective replicas in TopicDeletionFailed state since the replica cannot be
      // deleted when the broker is down. This will prevent the replica from being in TopicDeletionStarted state indefinitely
      // since topic deletion cannot be retried until at least one replica is in TopicDeletionStarted state
      deleteTopicManager.failReplicaDeletion(replicasForTopicsToBeDeleted)
    }

    // If broker failure did not require leader re-election, inform brokers of failed broker
    // Note that during leader re-election, brokers update their metadata
    if (partitionsWithoutLeader.isEmpty) {
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
    }
  }

partition的狀態有四種：

sealed trait PartitionState { def state: Byte }
case object NewPartition extends PartitionState { val state: Byte = 0 }
case object OnlinePartition extends PartitionState { val state: Byte = 1 }
case object OfflinePartition extends PartitionState { val state: Byte = 2 }
case object NonExistentPartition extends PartitionState { val state: Byte = 3 }

4. Partition的Leader選舉

KafkaController中共定義了四種selector選舉器，它們都繼承自PartitionLeaderSelector：

  val offlinePartitionSelector = new OfflinePartitionLeaderSelector(controllerContext, config)
  private val reassignedPartitionLeaderSelector = new ReassignedPartitionLeaderSelector(controllerContext)
  private val preferredReplicaPartitionLeaderSelector = new PreferredReplicaPartitionLeaderSelector(controllerContext)
  private val controlledShutdownPartitionLeaderSelector = new ControlledShutdownLeaderSelector(controllerContext)

0.10.2中去除了NoOpLeaderSelector。這四種選舉器分別對應不同的selectLeader策略，從ISR中選取分區的Leader。具體詳情有空再補充吧。

Kafka源碼閱讀-Controller(二)

內容目錄

1. BrokerChangeListener的源起

2. Controller對新建Broker的處理

3. Controller對Broker failure的處理

4. Partition的Leader選舉

TF-IDF原理及spark使用

基於隨機遊走的PersonalRank

經典論文閱讀(二)--DIN：深度興趣網絡

Kafka源碼閱讀-Controller(二)

Kafka源碼閱讀-Controller(一)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結