Kubernetes源碼學習-Controller-P1-多實例leader選舉

P1-多實例leader選舉.md

前言

Kubernetes多master場景下,核心組件都是以一主多從的模式來運行的,在前面scheduler部分的文章中,並沒有分析其主從選舉及工作的流程,那麼在本篇中,以controller爲例,單獨作一篇分析組件之間主從工作模式。

入口

如scheduler一樣,controller的cmd啓動也是藉助的cobra,對cobra不瞭解可以回到前面的文章中查看,這裏不再贅述,直接順着入口找到啓動函數:

==> cmd/kube-controller-manager/controller-manager.go:38

command := app.NewControllerManagerCommand()

==> cmd/kube-controller-manager/app/controllermanager.go:109

Run(c.Complete(), wait.NeverStop)

==> cmd/kube-controller-manager/app/controllermanager.go:153

func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {}

入口函數就在這裏,代碼塊中已分段註釋:

func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
  ...
	// 篇幅有限,省略部分代碼
  
  // 啓動kube-controller的http服務
	// Start the controller manager HTTP server
	// unsecuredMux is the handler for these controller *after* authn/authz filters have been applied
	var unsecuredMux *mux.PathRecorderMux
	if c.SecureServing != nil {
		unsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...)
		handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication)
		// TODO: handle stoppedCh returned by c.SecureServing.Serve
		if _, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {
			return err
		}
	}
	if c.InsecureServing != nil {
		unsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...)
		insecureSuperuserAuthn := server.AuthenticationInfo{Authenticator: &server.InsecureSuperuser{}}
		handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, nil, &insecureSuperuserAuthn)
		if err := c.InsecureServing.Serve(handler, 0, stopCh); err != nil {
			return err
		}
	}
  // 啓動controller工作的run函數,特別標註,會作爲回調函數在leader選舉成功後執行
	run := func(ctx context.Context) {
		rootClientBuilder := controller.SimpleControllerClientBuilder{
			ClientConfig: c.Kubeconfig,
		}
		var clientBuilder controller.ControllerClientBuilder
		if c.ComponentConfig.KubeCloudShared.UseServiceAccountCredentials {
			if len(c.ComponentConfig.SAController.ServiceAccountKeyFile) == 0 {
				// It'c possible another controller process is creating the tokens for us.
				// If one isn't, we'll timeout and exit when our client builder is unable to create the tokens.
				klog.Warningf("--use-service-account-credentials was specified without providing a --service-account-private-key-file")
			}
			clientBuilder = controller.SAControllerClientBuilder{
				ClientConfig:         restclient.AnonymousClientConfig(c.Kubeconfig),
				CoreClient:           c.Client.CoreV1(),
				AuthenticationClient: c.Client.AuthenticationV1(),
				Namespace:            "kube-system",
			}
		} else {
			clientBuilder = rootClientBuilder
		}
		controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())
		if err != nil {
			klog.Fatalf("error building controller context: %v", err)
		}
		saTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController
		
		if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {
			klog.Fatalf("error starting controllers: %v", err)
		}

		controllerContext.InformerFactory.Start(controllerContext.Stop)
		close(controllerContext.InformersStarted)

		select {}
	}

	if !c.ComponentConfig.Generic.LeaderElection.LeaderElect {
		run(context.TODO())
		panic("unreachable")
	}

	id, err := os.Hostname()
	if err != nil {
		return err
	}

	// add a uniquifier so that two processes on the same host don't accidentally both become active
	id = id + "_" + string(uuid.NewUUID())
	rl, err := resourcelock.New(c.ComponentConfig.Generic.LeaderElection.ResourceLock,
		"kube-system",
		"kube-controller-manager",
		c.LeaderElectionClient.CoreV1(),
		c.LeaderElectionClient.CoordinationV1(),
		resourcelock.ResourceLockConfig{
			Identity:      id,
			EventRecorder: c.EventRecorder,
		})
	if err != nil {
		klog.Fatalf("error creating lock: %v", err)
	}
  
	// 主從選舉從這裏開始
	leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{
		Lock:          rl,
		LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
		RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
		RetryPeriod:   c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
		Callbacks: leaderelection.LeaderCallbacks{
      // 回調函數,選舉成功後,主工作節點開始運行上方的工作run函數
			OnStartedLeading: run,
			OnStoppedLeading: func() {
				klog.Fatalf("leaderelection lost")
			},
		},
		WatchDog: electionChecker,
		Name:     "kube-controller-manager",
	})
	panic("unreachable")
}

從這裏可以看到,選舉成爲主領導節點後,纔會進入工作流程,先跳過具體的工作流程,來看看leaderelection的選舉過程

選舉

選舉入口

==> cmd/kube-controller-manager/app/controllermanager.go:252

leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{}

func RunOrDie(ctx context.Context, lec LeaderElectionConfig) {
   le, err := NewLeaderElector(lec)
   if err != nil {
      panic(err)
   }
   // 加載檢查leader健康狀態的http接口
   if lec.WatchDog != nil {
      lec.WatchDog.SetLeaderElection(le)
   }
   // 開始進入選舉
   le.Run(ctx)
}

==> vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:196

le.Run(ctx)

// Run starts the leader election loop
func (le *LeaderElector) Run(ctx context.Context) {
   defer func() {
      runtime.HandleCrash()
      le.config.Callbacks.OnStoppedLeading()
   }()
   // 1.acquire是競選函數,如果選舉執行失敗直接返回
   if !le.acquire(ctx) {
      return // ctx signalled done
   }
   ctx, cancel := context.WithCancel(ctx)
   defer cancel()
   // 2.競選成功則另起一個線程,執行上面特別標註的run工作函數,即controller的工作循環
   go le.config.Callbacks.OnStartedLeading(ctx)
   // 3.刷新leader狀態函數
   le.renew(ctx)
}

這個函數裏包含多個defer和return,這裏額外備註一下defer和return的執行先後順序:

1.多個defer是以棧結構保存的,後入先出,下文的defer先執行
2.return在defer之後執行
3.觸發return條件後,return上下文的所有defer中,下文的defer不會被執行

這個函數這裏,大概可以看出選舉執行的邏輯:

1.選舉成功者,開始執行run()函數,即controller的工作函數。同時提供leader狀態健康檢查的api

2.選舉失敗者,會結束選舉程序。但watchDog會持續運行,監測leader的健康狀態

3.選舉成功者,在之後會持續刷新自己的leader狀態信息

競選函數:

vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:212

// acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
// Returns false if ctx signals done.
// 選舉者開始循環執行申請,若申請leader成功則返回true,若申請leader失敗則進入循環狀態,每間隔一段時間再申請一次

func (le *LeaderElector) acquire(ctx context.Context) bool {
   ctx, cancel := context.WithCancel(ctx)
   defer cancel()
   succeeded := false
   desc := le.config.Lock.Describe()
   klog.Infof("attempting to acquire leader lease  %v...", desc)
   // 進入循環申請leader狀態,JitterUntil是一個定時循環功能的函數
   wait.JitterUntil(func() {
      // 申請或刷新leader函數
      succeeded = le.tryAcquireOrRenew()
      le.maybeReportTransition()
      if !succeeded {
         klog.V(4).Infof("failed to acquire lease %v", desc)
         return
      }
      le.config.Lock.RecordEvent("became leader")
      le.metrics.leaderOn(le.config.Name)
      klog.Infof("successfully acquired lease %v", desc)
     // 選舉成功後,執行cancel()從定時循環函數中跳出來,返回成功結果
      cancel()
   }, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
   return succeeded
}

定時執行函數

來看下定時循環函數JitterUntil的代碼:

vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:130

func JitterUntil(f func(), period time.Duration, jitterFactor float64, sliding bool, stopCh <-chan struct{}) {
	var t *time.Timer
	var sawTimeout bool

	for {
		select {
		case <-stopCh:
			return
		default:
		}

		jitteredPeriod := period
		if jitterFactor > 0.0 {
			jitteredPeriod = Jitter(period, jitterFactor)
		}
    // sliding代表是否將f()的執行時間計算在間隔之內
    // 若執行間隔將f()的執行時間包含在內,則在f()開始之前就啓動計時器
		if !sliding {
			t = resetOrReuseTimer(t, jitteredPeriod, sawTimeout)
		}

		func() {
			defer runtime.HandleCrash()
			f()
		}()
    
    // 若執行間隔不將f()的執行時間包含在內,則在f()執行完成之後再啓動計時器
		if sliding {
			t = resetOrReuseTimer(t, jitteredPeriod, sawTimeout)
		}

		// 在這裏,select的case沒有優先級之分,因此,可能跳過stop判斷,所以,在for loop的前面,也加入了一次stop判斷,防止重複執行。
		select {
		case <-stopCh:
			return
    // 到達
		case <-t.C:
			sawTimeout = true
		}
	}
}

// resetOrReuseTimer avoids allocating a new timer if one is already in use.
// Not safe for multiple threads.
func resetOrReuseTimer(t *time.Timer, d time.Duration, sawTimeout bool) *time.Timer {
	if t == nil {
		return time.NewTimer(d)
	}
  // timer首次啓動時,先將t.C channel內的值都取出來,避免channel消費方hang住
	if !t.Stop() && !sawTimeout {
		<-t.C
	}
  // 定時器重置
	t.Reset(d)
	return t
}

k8s定時任務用的是非常原生的time.timer()來實現的,t.C本質上還是一個channel struct {},消費方運用select來觸發到達指定計時間隔後,消費消息,進入下一次循環。

這裏關於select結合channel的用法說明進行以下備註:

在select中,代碼邏輯執行步驟如下:
1.檢查每個case代碼塊
2.如果存在一個case代碼塊下有數據產生,執行對應case下的內容
3.如果多個case代碼塊下有數據產生,隨機選取一個case並執行對應內容,無優先級之分
4.如果有default代碼塊,在沒有任何case產生數據時,執行default代碼塊對應內容
5.如果default之後的代碼爲空,此時也沒有任何case產生數據,則跳出select繼續執行下文
6.如果任何一個case代碼塊都沒有數據產生或代碼上下文,同時也沒有default,則select阻塞等待

關於go time.Timer,這裏有一篇文章講得很好:

https://tonybai.com/2016/12/21/how-to-use-timer-reset-in-golang-correctly/

申請/刷新leader函數

vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:293

// tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired,
// else it tries to renew the lease if it has already been acquired. Returns true
// on success else returns false.
// 在初次選舉、後續間隔刷新狀態 這兩處地方都會調用這個函數
// 如果參選者不是leader則嘗試選舉,如果已經是leader,則嘗試續約租期,最後刷新信息
func (le *LeaderElector) tryAcquireOrRenew() bool {
   now := metav1.Now()
   leaderElectionRecord := rl.LeaderElectionRecord{
      HolderIdentity:       le.config.Lock.Identity(),
      LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second),
      RenewTime:            now,
      AcquireTime:          now,
   }

   // 1. obtain or create the ElectionRecord 
   // 第1步:獲取當前的leader的競選記錄,如果當前還沒有leader記錄,則創建
   // 首先獲取當前的leader記錄
   oldLeaderElectionRecord, err := le.config.Lock.Get()
   if err != nil {
      if !errors.IsNotFound(err) {
         klog.Errorf("error retrieving resource lock %v: %v", le.config.Lock.Describe(), err)
         return false
      }
      if err = le.config.Lock.Create(leaderElectionRecord); err != nil {
         klog.Errorf("error initially creating leader election record: %v", err)
         return false
      }
      le.observedRecord = leaderElectionRecord
      le.observedTime = le.clock.Now()
      return true
   }
	 // 第2步,對比觀察記錄裏的leader與當前實際的leader
   // 2. Record obtained, check the Identity & Time
   if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) {
     // 如果參選者的上一次觀察記錄中的leader,不是當前leader,則修改記錄,以當前leader爲準
      le.observedRecord = *oldLeaderElectionRecord
      le.observedTime = le.clock.Now()
   }
   if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
  		// 如果參選者不是當前的leader,且當前leader的任期尚未結束,則返回false,參選者選舉失敗
      le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
      !le.IsLeader() {
      klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
      return false
   }

   // 3. We're going to try to update. The leaderElectionRecord is set to it's default
   // here. Let's correct it before updating.
   if le.IsLeader() {
     // 如果參選者就是當前的leader本身,則修改記錄裏的當選時間變爲它此前的當選時間,而不是本次時間,變更次數維持不變
      leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime
      leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions
   } else {
     // 如果參選者不是leader(則說明當前leader在任期已經結束,但並未續約),則當前參選者變更爲新的leader
      leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1
   }

   // update the lock itself
   // 更新leader信息,更新leader鎖,返回true選舉過程順利完成
   if err = le.config.Lock.Update(leaderElectionRecord); err != nil {
      klog.Errorf("Failed to update lock: %v", err)
      return false
   }
   le.observedRecord = leaderElectionRecord
   le.observedTime = le.clock.Now()
   return true
}

這一段代碼中有多個leader記錄信息相關的變量,很容易混淆,爲了便於理解這裏抽出來說明下:

LeaderElector  # 參選者,每一個controller進程都會參與leader選舉
oldLeaderElectionRecord  # 本次選舉開始前,leader鎖中記載的當前leader
leaderElectionRecord  # 本次選舉的leader記錄,最終會更新進入新的leader鎖中
observedRecord  # 每個參選者都會定期觀察當前的leader信息,記錄在自身的這個字段中

先來看第1步中是怎麼獲取當前leader記錄的:

vendor/k8s.io/client-go/tools/leaderelection/resourcelock/leaselock.go:39

// Get returns the election record from a Lease spec
func (ll *LeaseLock) Get() (*LeaderElectionRecord, error) {
   var err error
  // 1.取得lease對象
   ll.lease, err = ll.Client.Leases(ll.LeaseMeta.Namespace).Get(ll.LeaseMeta.Name, metav1.GetOptions{})
   if err != nil {
      return nil, err
   }
  // 2.將lease.spec轉爲LeaderElectionRecord記錄並返回
   return LeaseSpecToLeaderElectionRecord(&ll.lease.Spec), nil
}

取得lease對象的方法在這裏:

vendor/k8s.io/client-go/kubernetes/typed/coordination/v1/lease.go:66

func (c *leases) Get(name string, options metav1.GetOptions) (result *v1.Lease, err error) {}

轉換並返回的LeaderElectionRecord結構體是這樣的:

LeaderElectionRecord{
   HolderIdentity:       holderIdentity,   // leader持有標識
   LeaseDurationSeconds: leaseDurationSeconds,  // 選舉間隔
   AcquireTime:          metav1.Time{spec.AcquireTime.Time},  // 選舉成爲leader的時間
   RenewTime:            metav1.Time{spec.RenewTime.Time},  // 續任時間
   LeaderTransitions:    leaseTransitions,  // leader位置的轉接次數
}

對返回的LeaderElectionRecord進行對比,如果是自身,則續約,如果不是自身,則看leader是否過期,對leader lock信息相應處理。

刷新選舉狀態函數

vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:234

func (le *LeaderElector) renew(ctx context.Context) {
   ctx, cancel := context.WithCancel(ctx)
   defer cancel()
   wait.Until(func() {
      timeoutCtx, timeoutCancel := context.WithTimeout(ctx, le.config.RenewDeadline)
      defer timeoutCancel()
      // 間隔刷新leader狀態,成功則續約,不成功則釋放
      err := wait.PollImmediateUntil(le.config.RetryPeriod, func() (bool, error) {
         done := make(chan bool, 1)
         go func() {
            defer close(done)
            done <- le.tryAcquireOrRenew()
         }()

         select {
         case <-timeoutCtx.Done():
            return false, fmt.Errorf("failed to tryAcquireOrRenew %s", timeoutCtx.Err())
         case result := <-done:
            return result, nil
         }
      }, timeoutCtx.Done())

      le.maybeReportTransition()
      desc := le.config.Lock.Describe()
      if err == nil {
         klog.V(5).Infof("successfully renewed lease %v", desc)
         return
      }
      le.config.Lock.RecordEvent("stopped leading")
      le.metrics.leaderOff(le.config.Name)
      klog.Infof("failed to renew lease %v: %v", desc, err)
      cancel()
   }, le.config.RetryPeriod, ctx.Done())

   // if we hold the lease, give it up
   if le.config.ReleaseOnCancel {
      le.release()
   }
}

tryAcquireOrRenew()和循環間隔執行函數同上面所講基本一致,這裏就不再說明了。

總結

組件選舉大致可以概括爲以下流程:

  • 初始時,各實例均爲LeaderElector,最先開始選舉的,成爲leader,成爲工作實例。同時它會維護一份信息(leader lock)供各個LeaderElector探測,包括狀態信息、健康監控接口等。

  • 其餘LeaderElector,進入熱備狀態,監控leader的運行狀態,異常時會再次參與選舉

  • leader在運行中會間隔持續刷新自身的leader狀態。

不止於controller,其餘的幾個組件,主從之間的工作關係也應當是如此。

感謝閱讀,歡迎指正

發佈了77 篇原創文章 · 獲贊 23 · 訪問量 11萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章