P1-多實例leader選舉.md
前言
Kubernetes多master場景下,核心組件都是以一主多從的模式來運行的,在前面scheduler部分的文章中,並沒有分析其主從選舉及工作的流程,那麼在本篇中,以controller爲例,單獨作一篇分析組件之間主從工作模式。
入口
如scheduler一樣,controller的cmd啓動也是藉助的cobra,對cobra不瞭解可以回到前面的文章中查看,這裏不再贅述,直接順着入口找到啓動函數:
==> cmd/kube-controller-manager/controller-manager.go:38
command := app.NewControllerManagerCommand()
==> cmd/kube-controller-manager/app/controllermanager.go:109
Run(c.Complete(), wait.NeverStop)
==> cmd/kube-controller-manager/app/controllermanager.go:153
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {}
入口函數就在這裏,代碼塊中已分段註釋:
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
...
// 篇幅有限,省略部分代碼
// 啓動kube-controller的http服務
// Start the controller manager HTTP server
// unsecuredMux is the handler for these controller *after* authn/authz filters have been applied
var unsecuredMux *mux.PathRecorderMux
if c.SecureServing != nil {
unsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...)
handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication)
// TODO: handle stoppedCh returned by c.SecureServing.Serve
if _, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {
return err
}
}
if c.InsecureServing != nil {
unsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...)
insecureSuperuserAuthn := server.AuthenticationInfo{Authenticator: &server.InsecureSuperuser{}}
handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, nil, &insecureSuperuserAuthn)
if err := c.InsecureServing.Serve(handler, 0, stopCh); err != nil {
return err
}
}
// 啓動controller工作的run函數,特別標註,會作爲回調函數在leader選舉成功後執行
run := func(ctx context.Context) {
rootClientBuilder := controller.SimpleControllerClientBuilder{
ClientConfig: c.Kubeconfig,
}
var clientBuilder controller.ControllerClientBuilder
if c.ComponentConfig.KubeCloudShared.UseServiceAccountCredentials {
if len(c.ComponentConfig.SAController.ServiceAccountKeyFile) == 0 {
// It'c possible another controller process is creating the tokens for us.
// If one isn't, we'll timeout and exit when our client builder is unable to create the tokens.
klog.Warningf("--use-service-account-credentials was specified without providing a --service-account-private-key-file")
}
clientBuilder = controller.SAControllerClientBuilder{
ClientConfig: restclient.AnonymousClientConfig(c.Kubeconfig),
CoreClient: c.Client.CoreV1(),
AuthenticationClient: c.Client.AuthenticationV1(),
Namespace: "kube-system",
}
} else {
clientBuilder = rootClientBuilder
}
controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())
if err != nil {
klog.Fatalf("error building controller context: %v", err)
}
saTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController
if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {
klog.Fatalf("error starting controllers: %v", err)
}
controllerContext.InformerFactory.Start(controllerContext.Stop)
close(controllerContext.InformersStarted)
select {}
}
if !c.ComponentConfig.Generic.LeaderElection.LeaderElect {
run(context.TODO())
panic("unreachable")
}
id, err := os.Hostname()
if err != nil {
return err
}
// add a uniquifier so that two processes on the same host don't accidentally both become active
id = id + "_" + string(uuid.NewUUID())
rl, err := resourcelock.New(c.ComponentConfig.Generic.LeaderElection.ResourceLock,
"kube-system",
"kube-controller-manager",
c.LeaderElectionClient.CoreV1(),
c.LeaderElectionClient.CoordinationV1(),
resourcelock.ResourceLockConfig{
Identity: id,
EventRecorder: c.EventRecorder,
})
if err != nil {
klog.Fatalf("error creating lock: %v", err)
}
// 主從選舉從這裏開始
leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{
Lock: rl,
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
Callbacks: leaderelection.LeaderCallbacks{
// 回調函數,選舉成功後,主工作節點開始運行上方的工作run函數
OnStartedLeading: run,
OnStoppedLeading: func() {
klog.Fatalf("leaderelection lost")
},
},
WatchDog: electionChecker,
Name: "kube-controller-manager",
})
panic("unreachable")
}
從這裏可以看到,選舉成爲主領導節點後,纔會進入工作流程,先跳過具體的工作流程,來看看leaderelection的選舉過程
選舉
選舉入口
==> cmd/kube-controller-manager/app/controllermanager.go:252
leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{}
func RunOrDie(ctx context.Context, lec LeaderElectionConfig) {
le, err := NewLeaderElector(lec)
if err != nil {
panic(err)
}
// 加載檢查leader健康狀態的http接口
if lec.WatchDog != nil {
lec.WatchDog.SetLeaderElection(le)
}
// 開始進入選舉
le.Run(ctx)
}
==> vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:196
le.Run(ctx)
// Run starts the leader election loop
func (le *LeaderElector) Run(ctx context.Context) {
defer func() {
runtime.HandleCrash()
le.config.Callbacks.OnStoppedLeading()
}()
// 1.acquire是競選函數,如果選舉執行失敗直接返回
if !le.acquire(ctx) {
return // ctx signalled done
}
ctx, cancel := context.WithCancel(ctx)
defer cancel()
// 2.競選成功則另起一個線程,執行上面特別標註的run工作函數,即controller的工作循環
go le.config.Callbacks.OnStartedLeading(ctx)
// 3.刷新leader狀態函數
le.renew(ctx)
}
這個函數裏包含多個defer和return,這裏額外備註一下defer和return的執行先後順序:
1.多個defer是以棧結構保存的,後入先出,下文的defer先執行
2.return在defer之後執行
3.觸發return條件後,return上下文的所有defer中,下文的defer不會被執行
這個函數這裏,大概可以看出選舉執行的邏輯:
1.選舉成功者,開始執行run()函數,即controller的工作函數。同時提供leader狀態健康檢查的api
2.選舉失敗者,會結束選舉程序。但watchDog會持續運行,監測leader的健康狀態
3.選舉成功者,在之後會持續刷新自己的leader狀態信息
競選函數:
vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:212
// acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
// Returns false if ctx signals done.
// 選舉者開始循環執行申請,若申請leader成功則返回true,若申請leader失敗則進入循環狀態,每間隔一段時間再申請一次
func (le *LeaderElector) acquire(ctx context.Context) bool {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
succeeded := false
desc := le.config.Lock.Describe()
klog.Infof("attempting to acquire leader lease %v...", desc)
// 進入循環申請leader狀態,JitterUntil是一個定時循環功能的函數
wait.JitterUntil(func() {
// 申請或刷新leader函數
succeeded = le.tryAcquireOrRenew()
le.maybeReportTransition()
if !succeeded {
klog.V(4).Infof("failed to acquire lease %v", desc)
return
}
le.config.Lock.RecordEvent("became leader")
le.metrics.leaderOn(le.config.Name)
klog.Infof("successfully acquired lease %v", desc)
// 選舉成功後,執行cancel()從定時循環函數中跳出來,返回成功結果
cancel()
}, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
return succeeded
}
定時執行函數
來看下定時循環函數JitterUntil的代碼:
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:130
func JitterUntil(f func(), period time.Duration, jitterFactor float64, sliding bool, stopCh <-chan struct{}) {
var t *time.Timer
var sawTimeout bool
for {
select {
case <-stopCh:
return
default:
}
jitteredPeriod := period
if jitterFactor > 0.0 {
jitteredPeriod = Jitter(period, jitterFactor)
}
// sliding代表是否將f()的執行時間計算在間隔之內
// 若執行間隔將f()的執行時間包含在內,則在f()開始之前就啓動計時器
if !sliding {
t = resetOrReuseTimer(t, jitteredPeriod, sawTimeout)
}
func() {
defer runtime.HandleCrash()
f()
}()
// 若執行間隔不將f()的執行時間包含在內,則在f()執行完成之後再啓動計時器
if sliding {
t = resetOrReuseTimer(t, jitteredPeriod, sawTimeout)
}
// 在這裏,select的case沒有優先級之分,因此,可能跳過stop判斷,所以,在for loop的前面,也加入了一次stop判斷,防止重複執行。
select {
case <-stopCh:
return
// 到達
case <-t.C:
sawTimeout = true
}
}
}
// resetOrReuseTimer avoids allocating a new timer if one is already in use.
// Not safe for multiple threads.
func resetOrReuseTimer(t *time.Timer, d time.Duration, sawTimeout bool) *time.Timer {
if t == nil {
return time.NewTimer(d)
}
// timer首次啓動時,先將t.C channel內的值都取出來,避免channel消費方hang住
if !t.Stop() && !sawTimeout {
<-t.C
}
// 定時器重置
t.Reset(d)
return t
}
k8s定時任務用的是非常原生的time.timer()來實現的,t.C本質上還是一個channel struct {},消費方運用select來觸發到達指定計時間隔後,消費消息,進入下一次循環。
這裏關於select結合channel的用法說明進行以下備註:
在select中,代碼邏輯執行步驟如下:
1.檢查每個case代碼塊
2.如果存在一個case代碼塊下有數據產生,執行對應case下的內容
3.如果多個case代碼塊下有數據產生,隨機選取一個case並執行對應內容,無優先級之分
4.如果有default代碼塊,在沒有任何case產生數據時,執行default代碼塊對應內容
5.如果default之後的代碼爲空,此時也沒有任何case產生數據,則跳出select繼續執行下文
6.如果任何一個case代碼塊都沒有數據產生或代碼上下文,同時也沒有default,則select阻塞等待
關於go time.Timer,這裏有一篇文章講得很好:
https://tonybai.com/2016/12/21/how-to-use-timer-reset-in-golang-correctly/
申請/刷新leader函數
vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:293
// tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired,
// else it tries to renew the lease if it has already been acquired. Returns true
// on success else returns false.
// 在初次選舉、後續間隔刷新狀態 這兩處地方都會調用這個函數
// 如果參選者不是leader則嘗試選舉,如果已經是leader,則嘗試續約租期,最後刷新信息
func (le *LeaderElector) tryAcquireOrRenew() bool {
now := metav1.Now()
leaderElectionRecord := rl.LeaderElectionRecord{
HolderIdentity: le.config.Lock.Identity(),
LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second),
RenewTime: now,
AcquireTime: now,
}
// 1. obtain or create the ElectionRecord
// 第1步:獲取當前的leader的競選記錄,如果當前還沒有leader記錄,則創建
// 首先獲取當前的leader記錄
oldLeaderElectionRecord, err := le.config.Lock.Get()
if err != nil {
if !errors.IsNotFound(err) {
klog.Errorf("error retrieving resource lock %v: %v", le.config.Lock.Describe(), err)
return false
}
if err = le.config.Lock.Create(leaderElectionRecord); err != nil {
klog.Errorf("error initially creating leader election record: %v", err)
return false
}
le.observedRecord = leaderElectionRecord
le.observedTime = le.clock.Now()
return true
}
// 第2步,對比觀察記錄裏的leader與當前實際的leader
// 2. Record obtained, check the Identity & Time
if !reflect.DeepEqual(le.observedRecord, *oldLeaderElectionRecord) {
// 如果參選者的上一次觀察記錄中的leader,不是當前leader,則修改記錄,以當前leader爲準
le.observedRecord = *oldLeaderElectionRecord
le.observedTime = le.clock.Now()
}
if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
// 如果參選者不是當前的leader,且當前leader的任期尚未結束,則返回false,參選者選舉失敗
le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
!le.IsLeader() {
klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
return false
}
// 3. We're going to try to update. The leaderElectionRecord is set to it's default
// here. Let's correct it before updating.
if le.IsLeader() {
// 如果參選者就是當前的leader本身,則修改記錄裏的當選時間變爲它此前的當選時間,而不是本次時間,變更次數維持不變
leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime
leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions
} else {
// 如果參選者不是leader(則說明當前leader在任期已經結束,但並未續約),則當前參選者變更爲新的leader
leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1
}
// update the lock itself
// 更新leader信息,更新leader鎖,返回true選舉過程順利完成
if err = le.config.Lock.Update(leaderElectionRecord); err != nil {
klog.Errorf("Failed to update lock: %v", err)
return false
}
le.observedRecord = leaderElectionRecord
le.observedTime = le.clock.Now()
return true
}
這一段代碼中有多個leader記錄信息相關的變量,很容易混淆,爲了便於理解這裏抽出來說明下:
LeaderElector # 參選者,每一個controller進程都會參與leader選舉
oldLeaderElectionRecord # 本次選舉開始前,leader鎖中記載的當前leader
leaderElectionRecord # 本次選舉的leader記錄,最終會更新進入新的leader鎖中
observedRecord # 每個參選者都會定期觀察當前的leader信息,記錄在自身的這個字段中
先來看第1步中是怎麼獲取當前leader記錄的:
vendor/k8s.io/client-go/tools/leaderelection/resourcelock/leaselock.go:39
// Get returns the election record from a Lease spec
func (ll *LeaseLock) Get() (*LeaderElectionRecord, error) {
var err error
// 1.取得lease對象
ll.lease, err = ll.Client.Leases(ll.LeaseMeta.Namespace).Get(ll.LeaseMeta.Name, metav1.GetOptions{})
if err != nil {
return nil, err
}
// 2.將lease.spec轉爲LeaderElectionRecord記錄並返回
return LeaseSpecToLeaderElectionRecord(&ll.lease.Spec), nil
}
取得lease對象的方法在這裏:
vendor/k8s.io/client-go/kubernetes/typed/coordination/v1/lease.go:66
func (c *leases) Get(name string, options metav1.GetOptions) (result *v1.Lease, err error) {}
轉換並返回的LeaderElectionRecord結構體是這樣的:
LeaderElectionRecord{
HolderIdentity: holderIdentity, // leader持有標識
LeaseDurationSeconds: leaseDurationSeconds, // 選舉間隔
AcquireTime: metav1.Time{spec.AcquireTime.Time}, // 選舉成爲leader的時間
RenewTime: metav1.Time{spec.RenewTime.Time}, // 續任時間
LeaderTransitions: leaseTransitions, // leader位置的轉接次數
}
對返回的LeaderElectionRecord進行對比,如果是自身,則續約,如果不是自身,則看leader是否過期,對leader lock信息相應處理。
刷新選舉狀態函數
vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:234
func (le *LeaderElector) renew(ctx context.Context) {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
wait.Until(func() {
timeoutCtx, timeoutCancel := context.WithTimeout(ctx, le.config.RenewDeadline)
defer timeoutCancel()
// 間隔刷新leader狀態,成功則續約,不成功則釋放
err := wait.PollImmediateUntil(le.config.RetryPeriod, func() (bool, error) {
done := make(chan bool, 1)
go func() {
defer close(done)
done <- le.tryAcquireOrRenew()
}()
select {
case <-timeoutCtx.Done():
return false, fmt.Errorf("failed to tryAcquireOrRenew %s", timeoutCtx.Err())
case result := <-done:
return result, nil
}
}, timeoutCtx.Done())
le.maybeReportTransition()
desc := le.config.Lock.Describe()
if err == nil {
klog.V(5).Infof("successfully renewed lease %v", desc)
return
}
le.config.Lock.RecordEvent("stopped leading")
le.metrics.leaderOff(le.config.Name)
klog.Infof("failed to renew lease %v: %v", desc, err)
cancel()
}, le.config.RetryPeriod, ctx.Done())
// if we hold the lease, give it up
if le.config.ReleaseOnCancel {
le.release()
}
}
tryAcquireOrRenew()和循環間隔執行函數同上面所講基本一致,這裏就不再說明了。
總結
組件選舉大致可以概括爲以下流程:
-
初始時,各實例均爲LeaderElector,最先開始選舉的,成爲leader,成爲工作實例。同時它會維護一份信息(leader lock)供各個LeaderElector探測,包括狀態信息、健康監控接口等。
-
其餘LeaderElector,進入熱備狀態,監控leader的運行狀態,異常時會再次參與選舉
-
leader在運行中會間隔持續刷新自身的leader狀態。
不止於controller,其餘的幾個組件,主從之間的工作關係也應當是如此。
感謝閱讀,歡迎指正