P3-Controller分類與Deployment Controller
前言
Controller部分的第一篇文章中,我們從cobra啓動命令入口開始,進入到了多實例leader選舉部分的代碼,對leader選舉流程做了詳細地分析:
接着在第二篇中,文字和圖解簡單描述了controller是如何結合client-go模塊中的informer工作的,爲本篇及後面的幾篇作鋪墊:
Controller-P2-Controller與informer
那麼本篇,就接着第一篇往下,繼續看代碼。
Controller的分類
啓動
承接篇一,在cobra入口之下,controller的啓動入口在這裏:
cmd/kube-controller-manager/app/controllermanager.go:191
run := func(ctx context.Context) {}
==> cmd/kube-controller-manager/app/controllermanager.go:217
,重點是這裏的NewControllerInitializers函數。
if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {
klog.Fatalf("error starting controllers: %v", err)
}
==> cmd/kube-controller-manager/app/controllermanager.go:343
可以看到,controller會對不同的資源,分別初始化相應的controller,包含我們常見的deployment、statefulset、endpoint、pvc等等資源,controller種類有多達30餘個。因此,在controller整個章節中,不會對它們逐一分析,只會抽取幾個常見有代表性地進行深入,本篇就來看看deployment controller吧。
Deployment Controller
初始化
cmd/kube-controller-manager/app/controllermanager.go:354
controllers["deployment"] = startDeploymentController
==> cmd/kube-controller-manager/app/apps.go:82
func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] {
return nil, false, nil
}
dc, err := deployment.NewDeploymentController(
// deployment主要關注這3個資源: Deployment/ReplicaSet/Pod,deployment通過replicaSet來管理Pod
// 這3個函數會返回相應資源的informer
ctx.InformerFactory.Apps().V1().Deployments(),
ctx.InformerFactory.Apps().V1().ReplicaSets(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.ClientBuilder.ClientOrDie("deployment-controller"),
)
if err != nil {
return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
// deployment controller 運行函數
go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)
return nil, true, nil
}
dc.Run()函數,第一個參數是worker的數量,默認值是5個,在這裏定義的:pkg/controller/apis/config/v1alpha1/defaults.go:48
,第二個參數是空結構體,讓go協程接收異常停止的信號。
==> pkg/controller/deployment/deployment_controller.go:148
// Run begins watching and syncing.
func (dc *DeploymentController) Run(workers int, stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
defer dc.queue.ShutDown()
klog.Infof("Starting deployment controller")
defer klog.Infof("Shutting down deployment controller")
// 判斷各個informer的緩存是否已經同步完畢的函數
if !controller.WaitForCacheSync("deployment", stopCh, dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
return
}
// 啓動多個worker開始工作
for i := 0; i < workers; i++ {
go wait.Until(dc.worker, time.Second, stopCh)
}
<-stopCh
}
controller.WaitForCacheSync函數是用來檢測各個informer是否本地緩存已經同步完畢的函數,返回值是bool類型。前面第二章講到過,informer爲了加速和減輕apiserver的負擔,設計了local storage緩存,因此這裏做了一步緩存是否已同步的檢測。
默認是5個worker,每個worker,調用wait.Until()方法,每間隔1s,循環執行dc.worker函數,運行deployment controller的工作邏輯。wait.Until()這個循環調用的計時器函數還是挺有意思的,展開看下。
wait.Until循環計時器函數
pkg/controller/deployment/deployment_controller.go:160
==> vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
==>vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:130
func JitterUntil(f func(), period time.Duration, jitterFactor float64, sliding bool, stopCh <-chan struct{}) {
var t *time.Timer
var sawTimeout bool
for {
select {
case <-stopCh:
return
default:
}
jitteredPeriod := period
if jitterFactor > 0.0 {
jitteredPeriod = Jitter(period, jitterFactor)
}
// sliding這個布爾值的意思是是否將執行函數f()的執行時間計入執行間隔時間內,如果爲否,則在f執行前就開始計時,如果爲是,則在f執行後再開始計時。
if !sliding {
t = resetOrReuseTimer(t, jitteredPeriod, sawTimeout)
}
func() {
defer runtime.HandleCrash()
f()
}()
if sliding {
t = resetOrReuseTimer(t, jitteredPeriod, sawTimeout)
}
// select 下各個分支的權重是公平的,因此,stop信號的處理,在循環開始之前和循環開始之後都分別判斷了一次
select {
case <-stopCh:
return
// Timer.C是Timer結構體內部的一個channel,計時器在到達指定的時間後會往此channel發送一個事件,channel同時也可以被接收,來觸發其他的邏輯。這裏的邏輯是判斷:如果f()執行超過計時器的超時時間,那麼加一個超時的標記sawTimeout。
case <-t.C:
sawTimeout = true
}
}
}
resetOrReuseTimer函數:
func resetOrReuseTimer(t *time.Timer, d time.Duration, sawTimeout bool) *time.Timer {
if t == nil {
return time.NewTimer(d)
}
if !t.Stop() && !sawTimeout {
<-t.C
}
t.Reset(d)
return t
}
概括一下,這個函數是對timer模塊的一個再封裝,重複利用timer計時器,來每秒執行一次dc.worker().
dc.worker函數
pkg/controller/deployment/deployment_controller.go:460
==> pkg/controller/deployment/deployment_controller.go:464
func (dc *DeploymentController) processNextWorkItem() bool {
// 從隊列頭部取出對象
key, quit := dc.queue.Get()
if quit {
return false
}
defer dc.queue.Done(key)
// 處理對象
err := dc.syncHandler(key.(string))
dc.handleErr(err, key)
return true
}
Deployment controller 的worker函數就是不斷地調用processNextWorkItem函數,processNextWorkItem函數是從work queue中獲取待處理的對象(第二篇中informer圖解中的第7-第8步),如果存在,那麼執行相應後續的增刪改查邏輯,如果不存在,那麼就退出。
其中dc.queue.Get()接口方法的實現在這裏:
vendor/k8s.io/client-go/util/workqueue/queue.go:140
func (q *Type) Get() (item interface{}, shutdown bool) {
q.cond.L.Lock()
defer q.cond.L.Unlock()
for len(q.queue) == 0 && !q.shuttingDown {
q.cond.Wait()
}
if len(q.queue) == 0 {
// We must be shutting down.
return nil, true
}
// 取出隊列的隊首
item, q.queue = q.queue[0], q.queue[1:]
q.metrics.get(item)
// 對象加入正在處理中map
q.processing.insert(item)
// dirty map去除對象(dirty map中保存的是等待處理的對象)
q.dirty.delete(item)
return item, false
}
其中的dc.syncHandler()
方法在這裏:
pkg/controller/deployment/deployment_controller.go:135
dc.syncHandler = dc.syncDeployment
==> pkg/controller/deployment/deployment_controller.go:560
所有的增刪改(滾動更新)查操作,全部都在這個函數內部處理。
func (dc *DeploymentController) syncDeployment(key string) error {
startTime := time.Now()
klog.V(4).Infof("Started syncing deployment %q (%v)", key, startTime)
defer func() {
klog.V(4).Infof("Finished syncing deployment %q (%v)", key, time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
deployment, err := dc.dLister.Deployments(namespace).Get(name)
if errors.IsNotFound(err) {
klog.V(2).Infof("Deployment %v has been deleted", key)
return nil
}
if err != nil {
return err
}
// Deep-copy otherwise we are mutating our cache.
// TODO: Deep-copy only when needed.
d := deployment.DeepCopy()
everything := metav1.LabelSelector{}
if reflect.DeepEqual(d.Spec.Selector, &everything) {
// deployment必須包含selector標籤
dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
if d.Status.ObservedGeneration < d.Generation {
d.Status.ObservedGeneration = d.Generation
dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(d)
}
return nil
}
// 獲取deployment所控制的replicaSet
rsList, err := dc.getReplicaSetsForDeployment(d)
if err != nil {
return err
}
// 獲取所有的pod,map結構,按replicaSet分組,key是rs。
// 檢查deployment在重建的過程中是否還存在舊版本(未更新)的pod
podMap, err := dc.getPodMapForDeployment(d, rsList)
if err != nil {
return err
}
if d.DeletionTimestamp != nil {
return dc.syncStatusOnly(d, rsList)
}
// 檢查deployment是否爲pause暫停狀態,pause狀態則調用sync方法同步deployment
if err = dc.checkPausedConditions(d); err != nil {
return err
}
if d.Spec.Paused {
return dc.sync(d, rsList)
}
// 判斷本次deployment事件是否是一個回滾事件
// 一旦底層的rs更新到了一個新的版本,就無法自動執行回滾了,因此,直到下一次隊列中再次出現此deployment且不爲rollback狀態時,才能無虞地觸發更新rs。所以,這裏再進行一次判斷,如果deployment帶有回滾標記,那麼先執行rs的回滾。
if getRollbackTo(d) != nil {
return dc.rollback(d, rsList)
}
// 判斷本次deployment事件是否是一個scale事件,是則調用sync方法同步deployment
scalingEvent, err := dc.isScalingEvent(d, rsList)
if err != nil {
return err
}
if scalingEvent {
return dc.sync(d, rsList)
}
// 更新deployment,視Deployment.Spec.Strategy指定的更新策略類型來執行相應的更新操作
// 1.如果是rolloutRecreate類型,則一次性殺死pod再重建
// 2.如果是rolloutRolling類型,則滾動更新pod
switch d.Spec.Strategy.Type {
case apps.RecreateDeploymentStrategyType:
return dc.rolloutRecreate(d, rsList, podMap)
case apps.RollingUpdateDeploymentStrategyType:
return dc.rolloutRolling(d, rsList)
}
return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
}
暫停和擴(縮)容(/刪除)
dc.sync方法這裏出現了兩次,分別在pause狀態和scaling狀態調用,比較關鍵,分析一下sync方法的內容。
pkg/controller/deployment/sync.go:48
func (dc *DeploymentController) sync(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
// 展開查看代碼,可以知道,這裏的newRS,指的是找到模板hash值與當前的d Deployment 模板hash值相同的rs,oldRSs則是所有的歷史版本的rs
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
if err != nil {
return err
}
// 對比最新的rs和之前的rs,如果需要scale縮擴容,則執行scale方法
if err := dc.scale(d, newRS, oldRSs); err != nil {
// If we get an error while trying to scale, the deployment will be requeued
// so we can abort this resync
return err
}
// pause狀態,且不處於回滾狀態的deployment,進行清理(根據指定的保存歷史版本數上限,清理超出限制的歷史版本)
if d.Spec.Paused && getRollbackTo(d) == nil {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
allRSs := append(oldRSs, newRS)
// 同步deployment狀態
return dc.syncDeploymentStatus(allRSs, newRS, d)
}
來看看dc.scale()方法:
pkg/controller/deployment/sync.go:289
func (dc *DeploymentController) scale(deployment *apps.Deployment, newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet) error {
// FindActiveOrLatest方法返回值:如果此時只有一個活躍的rs,那麼就返回這個rs,如果不止,那麼就找出revision最新的rs返回
if activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil {
if *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) {
// 如果rs已經和deployment指定的副本數一致,直接return
return nil
}
_, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment)
return err
}
// 如果最新的rs的已經收斂到了deployment的期望狀態,則舊rs需要被完全scale down縮容刪除掉。
if deploymentutil.IsSaturated(deployment, newRS) {
for _, old := range controller.FilterActiveReplicaSets(oldRSs) {
if _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil {
return err
}
}
return nil
}
// 在滾動更新的過程中,需要控制舊rs與新rs所控制的模板pod的數量的總和,多出的pod數量不能超過MaxSurge數,因此是滾動更新的過程中,舊rs和新rs控制得pod數量必然是一個此消彼長的過程
if deploymentutil.IsRollingUpdate(deployment) {
allRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS))
allRSsReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
allowedSize := int32(0)
if *(deployment.Spec.Replicas) > 0 {
allowedSize = *(deployment.Spec.Replicas) + deploymentutil.MaxSurge(*deployment)
}
// 可以增加或刪除的pod數量,結果正數則代表可以繼續新增pod,結果爲負數則代表需要刪除pod了
deploymentReplicasToAdd := allowedSize - allRSsReplicas
var scalingOperation string
switch {
case deploymentReplicasToAdd > 0:
// 如果是擴容,那麼把所有的rs按時間從最新到最舊的順序排序
sort.Sort(controller.ReplicaSetsBySizeNewer(allRSs))
scalingOperation = "up"
case deploymentReplicasToAdd < 0:
// 如果是縮容,那麼把所有的rs按時間從最舊到最新的順序排序
sort.Sort(controller.ReplicaSetsBySizeOlder(allRSs))
scalingOperation = "down"
}
// 遍歷每一個rs, 用map保存此rs應該達到的pod的數量(等於當前數量+需scale數量)
deploymentReplicasAdded := int32(0)
nameToSize := make(map[string]int32)
for i := range allRSs {
rs := allRSs[i]
if deploymentReplicasToAdd != 0 {
// 計算當前rs需scale的數量
proportion := deploymentutil.GetProportion(rs, *deployment, deploymentReplicasToAdd, deploymentReplicasAdded)
// 總pod數量等於當前數量+scale數量
nameToSize[rs.Name] = *(rs.Spec.Replicas) + proportion
deploymentReplicasAdded += proportion
} else {
nameToSize[rs.Name] = *(rs.Spec.Replicas)
}
}
// Update all replica sets
for i := range allRSs {
rs := allRSs[i]
// Add/remove any leftovers to the largest replica set.
// 如果還有各rs加起來都未消化完的pod,則交給上面排序後的第一個rs(最新或最舊的rs)。
if i == 0 && deploymentReplicasToAdd != 0 {
leftover := deploymentReplicasToAdd - deploymentReplicasAdded
nameToSize[rs.Name] = nameToSize[rs.Name] + leftover
if nameToSize[rs.Name] < 0 {
nameToSize[rs.Name] = 0
}
}
// 把這個rs scale到它應該達到的數量
if _, _, err := dc.scaleReplicaSet(rs, nameToSize[rs.Name], deployment, scalingOperation); err != nil {
// Return as soon as we fail, the deployment is requeued
return err
}
}
}
return nil
}
syncDeploymentStatus函數
在完成rs的scale和pause狀態的邏輯處理後,deployment的狀態也需要與最新的rs同步,因此這個函數就是用來同步deployment的狀態的。
func (dc *DeploymentController) syncDeploymentStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {
newStatus := calculateStatus(allRSs, newRS, d)
if reflect.DeepEqual(d.Status, newStatus) {
return nil
}
newDeployment := d
newDeployment.Status = newStatus
_, err := dc.client.AppsV1().Deployments(newDeployment.Namespace).UpdateStatus(newDeployment)
return err
}
這個函數主要用來更新deployment的status字段的內容,例如版本、副本數、可用副本數、更新副本數等等。
整個擴容的過程涉及所有rs的操作,可能很容易混淆,但其實只要記住在99%的情況下,deployment只有一個活躍狀態的rs,即newRS,大部分操作都是針對這個newRS做的,那麼上面的過程就容易理解很多了。
滾動更新
Deployment更新策略分爲滾動更新和一次性更新,更新方式其實都是類似,只是一個是分批式,一個是全量式,這裏看下滾動更新的代碼。
deployment 的spec字段內的內容一旦發生變化,就會觸發rs的更新,生成新版本的rs,並且基於新rs進行副本擴容,舊版本的rs則會縮容。
pkg/controller/deployment/deployment_controller.go:644
==> pkg/controller/deployment/rolling.go:31
func (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
// 獲取新的rs,如果沒有新的rs則創建newRS
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
if err != nil {
return err
}
allRSs := append(oldRSs, newRS)
// 對比判斷newRS是否需要擴容(新rs管理的pod是否已達到目標數量)
scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d)
if err != nil {
return err
}
if scaledUp {
// 擴容完畢,更新deployment的status
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// 對比判斷oldRS是否需要縮容(舊rs管理的pod是否已經全部終結)
scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
if err != nil {
return err
}
if scaledDown {
// 縮容完畢,更新deployment的status
return dc.syncRolloutStatus(allRSs, newRS, d)
}
// deployment 進入complete狀態,根據revision歷史版本數限制,清除舊的rs
if deploymentutil.DeploymentComplete(d, &d.Status) {
if err := dc.cleanupDeployment(oldRSs, d); err != nil {
return err
}
}
// 更新deployment的status
return dc.syncRolloutStatus(allRSs, newRS, d)
}
reconcileNewReplicaSet函數:
這個函數返回bool值,即是否應該擴容newRS的bool值
func (dc *DeploymentController) reconcileNewReplicaSet(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
if *(newRS.Spec.Replicas) == *(deployment.Spec.Replicas) {
// deployment replicas 和newRS replicas相等,則說明new rs已經無需擴容
return false, nil
}
if *(newRS.Spec.Replicas) > *(deployment.Spec.Replicas) {
// newRS replicas > deployment replicas,則說明newRS需要縮容,返回值scaled此時值應當是false
scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment)
return scaled, err
}
// newRS replicas < deployment replicas,則使用NewRSNewReplicas方法計算newRS此時應用擁有的pod副本的數量
newReplicasCount, err := deploymentutil.NewRSNewReplicas(deployment, allRSs, newRS)
if err != nil {
return false, err
}
// 返回值scaled此時值應當是true
scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, newReplicasCount, deployment)
return scaled, err
}
NewRSNewReplicas函數:
計算newRS此時應該有的副本數量的函數
func NewRSNewReplicas(deployment *apps.Deployment, allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet) (int32, error) {
switch deployment.Spec.Strategy.Type {
// 滾動更新時
case apps.RollingUpdateDeploymentStrategyType:
// Check if we can scale up.
maxSurge, err := intstrutil.GetValueFromIntOrPercent(deployment.Spec.Strategy.RollingUpdate.MaxSurge, int(*(deployment.Spec.Replicas)), true)
if err != nil {
return 0, err
}
// 當前的副本數(當前值) = 所有版本的rs管理的pod數量的總和
currentPodCount := GetReplicaCountForReplicaSets(allRSs)
// 最多允許同時存在的副本數(最大值) = 指定副本數 + maxSurge的副本數(整數或者比例計算)
maxTotalPods := *(deployment.Spec.Replicas) + int32(maxSurge)
// 如果當前值比最大值還大,那麼說明不能再擴容了,直接返回最新的newRS.Spec.Replicas
if currentPodCount >= maxTotalPods {
return *(newRS.Spec.Replicas), nil
}
// 否則,可擴容值 = 最大值 - 當前值
scaleUpCount := maxTotalPods - currentPodCount
// 但每一個版本的rs管理的副本數量,不能超過deployment所指定的副本數量,只有新舊版本的rs加起來的副本數可以突破到maxSurge的上限。因此,這裏的可擴容值要取這兩個值之間的最小值。
scaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(*(deployment.Spec.Replicas)-*(newRS.Spec.Replicas))))
// 此時newRS應有的副本數 = 當前值 + 可擴容值
return *(newRS.Spec.Replicas) + scaleUpCount, nil
case apps.RecreateDeploymentStrategyType:
// 非滾動更新時,newRS的應用副本數 = deployment.Spec.Replicas,無彈性
return *(deployment.Spec.Replicas), nil
default:
return 0, fmt.Errorf("deployment type %v isn't supported", deployment.Spec.Strategy.Type)
}
}
reconcileOldReplicaSets函數
這個函數返回bool值,即是否應該縮容oldRSs的bool值
func (dc *DeploymentController) reconcileOldReplicaSets(allRSs []*apps.ReplicaSet, oldRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {
oldPodsCount := deploymentutil.GetReplicaCountForReplicaSets(oldRSs)
if oldPodsCount == 0 {
// 已經縮容完畢,直接返回
return false, nil
}
// 當前所有的pod的數量(當前值)
allPodsCount := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
klog.V(4).Infof("New replica set %s/%s has %d available pods.", newRS.Namespace, newRS.Name, newRS.Status.AvailableReplicas)
// deployment 指定的最大不可用的副本數(最大不可用值)
maxUnavailable := deploymentutil.MaxUnavailable(*deployment)
// Check if we can scale down. We can scale down in the following 2 cases:
// * Some old replica sets have unhealthy replicas, we could safely scale down those unhealthy replicas since that won't further
// increase unavailability.
// * New replica set has scaled up and it's replicas becomes ready, then we can scale down old replica sets in a further step.
//
// maxScaledDown := allPodsCount - minAvailable - newReplicaSetPodsUnavailable
// take into account not only maxUnavailable and any surge pods that have been created, but also unavailable pods from
// the newRS, so that the unavailable pods from the newRS would not make us scale down old replica sets in a further
// step(that will increase unavailability).
//
// Concrete example:
//
// * 10 replicas
// * 2 maxUnavailable (absolute number, not percent)
// * 3 maxSurge (absolute number, not percent)
//
// case 1:
// * Deployment is updated, newRS is created with 3 replicas, oldRS is scaled down to 8, and newRS is scaled up to 5.
// * The new replica set pods crashloop and never become available.
// * allPodsCount is 13. minAvailable is 8. newRSPodsUnavailable is 5.
// * A node fails and causes one of the oldRS pods to become unavailable. However, 13 - 8 - 5 = 0, so the oldRS won't be scaled down.
// * The user notices the crashloop and does kubectl rollout undo to rollback.
// * newRSPodsUnavailable is 1, since we rolled back to the good replica set, so maxScaledDown = 13 - 8 - 1 = 4. 4 of the crashlooping pods will be scaled down.
// * The total number of pods will then be 9 and the newRS can be scaled up to 10.
//
// case 2:
// Same example, but pushing a new pod template instead of rolling back (aka "roll over"):
// * The new replica set created must start with 0 replicas because allPodsCount is already at 13.
// * However, newRSPodsUnavailable would also be 0, so the 2 old replica sets could be scaled down by 5 (13 - 8 - 0), which would then
// allow the new replica set to be scaled up by 5.
// Available指的是就緒探針結果爲true的副本,若默認未指定就緒探針,則pod running之後自動視就緒爲true
// 最小可用副本數(至少可用數)
minAvailable := *(deployment.Spec.Replicas) - maxUnavailable
// newRs不可用數
newRSUnavailablePodCount := *(newRS.Spec.Replicas) - newRS.Status.AvailableReplicas
// 最大可縮容數 = 總數 - 最小可用數 - newRS不可用數(爲了保證最小可用數,因此此時newRS的不可用副本不能參與這個計算)
maxScaledDown := allPodsCount - minAvailable - newRSUnavailablePodCount
if maxScaledDown <= 0 {
return false, nil
}
// oldRS裏不健康的副本,無論如何都是需要清除的
// and cause timeout. See https://github.com/kubernetes/kubernetes/issues/16737
oldRSs, cleanupCount, err := dc.cleanupUnhealthyReplicas(oldRSs, deployment, maxScaledDown)
if err != nil {
return false, nil
}
klog.V(4).Infof("Cleaned up unhealthy replicas from old RSes by %d", cleanupCount)
// 還要對比最大可縮容數和deployment指定的最大同時不可用副本數,這兩者之間的最小值,纔是可縮容數量
allRSs = append(oldRSs, newRS)
scaledDownCount, err := dc.scaleDownOldReplicaSetsForRollingUpdate(allRSs, oldRSs, deployment)
if err != nil {
return false, nil
}
klog.V(4).Infof("Scaled down old RSes of deployment %s by %d", deployment.Name, scaledDownCount)
// oldRS裏不健康的副本,無論如何都是需要清除的
totalScaledDown := cleanupCount + scaledDownCount
// 判斷縮容數是否大於0
return totalScaledDown > 0, nil
}
中間的英文註釋裏的舉例說明非常詳細,可以看一下注釋。
syncRolloutStatus函數
這個函數主要用於更新deployment的status字段和其中的condition字段。
func (dc *DeploymentController) syncRolloutStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {
newStatus := calculateStatus(allRSs, newRS, d)
if !util.HasProgressDeadline(d) {
util.RemoveDeploymentCondition(&newStatus, apps.DeploymentProgressing)
}
currentCond := util.GetDeploymentCondition(d.Status, apps.DeploymentProgressing)
/**
判斷deployment是否爲complete狀態,條件有多個:
1. newRS.replicas = newRS.Status.UpdatedReplicas 說明newRS的副本更新已全部完成
2. newRS.status.condition.reason = miniumReplicasAvailable
**/
isCompleteDeployment := newStatus.Replicas == newStatus.UpdatedReplicas && currentCond != nil && currentCond.Reason == util.NewRSAvailableReason
// 未達到complete狀態的deployment,才進行下面的檢查
if util.HasProgressDeadline(d) && !isCompleteDeployment {
switch {
case util.DeploymentComplete(d, &newStatus):
// Update the deployment conditions with a message for the new replica set that
// was successfully deployed. If the condition already exists, we ignore this update.
msg := fmt.Sprintf("Deployment %q has successfully progressed.", d.Name)
if newRS != nil {
msg = fmt.Sprintf("ReplicaSet %q has successfully progressed.", newRS.Name)
}
condition := util.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionTrue, util.NewRSAvailableReason, msg)
util.SetDeploymentCondition(&newStatus, *condition)
case util.DeploymentProgressing(d, &newStatus):
// If there is any progress made, continue by not checking if the deployment failed. This
// behavior emulates the rolling updater progressDeadline check.
msg := fmt.Sprintf("Deployment %q is progressing.", d.Name)
if newRS != nil {
msg = fmt.Sprintf("ReplicaSet %q is progressing.", newRS.Name)
}
condition := util.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionTrue, util.ReplicaSetUpdatedReason, msg)
// Update the current Progressing condition or add a new one if it doesn't exist.
// If a Progressing condition with status=true already exists, we should update
// everything but lastTransitionTime. SetDeploymentCondition already does that but
// it also is not updating conditions when the reason of the new condition is the
// same as the old. The Progressing condition is a special case because we want to
// update with the same reason and change just lastUpdateTime iff we notice any
// progress. That's why we handle it here.
if currentCond != nil {
if currentCond.Status == v1.ConditionTrue {
condition.LastTransitionTime = currentCond.LastTransitionTime
}
util.RemoveDeploymentCondition(&newStatus, apps.DeploymentProgressing)
}
util.SetDeploymentCondition(&newStatus, *condition)
case util.DeploymentTimedOut(d, &newStatus):
// Update the deployment with a timeout condition. If the condition already exists,
// we ignore this update.
msg := fmt.Sprintf("Deployment %q has timed out progressing.", d.Name)
if newRS != nil {
msg = fmt.Sprintf("ReplicaSet %q has timed out progressing.", newRS.Name)
}
condition := util.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionFalse, util.TimedOutReason, msg)
util.SetDeploymentCondition(&newStatus, *condition)
}
}
DeploymentCondition在這裏面反覆出現,便於理解,參照一個正常狀態的deployment condition樣例:
總結
滾動更新過程中主要是通過調用reconcileNewReplicaSet
函數對 newRS 擴容,調用 reconcileOldReplicaSets
函數 對 oldRSs縮容,按照 maxSurge
和 maxUnavailable
的約束,計時器間隔1s反覆執行、收斂、修正,最終達到期望狀態,完成更新。
總結
Deployment的回滾、擴(縮)容、暫停、更新等操作,主要是通過修改rs來完成的。其中,rs的版本控制、replicas數量控制是其最核心也是難以理解的地方,但是隻要記住99%的時間裏deployment對應的活躍的rs只有一個,只有更新時纔會出現2個rs,極少數情況下(短時間重複更新)纔會出現2個以上的rs,對於上面源碼的理解就會容易許多。
另外,從上面這麼多步驟的拆解也可以發現,deployment的更新實際基本不涉及對pod的直接操作,因此,本章後續的章節會分析一下replicaSet controller是怎麼和pod進行管理交互的。