Kubernetes源碼學習-Controller-P5-StatefulSet Controller

P5-StatefulSet Controller

前言

在前面的幾篇文章中,先對deployment controller進行了初步分析:

Controller-P3-Deployment Controller

嚴格來講deployment的管理pod的邏輯是基於replicaSet來實現的,因此接下來結合replicaSet controller進行了深入:

Controller-P3-ReplicaSet Controller

那麼在本篇,來看看另一個最常用的承載在pod之上的管理單位的控制器實現: StatefulSet Controller

##StatefulSet 的基本特性

在看代碼之前,先回顧一下sts的基本運行特性,代入地閱讀代碼會比較順暢

創建

sts是有序的,pod副本有序串行地新建,pod名稱爲{sts_name}-{0…N},從小序號的pod(名稱爲{sts_name}-0)創建,一直到第n個副本的pod(名稱爲{sts_name}-n)

更新

sts的更新策略有2種:

  • RollingUpdateStatefulSetStrategyType,默認的滾動更新策略,此策略下,更新時pod根據序號反順序更新,從最大序號的pod開始刪除重建,更新至序號最小的pod。更新過程中,始終保持pod數量等於指定副本數,即每刪除一個pod,纔會再創建一個。同時可以指定一個partition參數,指定這個參數後,只有序號大於等於partition的pod纔會被更新,序號小於partition參數的pod不會被更新,例如有5個副本,partition設置爲2,那麼在更新sts時,0和1號pod不會更新,2 3 4號pod則會更新重建;此時繼續將partition縮減爲0,則0 1號pod也會更新重建。默認partition爲0,即所有的pod都會更新。這個參數一般不會使用,但可用在發佈時動態更新遞減partition的值,來實現滾動灰度發佈。

  • OnDeleteStatefulSetStrategyType, 此策略下controller不會對pod做任何操作,由手動刪除pod來觸發新pod的創建

刪除

刪除sts時,可以指定級聯模式的參數--cascade=true,默認爲true,意思是刪除sts會同時刪除它所管理的pod。設置爲false時,刪除sts不會影響pod的運行,且sts重建後依然能與此前的pod關聯起來(這種方式可能會產生孤兒pod)。

關聯關係

先來看看sts和pod的關聯方式:

# sts
[root@008019 ~]# kubectl get sts deptest11dev
NAME           READY   AGE
deptest11dev   2/2     99d

# pod
[root@008019 ~]# kubectl get pods  | grep deptest11dev
deptest11dev-0                                    1/1     Running                 1          99d
deptest11dev-1                                    1/1     Running                 0          3d17h

# edit pod
# 可以查看到pod的ownerReferences字段,與sts關聯
ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: deptest11dev
    uid: 28ecf735-2ab4-11ea-afa8-1866daf0f324

# 可以查看到pod的labels標籤,新增了一個controller-revision-hash標籤,與controllerRevision關聯
  labels:
    app: deptest11dev
    controller-revision-hash: deptest11dev-587f8bd845
    statefulset.kubernetes.io/pod-name: deptest11dev-1

再來看看sts和ControllerRevision關聯方式:

[root@008019 ~]# kubectl get sts deptest11dev
NAME           READY   AGE
deptest11dev   2/2     99d



[root@008019 ~]# kubectl get ControllerRevisions | grep deptest11dev
deptest11dev-587f8bd845                                    statefulset.apps/deptest11dev                                    1          99d

[root@008019 ~]# kubectl get ControllerRevisions deptest11dev-587f8bd845
NAME                      CONTROLLER                      REVISION   AGE
deptest11dev-587f8bd845   statefulset.apps/deptest11dev   1          99d


# ControllerRevisions資源中的ownerReferences字段,可以看出sts與其通過這個字段關聯
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: deptest11dev
    uid: 28ecf735-2ab4-11ea-afa8-1866daf0f324
    

# sts status字段,可以看出sts通過status下的currentRevision、updateRevision字段與ControllerRevision關聯
status:
  collisionCount: 0
  currentReplicas: 2
  currentRevision: deptest11dev-587f8bd845
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updateRevision: deptest11dev-587f8bd845
  updatedReplicas: 2

# 對sts.spec字段裏的內容更新後引起pod重建,sts開始滾動更新,此時sts的status字段內容如下:
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: deptest11dev-587f8bd845
  observedGeneration: 4
  readyReplicas: 2
  replicas: 2
  # 這時可以發現updateRevision字段更新爲了新的revision,即updateRevision是最近一次更新的Revision
  updateRevision: deptest11dev-7487498978
  
# 修改sts進行縮容/擴容 時的status字段:
status:
  collisionCount: 0
  currentReplicas: 3
  currentRevision: deptest11dev-7487498978
  observedGeneration: 5
  readyReplicas: 3
  replicas: 3
  # revision不會更新
  updateRevision: deptest11dev-7487498978
  updatedReplicas: 3
  

記住這幾者之間雙向地關聯方式,下面會提到。

StatefulSet Controller

初始化

cmd/kube-controller-manager/app/apps.go:59

func startStatefulSetController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "statefulsets"}] {
		return nil, false, nil
	}
	go statefulset.NewStatefulSetController(
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Apps().V1().StatefulSets(),
		ctx.InformerFactory.Core().V1().PersistentVolumeClaims(),
		ctx.InformerFactory.Apps().V1().ControllerRevisions(),
		ctx.ClientBuilder.ClientOrDie("statefulset-controller"),
	).Run(1, ctx.Stop)
	return nil, true, nil
}

先來看看NewStatefulSetController做了什麼:

==> pkg/controller/statefulset/stateful_set.go:81

func NewStatefulSetController(
  // 1.StatefulSetController關注四種類型的資源: Pod/Sts/PVC/ControllerRevision
	podInformer coreinformers.PodInformer,
	setInformer appsinformers.StatefulSetInformer,
	pvcInformer coreinformers.PersistentVolumeClaimInformer,
	revInformer appsinformers.ControllerRevisionInformer,
	kubeClient clientset.Interface,
) *StatefulSetController {
	eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(klog.Infof)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
	recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "statefulset-controller"})
  
  // 2.NewDefaultStatefulSetControl方法需要關注
	ssc := &StatefulSetController{
		kubeClient: kubeClient,
		control: NewDefaultStatefulSetControl(
			NewRealStatefulPodControl(
				kubeClient,
				setInformer.Lister(),
				podInformer.Lister(),
				pvcInformer.Lister(),
				recorder),
			NewRealStatefulSetStatusUpdater(kubeClient, setInformer.Lister()),
			history.NewHistory(kubeClient, revInformer.Lister()),
			recorder,
		),
		pvcListerSynced: pvcInformer.Informer().HasSynced,
		queue:           workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "statefulset"),
		podControl:      controller.RealPodControl{KubeClient: kubeClient, Recorder: recorder},

		revListerSynced: revInformer.Informer().HasSynced,
	}
  // 當sts管理的pod curd時對應的處理方法(按入workqueue/更新pod/刪除pod)
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		// lookup the statefulset and enqueue
		AddFunc: ssc.addPod,
		// lookup current and old statefulset if labels changed
		UpdateFunc: ssc.updatePod,
		// lookup statefulset accounting for deletion tombstones
		DeleteFunc: ssc.deletePod,
	})
	ssc.podLister = podInformer.Lister()
	ssc.podListerSynced = podInformer.Informer().HasSynced
  
  // 當sts curd時對應的方法(sts壓入workqueue)
	setInformer.Informer().AddEventHandlerWithResyncPeriod(
		cache.ResourceEventHandlerFuncs{
			AddFunc: ssc.enqueueStatefulSet,
			UpdateFunc: func(old, cur interface{}) {
				oldPS := old.(*apps.StatefulSet)
				curPS := cur.(*apps.StatefulSet)
				if oldPS.Status.Replicas != curPS.Status.Replicas {
					klog.V(4).Infof("Observed updated replica count for StatefulSet: %v, %d->%d", curPS.Name, oldPS.Status.Replicas, curPS.Status.Replicas)
				}
				ssc.enqueueStatefulSet(cur)
			},
			DeleteFunc: ssc.enqueueStatefulSet,
		},
		statefulSetResyncPeriod,
	)
	ssc.setLister = setInformer.Lister()
	ssc.setListerSynced = setInformer.Informer().HasSynced

	// TODO: Watch volumes
  // 返回ssc(StatefulSetController)
	return ssc
}

先看註釋1,可以發現,StatefulSetController關注四種類型的資源: Pod/Sts/PVC/ControllerRevision,其中的ControllerRevision不太熟悉,先找出來看下它的結構,逐級跳轉:

cmd/kube-controller-manager/app/apps.go:63

==> vendor/k8s.io/client-go/informers/apps/v1/interface.go:28

==>vendor/k8s.io/client-go/informers/apps/v1/controllerrevision.go:38

==> vendor/k8s.io/client-go/listers/apps/v1/controllerrevision.go:29

==> vendor/k8s.io/api/apps/v1/types.go:800

type ControllerRevision struct {
	metav1.TypeMeta `json:",inline"`
	// Standard object's metadata.
	// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

	// Data is the serialized representation of the state.
	Data runtime.RawExtension `json:"data,omitempty" protobuf:"bytes,2,opt,name=data"`

	// Revision indicates the revision of the state represented by Data.
	Revision int64 `json:"revision" protobuf:"varint,3,opt,name=revision"`
}

閱讀這個結構體上方的註釋可以得知,ControllerRevision提供給DaemonSet和StatefulSet用作更新和回滾,ControllerRevision存放的是數據的快照,ControllerRevision生成之後內容是不可修改的,由調用端來負責序列化寫入和反序列化讀取。其中Revision(int64)字段相當於ControllerRevision的版本id號,Data字段則存放序列化後的數據。

畫外音:不難猜測,StatefulSet的更新以及回滾(也是一種特殊的更新)操作,是基於對新舊ControllerRevision的對比來進行的

在來看下注釋2,NewDefaultStatefulSetControl方法:

pkg/controller/statefulset/stateful_set.go:95

==> pkg/controller/statefulset/stateful_set_control.go:54

func NewDefaultStatefulSetControl(
	podControl StatefulPodControlInterface,
	statusUpdater StatefulSetStatusUpdaterInterface,
	controllerHistory history.Interface,
	recorder record.EventRecorder) StatefulSetControlInterface {
	return &defaultStatefulSetControl{podControl, statusUpdater, controllerHistory, recorder}
}

NewDefaultStatefulSetControl返回的defaultStatefulSetControl結構體對象是sts管理控制邏輯的語義實現,defaultStatefulSetControl結構體裏面包含了sts控制過程中的各種接口:

  1. 管理sts對應的pod/pvc(podControl)的方法接口,有(CreateStatefulPod/UpdateStatefulPod/DeleteStatefulPod)這幾個方法,通過NewRealStatefulPodControl函數返回的realStatefulPodControl結構體對象來實現
  2. 管理sts status狀態的更新(statusUpdater)的方法接口,有UpdateStatefulSetStatus這一個方法,通過NewRealStatefulSetStatusUpdater返回的realStatefulSetStatusUpdater結構體對象來實現。
  3. 管理ControllerRevision版本(controllerHistory) 的方法接口,有(ListControllerRevisions/CreateControllerRevision/DeleteControllerRevision/UpdateControllerRevision/AdoptControllerRevision/ReleaseControllerRevision)這幾個方法,通過**history.NewHistory返回的realHistory結構體對象來實現。

現在接着往下,去看看ssc(StatefulSetController) 運行的Run函數。

工作過程

*StatefulSetController.Run()函數:

func (ssc *StatefulSetController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer ssc.queue.ShutDown()

	klog.Infof("Starting stateful set controller")
	defer klog.Infof("Shutting down statefulset controller")

	if !controller.WaitForCacheSync("stateful set", stopCh, ssc.podListerSynced, ssc.setListerSynced, ssc.pvcListerSynced, ssc.revListerSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(ssc.worker, time.Second, stopCh)
	}

	<-stopCh
}

wait.Until定時器前面已經講過,不再複述,重點在於ssc.worker函數,代碼裏有多次跳躍:

pkg/controller/statefulset/stateful_set.go:159

==>pkg/controller/statefulset/stateful_set.go:410

==> pkg/controller/statefulset/stateful_set.go:399

==>pkg/controller/statefulset/stateful_set.go:415

// sync syncs the given statefulset.
func (ssc *StatefulSetController) sync(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing statefulset %q (%v)", key, time.Since(startTime))
	}()
  // key的樣例: default/teststs,做個切割,拿到namespace和sts name
	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
  // 獲取到sts對象
	set, err := ssc.setLister.StatefulSets(namespace).Get(name)
	if errors.IsNotFound(err) {
		klog.Infof("StatefulSet has been deleted %v", key)
		return nil
	}
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("unable to retrieve StatefulSet %v from store: %v", key, err))
		return err
	}
  
  // labelSelector
	selector, err := metav1.LabelSelectorAsSelector(set.Spec.Selector)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("error converting StatefulSet %v selector: %v", key, err))
		// This is a non-transient error, so don't retry.
		return nil
	}

  // 孤兒Revisions修正託管
	if err := ssc.adoptOrphanRevisions(set); err != nil {
		return err
	}
  // 獲取到sts管理的pod
	pods, err := ssc.getPodsForStatefulSet(set, selector)
	if err != nil {
		return err
	}
  // syncStatefulSet 執行sts sync
	return ssc.syncStatefulSet(set, pods)
}

來分步看下

孤兒Revisions修正託管

上面指出sts和revision兩者之間顯示地雙向指定字段來關聯對方,明白這一點那麼這個函數就好理解了。

出現孤兒ControllerRevisions的原因,很有可能是sts在此期間進行了反覆的更新,更新時間差之中產生了髒數據.

pkg/controller/statefulset/stateful_set.go:316

// adoptOrphanRevisions adopts any orphaned ControllerRevisions matched by set's Selector.
func (ssc *StatefulSetController) adoptOrphanRevisions(set *apps.StatefulSet) error {
  // 通過sts指定的revision相關字段找到對應的revisions
	revisions, err := ssc.control.ListRevisions(set)
	if err != nil {
		return err
	}
	hasOrphans := false
	for i := range revisions {
    // 通過revision指定的controller來源,來找sts。如果指定綁定的sts爲空,那麼說明此ControllerRevisions是孤兒狀態(無託管),需要回收
		if metav1.GetControllerOf(revisions[i]) == nil {
			hasOrphans = true
			break
		}
	}
  
  // 出現孤兒ControllerRevisions的原因,很有可能是sts在此期間進行了反覆的更新,因此重新獲取一次最新的sts
	if hasOrphans {
		fresh, err := ssc.kubeClient.AppsV1().StatefulSets(set.Namespace).Get(set.Name, metav1.GetOptions{})
		if err != nil {
			return err
		}
    // sts(old) 若與fresh sts uid不同,則說明期間sts可能經歷了刪除重建,本次邏輯的流程打破,拋錯返回
		if fresh.UID != set.UID {
			return fmt.Errorf("original StatefulSet %v/%v is gone: got uid %v, wanted %v", set.Namespace, set.Name, fresh.UID, set.UID)
		}
    // 爲這些controller sts指定爲空的revision,若label匹配則加上ownerReferences sts指定,若label不匹配則gc
		return ssc.control.AdoptOrphanRevisions(set, revisions)
	}
	return nil
}

獲取到sts管理的pod

pkg/controller/statefulset/stateful_set.go:285

func (ssc *StatefulSetController) getPodsForStatefulSet(set *apps.StatefulSet, selector labels.Selector) ([]*v1.Pod, error) {
	// List all pods to include the pods that don't match the selector anymore but
	// has a ControllerRef pointing to this StatefulSet.
	pods, err := ssc.podLister.Pods(set.Namespace).List(labels.Everything())
	if err != nil {
		return nil, err
	}
	
  // filter函數的作用是判斷指定的pod和sts是否有所屬關係,展開代碼可以看到判斷的方式很簡單,對pod的名稱做re字符串切割,最後一個"-"之前的字符串是parent,之後的數字是序號索引,判斷parent與sts name是否一致,一致則爲true,pod 屬於 sts,不一致則爲false
	filter := func(pod *v1.Pod) bool {
		// Only claim if it matches our StatefulSet name. Otherwise release/ignore.
		return isMemberOf(set, pod)
	}
  

	// 如同revision一樣,若存在孤兒pod,也需要對孤兒pod進行收養,與sts label匹配則加上關聯,label不匹配則解除關聯。
	canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
		fresh, err := ssc.kubeClient.AppsV1().StatefulSets(set.Namespace).Get(set.Name, metav1.GetOptions{})
		if err != nil {
			return nil, err
		}
		if fresh.UID != set.UID {
			return nil, fmt.Errorf("original StatefulSet %v/%v is gone: got uid %v, wanted %v", set.Namespace, set.Name, fresh.UID, set.UID)
		}
		return fresh, nil
	})

	cm := controller.NewPodControllerRefManager(ssc.podControl, set, selector, controllerKind, canAdoptFunc)
  // 執行篩選
	return cm.ClaimPods(pods, filter)
}

ClaimPods

pkg/controller/controller_ref_manager.go:171

func (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod, filters ...func(*v1.Pod) bool) ([]*v1.Pod, error) {
	var claimed []*v1.Pod
	var errlist []error
  
  
	match := func(obj metav1.Object) bool {
		pod := obj.(*v1.Pod)
    // 先根據標籤匹配pod,僅當標籤匹配通過後,再匹配下一步(sts調用則是按照上面說的取 pod name 字符串切割後與sts name對比)
		if !m.Selector.Matches(labels.Set(pod.Labels)) {
			return false
		}
		for _, filter := range filters {
			if !filter(pod) {
				return false
			}
		}
		return true
	}
	adopt := func(obj metav1.Object) error {
    // 收養pod(添加關聯關係),即爲pod.metadata  patch ownerReferences字段。
		return m.AdoptPod(obj.(*v1.Pod))
	}
	release := func(obj metav1.Object) error {
    // 釋放pod關聯關係,即爲pod.metadata  delete ownerReferences字段。
		return m.ReleasePod(obj.(*v1.Pod))
	}

	for _, pod := range pods {
    // 判斷單個pod是否匹配,收養/釋放孤兒pod的函數ClaimObject
		ok, err := m.ClaimObject(pod, match, adopt, release)
		if err != nil {
			errlist = append(errlist, err)
			continue
		}
		if ok {
			claimed = append(claimed, pod)
		}
	}
	return claimed, utilerrors.NewAggregate(errlist)
}

####ClaimObject

pkg/controller/controller_ref_manager.go:66

func (m *BaseControllerRefManager) ClaimObject(obj metav1.Object, match func(metav1.Object) bool, adopt, release func(metav1.Object) error) (bool, error) {
  // 1 獲取到pod.metadata中的ownerReferences字段
	controllerRef := metav1.GetControllerOf(obj)
  // 1-1 如果pod存在ownerReferences,則直接進入判斷是否match
	if controllerRef != nil {
		if controllerRef.UID != m.Controller.GetUID() {
			// Owned by someone else. Ignore.
			return false, nil
		}
    // 1-2 匹配則返回true
		if match(obj) {
			return true, nil
		}
		
		if m.Controller.GetDeletionTimestamp() != nil {
			return false, nil
		}
    
    // 1-3 不匹配則pod釋放關聯字段,返回false
		if err := release(obj); err != nil {
			// If the pod no longer exists, ignore the error.
			if errors.IsNotFound(err) {
				return false, nil
			}
			return false, err
		}
		// Successfully released.
		return false, nil
	}
  
	// 2 孤兒pod,則要根據情況判斷是否收養/釋放
  // 2-1 已刪除的sts或match規則不匹配,返回false
	if m.Controller.GetDeletionTimestamp() != nil || !match(obj) {
		// Ignore if we're being deleted or selector doesn't match.
		return false, nil
	}
	if obj.GetDeletionTimestamp() != nil {
		// Ignore if the object is being deleted
		return false, nil
	}
	// Selector matches. Try to adopt.
	if err := adopt(obj); err != nil {
		// If the pod no longer exists, ignore the error.
		if errors.IsNotFound(err) {
			return false, nil
		}
		// Either someone else claimed it first, or there was a transient error.
		// The controller should requeue and try again if it's still orphaned.
		return false, err
	}
	// 收養成功返回true
	return true, nil
}

到這裏,所有應當被sts管理的pod(包括孤兒pod)就過濾完畢了,開始執行真正的sts sync。

syncStatefulSet

在找到了所有管理的pod後,就要開始sts 的sync,進行更新sts及更新pod的操作了,回到這裏:

pkg/controller/statefulset/stateful_set.go:451

==> pkg/controller/statefulset/stateful_set.go:458

==> pkg/controller/statefulset/stateful_set_control.go:75

func (ssc *defaultStatefulSetControl) UpdateStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {

	// 取出sts所有的revision並排序
	revisions, err := ssc.ListRevisions(set)
	if err != nil {
		return err
	}
	history.SortControllerRevisions(revisions)

	// 獲得當前revision,以及更新後最新的revision
	currentRevision, updateRevision, collisionCount, err := ssc.getStatefulSetRevisions(set, revisions)
	if err != nil {
		return err
	}

	// 核心方法,對pod進行操作
	status, err := ssc.updateStatefulSet(set, currentRevision, updateRevision, collisionCount, pods)
	if err != nil {
		return err
	}

	// 操作完成後記錄修改sts.status
	err = ssc.updateStatefulSetStatus(set, status)
	if err != nil {
		return err
	}

	klog.V(4).Infof("StatefulSet %s/%s pod status replicas=%d ready=%d current=%d updated=%d",
		set.Namespace,
		set.Name,
		status.Replicas,
		status.ReadyReplicas,
		status.CurrentReplicas,
		status.UpdatedReplicas)

	klog.V(4).Infof("StatefulSet %s/%s revisions current=%s update=%s",
		set.Namespace,
		set.Name,
		status.CurrentRevision,
		status.UpdateRevision)

	// 對set的revision history進行維護
	return ssc.truncateHistory(set, pods, revisions, currentRevision, updateRevision)
}

這裏面最核心的函數是updateStatefulSetStatus,接着往下

updateStatefulSet

這一個函數內容很多,200多行代碼,需要說明的地方會在下面代碼中註釋。

func (ssc *defaultStatefulSetControl) updateStatefulSet(
	set *apps.StatefulSet,
	currentRevision *apps.ControllerRevision,
	updateRevision *apps.ControllerRevision,
	collisionCount int32,
	pods []*v1.Pod) (*apps.StatefulSetStatus, error) {
  
  // 獲取到當前sts currentSet,然後獲取到需更新到的sts updateSet。要實現的更新效果是:
  
  // 1.滾動更新時,在未指定partition時,使當前sts的管理的pod縮減爲0,updateSet的ready pod數 = spec.replicas 
  // 2.滾動更新時,在未指定partition後,大於等於partition的pod全部歸於updateSet,小於partition值的pod還是歸屬於原currentSet
  // 3.OnDelete更新時,do nothing
	currentSet, err := ApplyRevision(set, currentRevision)
	if err != nil {
		return nil, err
	}
  
	updateSet, err := ApplyRevision(set, updateRevision)
	if err != nil {
		return nil, err
	}

	// set the generation, and revisions in the returned status
  // 重新計算sts的status
	status := apps.StatefulSetStatus{}
	status.ObservedGeneration = set.Generation
	status.CurrentRevision = currentRevision.Name
	status.UpdateRevision = updateRevision.Name
	status.CollisionCount = new(int32)
	*status.CollisionCount = collisionCount

	replicaCount := int(*set.Spec.Replicas)
	// replicas是合法副本,將滿足 0 <= pod序號 < sts.spec.replicas的pod,放到這個slice裏來。這裏面的pod都是要保證ready的
	replicas := make([]*v1.Pod, replicaCount)
  // condemned是非法副本,將滿足 pod序號 >= sts.spec.replicas的pod,放到這個slice裏來,這些pod是要刪除掉的(可能是被縮容掉的)
	condemned := make([]*v1.Pod, 0, len(pods))
	unhealthy := 0
	firstUnhealthyOrdinal := math.MaxInt32
	var firstUnhealthyPod *v1.Pod

	// First we partition pods into two lists valid replicas and condemned Pods
	for i := range pods {
		status.Replicas++

		// status.ReadyReplicas計數
		if isRunningAndReady(pods[i]) {
			status.ReadyReplicas++
		}

		if isCreated(pods[i]) && !isTerminating(pods[i]) {
      // 通過pod的controller-revision-hash label,判斷pod屬於currentSet還是UpdatedSet,分別計數
			if getPodRevision(pods[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(pods[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}
		}

		if ord := getOrdinal(pods[i]); 0 <= ord && ord < replicaCount {
      // 將滿足 0 <= pod序號 < sts.spec.replicas的pod,放到replicas這個slice裏來
			replicas[ord] = pods[i]

		} else if ord >= replicaCount {
			// 將滿足 pod序號 >= sts.spec.replicas的pod,放到condemned這個slice裏來,這些pod是要刪除掉的。
			condemned = append(condemned, pods[i])
		}
	}

 
  // replicas slice之中如果有索引位置爲空,則需要填充相應的pod。
  // 根據currentSet.replicas/UpdatedSet.replicas/partition這三個值來判斷pod是基於current revision還是基於update revision創建
	for ord := 0; ord < replicaCount; ord++ {
		if replicas[ord] == nil {
			replicas[ord] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name, ord)
		}
	}

	// 對需要刪除的非法pod按照序號從大到小的順序排序
	sort.Sort(ascendingOrdinal(condemned))

	// 如果有不健康的pod,也需要刪除,但還是遵循串行的原則,優先刪除非法pod中序號最大的,再到合法副本中的序號最小的。
	for i := range replicas {
		if !isHealthy(replicas[i]) {
			unhealthy++
			if ord := getOrdinal(replicas[i]); ord < firstUnhealthyOrdinal {
				firstUnhealthyOrdinal = ord
				firstUnhealthyPod = replicas[i]
			}
		}
	}

	for i := range condemned {
		if !isHealthy(condemned[i]) {
			unhealthy++
			if ord := getOrdinal(condemned[i]); ord < firstUnhealthyOrdinal {
				firstUnhealthyOrdinal = ord
				firstUnhealthyPod = condemned[i]
			}
		}
	}

	if unhealthy > 0 {
		klog.V(4).Infof("StatefulSet %s/%s has %d unhealthy Pods starting with %s",
			set.Namespace,
			set.Name,
			unhealthy,
			firstUnhealthyPod.Name)
	}

	// If the StatefulSet is being deleted, don't do anything other than updating
	// status.
	if set.DeletionTimestamp != nil {
		return &status, nil
	}

	monotonic := !allowsBurst(set)

	// 根據pod的序號,對它們依次進行檢查並操作。
	for i := range replicas {
		// 錯誤狀態的pod刪除重建
		if isFailed(replicas[i]) {
			ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
				"StatefulSet %s/%s is recreating failed Pod %s",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas--
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas--
			}
			status.Replicas--
			replicas[i] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name,
				i)
		}
    // pod沒有被創建(可能是上面剛填充的),就創建pod
		if !isCreated(replicas[i]) {
			if err := ssc.podControl.CreateStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			status.Replicas++
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}

			// 如果不允許burst,直接返回
			if monotonic {
				return &status, nil
			}
			// pod created, no more work possible for this round
			continue
		}
		// 如果不允許burst,對於終結中的pod不採取任何邏輯,等待它終結完畢後下一輪再操作。
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
    // 如果是正在創建中的pod(還未達到ready狀態),同樣不採取任何操作,因爲需要保證創建操作依次有序
		if !isRunningAndReady(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		
    // 如果此pod與sts已經匹配(ready),且存儲滿足sts、pod的要求,那麼這個pod就是合格的pod,continue
		if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) {
			continue
		}
    
		// 確保pod與sts的標籤關聯,以及爲pod準備好它需要的pvc
		replica := replicas[i].DeepCopy()
		if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
			return &status, err
		}
	}

  
  // 上面的合法副本得以保證之後,下面要開始按pod序號從大到小的順序,刪除非法pod了
	for target := len(condemned) - 1; target >= 0; target-- {
		// 終結中的pod不再處理,直接返回,等待下一輪檢查
		if isTerminating(condemned[target]) {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate prior to scale down",
				set.Namespace,
				set.Name,
				condemned[target].Name)
			// block if we are in monotonic mode
			if monotonic {
				return &status, nil
			}
			continue
		}
		
    // 如果此非法pod不是ready狀態,且不允許burst,且它不是優先級第一的非健康pod,不做任何操作。換而言之,即使是刪除非健康的pod,也要按照序號從大到小的順序串行執行。
		if !isRunningAndReady(condemned[target]) && monotonic && condemned[target] != firstUnhealthyPod {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready prior to scale down",
				set.Namespace,
				set.Name,
				firstUnhealthyPod.Name)
			return &status, nil
		}
    // 開始刪除此pod,更新status
		klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for scale down",
			set.Namespace,
			set.Name,
			condemned[target].Name)

		if err := ssc.podControl.DeleteStatefulPod(set, condemned[target]); err != nil {
			return &status, err
		}
		if getPodRevision(condemned[target]) == currentRevision.Name {
			status.CurrentReplicas--
		}
		if getPodRevision(condemned[target]) == updateRevision.Name {
			status.UpdatedReplicas--
		}
		if monotonic {
			return &status, nil
		}
	}

	// OnDelete更新模式下,不自動刪除pod,需要手動刪除pod來觸發更新
	if set.Spec.UpdateStrategy.Type == apps.OnDeleteStatefulSetStrategyType {
		return &status, nil
	}

	// 經過上面那麼多條件的過濾和準備,現在要開始對replicas裏的合法pod進行檢查了
	updateMin := 0
	if set.Spec.UpdateStrategy.RollingUpdate != nil {
		updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
	}
	// 按pod的序號倒序檢查
	for target := len(replicas) - 1; target >= updateMin; target-- {

		// 如果pod的revision不符合updateRevision,那麼刪除重建此pod
		if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
			klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update",
				set.Namespace,
				set.Name,
				replicas[target].Name)
			err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
			status.CurrentReplicas--
			return &status, err
		}

		// 合法pod更新過程中,還未到達ready狀態的pod,等待它
		if !isHealthy(replicas[target]) {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to update",
				set.Namespace,
				set.Name,
				replicas[target].Name)
			return &status, nil
		}

	}
	return &status, nil
}

updateStatefulSet函數總結

  1. 每個循環的週期中,最多操作一個pod
  2. 根據sts.spec.replicas對比現有pod的序號,對pod進行劃分,一部分劃爲合法(保留/重建),一部分劃爲非法(刪除)
  3. 對pods進行劃分,一部分劃入current(old) set陣營,另一部分劃入update(new) set陣營
  4. 更新過程中,無論是刪減、還是新建,都保持pod數量固定,有序地遞增、遞減
  5. 最終保證所有的pod都歸屬於update revision

總結

statefulset 在設計上與 deployment 有許多不同的地方,例如:

  • deployment通過rs管理pod,sts通過controllerRevision管理pod;

  • deployment curd是無序的,sts強保證有序curd

  • sts需要檢查存儲的匹配

在瞭解sts管理操作pod方式的基礎上來看代碼,會有許多的幫助。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章