P4-ReplicaSet Controller
前言
在上一篇文章中,對deployment controller的工作模式進行了詳細地分析:
Controller-P3-Deployment Controller
分析後得知,deployment controller更多的是對每個相應版本的replicaset副本數進行管理,而不涉及直接對pod的管理,因此,承接上節,本章來分析replicaSet Controller的源碼.
ReplicaSet Controller
初始化
參照上節一樣,直接來到各類controller初始化的函數:
cmd/kube-controller-manager/app/controllermanager.go:343
controllers["replicaset"] = startReplicaSetController
==> cmd/kube-controller-manager/app/apps.go:69
go replicaset.NewReplicaSetController(
// replicaSet controller只關注ReplicaSets和Pod這兩種資源。
ctx.InformerFactory.Apps().V1().ReplicaSets(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.ClientBuilder.ClientOrDie("replicaset-controller"),
replicaset.BurstReplicas,
).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)
創建ReplicaSetController
先來看看NewReplicaSetController創建的過程:
==> pkg/controller/replicaset/replica_set.go:109
func NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController {
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(klog.Infof)
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
// NewBaseController方法往下看
return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas,
apps.SchemeGroupVersion.WithKind("ReplicaSet"),
"replicaset_controller",
"replicaset",
controller.RealPodControl{
KubeClient: kubeClient,
Recorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "replicaset-controller"}),
},
)
}
// NewBaseController is the implementation of NewReplicaSetController with additional injected
// parameters so that it can also serve as the implementation of NewReplicationController.
func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
metrics.RegisterMetricAndTrackRateLimiterUsage(metricOwnerName, kubeClient.CoreV1().RESTClient().GetRateLimiter())
}
rsc := &ReplicaSetController{
GroupVersionKind: gvk,
kubeClient: kubeClient,
podControl: podControl,
burstReplicas: burstReplicas,
expectations: controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()),
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName),
}
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: rsc.enqueueReplicaSet,
UpdateFunc: rsc.updateRS,
DeleteFunc: rsc.enqueueReplicaSet,
})
rsc.rsLister = rsInformer.Lister()
// informer會同步待操作的資源到本地的queue中,HasSynced方法就是用來判斷判斷queue是否已同步的
rsc.rsListerSynced = rsInformer.Informer().HasSynced
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: rsc.addPod,
UpdateFunc: rsc.updatePod,
DeleteFunc: rsc.deletePod,
})
rsc.podLister = podInformer.Lister()
// informer會同步待操作的資源到本地的queue中,HasSynced方法就是用來判斷判斷queue是否已同步的
rsc.podListerSynced = podInformer.Informer().HasSynced
rsc.syncHandler = rsc.syncReplicaSet
return rsc
}
NewBaseController這裏主要關注AddEventHandler爲資源的informer增加的curd方法,例如pod相關的addPod、updatePod、deletePod方法。
ReplicaSetController Run方法
接着往下,創建好ReplicaSetController對象後,看它的運行過程,即Run方法。
==> pkg/controller/replicaset/replica_set.go:177
// Run begins watching and syncing.
func (rsc *ReplicaSetController) Run(workers int, stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
defer rsc.queue.ShutDown()
controllerName := strings.ToLower(rsc.Kind)
klog.Infof("Starting %v controller", controllerName)
defer klog.Infof("Shutting down %v controller", controllerName)
// 判斷各個informer的緩存是否已經同步完畢的函數
if !controller.WaitForCacheSync(rsc.Kind, stopCh, rsc.podListerSynced, rsc.rsListerSynced) {
return
}
// worker的數量默認是5個,開啓5個worker,每個worker間隔1s運行一次rsc.worker函數,來檢查並收斂rs的狀態
for i := 0; i < workers; i++ {
go wait.Until(rsc.worker, time.Second, stopCh)
}
<-stopCh
}
來到了這裏,可發現ReplicaSetController.Run()函數和上一節的DeploymentController.Run()函數非常地相似。所以,從這裏開始,各類controller之間代碼相似的步驟可能會跳過,不再每個地方都重複詳細說明。
往上溯源,可以找到,worker的數量配置默認爲5個,參見這裏:
pkg/controller/apis/config/v1alpha1/defaults.go:219
func SetDefaults_ReplicaSetControllerConfiguration(obj *kubectrlmgrconfigv1alpha1.ReplicaSetControllerConfiguration) {
if obj.ConcurrentRSSyncs == 0 {
obj.ConcurrentRSSyncs = 5
}
}
wait.Until()函數是很有意思的,上節也做過仔細分析,可以再回顧一下這裏:
[waituntil循環計時器函數](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/controller/Kubernetes源碼學習-Controller-P3-Controller分類與Deployment Controller.md#waituntil循環計時器函數)
好,直接進入主題,開始分析rsc.worker工作函數.
工作邏輯
pkg/controller/replicaset/replica_set.go:190
for i := 0; i < workers; i++ {
go wait.Until(rsc.worker, time.Second, stopCh)
}
==> pkg/controller/replicaset/replica_set.go:432
// processNextWorkItem()函數的作用是把informer work queue工作隊列裏的對象取出,按照申明的要求來處理它們,標記它們。
func (rsc *ReplicaSetController) worker() {
for rsc.processNextWorkItem() {
}
}
==> pkg/controller/replicaset/replica_set.go:437
func (rsc *ReplicaSetController) processNextWorkItem() bool {
// work queue中取出隊首元素
key, quit := rsc.queue.Get()
if quit {
return false
}
defer rsc.queue.Done(key)
// syncHandler每一個隊列對象,強保證同一時間只會有一個go協程處理它(無併發競爭)。所謂sync,意思是將work queue中待操作的對象,同步實現到運行環境中。
err := rsc.syncHandler(key.(string))
if err == nil {
rsc.queue.Forget(key)
return true
}
utilruntime.HandleError(fmt.Errorf("Sync %q failed with %v", key, err))
rsc.queue.AddRateLimited(key)
return true
}
主要函數是這個syncHandler,接着追溯,可以在這裏找到這個結構體屬性函數的賦值:
pkg/controller/replicaset/replica_set.go:163
// NewBaseController is the implementation of NewReplicaSetController with additional injected
// parameters so that it can also serve as the implementation of NewReplicationController.
func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
// ... 省略
rsc.syncHandler = rsc.syncReplicaSet
return rsc
}
接着便可以找到ReplicaSetController.syncReplicaSet函數:
pkg/controller/replicaset/replica_set.go:562
// syncReplicaSet will sync the ReplicaSet with the given key if it has had its expectations fulfilled,
// meaning it did not expect to see any more of its pods created or deleted. This function is not meant to be
// invoked concurrently with the same key.
func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
startTime := time.Now()
defer func() {
klog.V(4).Infof("Finished syncing %v %q (%v)", rsc.Kind, key, time.Since(startTime))
}()
// key的字符串格式是這樣的: ${NAMESPACE}/${NAME}
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
// 獲取到rs對象
rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
if errors.IsNotFound(err) {
klog.V(4).Infof("%v %v has been deleted", rsc.Kind, key)
rsc.expectations.DeleteExpectations(key)
return nil
}
if err != nil {
return err
}
// 判斷rs是否實現所聲明的期望狀態,這裏SatisfiedExpectations是使用expectations機制來判斷這個rs是否滿足期望狀態。
rsNeedsSync := rsc.expectations.SatisfiedExpectations(key)
selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)
if err != nil {
utilruntime.HandleError(fmt.Errorf("Error converting pod selector to selector: %v", err))
return nil
}
// list all pods to include the pods that don't match the rs`s selector
// anymore but has the stale controller ref.
// TODO: Do the List and Filter in a single pass, or use an index.
// 取出所有的的pod,labels.Everything()取到的是空selector,即不使用label selector,取全部pod
allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())
if err != nil {
return err
}
// Ignore inactive pods.
// 去除 inactive狀態的pod
filteredPods := controller.FilterActivePods(allPods)
// 根據rs和selector來選擇受此rs版本管理的pod
filteredPods, err = rsc.claimPods(rs, selector, filteredPods)
if err != nil {
return err
}
var manageReplicasErr error
// 如果rs未達到期望狀態,則對副本進行管理,以使rs滿足聲明的期望狀態
if rsNeedsSync && rs.DeletionTimestamp == nil {
// 最重要的函數manageReplicas,未達期望時,管理rs對應的pod(新增/刪除)
manageReplicasErr = rsc.manageReplicas(filteredPods, rs)
}
rs = rs.DeepCopy()
newStatus := calculateStatus(rs, filteredPods, manageReplicasErr)
// 只要有對應pod的更新,則需要更新rs的status字段
updatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus)
if err != nil {
// Multiple things could lead to this update failing. Requeuing the replica set ensures
// Returning an error causes a requeue without forcing a hotloop
return err
}
// 當指定了MinReadySeconds時,即使pod 已經是ready狀態了,但也不會視爲Available,需要等待MinReadySeconds後再來刷新rs的狀態。因此,enqueueReplicaSetAfter方法,異步等待MinReadySeconds後,把該rs重新壓入work queue隊列中
if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&
updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&
updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {
rsc.enqueueReplicaSetAfter(updatedRS, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)
}
return manageReplicasErr
}
劃重點,兩個重要的函數:SatisfiedExpectations(判斷是否滿足sync條件) / manageReplicas(sync後續的副本pod新增、刪除操作)。分別來看看
SatisfiedExpectations函數
在此之前,必須先了解一下rs controller(後面簡稱rsc)的Expectations機制。rsc會將每一個rs的期望狀態(比如期望新增3個副本)保存在本地緩存中,在sync執行之前,會對期望狀態進行條件判斷,滿足條件纔會真正進行sync操作。
來看看SatisfiedExpectations函數的邏輯:
pkg/controller/controller_utils.go:181
func (r *ControllerExpectations) SatisfiedExpectations(controllerKey string) bool {
// 若此key存在Expectations期望狀態
if exp, exists, err := r.GetExpectations(controllerKey); exists {
// Expectations期望狀態達成或者過期,則需要sync
if exp.Fulfilled() {
klog.V(4).Infof("Controller expectations fulfilled %#v", exp)
return true
} else if exp.isExpired() {
klog.V(4).Infof("Controller expectations expired %#v", exp)
return true
} else {
// 存在期望狀態但未達成,則無需sync。因爲後面的handler在處理資源增刪的時候會來新建和修改Expectations,說明當前正在接近期望狀態中,所以本次無需再sync
klog.V(4).Infof("Controller still waiting on expectations %#v", exp)
return false
}
}
// 不存在Expectations(新增的資源對象),或者獲取Expectations出錯,則視爲需要執行sync
else if err != nil {
klog.V(2).Infof("Error encountered while checking expectations %#v, forcing sync", err)
} else {
klog.V(4).Infof("Controller %v either never recorded expectations, or the ttl expired.", controllerKey)
}
return true
}
manageReplicas函數
==> pkg/controller/replicaset/replica_set.go:459
func (rsc *ReplicaSetController) manageReplicas(filteredPods []*v1.Pod, rs *apps.ReplicaSet) error {
// rs當前管理的pod數量 與 rs聲明指定pod的數量 的差量
diff := len(filteredPods) - int(*(rs.Spec.Replicas))
rsKey, err := controller.KeyFunc(rs)
if err != nil {
utilruntime.HandleError(fmt.Errorf("Couldn't get key for %v %#v: %v", rsc.Kind, rs, err))
return nil
}
// 當 rs當前管理的pod數量 小於 rs聲明指定pod的數量 時,說明應該繼續增加pod
if diff < 0 {
diff *= -1
// 每次新增數量以突發增加數量burstReplicas爲上限
if diff > rsc.burstReplicas {
diff = rsc.burstReplicas
}
// 創建ExpectCreations期望
rsc.expectations.ExpectCreations(rsKey, diff)
klog.V(2).Infof("Too few replicas for %v %s/%s, need %d, creating %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
// slowStartBatch用來以指數級批量啓動pod, 其中controller.SlowStartInitialBatchSize默認值爲1,作爲底數。
successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
// 創建單個pod的函數 CreatePodsWithControllerRef
err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, metav1.NewControllerRef(rs, rsc.GroupVersionKind))
if err != nil && errors.IsTimeout(err) {
return nil
}
return err
})
if skippedPods := diff - successfulCreations; skippedPods > 0 {
klog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for %v %v/%v", skippedPods, rsc.Kind, rs.Namespace, rs.Name)
for i := 0; i < skippedPods; i++ {
// Decrement the expected number of creates because the informer won't observe this pod
rsc.expectations.CreationObserved(rsKey)
}
}
return err
// 當 rs當前管理的pod數量 大於 rs聲明指定pod的數量 時,說明應該減少pod
} else if diff > 0 {
if diff > rsc.burstReplicas {
diff = rsc.burstReplicas
}
klog.V(2).Infof("Too many replicas for %v %s/%s, need %d, deleting %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
// 獲取需要刪除的pod
podsToDelete := getPodsToDelete(filteredPods, diff)
// 修改rs的期望狀態,在期望中剔除將要刪除的pod
rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))
errCh := make(chan error, diff)
var wg sync.WaitGroup
wg.Add(diff)
// 併發刪除目標pod
for _, pod := range podsToDelete {
go func(targetPod *v1.Pod) {
defer wg.Done()
if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
// Decrement the expected number of deletes because the informer won't observe this deletion
podKey := controller.PodKey(targetPod)
klog.V(2).Infof("Failed to delete %v, decrementing expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
rsc.expectations.DeletionObserved(rsKey, podKey)
errCh <- err
}
}(pod)
}
wg.Wait()
select {
case err := <-errCh:
// all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
if err != nil {
return err
}
default:
}
}
return nil
}
這個函數即是實際操控管理pod副本數量的函數,其中的slowStartBatch批量啓動pod的功能比較有意思,來看看。
批量啓動pod
pkg/controller/replicaset/replica_set.go:658
func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) {
// 剩餘要執行的數量
remaining := count
// 累計成功執行的數量
successes := 0
// batchSize是每次批量執行的數量,從initialBatchSize(1)和剩餘數量中取最小值。每次循環執行成功後,batchSize乘以2,以指數級擴充。
for batchSize := integer.IntMin(remaining, initialBatchSize); batchSize > 0; batchSize = integer.IntMin(2*batchSize, remaining) {
errCh := make(chan error, batchSize)
var wg sync.WaitGroup
wg.Add(batchSize)
for i := 0; i < batchSize; i++ {
go func() {
defer wg.Done()
if err := fn(); err != nil {
errCh <- err
}
}()
}
wg.Wait()
curSuccesses := batchSize - len(errCh)
successes += curSuccesses
// 某一輪循環出錯時,跳出循環,後續的不再執行。
if len(errCh) > 0 {
return successes, <-errCh
}
remaining -= batchSize
}
return successes, nil
}
ReplicaSetController工作流程總結
總結一下,在出現新版本的rs後,rsc按照以下步驟進行工作:
1.通過SatisfiedExpectations函數,發現expectations期望狀態本地緩存中不存在此rs key,因此返回true,需要sync
2.通過manageReplicas管理pod,新增或刪除
3.判斷pod副本數是多了還是少了,多則要刪,少則要增
4.增刪之前創建expectations對象並設置add / del值
5.slowStartBatch新增 / 併發刪除 pod
6.更新expection
expections緩存機制,在運行的pod副本數在向聲明指定的副本數收斂之時,很好地避免了頻繁的informer數據查詢,以及可能隨之而來的數據更新不及時的問題,這個機制設計巧妙貫穿整個rsc工作過程,也是不太易於理解之處。