Kubernetes源碼學習-Scheduler-P5-Pod優先級搶佔調度

P5-Pod優先級搶佔調度

1. 前言

前面的兩篇文章中,已經講過了調度pod的算法(predicate/priority),在kubernetes v1.8版本之後可以指定pod優先級(v1alpha1),若資源不足導致高優先級pod匹配失敗,高優先級pod會轉而將部分低優先級pod驅逐,以搶佔低優先級pod的資源盡力保障自身能夠調度成功,那麼本篇就從代碼的層面展開看一看pod搶佔調度的邏輯。

2. 搶佔調度入口

在P1-入口篇中我們找到了調度算法計算的入口,隨後展開了調度算法的兩篇解讀,本篇我們再次回到此入口的位置,接着往下看:

pkg/scheduler/scheduler.go:457

func (sched *Scheduler) scheduleOne() {
	... // 省略
  
  // 調度算法計算入口
	scheduleResult, err := sched.schedule(pod) 
  
	if err != nil {
		// schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
			if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
				klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
					" No preemption is performed.")
			} else {
				preemptionStartTime := time.Now()
				sched.preempt(pod, fitError)  // 搶佔調度邏輯入口
				metrics.PreemptionAttempts.Inc()
			  ... // 省略
			metrics.PodScheduleFailures.Inc()
		} else {
			klog.Errorf("error selecting node for pod: %v", err)
			metrics.PodScheduleErrors.Inc()
		}
		return
	}

註釋中可看出,若在篩選算法中並未找到fitNode且返回了fitError,那麼就會進入基於pod優先級的資源搶佔的邏輯,入口是sched.preempt(pod, fitError)函數。在展開搶佔邏輯之前,我們先來看一看pod優先級是怎麼一回事吧。

2.1. Pod優先級的定義

字面意義上來理解,pod優先級可以在調度的時候爲高優先級的pod提供資源空間保障,若出現資源緊張的情況,則在其他約束規則允許的情況下,高優先級pod會搶佔低優先級pod的資源。此功能在1.11版本以後默認開啓,默認情況下pod的優先級是0,優先級值high is better,具體說明來看看官方文檔的解釋吧:

Pod Priority and Preemption

下面列舉一個pod優先級使用的實例:

# Example PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

# Example Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

瞭解了定義及如何使用,那我們來看看代碼層面是如何實現的吧!

3. 搶佔調度算法

從上面的入口跳轉:

pkg/scheduler/scheduler.go:469 --> pkg/scheduler/scheduler.go:290

func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {
	preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
	if err != nil {
		klog.Errorf("Error getting the updated preemptor pod object: %v", err)
		return "", err
	}
  // 通過默認註冊的搶佔算法,計算得出最終被執行搶佔調度的node、node上需要驅逐的pod等信息
	node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)
	if err != nil {
		klog.Errorf("Error preempting victims to make room for %v/%v.", preemptor.Namespace, preemptor.Name)
		return "", err
	}
	var nodeName = ""
	if node != nil {
		nodeName = node.Name
		// Update the scheduling queue with the nominated pod information. Without
		// this, there would be a race condition between the next scheduling cycle
		// and the time the scheduler receives a Pod Update for the nominated pod.
    // 給調度隊列內的preemptor pod加上提名node信息,避免下一個調度週期出現衝突
		sched.config.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)

		// Make a call to update nominated node name of the pod on the API server.
    // 給待調度pod指定NominatedNodeName,pod.Status.NominatedNodeName = nodeName
		err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName)
		if err != nil {
			klog.Errorf("Error in preemption process. Cannot update pod %v/%v annotations: %v", preemptor.Namespace, preemptor.Name, err)
			sched.config.SchedulingQueue.DeleteNominatedPodIfExists(preemptor)
			return "", err
		}

		for _, victim := range victims {
      // 對node上需要驅逐的pod執行刪除操作
			if err := sched.config.PodPreemptor.DeletePod(victim); err != nil {
				klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
				return "", err
			}
			sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
		}
		metrics.PreemptionVictims.Set(float64(len(victims)))
	}
	// Clearing nominated pods should happen outside of "if node != nil". Node could
	// be nil when a pod with nominated node name is eligible to preempt again,
	// but preemption logic does not find any node for it. In that case Preempt()
	// function of generic_scheduler.go returns the pod itself for removal of the annotation.
  // 當找不到合適的搶佔node時,可能是因爲preemptor pod已經有了提名的node,但它又執行了一遍搶佔邏輯,說明它
  // 在上一次調度週期中沒有調度成功,因此,刪除調度隊列中比當前preemptor pod優先級更低的pod所指定的提名
  // node信息(pod.Status.NominatedNodeName)
	for _, p := range nominatedPodsToClear {
		rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p)
		if rErr != nil {
			klog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr)
			// We do not return as this error is not critical.
		}
	}
	return nodeName, err
}

如優先級篩選算法一樣,調度算法最終也是要挑選出一個供以實際運行搶佔調度邏輯的node,那麼一起來看看這個計算算法是怎麼樣的。如schedule()方法一樣,preempt()的默認方法也在generic_scheduler.go這個文件中:

pkg/scheduler/core/generic_scheduler.go:288

將函數內拆成幾個重要的部分,其餘部分省略,逐個說明

func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
	// ... 省略
  
  // 每次開始搶佔調度之前,檢查一下當前pod是否已經有了提名搶佔調度的節點,且該節點上當前不包含正在終結中的pod,若pod已有提名調度節點,且該節點上已經有pod正在終結中,則視爲已經在執行搶佔的動作了,所以不再往下重複執行。可以自行進去查看,比較簡單,不拿出來講了。
	if !podEligibleToPreemptOthers(pod, g.nodeInfoSnapshot.NodeInfoMap) {
		klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
		return nil, nil, nil, nil
	}
  // ... 省略
  
  // potentialNodes,找出潛在的可能進行搶佔調度的節點,下方詳解
	potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)

  // ... 省略
  
  // pdb,pod Disruption Budget,是用來保障可用副本的一種功能,下方詳解
	pdbs, err := g.pdbLister.List(labels.Everything())
	if err != nil {
		return nil, nil, nil, err
	}
  
  // 最重要的搶佔算法入口,下方詳解
	nodeToVictims, err := selectNodesForPreemption(pod, g.nodeInfoSnapshot.NodeInfoMap, potentialNodes, g.predicates,
		g.predicateMetaProducer, g.schedulingQueue, pdbs)
	if err != nil {
		return nil, nil, nil, err
	}

	// ... 省略
  
  // candidateNode,從所有提名的node中挑選一個真正執行搶佔步驟
	candidateNode := pickOneNodeForPreemption(nodeToVictims)
	if candidateNode == nil {
		return nil, nil, nil, nil
	}

	// 返回3個值,分別是選中的node、node上將要驅逐的pod、調度隊列中比當前pod優先級更低的pod
	nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
	if nodeInfo, ok := g.nodeInfoSnapshot.NodeInfoMap[candidateNode.Name]; ok {
		return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, nil
	}

	return nil, nil, nil, fmt.Errorf(
		"preemption failed: the target node %s has been deleted from scheduler cache",
		candidateNode.Name)
}

3.1. potentialNodes

第一步,先找出所有潛在的可能會參與搶佔調度的node,何爲潛在可能呢?意思是node調度此pod調度失敗的原因並非"硬傷"類原因。所謂硬傷原因,指的是即使驅逐調幾個pod,也無法改變此node無法運行這個pod的事實。這些硬傷包括哪些?來看看代碼:

pkg/scheduler/core/generic_scheduler.go:306 -> pkg/scheduler/core/generic_scheduler.go:1082

func nodesWherePreemptionMightHelp(nodes []*v1.Node, failedPredicatesMap FailedPredicateMap) []*v1.Node {
	potentialNodes := []*v1.Node{}
	for _, node := range nodes {
		unresolvableReasonExist := false
		failedPredicates, _ := failedPredicatesMap[node.Name]
		// If we assume that scheduler looks at all nodes and populates the failedPredicateMap
		// (which is the case today), the !found case should never happen, but we'd prefer
		// to rely less on such assumptions in the code when checking does not impose
		// significant overhead.
		// Also, we currently assume all failures returned by extender as resolvable.
		for _, failedPredicate := range failedPredicates {
			switch failedPredicate {
			case
        // 下面所有的failedPredicates,都視爲"硬傷",因此若相應的節點上若出現下面的失敗原因之一,則視爲該node不可參與搶佔調度。
				predicates.ErrNodeSelectorNotMatch,
				predicates.ErrPodAffinityRulesNotMatch,
				predicates.ErrPodNotMatchHostName,
				predicates.ErrTaintsTolerationsNotMatch,
				predicates.ErrNodeLabelPresenceViolated,
				// Node conditions won't change when scheduler simulates removal of preemption victims.
				// So, it is pointless to try nodes that have not been able to host the pod due to node
				// conditions. These include ErrNodeNotReady, ErrNodeUnderPIDPressure, ErrNodeUnderMemoryPressure, ....
				predicates.ErrNodeNotReady,
				predicates.ErrNodeNetworkUnavailable,
				predicates.ErrNodeUnderDiskPressure,
				predicates.ErrNodeUnderPIDPressure,
				predicates.ErrNodeUnderMemoryPressure,
				predicates.ErrNodeUnschedulable,
				predicates.ErrNodeUnknownCondition,
				predicates.ErrVolumeZoneConflict,
				predicates.ErrVolumeNodeConflict,
				predicates.ErrVolumeBindConflict:
				unresolvableReasonExist = true
				break
			}
		}
		if !unresolvableReasonExist {
			klog.V(3).Infof("Node %v is a potential node for preemption.", node.Name)
			potentialNodes = append(potentialNodes, node)
		}
	}
	return potentialNodes
}

3.2. Pod Disruption Budget(pdb)

這種資源類型本人沒有實際應用過,查閱了一下官方的手冊,實際上它也是kubernetes設計的一種抽象資源,主要用作面對主動中斷時,保障副本可用數量的一種功能,與deployment的maxUnavailable不一樣,maxUnavailable是在滾動更新(非主動中斷)時用來保障,pdb通常是面對主動中斷的場景,例如刪除pod,drain node等主動操作,更多詳細說明參考官方的手冊:

Specifying a Disruption Budget for your Application

資源實例:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper	
$ kubectl get poddisruptionbudgets
NAME      MIN-AVAILABLE   ALLOWED-DISRUPTIONS   AGE
zk-pdb    2               1                     7s


$ kubectl get poddisruptionbudgets zk-pdb -o yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  creationTimestamp: 2017-08-28T02:38:26Z
  generation: 1
  name: zk-pdb
...
status:
  currentHealthy: 3
  desiredHealthy: 3
  disruptedPods: null
  disruptionsAllowed: 1
  expectedPods: 3
  observedGeneration: 1

爲什麼這個資源相關的邏輯會出現在搶佔調度裏面呢?因爲設計者將pod搶佔造成的低優先級pod驅逐動作視爲主動中斷,有了這一層理解,我們接着往下。

3.3. nodeToVictims

selectNodesForPreemption()函數很重要,這個函數將會返回所有可行的node驅逐方案

pkg/scheduler/core/generic_scheduler.go:316 selectNodesForPreemption --> pkg/scheduler/core/generic_scheduler.go:916

func selectNodesForPreemption(pod *v1.Pod,
	nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	potentialNodes []*v1.Node,
	fitPredicates map[string]predicates.FitPredicate,
	metadataProducer predicates.PredicateMetadataProducer,
	queue internalqueue.SchedulingQueue,
	pdbs []*policy.PodDisruptionBudget,
) (map[*v1.Node]*schedulerapi.Victims, error) {
  // 返回的結構體,類型是map,key是*v1.Node,value是一個結構體,包含兩個元素:node上待驅逐的pod信息和將會違反PDB規則的次數
	nodeToVictims := map[*v1.Node]*schedulerapi.Victims{}
	var resultLock sync.Mutex

	// We can use the same metadata producer for all nodes.
	meta := metadataProducer(pod, nodeNameToInfo)
	checkNode := func(i int) {
		nodeName := potentialNodes[i].Name
		var metaCopy predicates.PredicateMetadata
		if meta != nil {
			metaCopy = meta.ShallowCopy()
		}
    // selectVictimsOnNode()是核心計算的函數
		pods, numPDBViolations, fits := selectVictimsOnNode(pod, metaCopy, nodeNameToInfo[nodeName], fitPredicates, queue, pdbs)
		if fits {
			resultLock.Lock()
			victims := schedulerapi.Victims{
				Pods:             pods,
				NumPDBViolations: numPDBViolations,
			}
			nodeToVictims[potentialNodes[i]] = &victims
			resultLock.Unlock()
		}
	}
  // 熟悉的併發控制模型
	workqueue.ParallelizeUntil(context.TODO(), 16, len(potentialNodes), checkNode)
	return nodeToVictims, nil
}

上面已在代碼中對重要部分進行註釋,不難發現,重要的計算函數是selectVictimsOnNode()函數,每個node所需要驅逐的pod,以及違反PDB規則次數信息,都由此函數來計算返回,最終組成nodeToVictims這個map,返回給上層調用函數。所以,接着來看selectVictimsOnNode()函數是怎麼運行的。

selectVictimsOnNode

func selectVictimsOnNode(
	pod *v1.Pod,
	meta predicates.PredicateMetadata,
	nodeInfo *schedulernodeinfo.NodeInfo,
	fitPredicates map[string]predicates.FitPredicate,
	queue internalqueue.SchedulingQueue,
	pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
	if nodeInfo == nil {
		return nil, 0, false
	}
  
  // 潛在的受害者(pod),按優先級排序的有序list,高優先級的排序靠前,低優先級的排序靠後
	potentialVictims := util.SortableList{CompFunc: util.HigherPriorityPod}
  // 在實際調度之前,所有的資源的考量計算都只能是預估,因此不能實際實施到node上,所以,基於node的元數據進行一個複製,將node信息的複製樣本來參與計算,最終計算得到正確的結果後纔會考慮實際往node上實施。
	nodeInfoCopy := nodeInfo.Clone()
  
  // 基於node複製樣本,假設減去一個pod之後,複製樣本重新計算得到的數據。例如node a上運行着有若干pod,假設減去了其上的pod1,pod1 request的內存是4Gi,那麼可假設node可分配的內存就多了4Gi
	removePod := func(rp *v1.Pod) {
		nodeInfoCopy.RemovePod(rp)
		if meta != nil {
			meta.RemovePod(rp)
		}
	}
  
  // 基於node複製樣本,假設加上一個pod之後,複製樣本重新計算得到的數據。
	addPod := func(ap *v1.Pod) {
		nodeInfoCopy.AddPod(ap)
		if meta != nil {
			meta.AddPod(ap, nodeInfoCopy)
		}
	}

  // 首先,枚舉出node上所有的低於待調度pod優先級的pod,並將它們加入潛在受害者potentialVictims,計算假設剔出它們後,node上現有的資源信息
	podPriority := util.GetPodPriority(pod)
	for _, p := range nodeInfoCopy.Pods() {
		if util.GetPodPriority(p) < podPriority {
			potentialVictims.Items = append(potentialVictims.Items, p)
			removePod(p)
		}
	}
	potentialVictims.Sort()

  // 第二步,判斷待調度pod是否fit此node,主要是親和性方面的考量,這個podFitsOnNode函數前面篩選算法已經講過了,這裏不再複述,這個函數通過後,會把待調度pod的request資源加入nodeInfoCopy內。
	if fits, _, err := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, queue, false); !fits {
		if err != nil {
			klog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
		}
		return nil, 0, false
	}
	var victims []*v1.Pod
	numViolatingVictim := 0
  
  // 第三步,將前面枚舉出的低優先級的pod有序list,拆分爲兩個有序list,一個是違反了PDB規則的(pdb.Status.PodDisruptionsAllowed <= 0,這個值等於0則代表理論上要求不能出現中斷的pod副本),一個是不違反PDB規則的。
	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
  
  // 第四步,前面枚舉假設把所有的低優先級pod都剔除了,但實際上可能不用剔除這麼多,因此,保證了待調度pod計算進來之後,這裏再用貪心法將低優先級的pod按優先級排序儘可能多地加入回來,最終無法調度的pod,才歸爲實際驅逐的pod。顯而易見的是,優先保障有PDB約束的pod。
	reprievePod := func(p *v1.Pod) bool {
		addPod(p)
		fits, _, _ := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, queue, false)
		if !fits {
			removePod(p)
			victims = append(victims, p)
			klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name)
		}
		return fits
	}
	for _, p := range violatingVictims {
		if !reprievePod(p) {
			numViolatingVictim++
		}
	}
	// Now we try to reprieve non-violating victims.
	for _, p := range nonViolatingVictims {
		reprievePod(p)
	}
  // 第五步,返回最終node的運算結果,分別是驅逐的pod list,以及驅逐的數量
	return victims, numViolatingVictim, true
}

這個函數分5步,先是枚舉出所有的低優先級pod,再貪心保障儘量多的pod能正常運行,從而計算出最終需要被驅逐的pod及相關信息,詳見代碼內註釋。

3.4. candidateNode

上面函數返回每一個可搶佔的node各自的搶佔方案後,這裏就需要篩選其中一個node來實際執行搶佔調度操作。

pkg/scheduler/core/generic_scheduler.go:330 pickOneNodeForPreemption() --> pkg/scheduler/core/generic_scheduler.go:809

func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*schedulerapi.Victims) *v1.Node {
	if len(nodesToVictims) == 0 {
		return nil
	}
	minNumPDBViolatingPods := math.MaxInt32
	var minNodes1 []*v1.Node
	lenNodes1 := 0
	for node, victims := range nodesToVictims {
		if len(victims.Pods) == 0 {
		  // 可能在調度的過程中,有極小的概率某個node上有pod終結了,使node上不再有需要驅逐的pod,那麼pod可直接調度到該node上
			return node
		}
		// 按違反PDB約束的次數排序,越少的node優先級越高,若最大優先級的node只有一個,則直接返回違反次數最小的node,若有多個,則進入下一步篩選
		numPDBViolatingPods := victims.NumPDBViolations
		if numPDBViolatingPods < minNumPDBViolatingPods {
			minNumPDBViolatingPods = numPDBViolatingPods
			minNodes1 = nil
			lenNodes1 = 0
		}
		if numPDBViolatingPods == minNumPDBViolatingPods {
			minNodes1 = append(minNodes1, node)
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}

	// 按node上需驅逐的第一個pod(即需驅逐的優先級最高的pod)的優先級大小排序,pod[0]優先級越小,則所屬的node優先級越高,若最大優先級的node只有一個,則直接返回此node,若有多個,則進入下一步篩選
	minHighestPriority := int32(math.MaxInt32)
	var minNodes2 = make([]*v1.Node, lenNodes1)
	lenNodes2 := 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		victims := nodesToVictims[node]
		// highestPodPriority is the highest priority among the victims on this node.
		highestPodPriority := util.GetPodPriority(victims.Pods[0])
		if highestPodPriority < minHighestPriority {
			minHighestPriority = highestPodPriority
			lenNodes2 = 0
		}
		if highestPodPriority == minHighestPriority {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	if lenNodes2 == 1 {
		return minNodes2[0]
	}

	// 按node上需驅逐的所有的pod的優先級總和計算,總和越小,node優先級越高,若最大優先級的node只有一個,則直接返回此node,若有多個,則進入下一步篩選
	minSumPriorities := int64(math.MaxInt64)
	lenNodes1 = 0
	for i := 0; i < lenNodes2; i++ {
		var sumPriorities int64
		node := minNodes2[i]
		for _, pod := range nodesToVictims[node].Pods {
			// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
			// needed so that a node with a few pods with negative priority is not
			// picked over a node with a smaller number of pods with the same negative
			// priority (and similar scenarios).
			sumPriorities += int64(util.GetPodPriority(pod)) + int64(math.MaxInt32+1)
		}
		if sumPriorities < minSumPriorities {
			minSumPriorities = sumPriorities
			lenNodes1 = 0
		}
		if sumPriorities == minSumPriorities {
			minNodes1[lenNodes1] = node
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}

	// 按node上需驅逐的所有的pod數量計算,數量越少,node優先級越高,若最大優先級的node只有一個,則直接返回此node,若有多個,則進入下一步篩選
	minNumPods := math.MaxInt32
	lenNodes2 = 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		numPods := len(nodesToVictims[node].Pods)
		if numPods < minNumPods {
			minNumPods = numPods
			lenNodes2 = 0
		}
		if numPods == minNumPods {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	// 若經過上面四個步驟的篩選,篩選出的node還是不止一個,那麼就挑選其中的第一個作爲最後選中被執行搶佔調度的node
	if lenNodes2 > 0 {
		return minNodes2[0]
	}
	klog.Errorf("Error in logic of node scoring for preemption. We should never reach here!")
	return nil
}

上面代碼結合註釋,可以歸納出,這個函數中做了非常細緻地檢查,最高分如下4個步驟來對node進行優先級排序,篩選出一個最終合適的node來被執行搶佔調度pod的操作:

1.按違反PDB約束的次數排序

2.按node上需驅逐的第一個pod(即需驅逐的優先級最高的pod)的優先級大小排序

3.按node上需驅逐的所有的pod的優先級總和計算排序

4.按node上需驅逐的所有的pod數量計算排序

5.若經過上面四個步驟的篩選,篩選出的node還是不止一個,那麼就挑選其中的第一個作爲最後選中node

4. 總結

搶佔調度的邏輯可以說是非常細緻和精彩,例如

1.從資源計算的角度:

  • 基於nodeInfo快照的計算,所有計算在最終確定實施之前都是預計算
  • 先枚舉出所有低優先級的pod,保障待調度pod能充分獲取資源
  • 在待調度pod能運行後,再盡力保障最多的低優先級pod能同時運行

2.從node選取的角度:

  • 分4個步驟篩選以選出驅逐造成影響最小一個node

本章完,感謝閱讀!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章