P3-Node篩選算法
前言
在上一篇文檔中,我們找到調度器篩選node的算法入口pkg/scheduler/core/generic_scheduler.go:162
Schedule()
方法
那麼在本篇,由此Schedule()
函數展開,看一看調度器的node篩選算法,優先級排序算法留作下一篇.
正文
Schedule()的篩選算法核心是findNodesThatFit()
方法 ,直接跳轉過去:
pkg/scheduler/core/generic_scheduler.go:184
--> pkg/scheduler/core/generic_scheduler.go:435
下面註釋劃出重點,篇幅有限省略部分代碼:
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
var filtered []*v1.Node
failedPredicateMap := FailedPredicateMap{}
if len(g.predicates) == 0 {
filtered = nodes
} else {
allNodes := int32(g.cache.NodeTree().NumNodes())
// 篩選的node對象的數量,點擊進去可查看詳情,當集羣規模小於100臺時,全部檢查,當集羣大於100臺時,
// 檢查指定比例的機器,若指定比例範圍內都沒有找到合適的node,則繼續查找
numNodesToFind := g.numFeasibleNodesToFind(allNodes)
... // 省略
ctx, cancel := context.WithCancel(context.Background())
// 負責篩選節點的匿名函數主體,核心實現在於內部的podFitsOnNode函數
checkNode := func(i int) {
nodeName := g.cache.NodeTree().Next()
fits, failedPredicates, err := podFitsOnNode(
pod,
meta,
g.nodeInfoSnapshot.NodeInfoMap[nodeName],
g.predicates,
g.schedulingQueue,
g.alwaysCheckAllPredicates,
)
if err != nil {
predicateResultLock.Lock()
errs[err.Error()]++
predicateResultLock.Unlock()
return
}
if fits {
length := atomic.AddInt32(&filteredLen, 1)
if length > numNodesToFind {
cancel()
atomic.AddInt32(&filteredLen, -1)
} else {
filtered[length-1] = g.nodeInfoSnapshot.NodeInfoMap[nodeName].Node()
}
} else {
predicateResultLock.Lock()
failedPredicateMap[nodeName] = failedPredicates
predicateResultLock.Unlock()
}
}
// 標記一下這裏,併發執行篩選,待會兒看看它的併發是怎麼設計的
// Stops searching for more nodes once the configured number of feasible nodes
// are found.
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)
// 調度器的擴展處理邏輯,如自定義的擴展篩選、優先級排序算法
if len(filtered) > 0 && len(g.extenders) != 0 {
... // 省略
}
// 返回結果
return filtered, failedPredicateMap, nil
}
這裏一眼就可以看出核心匿名函數內的主體是podFitsOnNode()
,但是並不是直接執行podFitsOnNode()
函數,而是又封裝了一層函數,這個函數的作用是在外層使用nodeName := g.cache.NodeTree().Next()
來獲取要判斷的node主體,傳遞給podFitsOnNode()
函數,而後對podFitsOnNode
函數執行返回的結果進行處理。着眼於其下的併發處理實現:workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)
,就可以理解這樣封裝的好處了,來看看併發實現的內部吧:
vendor/k8s.io/client-go/util/workqueue/parallelizer.go:38
func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
var stop <-chan struct{}
if ctx != nil {
stop = ctx.Done()
}
toProcess := make(chan int, pieces)
for i := 0; i < pieces; i++ {
toProcess <- i
}
close(toProcess)
if pieces < workers {
workers = pieces
}
wg := sync.WaitGroup{}
wg.Add(workers)
for i := 0; i < workers; i++ {
go func() {
defer utilruntime.HandleCrash()
defer wg.Done()
for piece := range toProcess {
select {
case <-stop:
return
default:
doWorkPiece(piece)
}
}
}()
}
wg.Wait()
}
敲黑板記筆記:
1.chan struct{}是什麼鬼? struct{}類型的chan,不佔用內存,通常用作go協程之間傳遞信號,詳情可參
考:https://dave.cheney.net/2014/03/25/the-empty-struct
2.ParallelizeUntil函數接收4個參數,分別是父協程上下文,max workers,task number,task執行函數,它啓動
指定數量的worker協程,數量最大不超過max workers,共同完成指定數量(task number)的task,每個task執行指
定的執行函數。這意味着,ParallelizeUntil函數只負責併發的數量,而併發的對象主體,需要由task執行函數自行
獲取。因此我們看到上面的checkNode匿名函數,內部通過nodeName := g.cache.NodeTree().Next()來獲取task
的對象主體,g.cache.NodeTree()對象內部必然維護了一個指針,來獲取當前task所需的對象主體。這裏使用的併發粒度是以node爲單位的.
ParallelizeUntil()的這種實現方式,可以很好地將併發實現和具體功能實現解耦,因此只要功能實現內部處理好指針,
都可以複用ParallelizeUntil()函數來實現併發的控制。
來看看checkNode()
內部是怎樣獲取每個子協程對應的node主體的:
pkg/scheduler/core/generic_scheduler.go:460 --> pkg/scheduler/internal/cache/node_tree.go:161
可以看到,這裏有一個zone的邏輯層級,這個層級彷彿沒有見過,google了一番才瞭解了這個頗爲冷門的功能:這是一個輕量級的支持集羣聯邦特性的實現,單個cluster可以屬於多個zone,但這個功能目前只有GCE和AWS支持,且絕大多數的使用場景也用不到,可以說是頗爲冷門。默認情況下,cluster只屬於一個zone,可以理解爲cluster和zone是同層級,因此後面見到有關zone相關的層級,我們直接越過它。有興趣的朋友可以瞭解一下zone的概念:
https://kubernetes.io/docs/setup/best-practices/multiple-zones/
繼續往下, pkg/scheduler/internal/cache/node_tree.go:176
--> pkg/scheduler/internal/cache/node_tree.go:47
// nodeArray is a struct that has nodes that are in a zone.
// We use a slice (as opposed to a set/map) to store the nodes because iterating over the nodes is
// a lot more frequent than searching them by name.
type nodeArray struct {
nodes []string
lastIndex int
}
func (na *nodeArray) next() (nodeName string, exhausted bool) {
if len(na.nodes) == 0 {
klog.Error("The nodeArray is empty. It should have been deleted from NodeTree.")
return "", false
}
if na.lastIndex >= len(na.nodes) {
return "", true
}
nodeName = na.nodes[na.lastIndex]
na.lastIndex++
return nodeName, false
}
果然可以看到, nodeArray結構體內部維護了一個lastIndex指針來獲取node,印證了上面的推測。
回到pkg/scheduler/core/generic_scheduler.go:461
,正式進入podFitsOnNode
內部:
func podFitsOnNode(
pod *v1.Pod,
meta predicates.PredicateMetadata,
info *schedulernodeinfo.NodeInfo,
predicateFuncs map[string]predicates.FitPredicate,
queue internalqueue.SchedulingQueue,
alwaysCheckAllPredicates bool,
) (bool, []predicates.PredicateFailureReason, error) {
var failedPredicates []predicates.PredicateFailureReason
podsAdded := false
for i := 0; i < 2; i++ {
metaToUse := meta
nodeInfoToUse := info
if i == 0 {
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(pod, meta, info, queue)
} else if !podsAdded || len(failedPredicates) != 0 {
break
}
for _, predicateKey := range predicates.Ordering() {
var (
fit bool
reasons []predicates.PredicateFailureReason
err error
)
//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
if predicate, exist := predicateFuncs[predicateKey]; exist {
fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
if err != nil {
return false, []predicates.PredicateFailureReason{}, err
}
... // 省略
}
}
}
return len(failedPredicates) == 0, failedPredicates, nil
}
註釋和部分代碼已省略,基於podFitsOnNode
函數內的註釋,來做一下說明:
1.通過指定pod.spec.priority
,來爲pod指定調度優先級的功能,在1.14版本已經正式GA,這裏所有的調度相關功能都會考慮到pod優先級,因爲優先級的原因,因此除了正常的Schedule調度動作外,還會有Preempt
搶佔調度的行爲,這個podFitsOnNode()
方法會被在這兩個地方調用。
2.Schedule調度時,會取出當前node上所有已存在的pod,與被提名調度的pod進行優先級對比,取出所有優先級大於等於提名pod,將它們需求的資源加上提名pod所需求的資源,進行彙總,predicate篩選算法計算的時候,是基於這個彙總的結果來進行計算的。舉個例子,node A memory cap = 128Gi
,其上現承載有20個pod,其中10個pod的優先級大於等於提名pod,它們sum(request.memory) = 100Gi
,若提名pod的request.memory = 32Gi
, (100+32) > 128
,因此篩選時會在內存選項失敗返回false;若提名pod的request.memory = 16Gi
,(100+16) < 128
,則內存項篩選通過。那麼剩下的優先級較低的10個pod就不考慮它們了嗎,它們也要佔用內存呀?處理方式是:如果它們佔用內存造成node資源不足無法調度提名pod,則調度器會將它們剔出當前node,這即是Preempt
搶佔。Preempt搶佔的說明會在後面的文章中補充.
3.對於每個提名pod,其調度過程會被重複執行1次,爲什麼需要重複執行呢?考慮到有一些場景下,會判斷到pod之間的親和力篩選策略,例如pod A對pod B有親和性,這時它們一起調度到node上,但pod B此時實際並未完成調度啓動,那麼pod A的inter-pod affinity predicates
一定會失敗,因此,重複執行1次篩選過程是有必要的.
有了以上理解,我們接着看代碼,圖中已註釋:
圖中pkg/scheduler/core/generic_scheduler.go:608
位置正式開始了逐個計算篩選算法,那麼篩選方法、篩選方法順序在哪裏呢?在上一篇P2-框架篇中已經有講過,默認調度算法都在pkg/scheduler/algorithm/
路徑下,我們接着往下看.
Predicates Ordering / Predicates Function
篩選算法相關的key/func/ordering
,全部集中在pkg/scheduler/algorithm/predicates/predicates.go
這個文件中
篩選順序:
pkg/scheduler/algorithm/predicates/predicates.go:142
// 默認predicate順序
var (
predicatesOrdering = []string{CheckNodeConditionPred, CheckNodeUnschedulablePred,
GeneralPred, HostNamePred, PodFitsHostPortsPred,
MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
MaxAzureDiskVolumeCountPred, MaxCinderVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
CheckNodeMemoryPressurePred, CheckNodePIDPressurePred, CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
)
官方的備註:
篩選key
const (
// MatchInterPodAffinityPred defines the name of predicate MatchInterPodAffinity.
MatchInterPodAffinityPred = "MatchInterPodAffinity"
// CheckVolumeBindingPred defines the name of predicate CheckVolumeBinding.
CheckVolumeBindingPred = "CheckVolumeBinding"
// CheckNodeConditionPred defines the name of predicate CheckNodeCondition.
CheckNodeConditionPred = "CheckNodeCondition"
// GeneralPred defines the name of predicate GeneralPredicates.
GeneralPred = "GeneralPredicates"
// HostNamePred defines the name of predicate HostName.
HostNamePred = "HostName"
// PodFitsHostPortsPred defines the name of predicate PodFitsHostPorts.
PodFitsHostPortsPred = "PodFitsHostPorts"
// MatchNodeSelectorPred defines the name of predicate MatchNodeSelector.
MatchNodeSelectorPred = "MatchNodeSelector"
// PodFitsResourcesPred defines the name of predicate PodFitsResources.
PodFitsResourcesPred = "PodFitsResources"
// NoDiskConflictPred defines the name of predicate NoDiskConflict.
NoDiskConflictPred = "NoDiskConflict"
// PodToleratesNodeTaintsPred defines the name of predicate PodToleratesNodeTaints.
PodToleratesNodeTaintsPred = "PodToleratesNodeTaints"
// CheckNodeUnschedulablePred defines the name of predicate CheckNodeUnschedulablePredicate.
CheckNodeUnschedulablePred = "CheckNodeUnschedulable"
// PodToleratesNodeNoExecuteTaintsPred defines the name of predicate PodToleratesNodeNoExecuteTaints.
PodToleratesNodeNoExecuteTaintsPred = "PodToleratesNodeNoExecuteTaints"
// CheckNodeLabelPresencePred defines the name of predicate CheckNodeLabelPresence.
CheckNodeLabelPresencePred = "CheckNodeLabelPresence"
// CheckServiceAffinityPred defines the name of predicate checkServiceAffinity.
CheckServiceAffinityPred = "CheckServiceAffinity"
// MaxEBSVolumeCountPred defines the name of predicate MaxEBSVolumeCount.
// DEPRECATED
// All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
MaxEBSVolumeCountPred = "MaxEBSVolumeCount"
// MaxGCEPDVolumeCountPred defines the name of predicate MaxGCEPDVolumeCount.
// DEPRECATED
// All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
MaxGCEPDVolumeCountPred = "MaxGCEPDVolumeCount"
// MaxAzureDiskVolumeCountPred defines the name of predicate MaxAzureDiskVolumeCount.
// DEPRECATED
// All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
MaxAzureDiskVolumeCountPred = "MaxAzureDiskVolumeCount"
// MaxCinderVolumeCountPred defines the name of predicate MaxCinderDiskVolumeCount.
// DEPRECATED
// All cloudprovider specific predicates are deprecated in favour of MaxCSIVolumeCountPred.
MaxCinderVolumeCountPred = "MaxCinderVolumeCount"
// MaxCSIVolumeCountPred defines the predicate that decides how many CSI volumes should be attached
MaxCSIVolumeCountPred = "MaxCSIVolumeCountPred"
// NoVolumeZoneConflictPred defines the name of predicate NoVolumeZoneConflict.
NoVolumeZoneConflictPred = "NoVolumeZoneConflict"
// CheckNodeMemoryPressurePred defines the name of predicate CheckNodeMemoryPressure.
CheckNodeMemoryPressurePred = "CheckNodeMemoryPressure"
// CheckNodeDiskPressurePred defines the name of predicate CheckNodeDiskPressure.
CheckNodeDiskPressurePred = "CheckNodeDiskPressure"
// CheckNodePIDPressurePred defines the name of predicate CheckNodePIDPressure.
CheckNodePIDPressurePred = "CheckNodePIDPressure"
// DefaultMaxGCEPDVolumes defines the maximum number of PD Volumes for GCE
// GCE instances can have up to 16 PD volumes attached.
DefaultMaxGCEPDVolumes = 16
// DefaultMaxAzureDiskVolumes defines the maximum number of PD Volumes for Azure
// Larger Azure VMs can actually have much more disks attached.
// TODO We should determine the max based on VM size
DefaultMaxAzureDiskVolumes = 16
// KubeMaxPDVols defines the maximum number of PD Volumes per kubelet
KubeMaxPDVols = "KUBE_MAX_PD_VOLS"
// EBSVolumeFilterType defines the filter name for EBSVolumeFilter.
EBSVolumeFilterType = "EBS"
// GCEPDVolumeFilterType defines the filter name for GCEPDVolumeFilter.
GCEPDVolumeFilterType = "GCE"
// AzureDiskVolumeFilterType defines the filter name for AzureDiskVolumeFilter.
AzureDiskVolumeFilterType = "AzureDisk"
// CinderVolumeFilterType defines the filter name for CinderVolumeFilter.
CinderVolumeFilterType = "Cinder"
)
篩選Function
每個predicate key
對應的function name
一般爲${KEY}Predicate
,function的內容其實都比較簡單,不一一介紹了,自行查看,這裏僅列舉一個:
pkg/scheduler/algorithm/predicates/predicates.go:1567
// CheckNodeMemoryPressurePredicate checks if a pod can be scheduled on a node
// reporting memory pressure condition.
func CheckNodeMemoryPressurePredicate(pod *v1.Pod, meta PredicateMetadata, nodeInfo *schedulernodeinfo.NodeInfo) (bool, []PredicateFailureReason, error) {
var podBestEffort bool
if predicateMeta, ok := meta.(*predicateMetadata); ok {
podBestEffort = predicateMeta.podBestEffort
} else {
// We couldn't parse metadata - fallback to computing it.
podBestEffort = isPodBestEffort(pod)
}
// pod is not BestEffort pod
if !podBestEffort {
return true, nil, nil
}
// check if node is under memory pressure
if nodeInfo.MemoryPressureCondition() == v1.ConditionTrue {
return false, []PredicateFailureReason{ErrNodeUnderMemoryPressure}, nil
}
return true, nil, nil
}
篩選算法過程到這裏就已然清晰明瞭!
重點回顧
篩選算法代碼中的幾個不易理解的點(亮點?)圈出:
- node粒度的併發控制
- 基於優先級的pod資源總和歸納計算
- 篩選過程重複1次
本篇調度器篩選算法篇到此結束,下一篇將學習調度器優先級排序的算法詳情內容