Kubernetes Scheduler原理解析

本文是對Kubernetes Scheduler的算法解讀和原理解析,重點介紹了預選(Predicates)和優選(Priorities)步驟的原理，並介紹了默認配置的Default Policies。接下來，我會分析Kubernetes Scheduler的源碼，窺探其具體的實現細節以及如何開發一個Policy，見我下片博文吧。

Scheduler及其算法介紹

Kubernetes Scheduler是Kubernetes Master的一個組件，通常與API Server和Controller Manager組件部署在一個節點，共同組成Master的三劍客。

一句話概括Scheduler的功能：將PodSpec.NodeName爲空的Pods逐個地，經過預選(Predicates)和優選(Priorities)兩個步驟，挑選最合適的Node作爲該Pod的Destination。

展開這兩個步驟，就是Scheduler的算法描述：

預選：根據配置的Predicates Policies（默認爲DefaultProvider中定義的default predicates policies集合）過濾掉那些不滿足這些Policies的的Nodes，剩下的Nodes就作爲優選的輸入。
優選：根據配置的Priorities Policies（默認爲DefaultProvider中定義的default priorities policies集合）給預選後的Nodes進行打分排名，得分最高的Node即作爲最適合的Node，該Pod就Bind到這個Node。

如果經過優選將Nodes打分排名後，有多個Nodes並列得分最高，那麼scheduler將隨機從中選擇一個Node作爲目標Node。

因此整個schedule過程，算法本身的邏輯是非常簡單的，關鍵在這些Policies的邏輯，下面我們就來看看Kubernetes的Predicates and Priorities Policies。

Predicates and Priorities Policies

Predicates Policies

Predicates Policies就是提供給Scheduler用來過濾出滿足所定義條件的Nodes，併發的(最多16個goroutine)對每個Node啓動所有Predicates Policies的遍歷Filter，看其是否都滿足配置的Predicates Policies，若有一個Policy不滿足，則直接被淘汰。

注意：這裏的併發goroutine number爲All Nodes number，但最多不能超過16個，由一個queue控制。

Kubernetes提供了以下Predicates Policies的定義，你可以在kube-scheduler啓動參數中添加--policy-config-file來指定要運用的Policies集合,比如：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    ...
    ]
}

NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.
NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.
PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.
PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.
HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.
MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.
MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.
MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.
CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.
CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

默認的DefaultProvider中選了以下Predicates Policies：

NoVolumeZoneConflict
MaxEBSVolumeCount
MaxGCEPDVolumeCount
MatchInterPodAffinity

說明：Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"
NoDiskConflict
GeneralPredicates
- PodFitsResources
  - pod, in number
  - cpu, in cores
  - memory, in bytes
  - alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4，每個node最多隻支持1個gpu
- PodFitsHost
- PodFitsHostPorts
- PodSelectorMatches
PodToleratesNodeTaints
CheckNodeMemoryPressure
CheckNodeDiskPressure

Priorities Policies

經過預選策略甩選後得到的Nodes，會來到優選步驟。在這個過程中，會併發的根據每個Node分別啓動一個goroutine，在每個goroutine中會根據對應的policy實現，遍歷所有的預選Nodes，分別進行打分，每個Node每一個Policy的打分爲0-10分，0分最低，10分最高。待所有policy對應的goroutine都完成後，根據設置的各個priorities policies的權重weight，對每個node的各個policy的得分進行加權求和作爲最終的node的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意：這裏的併發goroutine number爲All Nodes number，但最多不能超過16個，由一個queue控制。

思考：如果經過預選後，沒有一個Node滿足條件，則直接返回FailedPredicates報錯，不會再觸發Prioritizing階段，這是合理的。但是，如果經過預選後，只有一個Node滿足條件，同樣會觸發Prioritizing，並且所走的流程和多個Nodes一樣。實際上，如果只有一個Node滿足條件，在優選階段，可以直接返回該Node作爲最終scheduled結果，無需跑完整個打分流程。

如果經過優選將Nodes打分排名後，有多個Nodes並列得分最高，那麼scheduler將隨機從中選擇一個Node作爲目標Node。

Kubernetes提供了以下Priorities Policies的定義，你可以在kube-scheduler啓動參數中添加--policy-config-file來指定要運用的Policies集合，比如：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    ...
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ]
}

LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

默認的DefaultProvider中選了以下Priorities Policies

SelectorSpreadPriority, 默認權重爲1
InterPodAffinityPriority, 默認權重爲1
- pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
- as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
- AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.
scheduler.alpha.kubernetes.io/affinity="..."
LeastRequestedPriority, 默認權重爲1
BalancedResourceAllocation, 默認權重爲1
NodePreferAvoidPodsPriority, 默認權重爲10000

說明：這裏權重設置足夠大（10000），如果得分不爲0，那麼加權後最終得分將很高，如果得分爲0，那麼意味着相對其他得搞很高的，註定被淘汰,分析如下：

如果Node的Anotation沒有設置key-value:

scheduler.alpha.kubernetes.io/preferAvoidPods="..."

則該node對該policy的得分就是10分，加上權重10000，那麼該node對該policy的得分至少10W分。

如果Node的Anotation設置了

scheduler.alpha.kubernetes.io/preferAvoidPods="..."

如果該pod對應的Controller是ReplicationController或ReplicaSet，則該node對該policy的得分就是0分，那麼該node對該policy的得分相對沒有設置該Anotation的Node得分低的離譜了。也就是說這個Node一定會被淘汰！
NodeAffinityPriority, 默認權重爲1
TaintTolerationPriority, 默認權重爲1

scheduler算法流程圖

總結

kubernetes scheduler的任務就是將pod調度到最合適的Node。
整個調度過程分兩步：預選(Predicates)和優選(Policies)
默認配置的調度策略爲DefaultProvider，具體包含的策略見上。
可以通過kube-scheduler的啓動參數–policy-config-file指定一個自定義的Json內容的文件，按照格式組裝自己Predicates and Priorities policies。

Kubernetes Scheduler原理解析

Scheduler及其算法介紹

Predicates and Priorities Policies

Predicates Policies

Priorities Policies

scheduler算法流程圖

總結

SkyDNS2源碼分析

Kubernetes DNS Service技術研究

Kubernetes ReplicationController源碼分析

kube-proxy源碼分析

從源碼解析kube-scheduler默認的配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結