kubernetes: 調度器和調度過程

背景 :
本文介紹k8s調度的原理 方法與實踐,後續會有部分的源碼分析說明 本文所有的源碼都是基於1.17版本
參考資料:
https://www.cnblogs.com/xzkzzz/p/9963511.html 感謝作者,總結的非常到位
官方文檔 ,調度章節
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

scheduler介紹
當Scheduler通過API server 的watch接口監聽到新建Pod副本的信息後,它會檢查所有符合該Pod要求的Node列表,開始執行Pod調度邏輯。調度成功後將Pod綁定到目標節點上。Scheduler在整個系統中承擔了承上啓下的作用,承上是負責接收創建的新Pod,爲安排一個落腳的地(Node),啓下是安置工作完成後,目標Node上的kubelet服務進程接管後繼工作,負責Pod生命週期的後半生。具體來說,Scheduler的作用是將待調度的Pod安裝特定的調度算法和調度策略綁定到集羣中的某個合適的Node上,並將綁定信息傳給API server 寫入etcd中。整個調度過程中涉及三個對象,分別是:待調度的Pod列表,可以的Node列表,以及調度算法和策略。

Kubernetes Scheduler 提供的調度流程分三步:

預選策略(predicate) 遍歷nodelist,選擇出符合要求的候選節點,Kubernetes內置了多種預選規則供用戶選擇。
優選策略(priority) 在選擇出符合要求的候選節點中,採用優選規則計算出每個節點的積分,最後選擇得分最高的。
選定(select) 如果最高得分有好幾個節點,select就會從中隨機選擇一個節點。

在這裏插入圖片描述
預選策略
CheckNodeConditionPred 檢查節點是否正常
GeneralPred HostName(如果pod定義hostname屬性,會檢查節點是否匹配。pod.spec.hostname)、PodFitsHostPorts(檢查pod要暴露的hostpors是否被佔用。pod.spec.containers.ports.hostPort)
MatchNodeSelector pod.spec.nodeSelector 看節點標籤能否適配pod定義的nodeSelector
PodFitsResources 判斷節點的資源能夠滿足Pod的定義(如果一個pod定義最少需要2C4G node上的低於此資源的將不被調度。用kubectl describe node NODE名稱 可以查看資源使用情況)
NoDiskConflict 判斷pod定義的存儲是否在node節點上使用。(默認沒有啓用)
PodToleratesNodeTaints 檢查pod上Tolerates的能否容忍污點(pod.spec.tolerations)
CheckNodeLabelPresence 檢查節點上的標誌是否存在 (默認沒有啓動)
CheckServiceAffinity 根據pod所屬的service。將相同service上的pod儘量放到同一個節點(默認沒有啓動)
CheckVolumeBinding 檢查是否可以綁定(默認沒有啓動)
NoVolumeZoneConflict 檢查是否在一起區域(默認沒有啓動)
CheckNodeMemoryPressure 檢查內存是否存在壓力
CheckNodeDiskPressure 檢查磁盤IO壓力是否過大
CheckNodePIDPressure 檢查pid資源是否過大
預選節點的源碼

優選策略
least_requested 選擇消耗最小的節點(根據空閒比率評估 cpu(總容量-sum(已使用)*10/總容量) )
balanced_resource_allocation 從節點列表中選出各項資源使用率最均衡的節點(CPU和內存)
node_prefer_avoid_pods 節點傾向
taint_toleration 將pod對象的spec.toleration與節點的taints列表項進行匹配度檢查,匹配的條目越多,得分越低。
selector_spreading 與services上其他pod儘量不在同一個節點上,節點上通一個service的pod越少得分越高。
interpod_affinity 遍歷node上的親和性條目,匹配項越多的得分越高
most_requested 選擇消耗最大的節點上(儘量將一個節點上的資源用完)
node_label 根據節點標籤得分,存在標籤既得分,沒有標籤沒得分。標籤越多 得分越高。
image_locality 節點上有所需要的鏡像既得分,所需鏡像越多得分越高。(根據已有鏡像體積大小之和)
優選節點源碼
https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/scheduler/algorithm/priorities/優選節點源碼地址

代碼中定義的所有優選策略

package priorities

const (
	// EqualPriority defines the name of prioritizer function that gives an equal weight of one to all nodes.
	EqualPriority = "EqualPriority"
	// MostRequestedPriority defines the name of prioritizer function that gives used nodes higher priority.
	MostRequestedPriority = "MostRequestedPriority"
	// RequestedToCapacityRatioPriority defines the name of RequestedToCapacityRatioPriority.
	RequestedToCapacityRatioPriority = "RequestedToCapacityRatioPriority"
	// SelectorSpreadPriority defines the name of prioritizer function that spreads pods by minimizing
	// the number of pods (belonging to the same service or replication controller) on the same node.
	SelectorSpreadPriority = "SelectorSpreadPriority"
	// ServiceSpreadingPriority is largely replaced by "SelectorSpreadPriority".
	ServiceSpreadingPriority = "ServiceSpreadingPriority"
	// InterPodAffinityPriority defines the name of prioritizer function that decides which pods should or
	// should not be placed in the same topological domain as some other pods.
	InterPodAffinityPriority = "InterPodAffinityPriority"
	// LeastRequestedPriority defines the name of prioritizer function that prioritize nodes by least
	// requested utilization.
	LeastRequestedPriority = "LeastRequestedPriority"
	// BalancedResourceAllocation defines the name of prioritizer function that prioritizes nodes
	// to help achieve balanced resource usage.
	BalancedResourceAllocation = "BalancedResourceAllocation"
	// NodePreferAvoidPodsPriority defines the name of prioritizer function that priorities nodes according to
	// the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".
	NodePreferAvoidPodsPriority = "NodePreferAvoidPodsPriority"
	// NodeAffinityPriority defines the name of prioritizer function that prioritizes nodes which have labels
	// matching NodeAffinity.
	NodeAffinityPriority = "NodeAffinityPriority"
	// TaintTolerationPriority defines the name of prioritizer function that prioritizes nodes that marked
	// with taint which pod can tolerate.
	TaintTolerationPriority = "TaintTolerationPriority"
	// ImageLocalityPriority defines the name of prioritizer function that prioritizes nodes that have images
	// requested by the pod present.
	ImageLocalityPriority = "ImageLocalityPriority"
	// ResourceLimitsPriority defines the nodes of prioritizer function ResourceLimitsPriority.
	ResourceLimitsPriority = "ResourceLimitsPriority"
	// EvenPodsSpreadPriority defines the name of prioritizer function that prioritizes nodes
	// which have pods and labels matching the incoming pod's topologySpreadConstraints.
	EvenPodsSpreadPriority = "EvenPodsSpreadPriority"
)

我們以image_locality爲例看一下 評分是怎麼計算的

// calculatePriority returns the priority of a node. Given the sumScores of requested images on the node, the node's
// priority is obtained by scaling the maximum priority value with a ratio proportional to the sumScores.
func calculatePriority(sumScores int64) int {
	if sumScores < minThreshold {
		sumScores = minThreshold
	} else if sumScores > maxThreshold {
		sumScores = maxThreshold
	}

	return int(int64(framework.MaxNodeScore) * (sumScores - minThreshold) / (maxThreshold - minThreshold))
}

// sumImageScores returns the sum of image scores of all the containers that are already on the node.
// Each image receives a raw score of its size, scaled by scaledImageScore. The raw scores are later used to calculate
// the final score. Note that the init containers are not considered for it's rare for users to deploy huge init containers.
func sumImageScores(nodeInfo *schedulernodeinfo.NodeInfo, containers []v1.Container, totalNumNodes int) int64 {
	var sum int64
	imageStates := nodeInfo.ImageStates()

	for _, container := range containers {
		if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}
	}

	return sum
}

// scaledImageScore returns an adaptively scaled score for the given state of an image.
// The size of the image is used as the base score, scaled by a factor which considers how much nodes the image has "spread" to.
// This heuristic aims to mitigate the undesirable "node heating problem", i.e., pods get assigned to the same or
// a few nodes due to image locality.
func scaledImageScore(imageState *schedulernodeinfo.ImageStateSummary, totalNumNodes int) int64 {
	spread := float64(imageState.NumNodes) / float64(totalNumNodes)
	return int64(float64(imageState.Size) * spread)
}

可以看到確實是去遍歷了 node上的 image ,數量越多,得分越高

 if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}

再看一個 MostRequestedPriority 評分情況

package priorities

import framework "k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1"

var (
	mostRequestedRatioResources = DefaultRequestedRatioResources
	mostResourcePriority        = &ResourceAllocationPriority{"MostResourceAllocation", mostResourceScorer, mostRequestedRatioResources}

	// MostRequestedPriorityMap is a priority function that favors nodes with most requested resources.
	// It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes
	// based on the maximum of the average of the fraction of requested to capacity.
	// Details: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
	MostRequestedPriorityMap = mostResourcePriority.PriorityMap
)

func mostResourceScorer(requested, allocable ResourceToValueMap, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64 {
	var nodeScore, weightSum int64
	for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

}

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculateUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUsedScore.
func mostRequestedScore(requested, capacity int64) int64 {
	if capacity == 0 {
		return 0
	}
	if requested > capacity {
		return 0
	}

	return (requested * framework.MaxNodeScore) / capacity
}

對於已經部署的服務請求,node節點對於請求的響應都有一個 請求需要多少資源, 處理請求真實分配多少資源

resourceToWeightMap ResourceToWeightMap
//包含資源名稱和權重。

資源這塊只關注了 內存,cpu

var DefaultRequestedRatioResources = ResourceToWeightMap{v1.ResourceMemory: 1, v1.ResourceCPU: 1}

對於volume ,這快涉及到的是pv ,如果有使用volume ,那也是加分的 ,通過這個features.BalanceAttachedNodeVolumes 特性門判斷

if len(pod.Spec.Volumes) >= 0 && utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes) && nodeInfo.TransientInfo != nil {
		score = r.scorer(requested, allocatable, true, nodeInfo.TransientInfo.TransNodeInfo.RequestedVolumes, nodeInfo.TransientInfo.TransNodeInfo.AllocatableVolumesCount)
	} else {
		score = r.scorer(requested, allocatable, false, 0, 0)
	}
type ResourceToWeightMap map[v1.ResourceName]int64

將已分配的資源記錄起來allocatable[resource], requested[resource]

requested := make(ResourceToValueMap, len(r.resourceToWeightMap))
	allocatable := make(ResourceToValueMap, len(r.resourceToWeightMap))
	for resource := range r.resourceToWeightMap {
		allocatable[resource], requested[resource] = calculateResourceAllocatableRequest(nodeInfo, pod, resource)
	}

之後在根據 已經分配的資源和權重計算得分

for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

總之 , 分配的資源的時候 節點上部署的服務越多,mostRequestedScore 就得分越高
其他的 優選策略後續補充

最後 優選的時候會算一個總分出來,得分最高的執行本次調度

高級調度方式,親和性反親和性
當我們想把調度到預期的節點,我們可以使用高級調度分爲:
節點選擇器: nodeSelector、nodeName
節點親和性調度: nodeAffinity
Pod親和性調度:PodAffinity
Pod反親和性調度:podAntiAffinity
nodeAffinity
kubectl explain pod.spec.affinity.nodeAffinity

requiredDuringSchedulingIgnoredDuringExecution 硬親和性 必須滿足親和性。
matchExpressions 匹配表達式,這個標籤可以指定一段,例如pod中定義的key爲zone,operator爲In(包含那些),values爲 foo和bar。就是在node節點中包含foo和bar的標籤中調度
matchFields 匹配字段 和上面的意思 不過他可以不定義標籤值,可以定義
preferredDuringSchedulingIgnoredDuringExecution 軟親和性 能滿足最好,不滿足也沒關係。
preference 優先級
weight 權重1-100範圍內,對於滿足所有調度要求的每個節點,調度程序將通過迭代此字段的元素計算總和,並在節點與對應的節點匹配時將“權重”添加到總和。
運算符包含:In,NotIn,Exists,DoesNotExist,Gt,Lt。可以使用NotIn和DoesNotExist實現節點反關聯行爲。
podAffinity
Pod親和性場景,我們的k8s集羣的節點分佈在不同的區域或者不同的機房,當服務A和服務B要求部署在同一個區域或者同一機房的時候,我們就需要親和性調度了。

kubectl explain pod.spec.affinity.podAffinity 和NodeAffinity是一樣的,都是有硬親和性和軟親和性

硬親和性:

labelSelector 選擇跟那組Pod親和
namespaces 選擇哪個命名空間
topologyKey 指定節點上的哪個鍵
污點容忍調度(Taint和Toleration)
前兩種方式都是pod選擇那個pod,而污點調度是node選擇的pod,污點就是定義在節點上的鍵值屬性數據。舉要作用是讓節點拒絕pod,拒絕不合法node規則的pod。Taint(污點)和 Toleration(容忍)是相互配合的,可以用來避免 pod 被分配到不合適的節點上,每個節點上都可以應用一個或多個 taint ,這表示對於那些不能容忍這些 taint 的 pod,是不會被該節點接受的。

Taint
Taint是節點上屬性,我們看一下Taints如何定義

kubectl explain node.spec.taints(對象列表)

key 定義一個key
value 定義一個值
effect pod不能容忍這個污點時,他的行爲是什麼,行爲分爲三種:NoSchedule 僅影響調度過程,對現存的pod不影響。PreferNoSchedule 系統將盡量避免放置不容忍節點上污點的pod,但這不是必需的。就是軟版的NoSchedule NoExecute 既影響調度過程,也影響現存的pod,不滿足的pod將被驅逐。

親和性 這塊源碼分析後面繼續
本人水平有限,難免有錯誤的地方 ,如有發現,請留言。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章