kubernetes: 調度器和調度過程

背景：
本文介紹k8s調度的原理方法與實踐，後續會有部分的源碼分析說明本文所有的源碼都是基於1.17版本
參考資料：
https://www.cnblogs.com/xzkzzz/p/9963511.html 感謝作者，總結的非常到位
官方文檔，調度章節
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

scheduler介紹
當Scheduler通過API server 的watch接口監聽到新建Pod副本的信息後，它會檢查所有符合該Pod要求的Node列表，開始執行Pod調度邏輯。調度成功後將Pod綁定到目標節點上。Scheduler在整個系統中承擔了承上啓下的作用，承上是負責接收創建的新Pod，爲安排一個落腳的地（Node）,啓下是安置工作完成後，目標Node上的kubelet服務進程接管後繼工作，負責Pod生命週期的後半生。具體來說，Scheduler的作用是將待調度的Pod安裝特定的調度算法和調度策略綁定到集羣中的某個合適的Node上，並將綁定信息傳給API server 寫入etcd中。整個調度過程中涉及三個對象，分別是：待調度的Pod列表，可以的Node列表，以及調度算法和策略。

Kubernetes Scheduler 提供的調度流程分三步：

預選策略(predicate) 遍歷nodelist，選擇出符合要求的候選節點，Kubernetes內置了多種預選規則供用戶選擇。
優選策略(priority) 在選擇出符合要求的候選節點中，採用優選規則計算出每個節點的積分，最後選擇得分最高的。
選定(select) 如果最高得分有好幾個節點，select就會從中隨機選擇一個節點。

預選策略
CheckNodeConditionPred 檢查節點是否正常
GeneralPred HostName(如果pod定義hostname屬性，會檢查節點是否匹配。pod.spec.hostname)、PodFitsHostPorts（檢查pod要暴露的hostpors是否被佔用。pod.spec.containers.ports.hostPort）
MatchNodeSelector pod.spec.nodeSelector 看節點標籤能否適配pod定義的nodeSelector
PodFitsResources 判斷節點的資源能夠滿足Pod的定義（如果一個pod定義最少需要2C4G node上的低於此資源的將不被調度。用kubectl describe node NODE名稱可以查看資源使用情況）
NoDiskConflict 判斷pod定義的存儲是否在node節點上使用。（默認沒有啓用）
PodToleratesNodeTaints 檢查pod上Tolerates的能否容忍污點（pod.spec.tolerations）
CheckNodeLabelPresence 檢查節點上的標誌是否存在（默認沒有啓動）
CheckServiceAffinity 根據pod所屬的service。將相同service上的pod儘量放到同一個節點（默認沒有啓動）
CheckVolumeBinding 檢查是否可以綁定（默認沒有啓動）
NoVolumeZoneConflict 檢查是否在一起區域（默認沒有啓動）
CheckNodeMemoryPressure 檢查內存是否存在壓力
CheckNodeDiskPressure 檢查磁盤IO壓力是否過大
CheckNodePIDPressure 檢查pid資源是否過大
預選節點的源碼

優選策略
least_requested 選擇消耗最小的節點（根據空閒比率評估 cpu(總容量-sum(已使用)*10/總容量) ）
balanced_resource_allocation 從節點列表中選出各項資源使用率最均衡的節點（CPU和內存）
node_prefer_avoid_pods 節點傾向
taint_toleration 將pod對象的spec.toleration與節點的taints列表項進行匹配度檢查，匹配的條目越多，得分越低。
selector_spreading 與services上其他pod儘量不在同一個節點上，節點上通一個service的pod越少得分越高。
interpod_affinity 遍歷node上的親和性條目，匹配項越多的得分越高
most_requested 選擇消耗最大的節點上（儘量將一個節點上的資源用完）
node_label 根據節點標籤得分，存在標籤既得分，沒有標籤沒得分。標籤越多得分越高。
image_locality 節點上有所需要的鏡像既得分，所需鏡像越多得分越高。（根據已有鏡像體積大小之和）
優選節點源碼
https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/scheduler/algorithm/priorities/優選節點源碼地址

代碼中定義的所有優選策略

package priorities

const (
	// EqualPriority defines the name of prioritizer function that gives an equal weight of one to all nodes.
	EqualPriority = "EqualPriority"
	// MostRequestedPriority defines the name of prioritizer function that gives used nodes higher priority.
	MostRequestedPriority = "MostRequestedPriority"
	// RequestedToCapacityRatioPriority defines the name of RequestedToCapacityRatioPriority.
	RequestedToCapacityRatioPriority = "RequestedToCapacityRatioPriority"
	// SelectorSpreadPriority defines the name of prioritizer function that spreads pods by minimizing
	// the number of pods (belonging to the same service or replication controller) on the same node.
	SelectorSpreadPriority = "SelectorSpreadPriority"
	// ServiceSpreadingPriority is largely replaced by "SelectorSpreadPriority".
	ServiceSpreadingPriority = "ServiceSpreadingPriority"
	// InterPodAffinityPriority defines the name of prioritizer function that decides which pods should or
	// should not be placed in the same topological domain as some other pods.
	InterPodAffinityPriority = "InterPodAffinityPriority"
	// LeastRequestedPriority defines the name of prioritizer function that prioritize nodes by least
	// requested utilization.
	LeastRequestedPriority = "LeastRequestedPriority"
	// BalancedResourceAllocation defines the name of prioritizer function that prioritizes nodes
	// to help achieve balanced resource usage.
	BalancedResourceAllocation = "BalancedResourceAllocation"
	// NodePreferAvoidPodsPriority defines the name of prioritizer function that priorities nodes according to
	// the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".
	NodePreferAvoidPodsPriority = "NodePreferAvoidPodsPriority"
	// NodeAffinityPriority defines the name of prioritizer function that prioritizes nodes which have labels
	// matching NodeAffinity.
	NodeAffinityPriority = "NodeAffinityPriority"
	// TaintTolerationPriority defines the name of prioritizer function that prioritizes nodes that marked
	// with taint which pod can tolerate.
	TaintTolerationPriority = "TaintTolerationPriority"
	// ImageLocalityPriority defines the name of prioritizer function that prioritizes nodes that have images
	// requested by the pod present.
	ImageLocalityPriority = "ImageLocalityPriority"
	// ResourceLimitsPriority defines the nodes of prioritizer function ResourceLimitsPriority.
	ResourceLimitsPriority = "ResourceLimitsPriority"
	// EvenPodsSpreadPriority defines the name of prioritizer function that prioritizes nodes
	// which have pods and labels matching the incoming pod's topologySpreadConstraints.
	EvenPodsSpreadPriority = "EvenPodsSpreadPriority"
)

我們以image_locality爲例看一下評分是怎麼計算的

// calculatePriority returns the priority of a node. Given the sumScores of requested images on the node, the node's
// priority is obtained by scaling the maximum priority value with a ratio proportional to the sumScores.
func calculatePriority(sumScores int64) int {
	if sumScores < minThreshold {
		sumScores = minThreshold
	} else if sumScores > maxThreshold {
		sumScores = maxThreshold
	}

	return int(int64(framework.MaxNodeScore) * (sumScores - minThreshold) / (maxThreshold - minThreshold))
}

// sumImageScores returns the sum of image scores of all the containers that are already on the node.
// Each image receives a raw score of its size, scaled by scaledImageScore. The raw scores are later used to calculate
// the final score. Note that the init containers are not considered for it's rare for users to deploy huge init containers.
func sumImageScores(nodeInfo *schedulernodeinfo.NodeInfo, containers []v1.Container, totalNumNodes int) int64 {
	var sum int64
	imageStates := nodeInfo.ImageStates()

	for _, container := range containers {
		if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}
	}

	return sum
}

// scaledImageScore returns an adaptively scaled score for the given state of an image.
// The size of the image is used as the base score, scaled by a factor which considers how much nodes the image has "spread" to.
// This heuristic aims to mitigate the undesirable "node heating problem", i.e., pods get assigned to the same or
// a few nodes due to image locality.
func scaledImageScore(imageState *schedulernodeinfo.ImageStateSummary, totalNumNodes int) int64 {
	spread := float64(imageState.NumNodes) / float64(totalNumNodes)
	return int64(float64(imageState.Size) * spread)
}

可以看到確實是去遍歷了 node上的 image ，數量越多，得分越高

 if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}

再看一個 MostRequestedPriority 評分情況

package priorities

import framework "k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1"

var (
	mostRequestedRatioResources = DefaultRequestedRatioResources
	mostResourcePriority        = &ResourceAllocationPriority{"MostResourceAllocation", mostResourceScorer, mostRequestedRatioResources}

	// MostRequestedPriorityMap is a priority function that favors nodes with most requested resources.
	// It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes
	// based on the maximum of the average of the fraction of requested to capacity.
	// Details: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
	MostRequestedPriorityMap = mostResourcePriority.PriorityMap
)

func mostResourceScorer(requested, allocable ResourceToValueMap, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64 {
	var nodeScore, weightSum int64
	for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

}

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculateUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUsedScore.
func mostRequestedScore(requested, capacity int64) int64 {
	if capacity == 0 {
		return 0
	}
	if requested > capacity {
		return 0
	}

	return (requested * framework.MaxNodeScore) / capacity
}

對於已經部署的服務請求，node節點對於請求的響應都有一個請求需要多少資源，處理請求真實分配多少資源

resourceToWeightMap ResourceToWeightMap
//包含資源名稱和權重。

資源這塊只關注了內存，cpu

var DefaultRequestedRatioResources = ResourceToWeightMap{v1.ResourceMemory: 1, v1.ResourceCPU: 1}

對於volume ，這快涉及到的是pv ，如果有使用volume ，那也是加分的，通過這個features.BalanceAttachedNodeVolumes 特性門判斷

if len(pod.Spec.Volumes) >= 0 && utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes) && nodeInfo.TransientInfo != nil {
		score = r.scorer(requested, allocatable, true, nodeInfo.TransientInfo.TransNodeInfo.RequestedVolumes, nodeInfo.TransientInfo.TransNodeInfo.AllocatableVolumesCount)
	} else {
		score = r.scorer(requested, allocatable, false, 0, 0)
	}

type ResourceToWeightMap map[v1.ResourceName]int64

將已分配的資源記錄起來allocatable[resource], requested[resource]

requested := make(ResourceToValueMap, len(r.resourceToWeightMap))
	allocatable := make(ResourceToValueMap, len(r.resourceToWeightMap))
	for resource := range r.resourceToWeightMap {
		allocatable[resource], requested[resource] = calculateResourceAllocatableRequest(nodeInfo, pod, resource)
	}

之後在根據已經分配的資源和權重計算得分

for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

總之，分配的資源的時候節點上部署的服務越多，mostRequestedScore 就得分越高
其他的優選策略後續補充

最後優選的時候會算一個總分出來，得分最高的執行本次調度

高級調度方式，親和性反親和性
當我們想把調度到預期的節點，我們可以使用高級調度分爲：
節點選擇器： nodeSelector、nodeName
節點親和性調度： nodeAffinity
Pod親和性調度：PodAffinity
Pod反親和性調度：podAntiAffinity
nodeAffinity
kubectl explain pod.spec.affinity.nodeAffinity

requiredDuringSchedulingIgnoredDuringExecution 硬親和性必須滿足親和性。
matchExpressions 匹配表達式,這個標籤可以指定一段，例如pod中定義的key爲zone，operator爲In(包含那些)，values爲 foo和bar。就是在node節點中包含foo和bar的標籤中調度
matchFields 匹配字段和上面的意思不過他可以不定義標籤值，可以定義
preferredDuringSchedulingIgnoredDuringExecution 軟親和性能滿足最好，不滿足也沒關係。
preference 優先級
weight 權重1-100範圍內，對於滿足所有調度要求的每個節點，調度程序將通過迭代此字段的元素計算總和，並在節點與對應的節點匹配時將“權重”添加到總和。
運算符包含：In，NotIn，Exists，DoesNotExist，Gt，Lt。可以使用NotIn和DoesNotExist實現節點反關聯行爲。
podAffinity
Pod親和性場景，我們的k8s集羣的節點分佈在不同的區域或者不同的機房，當服務A和服務B要求部署在同一個區域或者同一機房的時候，我們就需要親和性調度了。

kubectl explain pod.spec.affinity.podAffinity 和NodeAffinity是一樣的，都是有硬親和性和軟親和性

硬親和性：

labelSelector 選擇跟那組Pod親和
namespaces 選擇哪個命名空間
topologyKey 指定節點上的哪個鍵
污點容忍調度（Taint和Toleration）
前兩種方式都是pod選擇那個pod，而污點調度是node選擇的pod，污點就是定義在節點上的鍵值屬性數據。舉要作用是讓節點拒絕pod，拒絕不合法node規則的pod。Taint（污點）和 Toleration（容忍）是相互配合的，可以用來避免 pod 被分配到不合適的節點上,每個節點上都可以應用一個或多個 taint ，這表示對於那些不能容忍這些 taint 的 pod，是不會被該節點接受的。

Taint
Taint是節點上屬性，我們看一下Taints如何定義

kubectl explain node.spec.taints（對象列表）

key 定義一個key
value 定義一個值
effect pod不能容忍這個污點時，他的行爲是什麼，行爲分爲三種：NoSchedule 僅影響調度過程，對現存的pod不影響。PreferNoSchedule 系統將盡量避免放置不容忍節點上污點的pod，但這不是必需的。就是軟版的NoSchedule NoExecute 既影響調度過程，也影響現存的pod，不滿足的pod將被驅逐。

親和性這塊源碼分析後面繼續
本人水平有限，難免有錯誤的地方，如有發現，請留言。

kubernetes: 調度器和調度過程

自學編程兩個月，現在我月入 4 萬元

「實戰應用」如何用圖表控件LightningChart創建2D氣泡圖

百度安全多篇議題入選Blackhat Asia以硬技術發現“芯”問題

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

git：顯示所有衝突文件

kubernetest:部署rancher 管理 k8s集羣

kubernetes: kubectl create與kubectl apply的區別

elastic：Another Kibana instance appears to be migrating the index

kubernetes:使用 kubectl patch 更新 API 對象

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結