kubernetes: 调度器和调度过程

背景：
本文介绍k8s调度的原理方法与实践，后续会有部分的源码分析说明本文所有的源码都是基于1.17版本
参考资料：
https://www.cnblogs.com/xzkzzz/p/9963511.html 感谢作者，总结的非常到位
官方文档，调度章节
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

scheduler介绍
当Scheduler通过API server 的watch接口监听到新建Pod副本的信息后，它会检查所有符合该Pod要求的Node列表，开始执行Pod调度逻辑。调度成功后将Pod绑定到目标节点上。Scheduler在整个系统中承担了承上启下的作用，承上是负责接收创建的新Pod，为安排一个落脚的地（Node）,启下是安置工作完成后，目标Node上的kubelet服务进程接管后继工作，负责Pod生命周期的后半生。具体来说，Scheduler的作用是将待调度的Pod安装特定的调度算法和调度策略绑定到集群中的某个合适的Node上，并将绑定信息传给API server 写入etcd中。整个调度过程中涉及三个对象，分别是：待调度的Pod列表，可以的Node列表，以及调度算法和策略。

Kubernetes Scheduler 提供的调度流程分三步：

预选策略(predicate) 遍历nodelist，选择出符合要求的候选节点，Kubernetes内置了多种预选规则供用户选择。
优选策略(priority) 在选择出符合要求的候选节点中，采用优选规则计算出每个节点的积分，最后选择得分最高的。
选定(select) 如果最高得分有好几个节点，select就会从中随机选择一个节点。

预选策略
CheckNodeConditionPred 检查节点是否正常
GeneralPred HostName(如果pod定义hostname属性，会检查节点是否匹配。pod.spec.hostname)、PodFitsHostPorts（检查pod要暴露的hostpors是否被占用。pod.spec.containers.ports.hostPort）
MatchNodeSelector pod.spec.nodeSelector 看节点标签能否适配pod定义的nodeSelector
PodFitsResources 判断节点的资源能够满足Pod的定义（如果一个pod定义最少需要2C4G node上的低于此资源的将不被调度。用kubectl describe node NODE名称可以查看资源使用情况）
NoDiskConflict 判断pod定义的存储是否在node节点上使用。（默认没有启用）
PodToleratesNodeTaints 检查pod上Tolerates的能否容忍污点（pod.spec.tolerations）
CheckNodeLabelPresence 检查节点上的标志是否存在（默认没有启动）
CheckServiceAffinity 根据pod所属的service。将相同service上的pod尽量放到同一个节点（默认没有启动）
CheckVolumeBinding 检查是否可以绑定（默认没有启动）
NoVolumeZoneConflict 检查是否在一起区域（默认没有启动）
CheckNodeMemoryPressure 检查内存是否存在压力
CheckNodeDiskPressure 检查磁盘IO压力是否过大
CheckNodePIDPressure 检查pid资源是否过大
预选节点的源码

优选策略
least_requested 选择消耗最小的节点（根据空闲比率评估 cpu(总容量-sum(已使用)*10/总容量) ）
balanced_resource_allocation 从节点列表中选出各项资源使用率最均衡的节点（CPU和内存）
node_prefer_avoid_pods 节点倾向
taint_toleration 将pod对象的spec.toleration与节点的taints列表项进行匹配度检查，匹配的条目越多，得分越低。
selector_spreading 与services上其他pod尽量不在同一个节点上，节点上通一个service的pod越少得分越高。
interpod_affinity 遍历node上的亲和性条目，匹配项越多的得分越高
most_requested 选择消耗最大的节点上（尽量将一个节点上的资源用完）
node_label 根据节点标签得分，存在标签既得分，没有标签没得分。标签越多得分越高。
image_locality 节点上有所需要的镜像既得分，所需镜像越多得分越高。（根据已有镜像体积大小之和）
优选节点源码
https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/scheduler/algorithm/priorities/优选节点源码地址

代码中定义的所有优选策略

package priorities

const (
	// EqualPriority defines the name of prioritizer function that gives an equal weight of one to all nodes.
	EqualPriority = "EqualPriority"
	// MostRequestedPriority defines the name of prioritizer function that gives used nodes higher priority.
	MostRequestedPriority = "MostRequestedPriority"
	// RequestedToCapacityRatioPriority defines the name of RequestedToCapacityRatioPriority.
	RequestedToCapacityRatioPriority = "RequestedToCapacityRatioPriority"
	// SelectorSpreadPriority defines the name of prioritizer function that spreads pods by minimizing
	// the number of pods (belonging to the same service or replication controller) on the same node.
	SelectorSpreadPriority = "SelectorSpreadPriority"
	// ServiceSpreadingPriority is largely replaced by "SelectorSpreadPriority".
	ServiceSpreadingPriority = "ServiceSpreadingPriority"
	// InterPodAffinityPriority defines the name of prioritizer function that decides which pods should or
	// should not be placed in the same topological domain as some other pods.
	InterPodAffinityPriority = "InterPodAffinityPriority"
	// LeastRequestedPriority defines the name of prioritizer function that prioritize nodes by least
	// requested utilization.
	LeastRequestedPriority = "LeastRequestedPriority"
	// BalancedResourceAllocation defines the name of prioritizer function that prioritizes nodes
	// to help achieve balanced resource usage.
	BalancedResourceAllocation = "BalancedResourceAllocation"
	// NodePreferAvoidPodsPriority defines the name of prioritizer function that priorities nodes according to
	// the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".
	NodePreferAvoidPodsPriority = "NodePreferAvoidPodsPriority"
	// NodeAffinityPriority defines the name of prioritizer function that prioritizes nodes which have labels
	// matching NodeAffinity.
	NodeAffinityPriority = "NodeAffinityPriority"
	// TaintTolerationPriority defines the name of prioritizer function that prioritizes nodes that marked
	// with taint which pod can tolerate.
	TaintTolerationPriority = "TaintTolerationPriority"
	// ImageLocalityPriority defines the name of prioritizer function that prioritizes nodes that have images
	// requested by the pod present.
	ImageLocalityPriority = "ImageLocalityPriority"
	// ResourceLimitsPriority defines the nodes of prioritizer function ResourceLimitsPriority.
	ResourceLimitsPriority = "ResourceLimitsPriority"
	// EvenPodsSpreadPriority defines the name of prioritizer function that prioritizes nodes
	// which have pods and labels matching the incoming pod's topologySpreadConstraints.
	EvenPodsSpreadPriority = "EvenPodsSpreadPriority"
)

我们以image_locality为例看一下评分是怎么计算的

// calculatePriority returns the priority of a node. Given the sumScores of requested images on the node, the node's
// priority is obtained by scaling the maximum priority value with a ratio proportional to the sumScores.
func calculatePriority(sumScores int64) int {
	if sumScores < minThreshold {
		sumScores = minThreshold
	} else if sumScores > maxThreshold {
		sumScores = maxThreshold
	}

	return int(int64(framework.MaxNodeScore) * (sumScores - minThreshold) / (maxThreshold - minThreshold))
}

// sumImageScores returns the sum of image scores of all the containers that are already on the node.
// Each image receives a raw score of its size, scaled by scaledImageScore. The raw scores are later used to calculate
// the final score. Note that the init containers are not considered for it's rare for users to deploy huge init containers.
func sumImageScores(nodeInfo *schedulernodeinfo.NodeInfo, containers []v1.Container, totalNumNodes int) int64 {
	var sum int64
	imageStates := nodeInfo.ImageStates()

	for _, container := range containers {
		if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}
	}

	return sum
}

// scaledImageScore returns an adaptively scaled score for the given state of an image.
// The size of the image is used as the base score, scaled by a factor which considers how much nodes the image has "spread" to.
// This heuristic aims to mitigate the undesirable "node heating problem", i.e., pods get assigned to the same or
// a few nodes due to image locality.
func scaledImageScore(imageState *schedulernodeinfo.ImageStateSummary, totalNumNodes int) int64 {
	spread := float64(imageState.NumNodes) / float64(totalNumNodes)
	return int64(float64(imageState.Size) * spread)
}

可以看到确实是去遍历了 node上的 image ，数量越多，得分越高

 if state, ok := imageStates[normalizedImageName(container.Image)]; ok {
			sum += scaledImageScore(state, totalNumNodes)
		}

再看一个 MostRequestedPriority 评分情况

package priorities

import framework "k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1"

var (
	mostRequestedRatioResources = DefaultRequestedRatioResources
	mostResourcePriority        = &ResourceAllocationPriority{"MostResourceAllocation", mostResourceScorer, mostRequestedRatioResources}

	// MostRequestedPriorityMap is a priority function that favors nodes with most requested resources.
	// It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes
	// based on the maximum of the average of the fraction of requested to capacity.
	// Details: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
	MostRequestedPriorityMap = mostResourcePriority.PriorityMap
)

func mostResourceScorer(requested, allocable ResourceToValueMap, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64 {
	var nodeScore, weightSum int64
	for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

}

// The used capacity is calculated on a scale of 0-10
// 0 being the lowest priority and 10 being the highest.
// The more resources are used the higher the score is. This function
// is almost a reversed version of least_requested_priority.calculateUnusedScore
// (10 - calculateUnusedScore). The main difference is in rounding. It was added to
// keep the final formula clean and not to modify the widely used (by users
// in their default scheduling policies) calculateUsedScore.
func mostRequestedScore(requested, capacity int64) int64 {
	if capacity == 0 {
		return 0
	}
	if requested > capacity {
		return 0
	}

	return (requested * framework.MaxNodeScore) / capacity
}

对于已经部署的服务请求，node节点对于请求的响应都有一个请求需要多少资源，处理请求真实分配多少资源

resourceToWeightMap ResourceToWeightMap
//包含资源名称和权重。

资源这块只关注了内存，cpu

var DefaultRequestedRatioResources = ResourceToWeightMap{v1.ResourceMemory: 1, v1.ResourceCPU: 1}

对于volume ，这快涉及到的是pv ，如果有使用volume ，那也是加分的，通过这个features.BalanceAttachedNodeVolumes 特性门判断

if len(pod.Spec.Volumes) >= 0 && utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes) && nodeInfo.TransientInfo != nil {
		score = r.scorer(requested, allocatable, true, nodeInfo.TransientInfo.TransNodeInfo.RequestedVolumes, nodeInfo.TransientInfo.TransNodeInfo.AllocatableVolumesCount)
	} else {
		score = r.scorer(requested, allocatable, false, 0, 0)
	}

type ResourceToWeightMap map[v1.ResourceName]int64

将已分配的资源记录起来allocatable[resource], requested[resource]

requested := make(ResourceToValueMap, len(r.resourceToWeightMap))
	allocatable := make(ResourceToValueMap, len(r.resourceToWeightMap))
	for resource := range r.resourceToWeightMap {
		allocatable[resource], requested[resource] = calculateResourceAllocatableRequest(nodeInfo, pod, resource)
	}

之后在根据已经分配的资源和权重计算得分

for resource, weight := range mostRequestedRatioResources {
		resourceScore := mostRequestedScore(requested[resource], allocable[resource])
		nodeScore += resourceScore * weight
		weightSum += weight
	}
	return (nodeScore / weightSum)

总之，分配的资源的时候节点上部署的服务越多，mostRequestedScore 就得分越高
其他的优选策略后续补充

最后优选的时候会算一个总分出来，得分最高的执行本次调度

高级调度方式，亲和性反亲和性
当我们想把调度到预期的节点，我们可以使用高级调度分为：
节点选择器： nodeSelector、nodeName
节点亲和性调度： nodeAffinity
Pod亲和性调度：PodAffinity
Pod反亲和性调度：podAntiAffinity
nodeAffinity
kubectl explain pod.spec.affinity.nodeAffinity

requiredDuringSchedulingIgnoredDuringExecution 硬亲和性必须满足亲和性。
matchExpressions 匹配表达式,这个标签可以指定一段，例如pod中定义的key为zone，operator为In(包含那些)，values为 foo和bar。就是在node节点中包含foo和bar的标签中调度
matchFields 匹配字段和上面的意思不过他可以不定义标签值，可以定义
preferredDuringSchedulingIgnoredDuringExecution 软亲和性能满足最好，不满足也没关系。
preference 优先级
weight 权重1-100范围内，对于满足所有调度要求的每个节点，调度程序将通过迭代此字段的元素计算总和，并在节点与对应的节点匹配时将“权重”添加到总和。
运算符包含：In，NotIn，Exists，DoesNotExist，Gt，Lt。可以使用NotIn和DoesNotExist实现节点反关联行为。
podAffinity
Pod亲和性场景，我们的k8s集群的节点分布在不同的区域或者不同的机房，当服务A和服务B要求部署在同一个区域或者同一机房的时候，我们就需要亲和性调度了。

kubectl explain pod.spec.affinity.podAffinity 和NodeAffinity是一样的，都是有硬亲和性和软亲和性

硬亲和性：

labelSelector 选择跟那组Pod亲和
namespaces 选择哪个命名空间
topologyKey 指定节点上的哪个键
污点容忍调度（Taint和Toleration）
前两种方式都是pod选择那个pod，而污点调度是node选择的pod，污点就是定义在节点上的键值属性数据。举要作用是让节点拒绝pod，拒绝不合法node规则的pod。Taint（污点）和 Toleration（容忍）是相互配合的，可以用来避免 pod 被分配到不合适的节点上,每个节点上都可以应用一个或多个 taint ，这表示对于那些不能容忍这些 taint 的 pod，是不会被该节点接受的。

Taint
Taint是节点上属性，我们看一下Taints如何定义

kubectl explain node.spec.taints（对象列表）

key 定义一个key
value 定义一个值
effect pod不能容忍这个污点时，他的行为是什么，行为分为三种：NoSchedule 仅影响调度过程，对现存的pod不影响。PreferNoSchedule 系统将尽量避免放置不容忍节点上污点的pod，但这不是必需的。就是软版的NoSchedule NoExecute 既影响调度过程，也影响现存的pod，不满足的pod将被驱逐。

亲和性这块源码分析后面继续
本人水平有限，难免有错误的地方，如有发现，请留言。

kubernetes: 调度器和调度过程

DAPPER 事务 TRANSACTION

git：顯示所有衝突文件

kubernetest:部署rancher 管理 k8s集羣

kubernetes: kubectl create與kubectl apply的區別

elastic：Another Kibana instance appears to be migrating the index

kubernetes:使用 kubectl patch 更新 API 對象

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結