8.深入k8s：資源控制Qos和eviction及其源碼分析

轉載請聲明出處哦~，本篇文章發佈於luozhiyun的博客：https://www.luozhiyun.com，源碼版本是1.19

又是一個週末，可以愉快的坐下來靜靜的品味一段源碼，這一篇涉及到資源的回收，工作量是很大的，篇幅會比較長，我們可以看到k8s在資源不夠時會怎麼做的，k8s在回收資源的時候有哪些考慮，我們的pod爲什麼會無端端的被幹掉等等。

limit&request

在k8s中，CPU和內存的資源主要是通過這limit&request來進行限制的，在yaml文件中的定義如下：

spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory

在調度的時候，kube-scheduler 只會按照 requests 的值進行計算，而真正限制資源使用的是limit。

下面我引用一個官方的例子：

apiVersion: v1
kind: Pod
metadata:
  name: cpu-demo
  namespace: cpu-example
spec:
  containers:
  - name: cpu-demo-ctr
    image: vish/stress
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "0.5"
    args:
    - -cpus
    - "2"

在這個例子中，args參數給的是cpus等於2，表示這個container可以使用2個cpu進行壓測。但是我們的limits是1，以及requests是0.5。

當我們創建好這個pod之後，然後使用kubectl top去查看資源使用情況的時候會發現cpu使用並不會超過1：

NAME                        CPU(cores)   MEMORY(bytes)
cpu-demo                    974m         <something>

這說明這個pod的cpu資源被限制在了1個cpu，即使container想使用，也是沒有辦法的。

在容器沒有指定 request 的時候，request 的值和 limit 默認相等。

QoS 模型與Eviction

下面說一下由不同的 requests 和 limits 的設置方式引出的不同的 QoS 級別。

kubernetes 中有三種 Qos，分別爲：

Guaranteed：Pod中所有Container的所有Resource的limit和request都相等且不爲0；
Burstable：pod不滿足Guaranteed條件，但是其中至少有一個container設置了requests或limits ；
BestEffort：pod的 requests 與 limits 均沒有設置；

當宿主機資源緊張的時候，kubelet 對 Pod 進行 Eviction（即資源回收）時會按照Qos的順序進行回收，回收順序是：BestEffort>Burstable>Guaranteed

Eviction有兩種模式，分爲 Soft 和 Hard。Soft Eviction 允許你爲 Eviction 過程設置grace period，然後等待一個用戶配置的grace period之後，再執行Eviction，而Hard則立即執行。

那麼什麼時候會發生Eviction呢？我們可以爲Eviction 設置threshold，比如設置設定內存的 eviction hard threshold 爲 100M，那麼當這臺機器的內存可用資源不足 100M 時，kubelet 就會根據這臺機器上面所有 pod 的 QoS 級別以及他們的內存使用情況，進行一個綜合排名，把排名最靠前的 pod 進行遷移，從而釋放出足夠的內存資源。

thresholds定義方式爲[eviction-signal][operator][quantity]

eviction-signal

eviction-signal按照官方文檔的說法分爲如下幾種：

Eviction Signal	Description
memory.available	memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
nodefs.available	nodefs.available := node.stats.fs.available
nodefs.inodesFree	nodefs.inodesFree := node.stats.fs.inodesFree
imagefs.available	imagefs.available := node.stats.runtime.imagefs.available
imagefs.inodesFree	imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree

nodefs和imagefs表示兩種文件系統分區：

nodefs：文件系統，kubelet 將其用於卷和守護程序日誌等。

imagefs：文件系統，容器運行時用於保存鏡像和容器可寫層。

operator

就是所需的關係運算符，如"<"。

quantity

是閾值的大小，可以容量大小，如：1Gi；也可以用百分比來表示：10%。

如果kubelet在節點經歷系統 OOM 之前無法回收內存，那麼oom_killer將基於它在節點上使用的內存百分比算出一個oom_score，然後結束得分最高的容器。

Qos源碼分析

qos的代碼位於pkg\apis\core\v1\helper\qos\包下面：

qos#GetPodQOS

//pkg\apis\core\v1\helper\qos\qos.go
func GetPodQOS(pod *v1.Pod) v1.PodQOSClass {
	requests := v1.ResourceList{}
	limits := v1.ResourceList{}
	zeroQuantity := resource.MustParse("0")
	isGuaranteed := true
	allContainers := []v1.Container{}
	//追加所有的初始化容器
	allContainers = append(allContainers, pod.Spec.Containers...)
	allContainers = append(allContainers, pod.Spec.InitContainers...)
	//遍歷container
	for _, container := range allContainers {
		// process requests
		//遍歷request 裏面的cpu、memory 獲取其中的值
		for name, quantity := range container.Resources.Requests {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				delta := quantity.DeepCopy()
				if _, exists := requests[name]; !exists {
					requests[name] = delta
				} else {
					delta.Add(requests[name])
					requests[name] = delta
				}
			}
		}
		// process limits
		qosLimitsFound := sets.NewString()
		//遍歷 limit 裏面的cpu、memory 獲取其中的值
		for name, quantity := range container.Resources.Limits {
			if !isSupportedQoSComputeResource(name) {
				continue
			}
			if quantity.Cmp(zeroQuantity) == 1 {
				qosLimitsFound.Insert(string(name))
				delta := quantity.DeepCopy()
				if _, exists := limits[name]; !exists {
					limits[name] = delta
				} else {
					delta.Add(limits[name])
					limits[name] = delta
				}
			}
		}
		//如果limits 沒有同時設置cpu 、Memory，那麼就不是Guaranteed
		if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
			isGuaranteed = false
		}
	}
	//如果requests 和 limits都沒有設置，那麼爲BestEffort
	if len(requests) == 0 && len(limits) == 0 {
		return v1.PodQOSBestEffort
	}
	// Check is requests match limits for all resources.
	if isGuaranteed {
		for name, req := range requests {
			if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
				isGuaranteed = false
				break
			}
		}
	}
	// 都設置了limits 和 requests，則是Guaranteed
	if isGuaranteed &&
		len(requests) == len(limits) {
		return v1.PodQOSGuaranteed
	}
	return v1.PodQOSBurstable
}

上面有註釋我就不過多介紹，非常的簡單。

下面這裏是QOS OOM打分機制，通過給不同的pod打分來判斷，哪些pod可以被優先kill掉，分數越高的越容易被kill。

policy

//\pkg\kubelet\qos\policy.go
// 分值越高越容易被kill
const (
	// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
	KubeletOOMScoreAdj int = -999
	// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
	KubeProxyOOMScoreAdj  int = -999
	guaranteedOOMScoreAdj int = -998
	besteffortOOMScoreAdj int = 1000
)

policy#GetContainerOOMScoreAdjust

//\pkg\kubelet\qos\policy.go
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
	//靜態Pod、鏡像Pod和高優先級Pod，直接可以是guaranteedOOMScoreAdj
	if types.IsCriticalPod(pod) {
		// Critical pods should be the last to get killed.
		return guaranteedOOMScoreAdj
	}
	//獲取pod的qos等級，這裏只處理Guaranteed與BestEffort
	switch v1qos.GetPodQOS(pod) {
	case v1.PodQOSGuaranteed:
		// Guaranteed containers should be the last to get killed.
		return guaranteedOOMScoreAdj
	case v1.PodQOSBestEffort:
		return besteffortOOMScoreAdj
	} 
	memoryRequest := container.Resources.Requests.Memory().Value()
	//如果我們佔用的內存越少，則打分就越高
	oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
	 
	//這裏是爲了保證burstable能有個更高的 OOM score
	if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
		return (1000 + guaranteedOOMScoreAdj)
	}
	 
	if int(oomScoreAdjust) == besteffortOOMScoreAdj {
		return int(oomScoreAdjust - 1)
	}
	return int(oomScoreAdjust)
}

這個方法裏面給不同的pod進行打分，靜態Pod、鏡像Pod和高優先級Pod，QOS直接被設置成爲guaranteed；

然後調用qos的GetPodQOS方法獲取一個pod的評分，但是如果一個pod是burstable，那麼需要根據其直接使用的內存來進行評分，佔用的內存越少，則打分就越高，如果分數小於1000 + guaranteedOOMScoreAdj，也就是2分，那麼被直接設置成2分，避免分數過低。

Eviction Manager源碼分析

kubelet在實例化一個kubelet對象的時候，調用eviction.NewManager新建了一個evictionManager對象。然後kubelet再Run方法開始工作的時候，創建一個goroutine，每5s執行一次updateRuntimeUp。

在updateRuntimeUp中，待確認runtime啓動成功後，會調用initializeRuntimeDependentModules完成runtime依賴模塊的初始化工作。

然後在initializeRuntimeDependentModules中會調用evictionManager的start方法進行啓動。

代碼如下，具體的kubelet流程我們留到以後慢慢分析：

func NewMainKubelet(...){
	...
	evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.podManager.GetMirrorPodByPod, klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock, etcHostsPathFunc)

	klet.evictionManager = evictionManager
    ...
}

func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
    ...
    go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
    ...
}

func (kl *Kubelet) updateRuntimeUp() {
    ...
    kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)
    ...
}


func (kl *Kubelet) initializeRuntimeDependentModules() {
    ...
    kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)
    ...
}

下面我們來到\pkg\kubelet\eviction\eviction_manager.go去看一下Start方法怎麼實現eviction的。

managerImp#Start

// 開啓一個控制循環去監視和響應資源過低的情況
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
	thresholdHandler := func(message string) {
		klog.Infof(message)
		m.synchronize(diskInfoProvider, podFunc)
	}
	//是否要利用kernel memcg notification
	if m.config.KernelMemcgNotification {
		for _, threshold := range m.config.Thresholds {
			if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
				notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
				if err != nil {
					klog.Warningf("eviction manager: failed to create memory threshold notifier: %v", err)
				} else {
					go notifier.Start()
					m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
				}
			}
		}
	}
	// start the eviction manager monitoring
	// 啓動一個goroutine，for循環裏每隔monitoringInterval（10s）執行一次synchronize
	go func() {
		for { 
			//synchronize是主要的eviction控制循環，返回被kill的pod，或返回nill
			if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
				klog.Infof("eviction manager: pods %s evicted, waiting for pod to be cleaned up", format.Pods(evictedPods))
				m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
			} else {
				time.Sleep(monitoringInterval)
			}
		}
	}()
}

下面的synchronize方法會很長，需要點耐心：

managerImpl#synchronize

根據上面介紹的不同的eviction signal會有不同的排序方法，以及設置節點資源回收方法

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	if m.dedicatedImageFs == nil {
		hasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()
		if ok != nil {
			return nil
		}
		m.dedicatedImageFs = &hasImageFs
		//註冊各個eviction signal所對應的資源排序方法
		m.signalToRankFunc = buildSignalToRankFunc(hasImageFs)
		// 註冊節點資源回收方法，例如imagefs.avaliable對應的是刪除無用容器和無用鏡像
		m.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)
	}
	...
}

看一下buildSignalToRankFunc方法的實現：

func buildSignalToRankFunc(withImageFs bool) map[evictionapi.Signal]rankFunc {
	signalToRankFunc := map[evictionapi.Signal]rankFunc{
		evictionapi.SignalMemoryAvailable:            rankMemoryPressure,
		evictionapi.SignalAllocatableMemoryAvailable: rankMemoryPressure,
		evictionapi.SignalPIDAvailable:               rankPIDPressure,
	} 
	if withImageFs { 
		signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes) 
		signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot}, resourceInodes)
	} else { 
		signalToRankFunc[evictionapi.SignalNodeFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalNodeFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
		signalToRankFunc[evictionapi.SignalImageFsAvailable] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, v1.ResourceEphemeralStorage)
		signalToRankFunc[evictionapi.SignalImageFsInodesFree] = rankDiskPressureFunc([]fsStatsType{fsStatsRoot, fsStatsLogs, fsStatsLocalVolumeSource}, resourceInodes)
	}
	return signalToRankFunc
}

這個方法裏面會將各個eviction signal的排序方法放入到一個map中返回，如MemoryAvailable、NodeFsAvailable、ImageFsAvailable等。

獲取所有的活躍的pod，以及整體的stat信息

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//獲取當前active的pods
	activePods := podFunc()
	updateStats := true
	//獲取節點的整體概況，即nodeStsts和podStats
	summary, err := m.summaryProvider.Get(updateStats)
	if err != nil {
		klog.Errorf("eviction manager: failed to get summary stats: %v", err)
		return nil
	}
	//如果Notifiers有超過10s沒有刷新，那麼更新Notifiers
	if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {
		m.thresholdsLastUpdated = m.clock.Now()
		for _, notifier := range m.thresholdNotifiers {
			if err := notifier.UpdateThreshold(summary); err != nil {
				klog.Warningf("eviction manager: failed to update %s: %v", notifier.Description(), err)
			}
		}
	}
	...
}

根據summary信息創建相應的統計信息到observations對象中

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//根據summary信息創建相應的統計信息到observations對象中，如SignalMemoryAvailable、SignalNodeFsAvailable等。
	observations, statsFunc := makeSignalObservations(summary)
	...
}

下面抽取部分代碼makeSignalObservations

func makeSignalObservations(summary *statsapi.Summary) (signalObservations, statsFunc) { 
	...
	if memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil {
		result[evictionapi.SignalMemoryAvailable] = signalObservation{
			available: resource.NewQuantity(int64(*memory.AvailableBytes), resource.BinarySI),
			capacity:  resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI),
			time:      memory.Time,
		}
	}
	... 
}

這個方法主要是將summary裏面的資源利用情況根據不同的eviction signal封裝到result裏面返回。

根據獲取的observations判斷是否已到達閾值的thresholds

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
    //根據獲取的observations判斷是否已到達閾值的thresholds，然後返回
	thresholds = thresholdsMet(thresholds, observations, false)
	
    if len(m.thresholdsMet) > 0 {
		//Minimum eviction reclaim 策略
		thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
		thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
	}
	...
}

thresholdsMet

func thresholdsMet(thresholds []evictionapi.Threshold, observations signalObservations, enforceMinReclaim bool) []evictionapi.Threshold {
	results := []evictionapi.Threshold{}
	for i := range thresholds {
		threshold := thresholds[i]
		observed, found := observations[threshold.Signal]
		if !found {
			klog.Warningf("eviction manager: no observation found for eviction signal %v", threshold.Signal)
			continue
		} 
		thresholdMet := false
        // 根據資源容量獲取閾值的資源大小
		quantity := evictionapi.GetThresholdQuantity(threshold.Value, observed.capacity) 
		//Minimum eviction reclaim 策略，具體看：https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim
		if enforceMinReclaim && threshold.MinReclaim != nil {
			quantity.Add(*evictionapi.GetThresholdQuantity(*threshold.MinReclaim, observed.capacity))
		}
		//如果observed.available比quantity大，那麼返回1
		thresholdResult := quantity.Cmp(*observed.available)
		//檢查Operator標識符
		switch threshold.Operator {
		//如果是小於號"<",當thresholdResult大於0，返回true
		case evictionapi.OpLessThan:
			thresholdMet = thresholdResult > 0
		}
		//如果append到results，表示已經到達閾值
		if thresholdMet {
			results = append(results, threshold)
		}
	}
	return results
}

thresholdsMet會遍歷整個thresholds，然後從observations裏面獲取eviction signal對應的資源情況。因爲我們上面講了設置的threshold可以是1Gi，也可以是百分比，所以需要調用GetThresholdQuantity方法換算一下，得到quantity；

然後根據Minimum eviction reclaim 策略判斷一下是否還需要提高這個需要eviction的資源，具體的信息查看文檔：https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#minimum-eviction-reclaim；

然後用quantity和available比較一下，如果已達閾值，那麼加入到results集合中返回。

記錄eviction signal 第一次的時間，並將Eviction Signals映射到對應的Node Conditions

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	now := m.clock.Now()
	//主要用來記錄 eviction signal 第一次的時間，沒有則設置 now 時間
	thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

	// the set of node conditions that are triggered by currently observed thresholds
	// Kubelet會將對應的Eviction Signals映射到對應的Node Conditions
	nodeConditions := nodeConditions(thresholds)
	if len(nodeConditions) > 0 {
		klog.V(3).Infof("eviction manager: node conditions - observed: %v", nodeConditions)
	}
	...
}

nodeConditions

func nodeConditions(thresholds []evictionapi.Threshold) []v1.NodeConditionType {
	results := []v1.NodeConditionType{}
	for _, threshold := range thresholds {
		if nodeCondition, found := signalToNodeCondition[threshold.Signal]; found {
			//檢查results裏是否已有nodeCondition
			if !hasNodeCondition(results, nodeCondition) {
				results = append(results, nodeCondition)
			}
		}
	}
	return results
}

nodeConditions方法主要就是根據signalToNodeCondition來映射對應的nodeCondition，其中nodeCondition如下：

	signalToNodeCondition = map[evictionapi.Signal]v1.NodeConditionType{}
	signalToNodeCondition[evictionapi.SignalMemoryAvailable] = v1.NodeMemoryPressure
	signalToNodeCondition[evictionapi.SignalAllocatableMemoryAvailable] = v1.NodeMemoryPressure
	signalToNodeCondition[evictionapi.SignalImageFsAvailable] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalNodeFsAvailable] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalImageFsInodesFree] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalNodeFsInodesFree] = v1.NodeDiskPressure
	signalToNodeCondition[evictionapi.SignalPIDAvailable] = v1.NodePIDPressure

也就是將Eviction Signals分別映射成了MemoryPressure或DiskPressure，整理出來的表格如下：

Node Condition	Eviction Signal	Description
MemoryPressure	memory.available	Available memory on the node has satisfied an eviction threshold
DiskPressure	nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree	Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold

本輪 node condition 與上次的observed合併，以最新的爲準

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//本輪 node condition 與上次的observed合併，以最新的爲準
	nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)
	...
}

防止Node的資源不斷在閾值附近波動，從而不斷變動Node Condition值

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//PressureTransitionPeriod參數默認爲5分鐘
	//防止Node的資源不斷在閾值附近波動，從而不斷變動Node Condition值
	//具體查看文檔：https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#oscillation-of-node-conditions
	nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
	if len(nodeConditions) > 0 {
		klog.V(3).Infof("eviction manager: node conditions - transition period not met: %v", nodeConditions)
	}
	...
}

nodeConditionsObservedSince

func nodeConditionsObservedSince(observedAt nodeConditionsObservedAt, period time.Duration, now time.Time) []v1.NodeConditionType {
	results := []v1.NodeConditionType{}
	for nodeCondition, at := range observedAt {
		duration := now.Sub(at)
		if duration < period {
			results = append(results, nodeCondition)
		}
	}
	return results
}

如果已經超過了5分鐘，那麼需要排除。

對eviction-soft做判斷

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//設置 eviction-soft-grace-period，默認爲90秒，超過該值加入閾值集合
	thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
	...
}

thresholdsMetGracePeriod

func thresholdsMetGracePeriod(observedAt thresholdsObservedAt, now time.Time) []evictionapi.Threshold {
	results := []evictionapi.Threshold{}
	for threshold, at := range observedAt {
		duration := now.Sub(at)
		//Soft Eviction Thresholds，必須要等一段時間之後才能進行trigger
		if duration < threshold.GracePeriod {
			klog.V(2).Infof("eviction manager: eviction criteria not yet met for %v, duration: %v", formatThreshold(threshold), duration)
			continue
		}
		results = append(results, threshold)
	}
	return results
}

設值，然後比較更新

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	// update internal state
	m.Lock()
	m.nodeConditions = nodeConditions
	m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
	m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
	m.thresholdsMet = thresholds
 
	// 閾值集合跟上次比較是否需要更新
	thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
	debugLogThresholdsWithObservation("thresholds - updated stats", thresholds, observations)

	//將本次的信息設置爲上次信息
	m.lastObservations = observations
	m.Unlock()
	...
}

排序之後找到第一個需要釋放的threshold，以及對應的resource

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//如果沒有 eviction signal 集合則本輪結束流程
	if len(thresholds) == 0 {
		klog.V(3).Infof("eviction manager: no resources are starved")
		return nil
	}
 
	//排序之後獲取thresholds集合中的第一個元素
	sort.Sort(byEvictionPriority(thresholds))
	thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)
	if !foundAny {
		return nil
	}
	...
}

getReclaimableThreshold

func getReclaimableThreshold(thresholds []evictionapi.Threshold) (evictionapi.Threshold, v1.ResourceName, bool) {
	//遍歷thresholds，然後根據對應的Eviction Signals找到對應的resource
	for _, thresholdToReclaim := range thresholds {
		if resourceToReclaim, ok := signalToResource[thresholdToReclaim.Signal]; ok {
			return thresholdToReclaim, resourceToReclaim, true
		}
		klog.V(3).Infof("eviction manager: threshold %s was crossed, but reclaim is not implemented for this threshold.", thresholdToReclaim.Signal)
	}
	return evictionapi.Threshold{}, "", false
}

下面我們看一下signalToResource的定義：

	signalToResource = map[evictionapi.Signal]v1.ResourceName{}
	signalToResource[evictionapi.SignalMemoryAvailable] = v1.ResourceMemory
	signalToResource[evictionapi.SignalAllocatableMemoryAvailable] = v1.ResourceMemory
	signalToResource[evictionapi.SignalImageFsAvailable] = v1.ResourceEphemeralStorage
	signalToResource[evictionapi.SignalImageFsInodesFree] = resourceInodes
	signalToResource[evictionapi.SignalNodeFsAvailable] = v1.ResourceEphemeralStorage
	signalToResource[evictionapi.SignalNodeFsInodesFree] = resourceInodes
	signalToResource[evictionapi.SignalPIDAvailable] = resourcePids

signalToResource將Eviction Signals分成了memory、ephemeral-storage、inodes、pids幾類。

回收節點級別的資源

```go
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...
	//回收節點級別的資源
	if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
		klog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
		return nil
	}
	...
}
```

**reclaimNodeLevelResources**

```go
func (m *managerImpl) reclaimNodeLevelResources(signalToReclaim evictionapi.Signal, resourceToReclaim v1.ResourceName) bool {
	//調用buildSignalToNodeReclaimFuncs中設置的方法
	nodeReclaimFuncs := m.signalToNodeReclaimFuncs[signalToReclaim]
	for _, nodeReclaimFunc := range nodeReclaimFuncs { 
		// 刪除沒用使用到的images或 刪除已經是dead狀態的Pod 和 container
		if err := nodeReclaimFunc(); err != nil {
			klog.Warningf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
		}

	}
	//回收之後再檢查一下資源佔用情況，如果沒有達到閾值，那麼直接結束
	if len(nodeReclaimFuncs) > 0 {
		summary, err := m.summaryProvider.Get(true)
		if err != nil {
			klog.Errorf("eviction manager: failed to get summary stats after resource reclaim: %v", err)
			return false
		}
 
		observations, _ := makeSignalObservations(summary)
		debugLogObservations("observations after resource reclaim", observations)
 
		thresholds := thresholdsMet(m.config.Thresholds, observations, false)
		debugLogThresholdsWithObservation("thresholds after resource reclaim - ignoring grace period", thresholds, observations)

		if len(thresholds) == 0 {
			return true
		}
	}
	return false
}
```

首先根據需要釋放的signal從signalToNodeReclaimFuncs中找到對應的釋放資源的方法，這個方法在上面buildSignalToNodeReclaimFuncs中設置的，如：

```
nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
```

這個方法會調用相應的GC方法，刪除無用的container以及無用的images來釋放資源。

然後會檢查釋放完資源之後是否依然超過閾值，如果沒有的話就直接結束了。

獲取相應的排序函數並進行排序

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	... 
	//得到上面的eviction signal 排序函數，在buildSignalToRankFunc方法中設置
	rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]
	if !ok {
		klog.Errorf("eviction manager: no ranking function for signal %s", thresholdToReclaim.Signal)
		return nil
	}

	//如果沒有 active pod 直接返回
	if len(activePods) == 0 {
		klog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
		return nil
	}
 
	//將pod按照特定資源排序
	rank(activePods, statsFunc)
	...
}

將排好序的pod刪除，並返回

func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	... 
	for i := range activePods {
		pod := activePods[i]
		gracePeriodOverride := int64(0)
		if !isHardEvictionThreshold(thresholdToReclaim) {
			gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
		}
		message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
		//kill pod
		if m.evictPod(pod, gracePeriodOverride, message, annotations) {
			metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
			return []*v1.Pod{pod}
		}
	}
	...
}

只要有一個pod被刪除了，那麼就返回~

到這裏eviction manager就分析完了~

總結

這一篇講解了其中資源控制是怎麼做的，理解了通過limit和request的設置會影響到pod被刪除的優先級，所以我們在設置pod的時候儘量設置合理的limit和request可以不那麼容易被kill掉；然後通過分析了源碼知道了limit和request會影響到QOS的評分，從而影響到pod被kill掉的優先級。

接下來通過源碼分析了k8s中對閾值的設定是怎樣的，當資源不夠的時候pod是根據什麼條件被kill掉的，這一部分花了很大的篇幅來介紹。通過源碼也可以知道在eviction發生的時候k8s也是做了很多的考慮，比如說對於節點狀態振盪應該怎麼處理、首先應該回收什麼類型的資源、minimum-reclaim最小回收資源在源碼裏是怎麼做到的等等。