kubernetes集羣三步安裝
kube-scheduler源碼分析
關於源碼編譯
我嫌棄官方提供的編譯腳本太麻煩,所以用了更簡單粗暴的方式編譯k8s代碼,當然官方腳本在編譯所有項目或者誇平臺編譯以及realse時還是挺有用的。
在容器中編譯:
docker run -v /work/src/k8s.io/kubernetes:/go/src/k8s.io/kubernetes golang:1.11.2 bash
在容器中可以保證環境乾淨
進入bash後直接進入kube-scheduler的主目錄編譯即可
<!--more-->
cd cmd/kube-scheduler && go build
二進制就產生了。。。
源碼編譯接入CI/CD
作爲高端玩家,自動化是必須的,因爲服務器性能更好,用CI/CD編譯更快,這裏分享一下我的一些配置:
- 我把vendor打到編譯的基礎鏡像裏了,因爲vendor大而且不經常更新
$ cat Dockerfile-build1.12.2
FROM golang:1.11.2
COPY vendor/ /vendor
然後代碼裏的vendor就可以刪了
- .drone.yml
workspace:
base: /go/src/k8s.io
path: kubernetes
pipeline:
build:
image: fanux/kubernetes-build:1.12.2-beta.3
commands:
- make all WHAT=cmd/kube-kubescheduler GOFLAGS=-v
publish:
image: plugins/docker
registry: xxx
username: xxx
password: xxx
email: xxx
repo: xxx/container/kube-scheduler
tags: ${DRONE_TAG=latest}
dockerfile: dockerfile/Dockerfile-kube-scheduler
insecure: true
when:
event: [push, tag]
- Dockerfile 靜態編譯連基礎鏡像都省了
$ cat dockerfile/Dockerfile-kube-scheduler
FROM scratch
COPY _output/local/bin/linux/amd64/kube-scheduler /
CMD ["/kube-scheduler"]
對於kubeadm這種二進制交付的,可直接編譯然後傳到nexus上, 通過drone deploy事件選擇是不是要編譯kubeadm:
build_kubeadm:
image: fanux/kubernetes-build:1.12.2-beta.3
commands:
- make all WHAT=cmd/kube-kubeadm GOFLAGS=-v
- curl -v -u container:container --upload-file kubeadm http://172.16.59.153:8081/repository/kubernetes/kubeadm/
when:
event: deployment
enviroment: kubeadm
直接go build的大坑
發現build完的kubeadm二進制並不能用,可能是build時選用的基礎鏡像的問題,也可能是沒去生成一些代碼導致的問題
[signal SIGSEGV: segmentation violation code=0x1 addr=0x63 pc=0x7f2b7f5f057c]
runtime stack:
runtime.throw(0x17c74a8, 0x2a)
/usr/local/go/src/runtime/panic.go:608 +0x72
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:374 +0x2f2
後面再補上CD的配置
如此我編譯scheduler代碼大約40秒左右,如vendor可軟連接還可節省十幾秒
調度器cache
cache狀態機
+-------------------------------------------+ +----+
| Add | | |
| | | | Update
+ Assume Add v v |
Initial +--------> Assumed +------------+---> Added <--+
^ + + | +
| | | | |
| | | Add | | Remove
| | | | |
| | | + |
+----------------+ +-----------> Expired +----> Deleted
- Assume 嘗試調度,會把node信息聚合到node上,如pod require多少CPU內存,那麼加到node上,如果超時了需要重新減掉
- AddPod 會檢測是不是已經嘗試調度了該pod,校驗是否過期,如果過期了會被重新添加
- Remove pod信息會在該節點上被清除掉
- cache其它接口如node相關的cache接口 ADD update等
cache實現
type schedulerCache struct {
stop <-chan struct{}
ttl time.Duration
period time.Duration
// This mutex guards all fields within this cache struct.
mu sync.RWMutex
// a set of assumed pod keys.
// The key could further be used to get an entry in podStates.
assumedPods map[string]bool
// a map from pod key to podState.
podStates map[string]*podState
nodes map[string]*NodeInfo
nodeTree *NodeTree
pdbs map[string]*policy.PodDisruptionBudget
// A map from image name to its imageState.
imageStates map[string]*imageState
}
這裏存儲了基本調度所需要的所有信息
以AddPod接口爲例,本質上就是把監聽到的一個pod放到了cache的map裏:
cache.addPod(pod)
ps := &podState{
pod: pod,
}
cache.podStates[key] = ps
node Tree
節點信息有這樣一個結構體保存:
type NodeTree struct {
tree map[string]*nodeArray // a map from zone (region-zone) to an array of nodes in the zone.
zones []string // a list of all the zones in the tree (keys)
zoneIndex int
NumNodes int
mu sync.RWMutex
}
cache 運行時會循環清理過期的assume pod
func (cache *schedulerCache) run() {
go wait.Until(cache.cleanupExpiredAssumedPods, cache.period, cache.stop)
}
scheduler
scheduler裏面最重要的兩個東西:cache 和調度算法
type Scheduler struct {
config *Config -------> SchedulerCache
|
+---> Algorithm
}
等cache更新好了,調度器就是調度一個pod:
func (sched *Scheduler) Run() {
if !sched.config.WaitForCacheSync() {
return
}
go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}
核心邏輯來了:
+-------------+
| 獲取一個pod |
+-------------+
|
+-----------------------------------------------------------------------------------+
| 如果pod的DeletionTimestamp 存在就不用進行調度, kubelet發現這個字段會直接去刪除pod |
+-----------------------------------------------------------------------------------+
|
+-----------------------------------------+
| 選一個suggestedHost,可理解爲合適的節點 |
+-----------------------------------------+
|_____________選不到就進入強佔的邏輯,與我當初寫swarm調度器邏輯類似
|
+--------------------------------------------------------------------------------+
| 雖然還沒真調度到node上,但是告訴cache pod已經被調度到node上了,變成assume pod |
| 這裏面會先檢查volumes |
| 然後:err = sched.assume(assumedPod, suggestedHost) 假設pod被調度到node上了 |
+--------------------------------------------------------------------------------+
|
+---------------------------+
| 異步的bind這個pod到node上 |
| 先bind volume |
| bind pod |
+---------------------------+
|
+----------------+
| 暴露一些metric |
+----------------+
bind動作:
err := sched.bind(assumedPod, &v1.Binding{
ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
Target: v1.ObjectReference{
Kind: "Node",
Name: suggestedHost,
},
})
先去bind pod,然後告訴cache bind結束
err := sched.config.GetBinder(assumed).Bind(b)
if err := sched.config.SchedulerCache.FinishBinding(assumed);
bind 流程
+----------------+
| GetBinder.Bind
+----------------+
|
+-------------------------------------+
| 告訴cache bind完成 FinishBinding接口
+-------------------------------------+
|
+-----------------------------------------------------+
| 失敗了就ForgetPod, 更新一下pod狀態爲 BindingRejected
+-----------------------------------------------------+
bind 實現
最終就是調用了apiserver bind接口:
func (b *binder) Bind(binding *v1.Binding) error {
glog.V(3).Infof("Attempting to bind %v to %v", binding.Name, binding.Target.Name)
return b.Client.CoreV1().Pods(binding.Namespace).Bind(binding)
}
調度算法
▾ algorithm/
▸ predicates/ 預選
▸ priorities/ 優選
現在最重要的就是選節點的實現
suggestedHost, err := sched.schedule(pod)
也就是調度算法的實現:
type ScheduleAlgorithm interface {
// 傳入pod 節點列表,返回一下合適的節點
Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
// 資源搶佔用的
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
// 預選函數集,
Predicates() map[string]FitPredicate
| 這一個節點適合不適合調度這個pod,不適合的話返回原因
+-------type FitPredicate func(pod *v1.Pod, meta PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []PredicateFailureReason, error)
// 返回優選配置,最重要兩個函數 map 和 reduce
Prioritizers() []PriorityConfig
|____________PriorityMapFunction 計算 節點的優先級
|____________PriorityReduceFunction 根據map的結果計算所有node的最終得分
|____________PriorityFunction 廢棄
}
調度算法可以通過兩種方式生成:
- Provider 默認方式, 通用調度器
- Policy 策略方式, 特殊調度器
最終new了一個scheduler:
priorityConfigs, err := c.GetPriorityFunctionConfigs(priorityKeys)
priorityMetaProducer, err := c.GetPriorityMetadataProducer()
predicateMetaProducer, err := c.GetPredicateMetadataProducer()
|
algo := core.NewGenericScheduler( |
c.schedulerCache, |
c.equivalencePodCache, V
c.podQueue,
predicateFuncs, ============> 這裏面把預選優選函數都注入進來了
predicateMetaProducer,
priorityConfigs,
priorityMetaProducer,
extenders,
c.volumeBinder,
c.pVCLister,
c.alwaysCheckAllPredicates,
c.disablePreemption,
c.percentageOfNodesToScore,
)
type genericScheduler struct {
cache schedulercache.Cache
equivalenceCache *equivalence.Cache
schedulingQueue SchedulingQueue
predicates map[string]algorithm.FitPredicate
priorityMetaProducer algorithm.PriorityMetadataProducer
predicateMetaProducer algorithm.PredicateMetadataProducer
prioritizers []algorithm.PriorityConfig
extenders []algorithm.SchedulerExtender
lastNodeIndex uint64
alwaysCheckAllPredicates bool
cachedNodeInfoMap map[string]*schedulercache.NodeInfo
volumeBinder *volumebinder.VolumeBinder
pvcLister corelisters.PersistentVolumeClaimLister
disablePreemption bool
percentageOfNodesToScore int32
}
這個scheduler實現了ScheduleAlgorithm中定義的接口
Schedule 流程:
+------------------------------------+
| trace記錄一下,要開始調度哪個pod了 |
+------------------------------------+
|
+-----------------------------------------------+
| pod基本檢查,這裏主要檢查卷和delete timestamp |
+-----------------------------------------------+
|
+----------------------------------------+
| 獲取node列表, 更新cache的node info map |
+----------------------------------------+
|
+----------------------------------------------+
| 預選,返回合適的節點列表和預選失敗節點的原因 |
+----------------------------------------------+
|
+----------------------------------------------------------+
| 優選, |
| 如果預選結果只有一個節點,那麼直接使用之,不需要進行優選 |
| 否則進行優選過程 |
+----------------------------------------------------------+
|
+------------------------------------+
| 在優選結果列表中選擇得分最高的節點 |
+------------------------------------+
預選
主要分成兩塊
- 預選, 檢查該節點符合不符合
- 執行extender, 自定義調度器擴展,官方實現了HTTP extender 把預選結果發給用戶,用戶再去過濾
podFitOnNode: 判斷這個節點是不是適合這個pod調度
這裏插播一個小知識,調度器裏有個Ecache:
Equivalence Class目前是用來在Kubernetes Scheduler加速Predicate,提升Scheduler的吞吐性能。
Kubernetes scheduler及時維護着Equivalence Cache的數據,當某些情況發生時(比如delete node、bind pod等事件),
需要立刻invalid相關的Equivalence Cache中的緩存數據。
一個Equivalence Class是用來定義一組具有相同Requirements和Constraints的Pods的相關信息的集合,
在Scheduler進行Predicate階段時可以只需對Equivalence Class中一個Pod進行Predicate,並把Predicate的結果放到
Equivalence Cache中以供該Equivalence Class中其他Pods(成爲Equivalent Pods)重用該結果。只有當Equivalence Cache
中沒有可以重用的Predicate Result纔會進行正常的Predicate流程。
ecache這塊後續可以深入討論,本文更多關注核心架構與流程
所以這塊就比較簡單了, 把所有的預選函數執行行一遍
先排序 predicates.Ordering()
if predicate, exist := predicateFuncs[predicateKey]; exist {
fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
順序是這樣的:
predicatesOrdering = []string{CheckNodeConditionPred, CheckNodeUnschedulablePred,
GeneralPred, HostNamePred, PodFitsHostPortsPred,
MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
MaxAzureDiskVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
CheckNodeMemoryPressurePred, CheckNodePIDPressurePred, CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
這些預選函數是存在一個map裏的,key是一個string,value就是一個預選函數, 再回頭去看註冊map的邏輯
predicateFuncs, err := c.GetPredicates(predicateKeys)
pkg/scheduler/algorithmprovider/defaults/defaults.go 裏面會對這些函數進行註冊,如:
factory.RegisterFitPredicate(predicates.NoDiskConflictPred, predicates.NoDiskConflict),
factory.RegisterFitPredicate(predicates.GeneralPred, predicates.GeneralPredicates),
factory.RegisterFitPredicate(predicates.CheckNodeMemoryPressurePred, predicates.CheckNodeMemoryPressurePredicate),
factory.RegisterFitPredicate(predicates.CheckNodeDiskPressurePred, predicates.CheckNodeDiskPressurePredicate),
factory.RegisterFitPredicate(predicates.CheckNodePIDPressurePred, predicates.CheckNodePIDPressurePredicate),
然後直接在init函數裏調用註冊邏輯
優選
PrioritizeNodes 優選大概可分爲三個步驟:
- Map 計算單個節點,優先級
- Reduce 計算每個節點結果聚合,計算所有節點的最終得分
- Extender 與預選差不多
優選函數同理也是註冊進去的, 不再贅述
factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),
// Prioritizes nodes to help achieve balanced resource usage
factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),
這裏註冊時註冊兩個,一個map函數一個reduce函數,爲了更好的理解mapreduce,去看一個實現
factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1)
node Affinity map reduce
map 核心邏輯, 比較容易理解:
如果滿足節點親和,積分加權重
count += preferredSchedulingTerm.Weight
return schedulerapi.HostPriority{
Host: node.Name,
Score: int(count), # 算出積分
}, nil
reduce:
一個節點會走很多個map,每個map會產生一個分值,如node affinity產生一個,pod affinity再產生一個,所以node和分值是一對多的關係
去掉reverse的邏輯(分值越高優先級越低)
var maxCount int
for i := range result {
if result[i].Score > maxCount {
maxCount = result[i].Score # 所有分值裏的最大值
}
}
for i := range result {
score := result[i].Score
score = maxPriority * score / maxCount # 分值乘以最大優先級是maxPriority = 10,除以最大值賦值給分值 這裏是做了歸一化處理;
result[i].Score = score
}
這裏做了歸一化處理後分值就變成[0,maxPriority]之間了
for i := range priorityConfigs {
if priorityConfigs[i].Function != nil {
continue
}
results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
if err != nil {
appendError(err)
results[i][index].Host = nodes[index].Name
}
}
err := config.Reduce(pod, meta, nodeNameToInfo, results[index]);
看這裏有個results,對理解很重要,是一個二維數組:
xxx | node1 | node2 | node3 |
---|---|---|---|
nodeaffinity | 1分 | 2分 | 1分 |
pod affinity | 1分 | 3分 | 6分 |
... | ... | ... | ... |
這樣reduce時取一行,其實也就是處理所有節點的某項得分
result[i].Score += results[j][i].Score * priorityConfigs[j].Weight (二維變一維)
reduce完最終這個節點的得分就等於這個節點各項得分乘以該項權重的和,最後排序選最高分 (一維變0緯)
調度隊列 SchedulingQueue
scheduler配置裏有一個NextPod
方法,獲取一個pod,並進行調度:
pod := sched.config.NextPod()
配置文件在這裏初始化:
pkg/scheduler/factory/factory.go
NextPod: func() *v1.Pod {
return c.getNextPod()
},
func (c *configFactory) getNextPod() *v1.Pod {
pod, err := c.podQueue.Pop()
if err == nil {
return pod
}
...
}
隊列接口:
type SchedulingQueue interface {
Add(pod *v1.Pod) error
AddIfNotPresent(pod *v1.Pod) error
AddUnschedulableIfNotPresent(pod *v1.Pod) error
Pop() (*v1.Pod, error)
Update(oldPod, newPod *v1.Pod) error
Delete(pod *v1.Pod) error
MoveAllToActiveQueue()
AssignedPodAdded(pod *v1.Pod)
AssignedPodUpdated(pod *v1.Pod)
WaitingPodsForNode(nodeName string) []*v1.Pod
WaitingPods() []*v1.Pod
}
給了兩種實現,優先級隊列和FIFO :
func NewSchedulingQueue() SchedulingQueue {
if util.PodPriorityEnabled() {
return NewPriorityQueue() # 基於堆排序實現,根據優先級排序
}
return NewFIFO() # 簡單的先進先出
}
隊列實現比較簡單,不做深入分析, 更重要的是關注隊列,調度器,cache之間的關係:
AddFunc: c.addPodToCache,
UpdateFunc: c.updatePodInCache,
DeleteFunc: c.deletePodFromCache,
| informer監聽,了pod創建事件之後往cache和隊列裏都更新了
V
if err := c.schedulerCache.AddPod(pod); err != nil {
glog.Errorf("scheduler cache AddPod failed: %v", err)
}
c.podQueue.AssignedPodAdded(pod)
+------------+ ADD +-------------+ POP +-----------+
| informer |------>| sche Queue |------->| scheduler |
+------------+ | +-------------+ +----^------+
+-->+-------------+ |
| sche cache |<------------+
+-------------+
Extender
調度器擴展
定製化調度器有三種方式:
- 改scheduler代碼重新編譯 - 沒啥可討論
- 重寫調度器,調度時選擇調度器 - 比較簡單,問題是沒法與默認調度器共同作用
- 寫調度器擴展(extender)讓k8s調度完了 把符合的節點扔給你 你再去過濾和優選 - 重點討論,新版本做了一些升級,老的方式可能都無用了 資料
- 這裏有個調度器擴展事例
目前第三點資料非常少,很多細節需要在代碼裏找到答案,帶着問題看代碼效果更好。
Extender接口
+----------------------------------+ +----------+
| kube-scheduler -> extender client|------>| extender | (你需要開發的擴展,單獨的進程)
+----------------------------------+ +----------+
這個接口是kube-scheduler實現的,下面會介紹HTTPextender的實現
type SchedulerExtender interface {
// 最重要的一個接口,輸入pod和節點列表,輸出是符合調度的節點的列表
Filter(pod *v1.Pod,
nodes []*v1.Node, nodeNameToInfo map[string]*schedulercache.NodeInfo,
) (filteredNodes []*v1.Node, failedNodesMap schedulerapi.FailedNodesMap, err error)
// 這個給節點打分的,優選時需要用的
Prioritize(pod *v1.Pod, nodes []*v1.Node) (hostPriorities *schedulerapi.HostPriorityList, weight int, err error)
// Bind接口主要是最終調度器選中節點哪個節點時通知extender
Bind(binding *v1.Binding) error
// IsBinder returns whether this extender is configured for the Bind method.
IsBinder() bool
// 可以過濾你感興趣的pod 比如按照標籤
IsInterested(pod *v1.Pod) bool
// ProcessPreemption returns nodes with their victim pods processed by extender based on
// given:
// 1. Pod to schedule
// 2. Candidate nodes and victim pods (nodeToVictims) generated by previous scheduling process.
// 3. nodeNameToInfo to restore v1.Node from node name if extender cache is enabled.
// The possible changes made by extender may include:
// 1. Subset of given candidate nodes after preemption phase of extender.
// 2. A different set of victim pod for every given candidate node after preemption phase of extender.
// 我猜是與親和性相關的功能,不太清楚 TODO
ProcessPreemption(
pod *v1.Pod,
nodeToVictims map[*v1.Node]*schedulerapi.Victims,
nodeNameToInfo map[string]*schedulercache.NodeInfo,
) (map[*v1.Node]*schedulerapi.Victims, error)
// 優先級搶佔特性,可不實現
SupportsPreemption() bool
// 當訪問不到extender時怎麼處理,返回真時extender獲取不到時調度不能失敗
IsIgnorable() bool
}
官方實現了HTTPextender,可以看下:
type HTTPExtender struct {
extenderURL string
preemptVerb string
filterVerb string # 預選RUL
prioritizeVerb string # 優選RUL
bindVerb string
weight int
client *http.Client
nodeCacheCapable bool
managedResources sets.String
ignorable bool
}
看其預選和優選邏輯:
args = &schedulerapi.ExtenderArgs{ # 調度的是哪個pod,哪些節點符合調度條件, 返回的也是這個結構體
Pod: pod,
Nodes: nodeList,
NodeNames: nodeNames,
}
if err := h.send(h.filterVerb, args, &result); err != nil { # 發了個http請求給extender(你要去實現的httpserver), 返回過濾後的結構
return nil, nil, err
}
HTTPExtender配置參數從哪來
scheduler extender配置:
NamespaceSystem string = "kube-system"
SchedulerDefaultLockObjectNamespace string = metav1.NamespaceSystem
// SchedulerPolicyConfigMapKey defines the key of the element in the
// scheduler's policy ConfigMap that contains scheduler's policy config.
SchedulerPolicyConfigMapKey = "policy.cfg"
總結
調度器的代碼寫的還是挺不錯的,相比較於kube-proxy好多了,可擴展性也還可以,不過目測調度器會面臨一次大的重構,現階段調度器對深度學習的批處理任務支持就不好
而one by one調度的這種設定關係到整個項目的架構,要想優雅的支持更優秀的調度估計重構是跑不掉了
掃碼關注sealyun
探討可加QQ羣:98488045