零停機給Kubernetes集羣節點打系統補丁

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Salesforce的Einstein Vision 和語言服務部署在AWS Elastic Kubernetes Service(EKS)集羣上。其中有一個最主要的安全和合規性需求,就是給集羣節點的操作系統打補丁。部署服務的集羣節點需要通過打補丁的方式進行系統的定期更新。這些補丁減少了可能讓虛擬機暴露於攻擊之下的漏洞。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"打補丁的過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛因斯坦服務以Kubernetes Pod的形式部署在不可變的EC2節點組(也稱爲AWS自動伸縮組,縮寫爲ASG)中。打補丁的過程包括構建新的Amazon Machine Image (AMI),鏡像中包含了所有更新的安全補丁。新的AMI用於更新節點組,每一次需要啓動一個新的EC2實例。當新實例通過運行健康狀況檢查後,舊實例將被終止。這個過程將會持續下去,直到節點組中的所有EC2實例都被新實例替換,這個過程也稱爲滾動更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,這個打補丁的過程給我們帶來了一個挑戰。當舊的EC2實例被終止時,在這些EC2實例上運行的服務Pod也會被終止。如果Pod的終止過程沒有得到妥善處理,可能會導致用戶請求處理失敗。要優雅地終止Pod,需要基礎設施組件(Kubernetes API和AWS ASG)和應用程序組件(服務\/應用程序容器)的支持。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"優雅終止應用程序"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個過程中,首先要優雅地終止應用程序。終止一個Pod可能會導致Pod中的Docker容器突然終止,在Docker容器中運行的進程也會突然終止。這可能會導致正在處理中的請求被終止,最終導致當時正在調用應用程序的上游服務調用失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當一個EC2實例在打補丁過程中被終止,該實例上的Pod也將被驅逐。Pod被標誌爲終止,在EC2實例上運行的kubelet就開始了關閉Pod的過程。kubelet將發出SIGTERM信號。如果在Pod中運行的應用程序沒有處理SIGTERM信號的邏輯,正在執行的任務可能會被突然終止。因此,你需要更新應用程序來處理這個信號,並實現優雅的終止。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,對於Java應用程序,有一種方法可以實現優雅的終止(不同的框架處理方式有所不同):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"public static final int gracefulShutdownTimeoutSeconds = 30;@Override\npublic void onApplicationEvent(@NotNull ContextClosedEvent contextClosedEvent) {\n this.connector.pause();\n Executor executor = this.connector.getProtocolHandler().getExecutor();\n if (executor instanceof ThreadPoolExecutor) {\n try {\n ThreadPoolExecutor threadPoolExecutor = (ThreadPoolExecutor) executor;\n threadPoolExecutor.shutdown();\n logger.warn(\"Gracefully shutdown the service.\");\n if (!threadPoolExecutor.awaitTermination(gracefulShutdownTimeoutSeconds, TimeUnit.SECONDS)) {\n logger.warn(\"Forcefully shutdown the service after {} seconds.\", gracefulShutdownTimeoutSeconds);\n threadPoolExecutor.shutdownNow();\n }\n } catch (InterruptedException ex) {\n Thread.currentThread().interrupt();\n }\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的代碼片段中,關閉信號被觸發,並在30秒後強制終止應用程序,這給了應用程序30秒的時間來處理正在執行中的任務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果Pod由多個容器組成,並且容器終止的順序很重要,那麼最好要定義一個容器"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/containers\/container-lifecycle-hooks\/","title":"","type":null},"content":[{"type":"text","text":"preStop鉤子"}]},{"type":"text","text":",以確保容器能以正確順序終止(例如,在終止日誌邊車容器前先終止應用程序容器)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在關閉Pod的過程中,kubelet會執行容器生命週期鉤子(如果定義了的話)。在我們的例子中,一個Pod中有多個容器,因此,對我們來說,終止順序很重要。我們爲應用程序容器定義了一個preStop鉤子,如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"lifecycle:\n preStop:\n exec:\n command:\n - \/bin\/sh\n - -c\n - kill -SIGTERM 1 && while ps -p 1 > \/dev\/null; do sleep 1; done;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"preStop鉤子中定義的動作將向Docker容器中的進程(PID 1)發送一個SIGTERM信號,並以1秒爲等待時間間隔,直到進程成功終止。進程可以完成任何一個掛起的任務,並正常終止。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"preStop鉤子的默認超時時間是30秒。在我們的例子中,這提供了足夠多的時間讓進程優雅地終止。如果默認的時間不夠,可以在preStop鉤子中使用"},{"type":"codeinline","content":[{"type":"text","text":"terminationGracePeriodSeconds"}]},{"type":"text","text":"字段來指定其他值。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"優雅地終止EC2實例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上所述,我們的服務運行在EC2實例的節點組上。優雅地終止EC2實例可以通過使用AWS ASG生命週期鉤子和AWS Lambda服務來實現。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"AWS EC2自動伸縮生命週期鉤子"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了生命週期鉤子,我們就可以實現在啓動新實例或終止舊實例前暫停實例狀態,並執行自定義操作。一旦實例被暫停,你就可以通過觸發Lambda函數或在實例上運行命令來完成生命週期操作。實例會一直保持等待狀態,直到生命週期操作完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們使用Terminating:Wait生命週期鉤子將要終止的實例置於等待狀態。有關ASG生命週期鉤子的更多細節,請參閱"},{"type":"link","attrs":{"href":"http:\/\/ttps\/\/docs.aws.amazon.com\/autoscaling\/ec2\/userguide\/lifecycle-hooks.html","title":"","type":null},"content":[{"type":"text","text":"AWS文檔"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"AWS Lambda"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們使用"},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/serverless\/sam\/","title":"","type":null},"content":[{"type":"text","text":"SAM"}]},{"type":"text","text":"框架來部署Lambda函數(這個Lambda函數是內部開發的,我們把它叫作node-drainer),當發生特定的ASG生命週期鉤子事件時被觸發。下圖顯示了優雅地終止節點組中的EC2實例所涉及的事件序列。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e6\/e6e9271f3d3276da7959a285b61a34a5.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當Patching Automation請求終止實例時,生命週期鉤子將啓動,並將實例置於Terminating:Wait狀態。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當實例處於terminate:Wait狀態,生命週期鉤子就會觸發AWS Lambda函數。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lambda函數調用Kubernetes API並隔離被終止的實例。隔離實例可防止在被終止的實例上啓動新的Pod。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隔離實例後,該實例所有的Pod都將被驅逐,並放在一個正常的節點上。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes負責爲健康實例提供新的Pod。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生命週期鉤子等待,直到所有Pod被驅逐出實例,並且新Pod出現在一個正常的實例中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦節點被完全清空,生命週期鉤子將移除WAIT狀態,並繼續執行終止操作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這確保了全部現有的請求都已處理完成,然後將Pod從節點中移除。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這樣做的同時,我們要確保新Pod能處理新的請求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種優雅的關閉過程確保沒有Pod是被突然關閉的,也不會出現服務中斷。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"RBAC(基於角色的訪問控制)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了能從AWS Lambda函數訪問Kubernetes資源,我們創建了一個IAM角色、一個"},{"type":"codeinline","content":[{"type":"text","text":"clusterrole"}]},{"type":"text","text":"和一個"},{"type":"codeinline","content":[{"type":"text","text":"clusterrolebinding"}]},{"type":"text","text":"。IAM角色用於授予訪問ASG的權限,"},{"type":"codeinline","content":[{"type":"text","text":"clusterrole"}]},{"type":"text","text":"和"},{"type":"codeinline","content":[{"type":"text","text":"clusterrolebinding"}]},{"type":"text","text":"爲"},{"type":"codeinline","content":[{"type":"text","text":"node-drainer"}]},{"type":"text","text":" Lambda函數授予驅逐Kubernetes Pod的權限。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"IAM角色策略"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Action\": [\n \"autoscaling:CompleteLifecycleAction\",\n \"ec2:DescribeInstances\",\n \"eks:DescribeCluster\",\n \"sts:GetCallerIdentity\"\n ],\n \"Resource\": \"*\",\n \"Effect\": \"Allow\"\n }\n ]\n}"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Clusterrole"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"kind: ClusterRole\napiVersion: rbac.authorization.k8s.io\/v1\nmetadata:\n name: lambda-cluster-access\nrules:\n - apiGroups: [\"\"]\n resources: [\"pods\", \"pods\/eviction\", \"nodes\"]\n verbs: [\"create\", \"list\", \"patch\"]"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Clusterrolebinding"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"kind: ClusterRoleBinding\napiVersion: rbac.authorization.k8s.io\/v1\nmetadata:\n name: lambda-user-cluster-role-binding\nsubjects:\n - kind: User\n name: lambda\n apiGroup: rbac.authorization.k8s.io\nroleRef:\n kind: ClusterRole\n name: lambda-cluster-access\n apiGroup: rbac.authorization.k8s.io"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結論"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過結合使用AWS Lambda、AWS EC2自動伸縮生命週期鉤子和優雅的應用程序進程終止,我們確保了在打補丁期間實現零停機頻繁滾動更新EC2實例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/engineering.salesforce.com\/zero-downtime-node-patching-in-a-kubernetes-cluster-cdceb21c8c8c"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章