零停机给Kubernetes集群节点打系统补丁

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景简介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Salesforce的Einstein Vision 和语言服务部署在AWS Elastic Kubernetes Service(EKS)集群上。其中有一个最主要的安全和合规性需求,就是给集群节点的操作系统打补丁。部署服务的集群节点需要通过打补丁的方式进行系统的定期更新。这些补丁减少了可能让虚拟机暴露于攻击之下的漏洞。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"打补丁的过程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爱因斯坦服务以Kubernetes Pod的形式部署在不可变的EC2节点组(也称为AWS自动伸缩组,缩写为ASG)中。打补丁的过程包括构建新的Amazon Machine Image (AMI),镜像中包含了所有更新的安全补丁。新的AMI用于更新节点组,每一次需要启动一个新的EC2实例。当新实例通过运行健康状况检查后,旧实例将被终止。这个过程将会持续下去,直到节点组中的所有EC2实例都被新实例替换,这个过程也称为滚动更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,这个打补丁的过程给我们带来了一个挑战。当旧的EC2实例被终止时,在这些EC2实例上运行的服务Pod也会被终止。如果Pod的终止过程没有得到妥善处理,可能会导致用户请求处理失败。要优雅地终止Pod,需要基础设施组件(Kubernetes API和AWS ASG)和应用程序组件(服务\/应用程序容器)的支持。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"优雅终止应用程序"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这个过程中,首先要优雅地终止应用程序。终止一个Pod可能会导致Pod中的Docker容器突然终止,在Docker容器中运行的进程也会突然终止。这可能会导致正在处理中的请求被终止,最终导致当时正在调用应用程序的上游服务调用失败。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当一个EC2实例在打补丁过程中被终止,该实例上的Pod也将被驱逐。Pod被标志为终止,在EC2实例上运行的kubelet就开始了关闭Pod的过程。kubelet将发出SIGTERM信号。如果在Pod中运行的应用程序没有处理SIGTERM信号的逻辑,正在执行的任务可能会被突然终止。因此,你需要更新应用程序来处理这个信号,并实现优雅的终止。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,对于Java应用程序,有一种方法可以实现优雅的终止(不同的框架处理方式有所不同):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"public static final int gracefulShutdownTimeoutSeconds = 30;@Override\npublic void onApplicationEvent(@NotNull ContextClosedEvent contextClosedEvent) {\n this.connector.pause();\n Executor executor = this.connector.getProtocolHandler().getExecutor();\n if (executor instanceof ThreadPoolExecutor) {\n try {\n ThreadPoolExecutor threadPoolExecutor = (ThreadPoolExecutor) executor;\n threadPoolExecutor.shutdown();\n logger.warn(\"Gracefully shutdown the service.\");\n if (!threadPoolExecutor.awaitTermination(gracefulShutdownTimeoutSeconds, TimeUnit.SECONDS)) {\n logger.warn(\"Forcefully shutdown the service after {} seconds.\", gracefulShutdownTimeoutSeconds);\n threadPoolExecutor.shutdownNow();\n }\n } catch (InterruptedException ex) {\n Thread.currentThread().interrupt();\n }\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的代码片段中,关闭信号被触发,并在30秒后强制终止应用程序,这给了应用程序30秒的时间来处理正在执行中的任务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果Pod由多个容器组成,并且容器终止的顺序很重要,那么最好要定义一个容器"},{"type":"link","attrs":{"href":"https:\/\/kubernetes.io\/docs\/concepts\/containers\/container-lifecycle-hooks\/","title":"","type":null},"content":[{"type":"text","text":"preStop钩子"}]},{"type":"text","text":",以确保容器能以正确顺序终止(例如,在终止日志边车容器前先终止应用程序容器)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在关闭Pod的过程中,kubelet会执行容器生命周期钩子(如果定义了的话)。在我们的例子中,一个Pod中有多个容器,因此,对我们来说,终止顺序很重要。我们为应用程序容器定义了一个preStop钩子,如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"lifecycle:\n preStop:\n exec:\n command:\n - \/bin\/sh\n - -c\n - kill -SIGTERM 1 && while ps -p 1 > \/dev\/null; do sleep 1; done;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"preStop钩子中定义的动作将向Docker容器中的进程(PID 1)发送一个SIGTERM信号,并以1秒为等待时间间隔,直到进程成功终止。进程可以完成任何一个挂起的任务,并正常终止。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"preStop钩子的默认超时时间是30秒。在我们的例子中,这提供了足够多的时间让进程优雅地终止。如果默认的时间不够,可以在preStop钩子中使用"},{"type":"codeinline","content":[{"type":"text","text":"terminationGracePeriodSeconds"}]},{"type":"text","text":"字段来指定其他值。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"优雅地终止EC2实例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上所述,我们的服务运行在EC2实例的节点组上。优雅地终止EC2实例可以通过使用AWS ASG生命周期钩子和AWS Lambda服务来实现。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"AWS EC2自动伸缩生命周期钩子"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了生命周期钩子,我们就可以实现在启动新实例或终止旧实例前暂停实例状态,并执行自定义操作。一旦实例被暂停,你就可以通过触发Lambda函数或在实例上运行命令来完成生命周期操作。实例会一直保持等待状态,直到生命周期操作完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们使用Terminating:Wait生命周期钩子将要终止的实例置于等待状态。有关ASG生命周期钩子的更多细节,请参阅"},{"type":"link","attrs":{"href":"http:\/\/ttps\/\/docs.aws.amazon.com\/autoscaling\/ec2\/userguide\/lifecycle-hooks.html","title":"","type":null},"content":[{"type":"text","text":"AWS文档"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"AWS Lambda"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们使用"},{"type":"link","attrs":{"href":"https:\/\/aws.amazon.com\/serverless\/sam\/","title":"","type":null},"content":[{"type":"text","text":"SAM"}]},{"type":"text","text":"框架来部署Lambda函数(这个Lambda函数是内部开发的,我们把它叫作node-drainer),当发生特定的ASG生命周期钩子事件时被触发。下图显示了优雅地终止节点组中的EC2实例所涉及的事件序列。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e6\/e6e9271f3d3276da7959a285b61a34a5.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当Patching Automation请求终止实例时,生命周期钩子将启动,并将实例置于Terminating:Wait状态。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当实例处于terminate:Wait状态,生命周期钩子就会触发AWS Lambda函数。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lambda函数调用Kubernetes API并隔离被终止的实例。隔离实例可防止在被终止的实例上启动新的Pod。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隔离实例后,该实例所有的Pod都将被驱逐,并放在一个正常的节点上。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kubernetes负责为健康实例提供新的Pod。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生命周期钩子等待,直到所有Pod被驱逐出实例,并且新Pod出现在一个正常的实例中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一旦节点被完全清空,生命周期钩子将移除WAIT状态,并继续执行终止操作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这确保了全部现有的请求都已处理完成,然后将Pod从节点中移除。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这样做的同时,我们要确保新Pod能处理新的请求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这种优雅的关闭过程确保没有Pod是被突然关闭的,也不会出现服务中断。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"RBAC(基于角色的访问控制)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了能从AWS Lambda函数访问Kubernetes资源,我们创建了一个IAM角色、一个"},{"type":"codeinline","content":[{"type":"text","text":"clusterrole"}]},{"type":"text","text":"和一个"},{"type":"codeinline","content":[{"type":"text","text":"clusterrolebinding"}]},{"type":"text","text":"。IAM角色用于授予访问ASG的权限,"},{"type":"codeinline","content":[{"type":"text","text":"clusterrole"}]},{"type":"text","text":"和"},{"type":"codeinline","content":[{"type":"text","text":"clusterrolebinding"}]},{"type":"text","text":"为"},{"type":"codeinline","content":[{"type":"text","text":"node-drainer"}]},{"type":"text","text":" Lambda函数授予驱逐Kubernetes Pod的权限。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"IAM角色策略"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Action\": [\n \"autoscaling:CompleteLifecycleAction\",\n \"ec2:DescribeInstances\",\n \"eks:DescribeCluster\",\n \"sts:GetCallerIdentity\"\n ],\n \"Resource\": \"*\",\n \"Effect\": \"Allow\"\n }\n ]\n}"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Clusterrole"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"kind: ClusterRole\napiVersion: rbac.authorization.k8s.io\/v1\nmetadata:\n name: lambda-cluster-access\nrules:\n - apiGroups: [\"\"]\n resources: [\"pods\", \"pods\/eviction\", \"nodes\"]\n verbs: [\"create\", \"list\", \"patch\"]"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Clusterrolebinding"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"kind: ClusterRoleBinding\napiVersion: rbac.authorization.k8s.io\/v1\nmetadata:\n name: lambda-user-cluster-role-binding\nsubjects:\n - kind: User\n name: lambda\n apiGroup: rbac.authorization.k8s.io\nroleRef:\n kind: ClusterRole\n name: lambda-cluster-access\n apiGroup: rbac.authorization.k8s.io"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"结论"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过结合使用AWS Lambda、AWS EC2自动伸缩生命周期钩子和优雅的应用程序进程终止,我们确保了在打补丁期间实现零停机频繁滚动更新EC2实例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文链接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/engineering.salesforce.com\/zero-downtime-node-patching-in-a-kubernetes-cluster-cdceb21c8c8c"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章