阿里技術實戰:數十萬雲服務器如何高效運維?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上雲後需要運維嗎?答案是:當然需要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上雲確實簡化了一部分的運維工作,比如傳統IT中服務器的日常運維等工作都交由雲服務商來完成了。但隨着雲上產品種類的不斷豐富和規模的不斷擴大,雲上資源如何高效運維正逐漸成爲運維人員的挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在剛剛落幕的"},{"type":"link","attrs":{"href":"https:\/\/qcon.infoq.cn\/2020\/shanghai\/schedule","title":"","type":null},"content":[{"type":"text","text":"QCon全球軟件開發大會(上海站)2020"}]},{"type":"text","text":"的“彈性工程與運維”專題中,阿里雲高級技術專家趙昱(巴梨)針對雲上運維話題,分享了阿里經濟體全面上雲後,如何實現雲上數十萬臺的ECS實例自動化運維的實踐與經驗,本文根據其演講整理。"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/mo8Ql29dPcCQf8vJ.jpeg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里雲高級技術專家趙昱"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"雲上運維的四大挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着雲計算的普及和發展,越來越多的企業選擇上雲。近幾年,阿里經濟體在全面上雲,在雲上運維方面與大多數企業遇到的問題類似,總結來說主要是來自以下四個方面:"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/HLh2K0CeehjDY9AH.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第一,規模問題。"},{"type":"text","text":"傳統的Human Ops和寫腳本的管理方式在資源少的情況下是玩得轉的,但是當規模一大就不行了。人肉管理幾十臺機器和幾萬臺機器是完全不同的概念,再加上雲上資源類型不斷豐富,雲上資源管理和運維的複雜度指數級上升。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第二,安全問題。"},{"type":"text","text":"阿里經濟體上雲涉及數百個業務方,涉及的運維人員非常的多,如何更好地進行權限控制、審計和審批都既複雜、又非常重要。數據和資源是公司的資產,運維權限過大、增加失誤風險,權限過小、增加管理成本,如何安全地使用雲賬號和資源爲管理者帶來極大的挑戰。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第三,效率問題。"},{"type":"text","text":"隨着資源規模的增長,如何高效地管理運維、提升研發人員的效率,也是雲上運維必須思考的問題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"第四,成本問題。"},{"type":"text","text":"業務方在成本優化方面的需求比較明顯,包括資源使用人員和財務人員,希望能夠提供不同維度的資源使用賬單,以便爲成本優化舉措提供依據。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/KQ2E4Qm1iaB61Q16.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們知道,傳統方式下資源的分配有專門的資源運營團隊負責,項目開發團隊只負責使用資源。但是隨着業務規模的不斷擴大,這種管理方式基本上是不可行的,這時候需要通過分權將基礎配置管理權交給業務項目組自行承擔,而這種運維模式的轉變對企業雲上資源管理也提出了挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事實上,阿里經濟體雲上運維也經歷了人肉運維到標準化、數據化和流程化運維的過程。直到2016年,內部雲上資源管理平臺“宙斯運維繫統”的雛形基本形成,實現了運維能力和經驗的標準化、流程化和系統化。隨着資源管理規模的日益龐大和需求多樣化,宙斯運維繫統隨後又接管了雲上資源的管控工作。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數十萬雲服務器如何高效運維?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前,宙斯運維繫統管理着阿里集團內部數百個業務方的20多種雲上產品和資源,包括數十萬臺的ECS實例,不僅爲各業務方提供資了源管理和運維能力,而且還提供了成本分析和治理能力。"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/rzhyH1eLFvseIJnh.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖:宙斯運維平臺整體架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整體來說,宙斯運維平臺包含資源管理、系統運維、應用運維、監控管理和成本分析五大模塊。向上通過控制檯和OpenAPI爲業務方提供服務,向下依賴阿里雲平臺的雲監控、資源編排、運維編排、標籤系統、彈性伸縮、運維通道和財務系統等服務,來管理日誌服務、雲服務器、網絡、對象存儲等衆多雲上資源。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"賬號管理"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/XdM9VDeVCmbQAHF4.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲歷史原因,宙斯運維平臺支持獨立大賬號和託管賬號的兩種賬號模式並存。獨立大賬號是宙斯系統運維平臺在阿里雲平臺的服務賬號,賬號下管理非常多的業務方的資源,業務方將運維功能全部託管到宙斯,因爲可以減少很多前置的工作,所以獨立大賬號是我們推薦業務方的方法。另外,因爲是服務賬號,不允許業務方直接登錄的,業務方只能通過白屏化入口進行操作,減少了操作失誤風險。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於託管賬號,它是在宙斯運維平臺之前的存量運維賬號,爲了幫助業務方更好地管理這些存量賬號,宙斯運維平臺提供了賬號託管服務,這些存量賬號授予宙斯服務賬號的管理員權限,因爲託管賬號的主子賬號與集團的登錄系統打通,運維人員可以直接登錄來管理。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"權限管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/kzfK4J1700mHk1NF.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"權限管理的主要思路是進行應用分組,應用分組以角色進行權限區分,給予人相應的應用上的角色。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們給予應用Owner、開發、運維和安全等角色,對不同的角色予以不同的權限 。Owner角色擁有應用下資源管理的上帝權限,也負責審批工作;開發人員是日常CI工作,以及日常、預發環境的測試工作;運維人員擁有線上發佈審批的能力;安全人員主要負責系統運維工作,包括安全掃描、代碼掃描等安全工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏所有的雲資源都是通過標籤掛載到相應的應用上,通過這樣的一個權限管理,管理員不僅可以在人的維度上可以看到有權限的應用,也可以應用維度上看到有權限的人。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"資源分組"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/fhdtt7hYEGsWDkW6.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於阿里雲的標籤系統,宙斯運維繫統支持資源按很多個維度分類,比如按部門、環境、Region等,宙斯運維繫統給創建的資源打上相應的標籤來方便業務方運進行資源的查找、管理和運維,通過標籤管理的模式可以很好地對無序化的資源進行運維和監控、甚至是資源分賬。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於託管賬號,可以通過API操作,系統通過解析離線的雲監控消息通知,讓業務方的標籤是按照一定的規範來設置,監聽到數據變化之後再同步到宙斯和CMDB中。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"資源交付"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/7LowXTGSaoABmncp.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於資源交付來說,最大的挑戰是雲上資源是多區域、多類型部署的。阿里雲平臺目前有上百種資源類型,如果每個資源都通過寫代碼、寫API的方式來進行操作,不僅複雜、效率還很低。而且,大多數的業務場景不是單字元的交付,若是挨個進行組合來操作,也非常耗時。業務方一般要求場景化交付,大多數業務場景是有一個規範化的常用範式,是可以通過場景化的交付大幅提升資源交付方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這類場景化交付的需求,一開始其實使用的是寫腳本的方式來操作的,但耗費大量的精力和人力,效率比較低下。爲了應對多種類型的資源分配場景,宙斯運維繫統引入了Infrastructure As Code機制進行資源編排,開源的Terraform也是同樣的思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏,宙斯運維繫統採用的是阿里雲提供的ROS資源編排工具,同時引入集團審批流,將資源部署標準化、流程化。宙斯運維繫統將常用場景抽象成本資源編排模板,通過模板一鍵按照一鍵按場景交付資源,通過模板這樣的方式大幅提升了我們資源交付的效率,同時也降低了新資源的接入門檻。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"運維管理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/0yKJTadN9G4lCYZ9.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從運維工作類型來看,運維也是分層的。系統層面的補丁管理、安全掃描、安全防護等能力是一個平臺的能力,是不需要業務方來關心,宙斯運維繫統將這些能力抽象出來後提供統一的機制來管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用層面,主要涉及到資源的運維和CI\/CD。應用資源運維,宙斯運維繫統將常用的運維動作抽象成運維編排模板,藉助阿里雲運維編排服務進行工作流編排,在定義常用運維場景同時支持業務方自定義運維操作,這樣可以實現運維流程可積累可複製。另外,利用底層能力支持定時、告警、事件觸發的運維操作,進一步提升運維操作效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CI\/CD部分,宙斯運維繫統主要使用了阿里集團的Aone(雲效)系統,支持基於軟件包和鏡像的分批發布,同時允許自定義操作。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"監控告警"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/SfJtNjEo9kWV4pTf.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從信息源的角度分類,告警和監控可以分爲資源監控、應用監控以及業務監控,越往上監控和告警的準確率越高、但通用性越低。宙斯運維繫統實現了多種告警處理方式,通過與監控系統的集成將告警按分組聯繫人分發,比如短信、釘釘等信息;對於自動化的場景,對接了彈性伸縮和運維編排來觸發自動操作,實現自動化運維工作,完成自動化閉環。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"診斷和修復"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/nalxgmB8YfIPrsRp.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着使用的資源和業務越來越多,內部業務方關於ECS實例、網絡等問題的諮詢量逐漸增多,爲了提升問題的解決效率,同時運維平臺也需要有自證清白的能力。於是,我們通過與阿里雲內部ECS、網絡、操作系統等團隊進行共建,利用歷史數據形成了案例庫、知識庫,再加上專家經驗,我們沉澱了診斷和修復的能力,通過一鍵診斷幫業務方快速定位具體問題。對於一些常見的問題,抽象出常用的修復腳本,提供一鍵修復能力。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以ECS實例爲例,通過實例的監控診斷定位出問題根因,同時我們提供出手動修復方案,同時我們也提供了使用運維編排一鍵自動修復能力,這個過程支持打快照回滾。通過這部分的建設,讓我們日常值班的服務量大幅降低。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"成本管理 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"成本管理的目標主要是成本優化,有很多業務方申請了很多雲服務器資源,使用中發現其實一些機器是沒怎麼用或是CPU利用率比較低,這就造成了資源的浪費。宙斯運維繫統通過成本管理的建設,將成本管理的意識傳遞給到業務方,並推動業務方來完成成本優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"成本管理的思路里,我們主要是在事前的卡點和事中的分賬能力來實現。首先,在資源申請時做審批卡點,如果申請的資源規格特別高就會給出一些提示,詢問資源申請是否合理;然後,在資源使用過程中,利用標籤和應用分組的分賬能力,把資源使用費用分攤到相應的部門和項目組,週期性地向業務方提供賬單,財務根據部門的賬單做分析,可以判斷哪些項目是入不敷出的,同時也推動業務方自己去優化資源的使用。比如,是否切換到彈性伸縮上來優化成本,調整資源配置規格進行優化等等,從成本的角度推動業務方來做優化。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"總結"}]},{"type":"image","attrs":{"src":"https:\/\/uploader.shimo.im\/f\/mYlrGciyk3ETqmtT.jpg!thumbnail","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文主要介紹了阿里經濟體上雲過程中宙斯運維繫統如何高效管理雲上資源的經驗,總結來說是通過標準化、流程化、自動化和數據化的方式來實現的,希望能給雲上運維面臨同樣問題的運維人員一些參考。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章