前車之鑑:聊聊我在基礎設施中掉過的坑

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着自己在職業道路上不斷成長,我總在會議上找到那種奇妙的、頗有點超現實意味的“既視感”。同事們偶爾提到的小狀況,不禁讓我想起之前曾經就同一個問題開過的會。我也還記得,自己當初的某個糟糕選定讓接下來的幾個月變得如同噩夢。於是我條件反射般地蹦起來,大呼“千萬別如何如何!”同事們嚇了一跳,但他們不知道這種反應的背後是種深深的恐懼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我想用這篇文章聊聊自己犯過的最大錯誤,希望能給各位後來者一點指引。其實我知道,無論看不看本文,該犯的錯都會犯、該掉的坑都要掉。但無論是給您做個提醒、還是給自己做個回顧,我都希望能用一篇文章整理自己那些曾經深以爲然、甚至投入不少精力的災難性決定。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"別貿然把應用程序從數據中心遷移到雲端"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各位雲支持者先別急着噴。我個人也是雲服務的忠實粉絲,但咱們應該承認,運行在物理數據中心內的應用程序很少能無縫遷移至雲端。我先後參與過三次嘗試,內容都是把專爲特定數據中心編寫的應用程序大規模遷移至雲端;而每一次都會冒出預想不到的狀況,把原本美好的遷移期望擊個粉碎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/57\/57ce2eb98282e29e79b3681dd6d3fbfc.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"當初遇到第一個無法解決的雲遷移問題時,我的心情就如圖所示"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開發人員在編寫和測試應用程序的同時,已經對目標環境的運作方式做出了假設或者說預期。服務器該如何工作、我的應用程序能獲得怎樣的性能、網絡可靠性如何、傳輸延遲大概處在怎樣的水平等等。任何一位擁有豐富工作經歷的朋友,都會在接觸項目的一剎那就整理出大致的判斷;但如果想把應用程序打包起來、特別是比較陳舊的應用程序,再轉移到其他運行環境時,總會出現種種奇怪的現象。我們開始遭遇前所未有的錯誤,還得忍受一個個爲了支持遷移而不得不做的架構決策。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很快,大家就會發現遷移的預期價值被消耗殆盡,而某些糟糕的嘗試(例如通過直連服務將本地數據中心與AWS無縫橋接起來)還會引發意外後果。總之,我們的決策清單開始快速堆積、增長,也給雲服務商帶來一個又一個極端案例。在此期間,我們還會找到很多根本不可能遷移的東西,於是被迫卡在兩個環境之間——需要繼續維護的數據中心,外加新的雲賬戶。當初的意氣風發已經不在,如今心裏只剩下後悔。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"經驗分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把應用程序移植到雲端。請務必爲開發人員提供一個與數據中心完全隔離的環境,引導他們將應用程序移植到雲端,之後再給應用程序安排4到8小時的停機時間。在此期間,我們可以分步進行持久層切換,之後再更改DNS條目以指向新的雲端版本。這是必要之惡,任何想要逃避這段停機時間的嘗試都會帶來一個又一個更錯誤的決定。總之,直面成本、大步向前。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或者更直接點,在應用程序開發之初就充分考慮到雲環境的特性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"別總想自主編寫加密系統"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不知道是太倒黴了還是怎麼樣,我總會遇上自主編寫的系統。很多組織都特別喜歡原創加密系統,有時候是把環境變量注入系統、有時候是調用基於RSA密鑰的解密API。好吧,我承認自己也曾經是那種自以爲是的傢伙,總以爲這事“不會特別困難”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有一次,我肯定是瘋了心了,突發奇想要在自己負責的PostgREST應用中管理自有secrets。所以我編寫了程序,能夠根據各種標準生成JWT並將其返回至應用程序。如此一來,應用就能安全無憂地訪問secrets了……至少理論上能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a1\/a18da7378ea4b0e75b877cdcaef5b681.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我還要爲PostgREST辯護幾句,它很好地完成了預期任務。但問題是,後來出現的secrets管理情況比初步預計要複雜得多。我們先是碰到了緩存問題,也就是在面對每小時上百萬次的服務訪問壓力之下,怎麼繼續保留一部分服務器作爲事實來源。雖然後來我們調整Nginx配置解決了這個問題,但我其實應該提前就想到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後的問題再次狠狠打了我的臉。新版本推送倒不成問題,但secrets卻往往無法在客戶端上實現版本化。我對應用程序進行了身份驗證,也能查看正確的secrets;但在輪替期間會有兩條secrets同時處於正確狀態,這一點我是真的沒想到。跟之前的問題一樣,這事不難修復,但隨着時間推移我在服務中遇到了越來越多的極端狀況,也最終承認自己犯了個大錯誤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到現實,secrets管理是那種典型的高風險、低迴報服務。這類功能既不會直接提升客戶體驗、也不會得到管理層的認同,它只會用沒完沒了的調試不斷浪費我的時間,同時逼着我邊調邊學新知識。就爲了這樣一項小小的原創功能,我把從多區域可用性(跨區域同步問題)到服務強化等方向試了個遍。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"經驗分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直接用AWS Secrets Manager或者Vault。我個人更喜歡Secrets Manager,但大家可以隨意選擇。總之,選什麼都比自己寫要好,因爲原創方案往往沒考慮到那些極端狀況、而且跟現成服務比也沒啥優勢。我要用自己的體會告訴大家,一個小小的原創應用很可能成爲系統宕機、拉昇運營成本的罪魁禍首。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"不要自主運行Kubernetes集羣"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我知道,很多朋友都覺得自己完全有能力運行Kubernetes集羣。什麼etcd啦、設置各種證書啦,分分鐘搞定。沒錯,但面對“該不該自主運行K8s集羣”這個問題時,我們不妨走一遍以下決策樹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你所在的是全球財富百強企業嗎?如果不是,請放棄。因爲雖然可以,但真的沒有必要。直接用現成的K8s集羣,能讓我們享受到服務商添加的一系列強大功能。AWS EKS就包含不少令人驚豔的選項,包括在kubeconfig文件中支持AWS SSO、以及使用IAM角色讓ServiceAccounts for pod訪問AWS資源等等。最重要的是,整個控制平面的年運行成本還不到1000美元。如果這還不能說服你那顆騷動的心,那咱們再聊點實在的。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雲服務的一大優勢,就是讓別人幫你做beta測試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我不明白這麼簡單的道理爲什麼很多人老是不懂。沒錯,你當然可以順暢運行自己的K8s集羣,但圖什麼呢?AWS那邊有成千上萬的beta測試員幫我們保障EKS的迭代升級。更重要的是,AWS的工程師水平很高、穩定可靠;而勉強在AWS中運行自有集羣的唯一回報,似乎只是讓自己擁有可以隨時“更換雲服務商”的錯覺。是的,只是種錯覺,後面我會再具體討論。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"經驗分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"把一切交給雲服務商。交給他們,問題就歸他們管了。讓開發人員們過得舒服點、輕鬆點,不好嗎?"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"別針對多家雲服務商做設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實我一直就是這麼想的,但有位經驗老道的經理說服了我,強調企業應該有能力隨時更換雲服務商。這話聽起來也有道理,所以我們就踏上了這條“不歸路”。現在的我,會稱這種思路爲“過早優化症候羣”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照多雲設計思路,我開始審查“多雲兼容性”並通過自主維護的SDK替代AWS提供的預製SDK。而這一切都是爲了營造某種虛假的安全感,即如果AWS這家大受歡迎的雲巨頭哪天突然不行了,我們就能比較順暢地完成大規模負載轉移並值回投入的成本。我猜大家這麼幹是想證明自己有某種厲害的前瞻性,或者是野心太大把腦回路給帶偏了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而這也是我們做過的最愚蠢的嘗試,大大提高了向客戶交付產品的難度。既然選擇了AWS,請拜託各位別假裝自己的應用程序還需要面向其他雲環境部署。如果AWS過幾天消失了,那確實得遷移應用程序;但有幾個人敢說自己的公司能開得比AWS久?既然沒這個信心,幹嘛要在跟當前雲無關的翻譯層上浪費那麼多時間、投入那麼多金錢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們最終得到的只是一堆始終跟不上節奏的庫,開發人員也被搞得身心俱疲、沒工夫研究AWS發佈的一系列最新功能。這些自定義庫發揮不了AWS的雲功能優勢,我們自己也從來沒試過更換雲服務商或者搞什麼雙雲部署。因爲從經濟層面講,這麼幹沒有任何意義。開發團隊空耗許久,最終什麼都沒有得到。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"經驗分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果各位身邊還有人鼓吹“必須保證不必依賴於單一雲服務商”,拜託替我罵醒這傢伙。與數據中心類似,在AWS當中設計、測試併成功運行多年的應用程序往往都帶有與環境相匹配的某些預期和模式。針對其他不可知設計做出的優化,都必然在犧牲當前雲服務商功能價值的同時、給開發團隊帶來本不必要的額外工作壓力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"別當那種站關說話不腰疼的人,跟所有人對着幹往往當不成力挽狂瀾的英雄、反而讓我們淪爲衆目睽睽下的小丑。另外,就算大家基本認定轉向其他雲服務商具有可靠的經濟意義,也請至少留出3個月時間開展應用程序測試和移植,之後再考慮到底要不要具體實施。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雲服務商當然會產生依賴性,就像編程語言一樣。我們不可能隨隨便便就決定更換,甚至全面移植也不太現實。所以大家不妨把遷移議案當成一種演習偶爾試試,同時把大部分精力集中在產品與當前環境的充分融合上。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"別沒完沒了地增加警報"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相信大家在工作中都見過這類情境。辦公室裏擺着一臺顯示器,專門展示圖表或者CloudWatch警報之類。某些警報會隔段時間就定期被觸發,但同事們已經習以爲常,直接當作沒看見。繼續追問,他們的回答也只是“我們只想看看到底會報多少次警,沒突然增多就行。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種不在意的態度終究會體現在真正緊急的警報當中,導致我們沒能把握時機解決關鍵問題。那能不能減少警報數量呢?不行,服務管理團隊覺得還是應該提醒一下,哪怕沒人看。所以隨着時間推移,似乎沒人能說清到底哪些警報重要、哪些不重要;只有新人報道時會再提起這事,然後同事們把前面的話重新解釋一遍。最終,系統一定會發生大崩潰,因爲某條“不要緊”的警報裏藏着非常要緊的信息,但卻根本沒人關注。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我就這麼幹過,而且捏造出了“設置警報肯定有設置的道理”來搪塞新員工。我錯了,我當初應該支持“幹翻一切、推倒重來”的觀點。希望多年前讓我痛心疾首的決定,不要在當下成爲各位讀者朋友的現實隱患。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"經驗分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只要警報響起,就一定代表着系統無法自行恢復、必須馬上介入的狀況。警報應該具有嚴肅性,我們也不該把故障內置到應用程序的設計之內。比如“有時候我們的服務需要重新啓動,通過SSH連接並重啓就行了”,這屬於常態、不該被設置成警報。如果說重啓失敗,那又是另一個問題,咱們不要把二者搞混淆了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"別讓垃圾警報慢慢污染我們的生活。如果平臺上的所有警報幾乎都沒有實際意義,請果斷推倒重來。如果系統每天要發出600封提醒郵件,那跟一封都不發沒有任何區別。過量的警報只會讓我們神經麻木,代表着我們當前根本得不到警報系統的保護。人類的大腦結構就是這樣,我們不能指望一個人在持續收到垃圾警報之後,還能敏銳地從中找到真正重要的那條。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"不要用Python編寫內部CLI工具"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這事我只簡單談兩句。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"沒人知道怎麼正確安裝和打包Python應用程序。如果你決定用Python編寫內部工具,那面前只有兩條路:要麼保證應用具有完全可移植性,要麼請選擇Go或者Rust語言。相信我,否則在安裝過程中遇到麻煩的用戶很可能提着刀來追殺你。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/matduggan.com\/mistakes\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/matduggan.com\/mistakes\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章