從10次宕機事件中,我學到5個重要的經驗

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文總結了過去遇到的許多次宕機事件中反覆出現的問題。工程團隊在處理這些事件時,某些模式(無論是作爲風險還是作爲資產)幾乎次次都能遇到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從這些反覆出現的模式中,我們提取出了一些工程團隊準備採納的經驗教訓,希望你也能從中學到有用的知識並做好準備。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"第1課:循環依賴會破壞你的運維工具"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用自己做出來的東西是一種很好的做法——畢竟,如果你都不這樣做,你怎麼能指望客戶使用你的產品和服務呢?如果你不拿自己公司的生產力當作賭注,如何爲這些產品和服務實現的流程背書呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但這種健康的習慣也會產生反作用,因爲這種行爲會造成依賴循環。所謂依賴循環是說,你依賴自己的系統……來修復你的系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種依賴模式還有其他一些實例,它們都違背了一條座右銘:不要重複自己。爲什麼只爲了監控用途就要再運行一種數據庫?你的生產數據庫本來運行得很好,所以把遙測數據也放在那裏就行了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些循環也會在停機期間帶來嚴重後果。例如,你可能需要身份驗證才能訪問操作系統,修復身份驗證模塊……或者監控本來應該正常運行的數據庫以獲取指標數據,找出數據庫出了什麼問題。總之就是這樣的死循環。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"甚至客戶通信系統有時也會損壞,因爲你用了自己的系統將系統狀態傳遞給客戶。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/episode-1-slack-vs-tgws\/","title":"","type":null},"content":[{"type":"text","text":"第1集"}]},{"type":"text","text":",Slack與TGW:Slack無法訪問控制面板來了解他們的系統出了什麼問題,因爲AWS Transit Gateway需要處於健康狀態才能將http流量傳輸到控制面板上。不幸的是,這個TGW是不健康的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/monzos-2019-cassandra-outage\/","title":"","type":null},"content":[{"type":"text","text":"第3集"}]},{"type":"text","text":",Monzo的2019年Cassandra宕機:Monzo的生產數據庫出現故障,要驗證系統訪問權限和部署代碼才能解決問題,但前者必須使用這個生產數據庫。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/kinesis-hits-the-thread-limit\/","title":"","type":null},"content":[{"type":"text","text":"第10集"}]},{"type":"text","text":",Kinesis達到線程限制:AWS無法更新Kinesis相關中斷的狀態頁面,因爲狀態頁面的更新依賴Kinesis。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/salesforce-publishes-a-controversial-postmortem-and-breaks-their-dns\/","title":"","type":null},"content":[{"type":"text","text":"第11集"}]},{"type":"text","text":",Salesforce發佈了有爭議的事後分析:Salesforce無法更新他們的狀態頁面,因爲他們將其託管在基於Heroku的服務上,並且由於Heroku就是他們所有的,還集成到了他們的基礎設施中,所以Heroku的運行狀態取決於他們的系統健康狀況。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"第2課:愚蠢的自動化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家都對現代公有云及其提供的無數API感到非常興奮。彈性!編排!所有運維都可以自動化,這樣人類就不會被吵醒了!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是這種熱情有時會讓我們過度自動化系統,進而很難測試退化的用例。而且這些未經測試的退化用例可能會有很大的不利影響,相比之下,在健康的系統狀態下,實現自動決策帶來的那點效率或經濟優勢是不夠看的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但即使自動化確實是明智的(因爲系統需要經常調整,和\/或調整所涉及的經濟成本較大),自動化有時也缺乏必要的“恐慌模式”,無法識別參數何時超出正常範圍。在這些情況下,自動化應該停止自動化操作並通知運維人員,因爲它即將開始做出一些非常不合邏輯的決定了。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/episode-1-slack-vs-tgws\/","title":"","type":null},"content":[{"type":"text","text":"第1集"}]},{"type":"text","text":",Slack與TGW:由於出現一個網絡問題期間CPU閒置,Slack的自動化操作丟棄了一堆他們“不需要”(旁白:可他們確實需要)的服務器,然後在流量激增時啓動了過多的服務器,超出了系統上的文件描述符限制。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/githubs-43-second-network-partition\/","title":"","type":null},"content":[{"type":"text","text":"第6集"}]},{"type":"text","text":",GitHub的43秒網絡分區:GitHub的數據庫自動化在一次爲時43秒的網絡分區期間,對一個主數據庫進行了記錄不完整的跨國提升。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/auth0s-seriously-congested-database\/","title":"","type":null},"content":[{"type":"text","text":"第8集"}]},{"type":"text","text":",Auth0的嚴重擁塞的數據庫:當請求因數據庫瓶頸而變慢時,Auth0啓動了兩倍的前端,結果帶來了更大流量,讓問題更嚴重了。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"第3課:現在是2021年,數據庫仍然很棘手"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果一切都是無狀態的,那會怎樣呢?那些討厭的數據庫總是給我們找麻煩。甚至在前端層表現出來的問題也常常是上游數據庫引發的堵塞,源頭可以追溯到深層服務棧的瓶頸。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個主題的素材非常豐富,我們把它分解成三個子課程:"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"第3a課:生產數據庫應該主要是點查詢或嚴格限制的範圍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"生產系統喜歡平坦、均勻且差異小的負載。對數據庫服務器來說,它們喜歡許多非常快速的查詢,可能都是有索引支持的,這樣最壞情況下成本也是可以控制的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲確保這一點,請將你的任意批量查詢放在專用的輔助服務器中,或者放在一些OLAP系統(如BigQuery或Snowflake)中。或者轉儲到CSV和並行grep都可以。不管這些批量查詢複雜程度如何,是不是符合你的數據集大小和流程,都請這樣做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且,如果你對查詢時間分佈還不夠了解,無法知道尾部是否有瘋狂的表掃描,請立即添加相應的監控。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/gitlabs-2017-postgres-outage\/","title":"","type":null},"content":[{"type":"text","text":"第2集"}]},{"type":"text","text":",Gitlab的2017年Postgres停機:非常昂貴的、長時間運行的帳戶刪除操作被放在了他們的生產數據庫上實時運行,導致擁塞和故障。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/auth0-silently-loses-some-indexes\/","title":"","type":null},"content":[{"type":"text","text":"第5集"}]},{"type":"text","text":",Auth0悄悄丟失了一些索引:創建索引時未受監控的失敗導致一些查詢突然變成掃描,從而大大增加了數據庫的負載並最終導致停機。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/auth0s-seriously-congested-database\/","title":"","type":null},"content":[{"type":"text","text":"第8集"}]},{"type":"text","text":",Auth0的嚴重擁塞的數據庫:生產系統上發生的一些特別昂貴的掃描加劇了數據庫問題。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"第3b課:避免數據庫中的“中間魔法”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"什麼是中間魔法?我們來大致瞭解一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好選項:使用像MySQL這樣的無聊事物並自己處理分片。這會很麻煩,因爲你必須在應用層做很多額外工作,但當它崩潰時你可能會知道它是如何運作的。這在10年前可能是正確的想法,但現在看來也不錯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更好的選項:只需購買一個更大的服務器並使用一個未分片的MySQL\/PostgreSQL服務器和一個或兩個副本。這種辦法一直都是好方案,儘可能選擇它。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能是2021年後最好的選項:花錢找雲服務提供商爲你運行數據庫,包括所有備份和故障轉移等業務。如果你真的喜歡,你甚至可以使用很帥氣的數據庫,例如CloudSpanner、DynamoDB之類。完全、不透明地依賴第三方在過去是不可想象的,但這可能是2021年最好的辦法。這些大公司在這方面做得非常好,畢竟他們做得不好的話,因爲你的公司就是依賴他們運營的,估計你們已經完蛋了。缺點是它會讓你破費多多,因爲這些服務的定價很高。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"玩火選項:使用一些聲稱可以自動解決所有擴展和故障轉移問題的東西,但你仍然需要做運維工作,而且它的生產環境歷史比MySQL之類的東西少得多。當它出錯時,很少有人知道如何操作,或充分了解其內部結構以診斷其編排流程的複雜故障模式。我們在這些停機事件中遇到的可能嫌疑人包括MongoDB和Cassandra。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/monzos-2019-cassandra-outage\/","title":"","type":null},"content":[{"type":"text","text":"第3集"}]},{"type":"text","text":",Monzo的2019年Cassandra停機:擴展的Cassandra集羣有很多難以理解的配置麻煩。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/auth0-silently-loses-some-indexes\/","title":"","type":null},"content":[{"type":"text","text":"第5集"}]},{"type":"text","text":",Auth0悄悄丟失了一些索引:在不降低實時流量的情況下,在mongo中重新同步副本是很難實現的。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"第3c課:重點在於恢復而不是備份,還要注意它們需要多長時間"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你不能證明你可以恢復某項內容,那麼備份就沒有任何意義。並且你還要恢復到正確的記錄上,恢復需要的時間太久也不行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們來看看有哪些情況:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備份沒有運行……這怎麼可能,我明明正在監控它啊!"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備份在S3中運行並生成了一個文件。這可能要看你的備份驗證到了什麼地步。可能的情況是該文件爲空,或者它包含的唯一有用的字符串是:"},{"type":"codeinline","content":[{"type":"text","text":"Error: permission denied on directory \/data"}]},{"type":"text","text":"。你的公司完蛋了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備份表面上包含大量重要數據,但在上傳時已損壞。你的公司完蛋了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"備份包含有效的數據庫!但是由於備份腳本中的循環錯誤,每個分片都是分片0。你公司的87.5%已經消失了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個備份都包含正確、有效的數據庫!但是你只能通過一個85毫秒的鏈接從廉價的存儲類下載它,意味着恢復需要2周時間。你的公司還是沒了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,請一定要證明你的恢復是有效的——自動化並監控這一步驟,不要只是偶爾做一次驗證——並確保它們會在可接受的時間內恢復完成。4小時的宕機會是糟糕的一天,但4天的宕機後,你的公司就完蛋了。確保你的公司政策可以容忍這樣的恢復時間,並讓你的領導簽字,這樣當工程團隊在災難期間需要7小時才能恢復數據庫時,他們也不會抓狂。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/gitlabs-2017-postgres-outage\/","title":"","type":null},"content":[{"type":"text","text":"第2集"}]},{"type":"text","text":",Gitlab的2017年Postgres停機:備份腳本每天都在運行,將內容放到S3中……直到軟件更新破壞了備份腳本。對應的修復還沒有真正經過測試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/githubs-43-second-network-partition\/","title":"","type":null},"content":[{"type":"text","text":"第6集"}]},{"type":"text","text":",GitHub的43秒網絡分區:恢復需要很長時間(10小時以上),尤其是在流量高峯期間,導致站點退化了很長時間。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"第4課:分階段慢慢部署"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管我們盡了最大努力,錯誤仍然會發生。我們會引入錯誤、或錯誤配置的東西、或傳播錯誤的防火牆規則,或其他什麼事物。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但分階段部署可以把問題鎖定在確定的範圍內,因此你可以在火勢蔓延並燒燬整個站點之前先看到哪裏在冒煙。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們討論過的許多團隊都有一套周密的部署方法,以確保他們公司的員工是第一批嘗試其服務更改的用戶,然後只有一小部分客戶會提前試用新的部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是一個具體的例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"部署到你的"},{"type":"text","marks":[{"type":"strong"}],"text":"Dogfooding"},{"type":"text","text":"(自家用)集羣——每小時或每個更改集,當前的HEAD版本都會部署給你的員工。這可以讓你自己的團隊在客戶發現問題之前未雨綢繆。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"金絲雀"},{"type":"text","text":"集羣——按照你的發佈節奏(也許每天一次?),發佈候選被推送到一個小型部署,將它暴露給你的一小部分用戶。一些公司會從幾十個數據中心裏挑一個來做金絲雀;其他公司根據他們的user_id或類似的東西挑出用戶羣的一部分來部署。發佈經理可能會仔細監控金絲雀受衆中這個新版本的對應指標,然後再繼續……"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"生產"},{"type":"text","text":"。現在它開始走向更廣闊的世界。根據服務的重要性和發佈節奏,有時生產部署會同時進行,有時會進一步分批部署,比如一次部署一個數據中心。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於採用這些方法的公司來說,一些小問題往往不會被大多數用戶發現,因爲它被自用、金絲雀或其他階段提前捕獲了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而在公司沒有使用分階段部署的情況下,事情顯然不太順利……編寫事後分析的團隊往往是第一個指出分階段部署會有多大影響的團隊。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/one-subtle-regex-takes-down-cloudflare\/","title":"","type":null},"content":[{"type":"text","text":"第4集"}]},{"type":"text","text":",一個微妙的正則表達式讓Cloudflare癱瘓:Cloudflare非常快速地部署了一種更昂貴的基於正則表達式的規則,結果由於CPU耗盡而導致整個站點癱瘓"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/salesforce-publishes-a-controversial-postmortem-and-breaks-their-dns\/","title":"","type":null},"content":[{"type":"text","text":"第11集"}]},{"type":"text","text":",Salesforce發佈了一個有爭議的事後分析:一個DNS配置更改的快速部署讓他們的所有名稱服務器都下線了。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"第5課:爲失敗做好準備,提前寫好策略和計劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,雖然我們都願意相信,如果測試非常徹底,並且周到地安排了所有事情,我們將不會再遇到大規模宕機事故……可我們都知道它們仍然會遲早發生。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,正如從許多停機事件中瞭解到的那樣,如果在停機之前就把策略和計劃內置到我們的系統和劇本中,我們就更容易從這些事件中恢復了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"策略意味着經過深思熟慮並做出決定,例如:如果整個站點因超載而停機,我們首先要減少哪些流量來恢復正常?這些流量涉及什麼類型或什麼類別的客戶?如果這些決定是提前做出的,並由領導簽字,甚至可能得到律師的驗證,工程團隊就更容易把壓力減到閾值以下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計劃是說:我們可以設置類似“恐慌模式”之類的東西,在這種模式下編排會停止、負載均衡器變得不那麼聰明,並且非必要的工作會自動暫停。我們可以有一個運行時參數,調整它可以減少一點負載,這樣我們就不必關閉和打開所有東西,驚動一大堆客戶了。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"引文"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/episode-1-slack-vs-tgws\/","title":"","type":null},"content":[{"type":"text","text":"第1集"}]},{"type":"text","text":",Slack與TGW:Slack使用了特使代理的恐慌模式,讓負載平衡算法在過載時找到健康主機的機會大大增加。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/one-subtle-regex-takes-down-cloudflare\/","title":"","type":null},"content":[{"type":"text","text":"第4集"}]},{"type":"text","text":",一個微妙的正則表達式摧毀了Cloudflare:Cloudflare已經制定了政策和支持使用條款,允許他們在服務出現故障時關閉全球Web應用程序防火牆。此外,他們有一個運行時參數,允許他們無需部署代碼就能立即禁用它。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/githubs-43-second-network-partition\/","title":"","type":null},"content":[{"type":"text","text":"第6集"}]},{"type":"text","text":",GitHub的43秒網絡分區:GitHub在從過載中恢復時關閉了Webhook調用和GitHubPages構建。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/downtimeproject.com\/podcast\/how-coinbase-unleashed-a-thundering-herd\/","title":"","type":null},"content":[{"type":"text","text":"第9集"}]},{"type":"text","text":",Coinbase驚動了大批客戶:在Coinbase需要配置自己的一個集羣,結果在關閉\/打開所有流量後驚動了大批客戶,他們本應該緩慢恢復流量的。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"小結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在回顧了所有這些壓力巨大的宕機事件後,我們得出了一個非常令人鼓舞的結論:包括我們在上面列舉的許多實踐在內,一些常見實踐可以預防或顯著減輕各種站點停機問題帶來的嚴重影響。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/downtimeproject.com\/podcast\/7-lessons-from-10-outages\/"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章