Facebook史上最嚴重宕機:互聯網企業是時候重新審視架構了?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"剛被指責“利用放大仇恨言論的算法謀取利益”沒多久,Facebook再次陷入危機。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"美國東部時間10月4日上午 11 點 39 分左右,美國社交媒體Facebook、Instagram和即時通訊軟件WhatsApp出現大規模宕機,此次宕機長達近7個小時,刷新了Facebook自 2008年以來的最長宕機時長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"美國互聯網監控網站DownDectors的監控情況顯示,Facebook在歐洲、美洲、大洋洲幾乎是完全下線,在亞洲的日本、韓國、印度等國也無法訪問。據悉,WhatsApp 和 Facebook Messenger 兩款“微信”類即時通信產品,分別在全球範圍擁有 20 億用戶和 13 億用戶,社交平臺 Instagram 用戶數也達到了10億用戶。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除了讓數十億用戶陷入困境之外, Facebook 服務中斷還使其員工無法使用內部工具相互交流。Facebook 的電子郵件和工具都是企業內部管理的,Facebook 很多員工也無法正常工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/10\/100e8ad14861221ff45438cb24e90604.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Facebook首席技術官Mike Schroepfer在推特上道歉"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一條指令引發的“血案”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Facebook 表示,這次故障的根本原因是例行維護工作發出了一條糟糕的指令,結果導致其DNS服務器不可使用,切斷了Facebook 整個骨幹網絡與數據中心之間的連接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"所謂骨幹網,是Facebook爲一切計算設施構建而成的全局連接網絡,由長達數萬英里的光纖線纜組成,跨越全球並將各地的數據中心連接了起來。Facebook基礎設施副總裁 Santosh Janardhan在"},{"type":"link","attrs":{"href":"https:\/\/engineering.fb.com\/2021\/10\/05\/networking-traffic\/outage-details\/","title":null,"type":null},"content":[{"type":"text","text":"文章中"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"解釋道,數據中心主要有兩種形式,一種是存放着數百萬臺數據存儲與高強度計算負載運行設備的“巨大的建築”,另一種則屬於小型設備,通過骨幹網絡接入整體互聯網並構建起Facebook社交平臺的方方面面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當用戶打開應用並加載摘要或消息時,應用提出的數據請求會由當前設備傳輸至最近的設施,之後再直接通過骨幹網絡與更大的數據中心進行通信。應用所需要的信息將在這些數據中心內進行檢索與處理,再把結果通過網絡發送回用戶手機上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"維護基礎設施的日常工作非常繁重。工程師們還經常需要讓部分骨幹網絡離線以實施維護——包括修復光纖線路、擴大容量或者更新路由器自身軟件等等。而這也是此次宕機事件的原因所在。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Janardhan表示,在一項日常維護工作中,工程師們發出一條用於評估全球骨幹網容量可用性的指令,但意外切斷了骨幹網絡中的所有連接,這實質上就是斷開了Facebook全球數據中心之間的連接。不幸的是,Facebook的系統在設計上能夠審查此類指令以防止出現錯誤,但其功能只是發出錯誤提示,並不能真正阻止指令執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這次意外,導致Facebook的數據中心與互聯網之間的服務器連接完全斷開,由此帶來了一系列連鎖效應讓情況進一步惡化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在此次宕機事件中,由於整個骨幹網都已陷入癱瘓,因此各DNS服務器位置均上報連接狀態問題並撤回邊界網關協議(BGP)通告。最終結果是,Facebook的DNS服務器雖然仍在運行但卻無法正常訪問,導致其他互聯網用戶也無法正常接入其服務器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"響應DNS查詢是小型設施執行的一項重要任務。DNS可以稱之爲互聯網的地址簿,能夠將用戶在瀏覽器中輸入的簡單網絡名稱轉換爲特定的服務器IP地址。這些轉換查詢由Facebook的權威名稱服務器給出應答,而這些服務器本身就佔用着最衆所周知的IP地址。接下來,這些服務器再通過邊界網關協議(BGP)向互聯網的其餘部分發布通告。爲了確保運行可靠性,如果DNS服務器自身無法與數據中心通信,則所有BGP通告都將被禁用,表示當前網絡連接狀態不正確。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"簡單來說,Facebook 拿走了告訴世界計算機如何找到其各種在線資產的地圖。結果,當在 Web 瀏覽器中鍵入 Facebook.com 時,瀏覽器不知道在哪裏可以找到 Facebook.com,因此返回到了錯誤頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼無法及時修復"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"爲什麼這次故障持續了近7個小時之久呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Janardhan表示,工程師們在修復這一故障時,面臨着兩個巨大的障礙:首先,Facebook的工程師們無法通過正常方式訪問自己的Facebook數據中心,因爲這時候骨幹網已經出現了故障;其次,DNS沒有響應致使Facebook無法使用調查及解決宕機問題的常規內部工具。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"骨幹網與帶外網絡訪問均出現故障,這意味着工程師只能親自前往現場進行調試並嘗試重啓系統。但這需要時間,因爲各處設施都遵循高水平的物理與系統安全保護政策。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#555555","name":"user"}}],"text":"錯誤的更新阻止了 Facebook 員工(其中大多數是遠程工作)恢復和更改系統。與此同時,那些可以物理訪問 Facebook 大樓的人無法訪問 Facebook 的內部工具。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“任何人員都很難進入,而且一旦進入並獲得物理訪問能力,這些硬件與路由器的設計也很難得到修改。因此,需要更多的時間將工程師們引導進機房,併爲他們提供在服務器上工作所需要的安全訪問協議。只有這樣,我們才能確認問題並讓骨幹網重新上線。”Janardhan寫道。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/83\/834b69bc4af3aefccd42b96f25c1da41.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"有專家估計,Facebook、Instagram、WhatsApp 全球服務中斷一小時將給全球經濟造成 1.6 億美元的損失。同時,Facebook當日股價盤中暴跌6%,扎克伯格個人財富一日蒸發逾60億美元。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"屋漏偏逢連夜雨。在 Facebook 全球網絡服務中斷期間,據稱在黑客論壇上有超過 15 億 Facebook 用戶的數據被出售。但Facebook 方面否認了這次用戶數據泄露與服務中斷有關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"“我們要明確表示,這次宕機背後沒有惡意活動,其根本原因是我們端的錯誤配置更改。我們也沒有證據表明用戶數據因此次停機而受到損害。”Janardhan"},{"type":"link","attrs":{"href":"https:\/\/engineering.fb.com\/2021\/10\/04\/networking-traffic\/outage\/","title":null,"type":null},"content":[{"type":"text","text":"說道"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架構缺陷"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"美國東部時間下午 6 點 33 分,"},{"type":"link","attrs":{"href":"https:\/\/twitter.com\/Facebook\/status\/1445155265360416773?s=20","title":null,"type":null},"content":[{"type":"text","text":"Facebook 發推文"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"稱其應用程序和服務已開始恢復運行。隨着各數據中心區域中的骨幹網連接的恢復,一切都隨之復原。但問題還沒有真正結束。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一次性對所有服務全部重啓會帶來新的隱患,因爲流量激增很可能導致新一輪崩潰。個別數據中心還上報稱宕機導致設施耗電量下降了幾十兆瓦,而突然上線帶來的用電量暴增很可能給電氣系統、緩存等各類裝置帶來意外衝擊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Janardhan表示,雖然Facebook 一直在做“風暴”演習,對服務、數據中心乃至整個區域進行脫機,並針對一切相關基礎設施與軟件開展壓力測試以模擬主要系統故障,但並未演練過全球骨幹網絡脫機的狀況,後續會找可行性方法作出應對。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"據監測互聯網流量和故障的思科ThousandEyes的產品營銷主管Angelique Medina表示,這起事件暴露了Facebook 架構的一個缺點:如果本身出現DNS故障,又沒有後備DNS,就可能會出現長時間的故障,“所以我認爲,這件事帶來的一大經驗教訓就是要有冗餘DNS。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Medina表示,一套更穩健的架構將擁有雙DNS服務,那樣一個DNS服務可以支援另一個。據Medina聲稱,比如說,亞馬遜(其AWS提供DNS服務)爲其DNS使用兩項外部服務:Dyn和UltraDNS。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,這次宕機事件也讓身處反壟斷調查的 Facebook 雪上加霜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美國國會衆議院成員Alexandria Ocasio-Cortez表示,Facebook爆發大規模宕機事故,這凸顯出該公司在全球通信和其他服務領域的壟斷地位。其在推特上表示,Facebook 週一發生的大規模宕機事故是對該公司壟斷全球通訊和其他服務的一次提醒,再次表明 Facebook 應該被分拆。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章