Salesfoce遭遇中斷:快速修復未果,工程師意外引發全球大宕機

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日前,因某位維護工程師的錯誤操作,Salesforce 惹上了意外的大麻煩。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幾天前,Salesforce 遭遇了一次長達 5 個小時的全球宕機。向外宣佈 5 個小時的宕機不是一件容易的事情,特別是讓 Salesforce 的 15 萬客戶受到嚴重影響的情況下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次宕機的起源,是因爲一位維護工程師想用一個簡單辦法規避批准從而快速修復問題,沒想到最後引起 Salesforce 的域名系統(DNS 服務器)配置錯誤,導致人們長時間無法訪問該公司的多款核心軟件即服務產品。在這段時間內,客戶無法穩定登錄,甚至服務狀態頁面也無法正常訪問。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而對這位決心繞開既有管理政策、意外肇事的工程師本人,Salesforce 表示“我們已經對這位員工做出了適當處理。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最開始,Salesforce 也搞不明白爲何宕機"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Salesforce 是目前最受歡迎的雲軟件應用程序之一。據報道該軟件應用程序已被全球大約 150,000 個組織中的數百萬名員工使用。Salesforce 提供的服務涉及客戶關係管理的各個方面,從普通的聯繫人管理、產品目錄到訂單管理、機會管理、銷售管理等。用戶無需花費大量資金和人力用於記錄的維護、儲存和管理,所有的記錄和數據都儲存在 Salesforce.com 上面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本月 11 日約 21:00(UTC),Salesforce 的服務開始不可用。因爲許多公司都使用了 Customer Cloud 來滿足用戶請求,所以這些客戶都受到了影響。有着急的客戶被迫撥打 Salesforce 的電話,卻得不到應答。自動應答表明他們正處於服務中斷中,呼叫者被定向到了在線頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/31\/311776d039ca8a06a283d2d5b6d77234.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Salesforce 的首席技術官和聯合創始人帕克·哈里斯(Parker Harris)隨後在 Twitter 上發推並暗示該問題與 DNS 有關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/28\/281f01790d4f7c871e15f3b2d677fcd7.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管哈里斯(Harris)所表現出來的態度還算樂觀,但實際上問題遲遲得不到解決。更不幸的是,因爲狀態頁面一起離線了,他們只能通過社交媒體與客戶進行溝通。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於這次中斷太過異常,因此有人推測這可能是受到網絡攻擊的結果,尤其是考慮到最近美國燃油網絡攻擊事件。Salesforce 合作伙伴 Groundswell Cloud 還猜測該故障與 AWS 有關,因爲他們認爲在此階段並沒有任何受到攻擊的跡象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/37\/37d2d410ca9f01c1dab3957459605886.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"工程師到底做了什麼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Salesforce 公司在事件發生後不斷更新原因分析進度。幾個小時後,公司首席可用性官 Darryn Dieken 組織了一次客戶簡報會,他在會議上強調,他們還需要一定調整才能全面完成修復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也正是在這次簡報會上,Salesforce 完整披露了事件情況與相關工程師的操作流程。雖然 Salesforce 向來以高度自動化的內部業務流程爲傲,但其中不少環節仍然只能手動操作完成——DNS 正是其中之一。當時,一位工程師正打算執行一項配置變更,負責將 DNS 系統對接澳大利亞的一處新 Salesforce Hyperforce 環境。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DNS 變更並不是什麼罕見操作,這位工程師使用的配置腳本也擁有着四年的穩定記錄。但 Salesforce 一直強調以“交叉”升級方式減小錯誤的影響半徑,因此工程師只能以手動方式緩慢完成這項變更。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但實際情況並不順利。根據 Dieken 的介紹,這位工程師錯誤地決定使用所謂“緊急停機修復(EBF)”流程縮短常規變更。而 EBF 實際只適用於發生嚴重問題,或者需要快速部署大量應急補丁的情況。因此選擇 EBF 流程,就意味着走上一條規避批准的非漸進式“捷徑”。但這位工程師想得很簡單——結合多年的工作經驗,再加上這套穩定可靠的腳本,有什麼可擔心的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後來的情況大家都清楚了,又是“小丑竟是我自己”的經典場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/c3\/c309876a9c4f5b3ddd50e0b062e1a57c.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dieken 補充道,“出於我們也搞不明白的某些原因,這位員工決定執行全局部署。”繞過常規的交叉更新之後,DNS 變更需要各服務器重新啓動。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這本身並不是什麼大事,也許會帶來短暫的中斷,但還不至於引發災難性的後果。但事件證明,這套“穩定可靠”的腳本內存在一個 bug。在實際負載下,此腳本可能發生超時並導致其餘內容無法正常運行。事實也正是如此,隨着更新在 Salesforce 各數據中心內不斷部署,超時點也被不斷引爆。這意味着服務器在重新啓動後未能正常啓動某些任務,導致服務器自身無法正確運行。於是乎,客戶當然不能像往常那樣順暢訪問 Salesforce 產品。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後來的情況變得更糟。Salesforce 團隊決定使用不良服務器處理工具,以“拉下緊急開關”的方式強制執行回滾與設備重啓。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dieken 無奈地表示,“但到這個時候,我們才發現其中的循環依賴關係。這些生產工具的起效前提,正是 DNS 服務器處於活動狀態。”當然,工作人員最終還是成功介入並完成了服務器修復。但事件已經給客戶造成了重大影響,Salesforce 也不得不投入大量精力平息由此引發的混亂事態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了避免再次發生類似的問題,Salesforce 決定採取保障措施以防止任何手動形式的全局部署操作,並實現整個流程的全面自動化。Dieken 還坦言,事實證明 Salesforce 在測試覆蓋率仍然不夠完善——換言之,對腳本的測試並不充分。最後,Salesforce 還需要解決恢復工具依賴於 DNS 系統的大問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次宕機事件中,最讓客戶不爽的是,因爲 Salesforce 狀態網站一併陷入癱瘓,他們只能在 Salesforce 的社交媒體上跟進官方停機消息。如果狀態頁面顯示不了故障狀態,還要它何用?Dieken 解釋道,“我們一直備有充裕的容量來應對種種峯值需求,但從來沒想到會出現這樣的負載情況。”但不必擔心,自動規模伸縮已經正式上線,後續情況肯定會有所好轉,至少狀態查看頁面應該不會如此“拉胯”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至於這位維護工程師,Dieken 雖然之前說到“我們並不打算指責員工本人”,但之後又表示“我們已經對這位員工做出適當處理。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/help.salesforce.com\/articleView?id=000358392&type=1&mode=1","title":"","type":null},"content":[{"type":"text","text":"https:\/\/help.salesforce.com\/articleView?id=000358392&type=1&mode=1"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.theregister.com\/2021\/05\/19\/salesforce_root_cause\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/www.theregister.com\/2021\/05\/19\/salesforce_root_cause\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章