Facebook工程經驗--PCIe故障監控和修復

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d4e802984155bc684607b9c07de406b5.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"PCIe是當前使用最廣泛的硬件高速總線接口,業界主流的GPU、網卡、存儲等系統幾乎都通過PCIe總線和CPU通信,因此,對於大規模數據中心來說,PCIe的穩定性和可維護性對整個數據中心的高可靠、高可用性至關重要。Facebook在這篇文章裏介紹了在數據中心裏自動化監控、修復PCIe錯誤的經驗,原文鏈接:","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fengineering.fb.com%2F2021%2F06%2F02%2Fdata-center-engineering%2Fhow-facebook-deals-with-pcie-faults-to-keep-our-data-centers-running-reliably%2F","title":null,"type":null},"content":[{"type":"text","text":"How Facebook deals with PCIe faults to keep our data centers running reliably","attrs":{}}],"marks":[{"type":"italic"}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"得益於傳輸速率的持續提升,PCIe(Peripheral component interconnect express,高速串行計算機擴展總線標準)正在不斷擴展計算的邊界。現在PCIe支持更多的並行數據傳輸通道,同時在主板上佔用的面積也越來越小。今天,基於PCIe連接的硬件可以支持更快的數據傳輸,是將組件連接到服務器的標準方式之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Facebook的數據中心包含數以百萬計的基於PCIe的硬件組件(包括基於ASIC的用於視頻處理和AI推理的加速器、GPU、網卡和SSD)直接連接到服務器主板上的PCI插槽,或者連接到類似於載卡(carrier card)的PCIe交換機上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和其他硬件一樣,基於PCIe的組件也容易發生不同類型的硬件、固件或軟件相關故障,也有可能出現性能下降的情況。不同的組件和供應商、大量的故障以及規模化帶來的挑戰,使得基於PCIe的組件監控、數據收集和故障隔離具有很大的挑戰性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們開發了一套系統來檢測、診斷、補救和修復這些問題。自從有了這套系統,使我們的硬件系統更可靠、更有彈性,從而獲得了更高的使用效率。我們相信,整個行業都可以從這些信息、策略中受益,並幫助圍繞這一常見問題建立行業標準。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e6a27ec89f5cf7962e671f52742052cd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":9}}],"text":"Facebook的數據中心採用了一系列基於PCIe的硬件組件","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"PCIe故障定位工具","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們簡單介紹一下使用的工具:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"PCIcrawler","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[1][2]","attrs":{}}],"attrs":{}},{"type":"text","text":":一個開源的、基於python的命令行工具,可以用來顯示、過濾和導出PCI或PCIe總線和設備的信息,包括PCI拓撲和PCIe AER(Advanced Error Reporting,高級錯誤報告)錯誤。該工具可以生成可視化的樹狀報告,非常便於調試。同時可以輸出方便機器解析的JSON格式報告,從而可以藉助於分析工具使得PCIcrawler可以適用於大規模部署。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MachineChecker","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[3]","attrs":{}}],"attrs":{}},{"type":"text","text":":一個用於從硬件角度快速評估服務器生產價值的內部工具。MachineChecker幫助檢測和診斷硬件問題。它可以作爲命令行工具運行,也可以作爲一個數據庫或一項服務。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個用於獲取目標主機硬件配置的快照以及硬件模塊的內部工具。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個用於解析自定義dmesg和SEL(System Event Logs,系統事件日誌),從而可以檢測和報告數百萬臺服務器上的PCIe錯誤的內部實用程序服務。該工具定期解析服務器上的日誌,並在相應服務器上的文件上記錄可糾正錯誤的發生率。速率記錄爲每10分鐘、每30分鐘、每小時、每6小時和每天。基於不同平臺和服務的要求,此速率可以用於確定哪些服務器超過了配置的可容忍的PCIe錯誤修正率閾值。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"IPMI Tool","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[4]","attrs":{}}],"attrs":{}},{"type":"text","text":":一個用於管理和配置支持IPMI(Intelligent Platform Management Interface,智能平臺管理接口)的設備的工具。IPMI是一個開放標準,用於對獨立於主CPU、BIOS和操作系統的硬件進行監控、日誌記錄、恢復、管理和控制。主要用於手動提取SEL(System Event Logs,系統事件日誌),用於檢測、調試和研究。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"OpenBMC","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[5]","attrs":{}}],"attrs":{}},{"type":"text","text":":一個用於擁有BMC(Baseboard Management Controller,基板管理控制器)的嵌入式設備的Linux發行版。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"FBAR(Facebook auto remediation,Facebook自動修復)","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[6]","attrs":{}}],"attrs":{}},{"type":"text","text":":一個系統以及一組守護進程,根據檢測到的各個服務器上的軟件和硬件信號自動執行代碼。每天,在沒有人爲干預的情況下,FBAR將故障服務器從生產環境中隔離,並向我們的數據中心團隊發送物理硬件維修請求,因此孤立的故障不再成爲問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Scuba","attrs":{}},{"type":"sup","content":[{"type":"text","text":"[7]","attrs":{}}],"attrs":{}},{"type":"text","text":": Facebook開發的一個快速、可擴展、分佈式內存數據庫,在Facebook用於大多數實時分析數據管理系統。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"PCIe故障分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PCIe硬件組件(ASIC、網卡、SSD等)的多樣性使得分析PCIe相關問題成爲一項艱鉅的任務。這些組件有不同的供應商、固件版本,以及運行在上面的不同的應用程序。在此基礎上,應用程序本身可能有不同的計算和存儲需求、應用配置和容錯需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過利用上面列出的工具,我們一直在進行研究,從而解決這些挑戰,並確定PCIe硬件故障和性能退化的根本原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有些問題是顯而易見的。例如,即使特定服務器上只有一個實例,PCIe的致命未修正錯誤也絕對會造成糟糕的後果。MachineChecker可以檢測到這一點,並標記出有問題的硬件(最終會把它替換掉)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據錯誤條件的不同,不可糾正錯誤進一步分爲非致命錯誤和致命錯誤。非致命錯誤是指那些導致特定事務不可靠,但PCIe鏈接本身功能齊全的錯誤。而致命錯誤會導致鏈接本身不可靠。根據我們的經驗,我們發現對於任何未糾正的PCIe錯誤,更換硬件組件(有時是主板)是最有效的行動。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其他問題在一開始看起來似乎無關緊要。例如,根據標準定義,PCIe-corrected錯誤是可糾正的,並在實踐中大多可以得到很好的糾正。可糾正的錯誤應該不會對接口的功能造成影響。然而,可糾正錯誤發生的頻率很重要。如果發生頻率超過了特定的閾值,就會導致某些應用程序出現不可接受的性能下降。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們進行了深入的研究,將性能退化和系統失速與PCIe錯誤修正率關聯起來。另一個挑戰是確定合適的閾值,因爲不同的平臺和應用程序有不同的系統配置和需求。我們推出了PCIe錯誤日誌服務,觀察Scuba中的故障、關聯事件、系統暫停和PCIe故障,以確定每個平臺的閾值。我們發現,當PCIe錯誤修正率超過特定閾值時,替換硬件是最有效的解決方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PCIe定義了兩種錯誤報告模式:基線能力和AER能力。所有PCIe組件都需要基線能力,並提供最低定義的錯誤報告。AER能力是通過一個PCIe AER擴展能力數據結構實現的,並提供更健壯的錯誤報告。PCIe AER驅動程序提供了支持PCIe AER能力的基礎設施,我們通過PCIcrawler來收集這些信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們建議每個供應商應當採用PCIe AER功能和PCIcrawler,而不是依賴自定義工具,因爲供應商自定義工具缺乏通用性。自定義工具的信息很難被解析,因此更加難以維護。此外,採用自定義工具使得集成新的供應商、新的內核版本或新的硬件類型需要大量的時間和精力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"壞的(降級的)鏈路速度(通常是運行在預期速度的1/2或1/4)和壞的(降級的)鏈路帶寬(通常是運行在預期鏈路帶寬的1/2、1/4甚至1/8)是需要關注的另一類PCIe相關故障。因爲硬件本身還在工作,只是沒有達到的最佳狀態,所以如果沒有某種自動化工具,這些故障很難被檢測到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據我們的大規模研究發現,這些故障大部分可以通過復位硬件組件來修復。這就是爲什麼我們在將硬件標記爲故障之前首先會去嘗試這種方法的原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於這些故障中的一小部分可以通過重啓硬件來修復,我們還記錄了歷史修復操作。我們有特殊的規則來識別經常出錯的硬件。例如,如果同一臺服務器上的相同硬件組件在預先確定的時間間隔內發生故障的次數達到預定義的閾值,經過預先確定的復位次數後,我們自動將其標記爲故障並將其替換出去。如果替換組件不能解決問題,我們將不得不進一步替換主板。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們還密切關注修復的趨勢,以識別非典型故障率。例如,在一種情況下,通過使用自定義Scuba表的數據及其生成的可視化圖表和時間線,我們找到了一個性能下降的根本原因,是由特定廠商發佈的特定固件所造成的,然後我們與供應商合作推出了新的固件來修復這個問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣重要的是,我們可以將補救和修復速率限制作爲一個安全網,以防止代碼中的bug、配置錯誤等問題泄露到生產環境,如果對這些問題處理不當,有可能會導致服務中斷等嚴重後果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這個總體方案,我們已經能夠覆蓋硬件健康指標,並修復數千臺服務器和服務器組件。我們每週都能夠檢測、診斷、補救和修復數百臺服務器上的各種PCIe故障。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"PCIe故障處理流程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是我們識別和修復PCIe故障的過程分解:","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"MachineChecker作爲服務定期運行在我們生產團隊中的數百萬硬件服務器和交換機上。檢查範圍包括PCIe鏈路速度、PCIe鏈路帶寬、以及PCIe未校正和已校正錯誤率。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"對於特定的PCIe端口,我們使用PCIcrawler的PCIe拓撲信息找到它的上游父端口,兩端合起來我們稱爲一條PCIe鏈路。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"我們利用PCIcrawler的輸出,而它依賴於泛型寄存器LnkSta、LnkSta2、LnkCtl和LnkCtl2。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"計算預期速率:expected_speed = min (upstream_target_speed, endpoint_capable_speed, upstream_capable_speed)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"計算當前速率:current_speed = min (endpoint_current_speed, upstream_current_speed)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"Current_speed必須等於expected_speed。換句話說,我們應該讓任何一端的當前速率等於可用速率、上游可用速率、下游可用速率和上游目標速率的最小值。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"對於PCIe鏈路寬度的計算:expected_width = min(pcie_upstream_device capable_width, pcie_endpoint_device capable width)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null},"content":[{"type":"text","text":"如果expected_width小於上游的當前寬度,我們將其標記爲壞鏈接。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":9,"align":null,"origin":null},"content":[{"type":"text","text":"PCIe錯誤日誌服務獨立運行在我們的硬件服務器上,並以預先確定的格式(JSON)獨立記錄糾正和不可糾正錯誤的比率和它們的出現速率。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":10,"align":null,"origin":null},"content":[{"type":"text","text":"MachineChecker檢查未修正的錯誤。即使只有一個未糾正的錯誤事件也將服務器定性爲故障。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":11,"align":null,"origin":null},"content":[{"type":"text","text":"在其定期運行期間,MachineChecker還在服務器上查找生成的文件,並根據Configerator(Facebook的配置管理系統)中預先配置的值來檢查它們,以確定每個平臺的閾值。如果速率超過設定的閾值,硬件將被標記爲故障。這些閾值很容易根據平臺進行調整。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":12,"align":null,"origin":null},"content":[{"type":"text","text":"PCIcrawler預裝在我們所有的硬件服務器上,用以檢查PCIe AER問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":13,"align":null,"origin":null},"content":[{"type":"text","text":"通過Facebook內部的硬件配置記錄工具,可以關聯PCIe地址到給定的硬件。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":14,"align":null,"origin":null},"content":[{"type":"text","text":"MachineChecker使用PCIcrawler(獲取鏈路寬度、鏈路速度和AER信息)和PCIe錯誤解析服務(通過SEL和dmesg)來識別硬件問題並創建警報或告警。MachineChecker利用我們內部工具提供的信息來識別與PCIe地址相關的硬件組件,並通過提供附加信息(如組件的位置、模塊信息和供應商信息)來幫助數據中心管理員(他們可能需要替換硬件)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":15,"align":null,"origin":null},"content":[{"type":"text","text":"應用開發工程師可以訂閱這些警報或告警,並定製用於監視、告警、補救和自定義修復的工作流。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":16,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以對所有告警的某個子集進行特定的修正,還可以對補救措施進行微調,並添加特殊的處理,例如,如果已知在某種特定情況下,可以將補救措施限制爲固件升級。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":17,"align":null,"origin":null},"content":[{"type":"text","text":"如果修復完全失敗,將自動創建硬件修復工單,以便數據中心管理員可以將壞的硬件組件或服務器替換爲測試好的硬件組件或服務器。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":18,"align":null,"origin":null},"content":[{"type":"text","text":"我們在幾個地方設置了速率限制,從而作爲一個防止代碼中的bug、錯誤配置泄露到生產環境中去的安全網。這類問題如果處理不當,可能會導致服務中斷。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們已經能夠覆蓋硬件健康指標,並通過這種方式修復了數千臺服務器和服務器組件。我們每週都能夠檢測、診斷、補救和修復數百臺服務器上的各種PCIe故障。這使得我們的硬件更可靠、更有彈性、有更高的使用效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Reference:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fgithub.com%2Ffacebook%2Fpcicrawler","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/facebook/pcicrawler","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fengineering.fb.com%2F2020%2F08%2F05%2Fopen-source%2Fpcicrawler%2F","title":null,"type":null},"content":[{"type":"text","text":"https://engineering.fb.com/2020/08/05/open-source/pcicrawler/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fengineering.fb.com%2F2020%2F12%2F09%2Fdata-center-engineering%2Fhow-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running%2F","title":null,"type":null},"content":[{"type":"text","text":"https://engineering.fb.com/2020/12/09/data-center-engineering/how-facebook-keeps-its-large-scale-infrastructure-hardware-up-and-running/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fgithub.com%2Fipmitool%2F","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/ipmitool/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fgithub.com%2Fopenbmc%2F","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/openbmc/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[6] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fengineering.fb.com%2F2016%2F07%2F11%2Fproduction-engineering%2Fmaking-facebook-self-healing-automating-proactive-rack-maintenance%2F","title":null,"type":null},"content":[{"type":"text","text":"https://engineering.fb.com/2016/07/11/production-engineering/making-facebook-self-healing-automating-proactive-rack-maintenance/","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[7] ","attrs":{}},{"type":"link","attrs":{"href":"https://links.jianshu.com/go?to=https%3A%2F%2Fresearch.fb.com%2Fwp-content%2Fuploads%2F2016%2F11%2Fscuba-diving-into-data-at-facebook.pdf","title":null,"type":null},"content":[{"type":"text","text":"https://research.fb.com/wp-content/uploads/2016/11/scuba-diving-into-data-at-facebook.pdf","attrs":{}}]}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章