Telemetry標準日誌接口如何提升運維效率?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"企業級SSD在存儲系統的部署運維過程中,有時會需要收集日誌信息做進一步問題調試分析,傳統的做法是一些SSD廠商會自定義命令和工具進行收集,但不同SSD廠商工具和命令格式不盡相同,這就爲存儲系統的運維帶來很高的運維成本。有些存儲系統對SSD要求十分嚴格,不允許發一些自定義的命令,也會給日誌收集工作帶來很大的挑戰。因此,Telemetry標準化日誌收集接口應運而生。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Telemetry 是什麼?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry是 NVM Express Base Specification 定義的錯誤日誌收集標準接口,是NVMe1.3新增功能。Telemetry通過統一的標準接口來收集SSD廠商自定義數據日誌,用戶下發一條標準命令就可以獲取需要的日誌,然後將收集到的日誌發給SSD廠商做進一步日誌分析,降低爲了收集日誌需要將SSD從部署系統移出的可能性,有效減少現場debug時間,提升用戶運維效率。Telemetry典型應用場景如現場失效分類,週期性健康監控,問題定位等,快速定位問題並解析,進一步提高產品可靠性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Telemetry日誌收集方式:Host/Controller-Initiated","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry功能定義了收集SSD廠商自定義日誌的機制,Telemetry數據採集可以是Host-Initiated和Controller-Initiated,前者是host發起獲取SSD一些關鍵信息,後者是SSD認爲重要的數據,由SSD廠商自定義。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/45869fbbfa6aa7c61cb60132e0916257.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Host發起的收集數據記錄在Telemetry Host-Initiated Log,Controller發起的收集數據記錄在Telemetry Controller-Initiated Log。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Host-Initated Telemetry Log(Log Page ID=07),只存儲於DDR,不保存NAND","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Controller-Initiated Telemetry Log (Log Page ID=08),存儲於NAND中","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Telemetry日誌解析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Telemetry數據日誌由一組單獨的Telemetry數據塊(Block)組成,每個Telemetry數據塊大小爲512字節,Telemetry數據以Telemetry數據塊爲單位返回(必須是512Bytes 的整數倍)。Block 0是日誌Header部分。NVMe Spec 定義了標準Telemetry Header 信息(Byte#0~511,Host-Initiated Telemetry日誌和Controller-Initiated Telemetry 日誌的數據結構類似,如下圖爲Telemetry: Host-Initiated log 數據結構)。其中384:511 Reason Identifier這部分屬於廠商自定義字段,不同方式觸發的Controller-Initiated日誌該部分結構可能存在差異。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/13856bea0ed245e135bae1634b0ccf98.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Telemetry日誌數據塊返回必須是512Bytes 的整數倍:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d9/d968d33edc47795913d8593a3447af43.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Telemetry數據日誌被劃分爲三個Telemetry數據區(即Area 1、Area 2和Area 3,後者均包含前者)。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Area 1:第一階段,確定存在問題,通過收集最小的數據集來確定問題與其他問題的區別。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Area 2:第二階段,收集並分析更深入的中等量有效數據集,以確定問題的來源。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Area 3:第三階段,可用於收集最大和最完整的數據集,以診斷問題。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有Telemetry數據區從Telemetry Area1開始,通過Byte 383:Telemetry Controller-Initiated Data Generation Number判斷數據是否已經讀完。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是NVMe Spec中給出的Host-Initiated Telemetry Log實例,Controller-Initiated Telemetry log類似。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"當沒有收集到telemetry log信息時","attrs":{}}]}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 1 Last Block = 0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 2 Last Block = 0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 3 Last Block = 0","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 當3個Area都存在信息時(以下數字僅爲示例,請以實際情況爲準)","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 1 Last Block = 65","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 2 Last Block = 1000","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 3 Last Block = 30000","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e07cb840973f7c86b1e1098292ba835e.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 當只有Area 2存在信息時(以下數字僅爲示例,請以實際情況爲準)","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 1 Last Block = 0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 2 Last Block = 1000","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 3 Last Block = 1000","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3bbe1dda14619fef53d2b937652d1eb1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Host/Controller-Initiated Telemetry觸發條件差異","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Host-Initiated和Controller-Initiated的數據的準備、收集和提交等方面類似,主要的區別在於收集的觸發方式,Controller-Initiated Telemetry Log會在出現壞塊、PCIe超時等SSD廠商自定義場景下觸發。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry的操作流程一般爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1) 主機在Identify Controller data structure 中識別控制器對Telemetry日誌的支持;","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ed/ed0922b317ade8b0af8bc9a6a17df4b4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2) 主機準備一個區域,以便在需要時存儲Telemetry數據;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3) 要接收Controller-Initiated的Telemetry數據可用的通知,主機使用異步事件配置功能啓用Telemetry日誌通知;","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4f4bf717a58a3eb29f361a7b808be69a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4) 如果主機決定收集Host-Initiated的Telemetry數據或控制器信號表明Controller-Initiated的Telemetry數據可用","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a. 主機從Host-Initiated的日誌或Controller-Initiated的日誌中讀取Telemetry數據區(Telemetry Data Area)的相應塊。如果可能,主機應該收集Telemetry數據區1、2和3。主機以512字節的Telemetry數據塊單元讀取日誌。作爲對Controller-Initiated的日誌的最後一次讀取的一部分,主機將異步保持事件RAE(Retain Asynchronous Event)位清除爲0;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b. 如果是Controller-Initiated的日誌,主機將重新讀取日誌頁的頭,並確保讀取的controller-initiated Data Generation Number與讀取的原始值匹配。如果這個值不匹配,則獲取的數據不完整,需要重新繼續讀取;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c. 當所有Telemetry數據保存完畢後,應將數據轉發給SSD廠商","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"NVMe-CLI 廠商自定義一鍵化收集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開源CLI工具支持Telemetry log收集,收集到的telemetry日誌Hearder部分可以解析,數據日誌內容都進行加密,需要返回SSD廠商進行解析,以nvme-cli v1.12爲例:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8d7ab9fc3e12ee30a8ac8c6c5fc41bd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"示例1:收集一條host-initiated的telemetry-log。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"# nvme telemetry-log /dev/nvme0 -o host_tel.log -g 1","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"示例2:收集一條controller-initiated的telemetry-log(如需獲取全部,需要執行多次)。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"# nvme telemetry-log /dev/nvme0 -o controller_tel.log -c","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真實場景有可能存了多條telemetry日誌,SSD廠商如Memblaze基於nvme-cli 提供自動化收集工具,可以一次性把所以telemetry 日誌全部收集。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"OCP Spec 2.0 關於Telemetry 延展","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP(Open Compute Project) Datacenter NVMe® SSD Specification是Facebook和Microsoft主導制定,來自Hyperscale業務商對SSD的需求定義,協議到2021年演進到2.0版本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP 2.0協議中新增基於 PCIe VDM 實現的NVMe-MI帶外管理支持Telemetry日誌收集,這也可能是國內業務廠商未來的潛在需求。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df2e224ab2c8ca63fafb94ede68ee044.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文獻:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NVM ExpressTM Base Specification NVM Express Revision 1.4","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP Datacenter NVMe® SSD Specification Version 2.0","attrs":{}}]}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章