Telemetry标准日志接口如何提升运维效率?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"企业级SSD在存储系统的部署运维过程中,有时会需要收集日志信息做进一步问题调试分析,传统的做法是一些SSD厂商会自定义命令和工具进行收集,但不同SSD厂商工具和命令格式不尽相同,这就为存储系统的运维带来很高的运维成本。有些存储系统对SSD要求十分严格,不允许发一些自定义的命令,也会给日志收集工作带来很大的挑战。因此,Telemetry标准化日志收集接口应运而生。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Telemetry 是什么?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry是 NVM Express Base Specification 定义的错误日志收集标准接口,是NVMe1.3新增功能。Telemetry通过统一的标准接口来收集SSD厂商自定义数据日志,用户下发一条标准命令就可以获取需要的日志,然后将收集到的日志发给SSD厂商做进一步日志分析,降低为了收集日志需要将SSD从部署系统移出的可能性,有效减少现场debug时间,提升用户运维效率。Telemetry典型应用场景如现场失效分类,周期性健康监控,问题定位等,快速定位问题并解析,进一步提高产品可靠性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Telemetry日志收集方式:Host/Controller-Initiated","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry功能定义了收集SSD厂商自定义日志的机制,Telemetry数据采集可以是Host-Initiated和Controller-Initiated,前者是host发起获取SSD一些关键信息,后者是SSD认为重要的数据,由SSD厂商自定义。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/45869fbbfa6aa7c61cb60132e0916257.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Host发起的收集数据记录在Telemetry Host-Initiated Log,Controller发起的收集数据记录在Telemetry Controller-Initiated Log。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Host-Initated Telemetry Log(Log Page ID=07),只存储于DDR,不保存NAND","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Controller-Initiated Telemetry Log (Log Page ID=08),存储于NAND中","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Telemetry日志解析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每个Telemetry数据日志由一组单独的Telemetry数据块(Block)组成,每个Telemetry数据块大小为512字节,Telemetry数据以Telemetry数据块为单位返回(必须是512Bytes 的整数倍)。Block 0是日志Header部分。NVMe Spec 定义了标准Telemetry Header 信息(Byte#0~511,Host-Initiated Telemetry日志和Controller-Initiated Telemetry 日志的数据结构类似,如下图为Telemetry: Host-Initiated log 数据结构)。其中384:511 Reason Identifier这部分属于厂商自定义字段,不同方式触发的Controller-Initiated日志该部分结构可能存在差异。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/13856bea0ed245e135bae1634b0ccf98.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Telemetry日志数据块返回必须是512Bytes 的整数倍:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d9/d968d33edc47795913d8593a3447af43.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每个Telemetry数据日志被划分为三个Telemetry数据区(即Area 1、Area 2和Area 3,后者均包含前者)。","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Area 1:第一阶段,确定存在问题,通过收集最小的数据集来确定问题与其他问题的区别。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Area 2:第二阶段,收集并分析更深入的中等量有效数据集,以确定问题的来源。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Data Area 3:第三阶段,可用于收集最大和最完整的数据集,以诊断问题。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有Telemetry数据区从Telemetry Area1开始,通过Byte 383:Telemetry Controller-Initiated Data Generation Number判断数据是否已经读完。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下是NVMe Spec中给出的Host-Initiated Telemetry Log实例,Controller-Initiated Telemetry log类似。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"当没有收集到telemetry log信息时","attrs":{}}]}]}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 1 Last Block = 0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 2 Last Block = 0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 3 Last Block = 0","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 当3个Area都存在信息时(以下数字仅为示例,请以实际情况为准)","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 1 Last Block = 65","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 2 Last Block = 1000","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 3 Last Block = 30000","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e07cb840973f7c86b1e1098292ba835e.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 当只有Area 2存在信息时(以下数字仅为示例,请以实际情况为准)","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 1 Last Block = 0","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 2 Last Block = 1000","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry Host-Initiated Data Area 3 Last Block = 1000","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3bbe1dda14619fef53d2b937652d1eb1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Host/Controller-Initiated Telemetry触发条件差异","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Host-Initiated和Controller-Initiated的数据的准备、收集和提交等方面类似,主要的区别在于收集的触发方式,Controller-Initiated Telemetry Log会在出现坏块、PCIe超时等SSD厂商自定义场景下触发。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Telemetry的操作流程一般为:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1) 主机在Identify Controller data structure 中识别控制器对Telemetry日志的支持;","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ed/ed0922b317ade8b0af8bc9a6a17df4b4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2) 主机准备一个区域,以便在需要时存储Telemetry数据;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3) 要接收Controller-Initiated的Telemetry数据可用的通知,主机使用异步事件配置功能启用Telemetry日志通知;","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4f4bf717a58a3eb29f361a7b808be69a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4) 如果主机决定收集Host-Initiated的Telemetry数据或控制器信号表明Controller-Initiated的Telemetry数据可用","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a. 主机从Host-Initiated的日志或Controller-Initiated的日志中读取Telemetry数据区(Telemetry Data Area)的相应块。如果可能,主机应该收集Telemetry数据区1、2和3。主机以512字节的Telemetry数据块单元读取日志。作为对Controller-Initiated的日志的最后一次读取的一部分,主机将异步保持事件RAE(Retain Asynchronous Event)位清除为0;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b. 如果是Controller-Initiated的日志,主机将重新读取日志页的头,并确保读取的controller-initiated Data Generation Number与读取的原始值匹配。如果这个值不匹配,则获取的数据不完整,需要重新继续读取;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c. 当所有Telemetry数据保存完毕后,应将数据转发给SSD厂商","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"NVMe-CLI 厂商自定义一键化收集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"开源CLI工具支持Telemetry log收集,收集到的telemetry日志Hearder部分可以解析,数据日志内容都进行加密,需要返回SSD厂商进行解析,以nvme-cli v1.12为例:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8d7ab9fc3e12ee30a8ac8c6c5fc41bd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"示例1:收集一条host-initiated的telemetry-log。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"# nvme telemetry-log /dev/nvme0 -o host_tel.log -g 1","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"示例2:收集一条controller-initiated的telemetry-log(如需获取全部,需要执行多次)。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"# nvme telemetry-log /dev/nvme0 -o controller_tel.log -c","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真实场景有可能存了多条telemetry日志,SSD厂商如Memblaze基于nvme-cli 提供自动化收集工具,可以一次性把所以telemetry 日志全部收集。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"OCP Spec 2.0 关于Telemetry 延展","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP(Open Compute Project) Datacenter NVMe® SSD Specification是Facebook和Microsoft主导制定,来自Hyperscale业务商对SSD的需求定义,协议到2021年演进到2.0版本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP 2.0协议中新增基于 PCIe VDM 实现的NVMe-MI带外管理支持Telemetry日志收集,这也可能是国内业务厂商未来的潜在需求。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df2e224ab2c8ca63fafb94ede68ee044.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"参考文献:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NVM ExpressTM Base Specification NVM Express Revision 1.4","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCP Datacenter NVMe® SSD Specification Version 2.0","attrs":{}}]}]}],"attrs":{}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章