从Ftrace开始内核探索之旅

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作系统内核对应用开发工程师来说就像一个黑盒,似乎很难窥探到其内部的运行机制。其实Linux内核很早就内置了一个强大的tracing工具:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Ftrace","attrs":{}},{"type":"text","text":",它几乎可以跟踪内核的所有函数,不仅可以用于调试和分析,还可以用于观察学习Linux内核的内部运行。虽然Ftrace在2008年就加入了内核,但很多应用开发工程师仍然不知道它的存在。本文就给你介绍一下Ftrace的基本使用。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Ftrace初体验","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先用一个例子体验一下使用简单且功能强大的Ftrace。使用 root 用户进入/sys/kernel/debug/tracing目录,执行 echo 和 cat 命令:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4b8a5898a65576a3818a00d142b615f6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们使用Ftrace的function_graph功能显示了内核函数 _do_fork() 所有子函数调用。左边的第一列是执行函数的 CPU,第二列 DURATION 显示在相应函数中花费的时间。我们注意到最后一行的耗时之前有个 + 号,提示用户注意延迟高的函数。+ 代表耗时大于 10 μs。如果耗时大于 100 μs,则显示 ! 号。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们知道,fork 是建立父进程的一个完整副本,然后作为子进程执行。那么_do_fork()的第一件大事就是调用 copy_process() 复制父进程的数据结构,从上面输出的调用链信息也验证了这一点。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用完后执行下面的命令关闭function_graph:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo nop > current_tracer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo > set_graph_function","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 Ftrace 的 function_graph 功能,可以查看内核函数的子函数调用链,帮助我们理解复杂的代码流程,而这只是 Ftrace 的功能之一。这么强大的功能,我们不必安装额外的用户空间工具,只要使用 echo 和 cat 命令访问特定的文件就能实现。Ftrace 对用户的使用接口正是tracefs文件系统。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"tracefs 文件系统","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用户通过tracefs文件系统使用Ftrace,这很符合一切皆文件的Linux哲学。tracefs文件系统一般挂载在/sys/kernel/tracing目录。由于Ftrace最初是debugfs文件系统的一部分,后来才被拆分为自己的tracefs。所以如果系统已经挂载了debugfs,那么仍然会保留原来的目录结构,将tracefs挂载到debugfs的子目录下。我们可以使用 mount 命令查看当前系统debugfs和tracefs挂载点:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/50/5003c3621d7b8de8011afd07c1241562.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我使用的系统是Ubuntu 20.04.2 LTS,可以看到,为了保持兼容,tracefs同时挂载到了/sys/kernel/tracing和/sys/kernel/debug/tracing。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tracefs下的文件主要分两类:控制文件和输出文件。这些文件的名字都很直观,像前面例子通过 current_tracer 设置当前要使用的 tracer,然后从 trace中读取结果。还有像 available_tracers 包含了当前内核可用的 tracer,可以设置 trace_options 自定义输出。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/73/736d59447d1d3f8a50955fe022cc8104.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文后面的示例假定你已经处在了/sys/kernel/tracing或/sys/kernel/debug/tracing目录下。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"函数跟踪","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ftrace 实际上代表的就是function trace(函数跟踪),因此函数追踪是Ftrace最初的一个主要功能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ftrace 可以跟踪几乎所有内核函数调用的详细信息,这是怎么做到的呢?简单来说,在编译内核的时候使用了 gcc 的 -pg 选项,编译器会在每个内核函数的入口处调用一个特殊的汇编函数“mcount” 或 “__fentry__”,如果跟踪功能被打开,mcount/fentry 会调用当前设置的 tracer,tracer将不同的数据写入","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ring buffer。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4785a25b9f3a5772ab00192b0a1b022.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从上图可以看出,Ftrace 提供的 function hooks 机制在内核函数入口处埋点,根据配置调用特定的 tracer, tracer将数据写入ring buffer。Ftrace实现了一个无锁的ring buffer,所有的跟踪信息都存储在ring buffer中。用户通过 tracefs 文件系统接口访问函数跟踪的输出结果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可能已经意识到,如果每个内核函数入口都加入跟踪代码,必然会非常影响内核的性能,幸好Ftrace支持动态跟踪功能。如果启用了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CONFIG_DYNAMIC_FTRACE","attrs":{}},{"type":"text","text":"选项,编译内核时所有的mcount/fentry调用点都会被收集记录。在内核的初始化启动过程中,会根据编译期记录的列表,将mcount/fentry调用点替换为","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NOP","attrs":{}},{"type":"text","text":"指令。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NOP","attrs":{}},{"type":"text","text":"就是 no-operation,不做任何事,直接转到下一条指令。因此在没有开启跟踪功能的情况下,Ftrace不会对内核性能产生任何影响。在开启追踪功能时,Ftrace才会将","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NOP","attrs":{}},{"type":"text","text":"指令替换为mcount/fentry。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"启用函数追踪功能,只需要将 current_tracer 文件的内容设置为 \"function\":","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c5/c5c9bcf6b3bce79b56ef353592ecbef3.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件头已经很好的解释了每一列的含义。前两项是被追踪的任务名称和 PID,大括号内是执行跟踪的CPU。TIMESTAMP 是启动后的时间,后面是被追踪的函数,它的调用者在  set_ftrace_pid","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo function > current_tracer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果设置了 function-fork 选项,那么当一个 PID 被列在 set_ftrace_pid 这个文件中时,其子任务的 PID 将被自动添加到这个文件中,并且子任务也将被 tracer 追踪。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo function-fork > trace_options","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"取消function-fork 选项:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4dd121a8421752fb33a0a4e42dd32d16.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"取消 set_ftrace_pid 的设置:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo > set_ftrace_pid","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Ftrace function_graph","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章开始例子已经展示过,function_graph 可以打印出函数的调用图,揭示代码的流程。function_graph 不仅跟踪函数的输入,而且跟踪函数的返回,这使得 tracer 能够知道被调用的函数的深度。function_graph 可以让人更容易跟踪内核的执行流程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们再看一个例子:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d321b60f468a219f00f35e89c7a2e9a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面提到过,函数耗时大于 10 μs,前面会有 + 号提醒用户注意,其他的符号还有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"$","attrs":{}},{"type":"text","text":" :延迟大于1秒","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"@ ","attrs":{}},{"type":"text","text":":延迟大于 100 ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"* ","attrs":{}},{"type":"text","text":":延迟大于 10 ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"# ","attrs":{}},{"type":"text","text":":延迟大于 1 ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"! ","attrs":{}},{"type":"text","text":":延迟大于 100 μs","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"+ ","attrs":{}},{"type":"text","text":":延迟大于 10 μs","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"函数Profiler","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"函数Profiler提供了内核函数调用的统计数据,可以观察哪些内核函数正在被使用,并能发现哪些函数的执行耗时最长。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d7/d7c7248e57b84966928f7d031b06a29f.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这里有一个要注意的地方,确保使用的是 0 >,而不是 0>。这两者的含义不一样,0>是对文件描述符 0 的重定向。同样要避免使用 1>,因为这是对文件描述符 1 的重定向。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在可以从 trace_stat 目录中读取 profile 的统计数据。在这个目录中,profile 数据按照 CPU 保存在名为 function[n] 文件中。我使用的4核CPU,看一下profile 结果:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/94/94cd612cb7bb5272b5ea72ac6f784867.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一行是每一列的名称,分别是函数名称(Function),调用次数(Hit),函数的总时间(Time)、平均函数时间(Avg)和标准差(s^2)。输出结果显示,tcp_sendmsg() 在3个 CPU 上都是最频繁的,tcp_v4_rcv() 在 CPU2 上被调用了1618次,平均延迟为 17.218 us。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最后要注意一点,在使用 Ftrace Profiler 时,尽量通过 set_ftrace_filter 限制 profile 的范围,避免对所有的内核函数都进行 profile。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"追踪点 Tracepoints","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tracepoints是内核的静态埋点。内核维护者在他认为重要的位置放置静态 tracepoints 记录上下文信息,方便后续排查问题。例如系统调用的开始和结束,中断被触发,网络数据包发送等等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Linux的早期,内核维护者就一直想在内核中加入静态 tracepoints,尝试过各种策略。Ftrace 创造了Event Tracing 基础设施,让开发者使用 TRACE_EVENT() 宏添加内核 tracepoints,不用创建自定义内核模块,使用 Event Tracing 基础设施来注册埋点函数。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在内核中的Tracepoints都使用了 TRACE_EVENT() 宏来定义,tracepoints 记录的上下文信息作为 Trace events 进入 Event Tracing 基础设施,这样我们就可以复用 Ftrace 的 tracefs ,通过文件接口来配置 tracepoint events,并使用 trace 或 trace_pipe 文件查看事件输出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有的 tracepoint events 的控制文件都在 events 目录下,按照类别以子目录形式组织:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/02/02ae376f6f14df89f4169fae9430a6b1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们以 events/sched/sched_process_fork 事件为例,该事件是在 include/trace/events/sched.h 中由 TRACE_EVENT 宏所定义:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/07de2cd771999c5c2b3a37e33b59b8c4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TRACE_EVENT 宏会根据事件名称 sched_process_fork 生成 tracepoint 方法 trace_sched_process_fork()。你会在 kernel/fork.c 的 _do_fork() 中看到调用这个 tracepoint 方法。_do_fork() 是进程 fork 的主流程,在这里放置 tracepoint 是一个合适的位置,trace_sched_process_fork(current, p) 记录当前进程和 fork 出的子进程信息:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/ea9ff1c30eb462760e3b91aba725ee9b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 events/sched/sched_process_fork 目录下,有这个事件的控制文件:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7c/7c5e2bac412d270b490fa6322cccab2b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们演示如何通过 enable 文件开启和关闭这个 tracepoint 事件:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/5706309d333c917117fa2f5338448337.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前五列分别是进程名称,PID,CPU ID,irqs-off 等标志位,timestamp 和 tracepoint 事件名称。其余部分是 tracepoint 格式字符串,包含当前这个 tracepoint 记录的重要信息。格式字符串可以在 events/sched/sched_process_fork/format 文件中查看:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1bf9278cbbb26263a6546e1a535105fd.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过这个 format 文件,我们可以了解这个 tracepoint 事件每个字段的含义。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们再演示一个使用 trigger 控制文件的例子:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1bf9278cbbb26263a6546e1a535105fd.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个例子使用了 hist triggers,通过 sched_process_fork 事件来统计 _do_fork 的次数,并按照进程ID生成直方图。输出显示了 PID 24493 在追踪期间 fork 了24个子进程,最后几行显示了统计数据。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"关于 Hist Triggers 的详细介绍可以参考文档 Event Histograms。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我的系统内核版本是 5.8.0-59-generic,当前可用的 tracepoints events 有2547个:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d40b3d866d53bbc8c3066dd6d1702541.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Event Tracing 基础设施应该是 Ftrace 的另一大贡献,它提供的 TRACE_EVENT 宏统一了内核 tracepoint 的实现方式,为 tracepoint events 提供了基础支持。perf 的 tracepoint events 也是基于 Ftrace 实现的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"利用 Tracepoints 理解内核代码","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于 tracepoints 是内核维护者在流程重要位置设置的埋点,因此我们可以从 tracepoints 入手来学习内核代码。所有的 tracepoints 都定义在 include/trace/events/ 目录下的头文件中,例如进程调度相关的 tracepoints 定义在 include/trace/events/sched.h中,我们以 sched_switch 为例:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f68a4ba1b30c1ff370f439a534f14827.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TRACE_EVENT 宏会根据事件名称 sched_switch 生成 tracepoint 方法 trace_sched_switch(),在源码中查找该方法,发现在 kernel/sched/core.c 的 __schedule()中调用了trace_sched_switch() :","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6a/6a8ba7121acda7e47f07368867dbb7c6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这样我们就找到了 scheduler 的主流程,可以从这里开始阅读进程调度的源码。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"写在最后","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ftrace 就包含在内核源码中 kernel/trace,理解了 Ftrace 内核不再是黑箱,你会有豁然开朗的感觉,内核源码忽然有条理了起来。让我们从 Ftrace 开始内核探索之旅吧。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章