從Ftrace開始內核探索之旅

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"操作系統內核對應用開發工程師來說就像一個黑盒,似乎很難窺探到其內部的運行機制。其實Linux內核很早就內置了一個強大的tracing工具:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Ftrace","attrs":{}},{"type":"text","text":",它幾乎可以跟蹤內核的所有函數,不僅可以用於調試和分析,還可以用於觀察學習Linux內核的內部運行。雖然Ftrace在2008年就加入了內核,但很多應用開發工程師仍然不知道它的存在。本文就給你介紹一下Ftrace的基本使用。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Ftrace初體驗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先用一個例子體驗一下使用簡單且功能強大的Ftrace。使用 root 用戶進入/sys/kernel/debug/tracing目錄,執行 echo 和 cat 命令:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4b/4b8a5898a65576a3818a00d142b615f6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們使用Ftrace的function_graph功能顯示了內核函數 _do_fork() 所有子函數調用。左邊的第一列是執行函數的 CPU,第二列 DURATION 顯示在相應函數中花費的時間。我們注意到最後一行的耗時之前有個 + 號,提示用戶注意延遲高的函數。+ 代表耗時大於 10 μs。如果耗時大於 100 μs,則顯示 ! 號。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們知道,fork 是建立父進程的一個完整副本,然後作爲子進程執行。那麼_do_fork()的第一件大事就是調用 copy_process() 複製父進程的數據結構,從上面輸出的調用鏈信息也驗證了這一點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用完後執行下面的命令關閉function_graph:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo nop > current_tracer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo > set_graph_function","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 Ftrace 的 function_graph 功能,可以查看內核函數的子函數調用鏈,幫助我們理解複雜的代碼流程,而這只是 Ftrace 的功能之一。這麼強大的功能,我們不必安裝額外的用戶空間工具,只要使用 echo 和 cat 命令訪問特定的文件就能實現。Ftrace 對用戶的使用接口正是tracefs文件系統。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"tracefs 文件系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶通過tracefs文件系統使用Ftrace,這很符合一切皆文件的Linux哲學。tracefs文件系統一般掛載在/sys/kernel/tracing目錄。由於Ftrace最初是debugfs文件系統的一部分,後來才被拆分爲自己的tracefs。所以如果系統已經掛載了debugfs,那麼仍然會保留原來的目錄結構,將tracefs掛載到debugfs的子目錄下。我們可以使用 mount 命令查看當前系統debugfs和tracefs掛載點:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/50/5003c3621d7b8de8011afd07c1241562.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我使用的系統是Ubuntu 20.04.2 LTS,可以看到,爲了保持兼容,tracefs同時掛載到了/sys/kernel/tracing和/sys/kernel/debug/tracing。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"tracefs下的文件主要分兩類:控制文件和輸出文件。這些文件的名字都很直觀,像前面例子通過 current_tracer 設置當前要使用的 tracer,然後從 trace中讀取結果。還有像 available_tracers 包含了當前內核可用的 tracer,可以設置 trace_options 自定義輸出。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/73/736d59447d1d3f8a50955fe022cc8104.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文後面的示例假定你已經處在了/sys/kernel/tracing或/sys/kernel/debug/tracing目錄下。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"函數跟蹤","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ftrace 實際上代表的就是function trace(函數跟蹤),因此函數追蹤是Ftrace最初的一個主要功能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ftrace 可以跟蹤幾乎所有內核函數調用的詳細信息,這是怎麼做到的呢?簡單來說,在編譯內核的時候使用了 gcc 的 -pg 選項,編譯器會在每個內核函數的入口處調用一個特殊的彙編函數“mcount” 或 “__fentry__”,如果跟蹤功能被打開,mcount/fentry 會調用當前設置的 tracer,tracer將不同的數據寫入","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ring buffer。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a4/a4785a25b9f3a5772ab00192b0a1b022.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上圖可以看出,Ftrace 提供的 function hooks 機制在內核函數入口處埋點,根據配置調用特定的 tracer, tracer將數據寫入ring buffer。Ftrace實現了一個無鎖的ring buffer,所有的跟蹤信息都存儲在ring buffer中。用戶通過 tracefs 文件系統接口訪問函數跟蹤的輸出結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可能已經意識到,如果每個內核函數入口都加入跟蹤代碼,必然會非常影響內核的性能,幸好Ftrace支持動態跟蹤功能。如果啓用了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CONFIG_DYNAMIC_FTRACE","attrs":{}},{"type":"text","text":"選項,編譯內核時所有的mcount/fentry調用點都會被收集記錄。在內核的初始化啓動過程中,會根據編譯期記錄的列表,將mcount/fentry調用點替換爲","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NOP","attrs":{}},{"type":"text","text":"指令。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NOP","attrs":{}},{"type":"text","text":"就是 no-operation,不做任何事,直接轉到下一條指令。因此在沒有開啓跟蹤功能的情況下,Ftrace不會對內核性能產生任何影響。在開啓追蹤功能時,Ftrace纔會將","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NOP","attrs":{}},{"type":"text","text":"指令替換爲mcount/fentry。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"啓用函數追蹤功能,只需要將 current_tracer 文件的內容設置爲 \"function\":","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c5/c5c9bcf6b3bce79b56ef353592ecbef3.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件頭已經很好的解釋了每一列的含義。前兩項是被追蹤的任務名稱和 PID,大括號內是執行跟蹤的CPU。TIMESTAMP 是啓動後的時間,後面是被追蹤的函數,它的調用者在  set_ftrace_pid","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo function > current_tracer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果設置了 function-fork 選項,那麼當一個 PID 被列在 set_ftrace_pid 這個文件中時,其子任務的 PID 將被自動添加到這個文件中,並且子任務也將被 tracer 追蹤。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo function-fork > trace_options","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"取消function-fork 選項:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4d/4dd121a8421752fb33a0a4e42dd32d16.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"取消 set_ftrace_pid 的設置:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# echo > set_ftrace_pid","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Ftrace function_graph","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章開始例子已經展示過,function_graph 可以打印出函數的調用圖,揭示代碼的流程。function_graph 不僅跟蹤函數的輸入,而且跟蹤函數的返回,這使得 tracer 能夠知道被調用的函數的深度。function_graph 可以讓人更容易跟蹤內核的執行流程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們再看一個例子:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d321b60f468a219f00f35e89c7a2e9a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面提到過,函數耗時大於 10 μs,前面會有 + 號提醒用戶注意,其他的符號還有:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"$","attrs":{}},{"type":"text","text":" :延遲大於1秒","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"@ ","attrs":{}},{"type":"text","text":":延遲大於 100 ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"* ","attrs":{}},{"type":"text","text":":延遲大於 10 ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"# ","attrs":{}},{"type":"text","text":":延遲大於 1 ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"! ","attrs":{}},{"type":"text","text":":延遲大於 100 μs","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"+ ","attrs":{}},{"type":"text","text":":延遲大於 10 μs","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"函數Profiler","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"函數Profiler提供了內核函數調用的統計數據,可以觀察哪些內核函數正在被使用,並能發現哪些函數的執行耗時最長。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d7/d7c7248e57b84966928f7d031b06a29f.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏有一個要注意的地方,確保使用的是 0 >,而不是 0>。這兩者的含義不一樣,0>是對文件描述符 0 的重定向。同樣要避免使用 1>,因爲這是對文件描述符 1 的重定向。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在可以從 trace_stat 目錄中讀取 profile 的統計數據。在這個目錄中,profile 數據按照 CPU 保存在名爲 function[n] 文件中。我使用的4核CPU,看一下profile 結果:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/94/94cd612cb7bb5272b5ea72ac6f784867.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一行是每一列的名稱,分別是函數名稱(Function),調用次數(Hit),函數的總時間(Time)、平均函數時間(Avg)和標準差(s^2)。輸出結果顯示,tcp_sendmsg() 在3個 CPU 上都是最頻繁的,tcp_v4_rcv() 在 CPU2 上被調用了1618次,平均延遲爲 17.218 us。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後要注意一點,在使用 Ftrace Profiler 時,儘量通過 set_ftrace_filter 限制 profile 的範圍,避免對所有的內核函數都進行 profile。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"追蹤點 Tracepoints","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tracepoints是內核的靜態埋點。內核維護者在他認爲重要的位置放置靜態 tracepoints 記錄上下文信息,方便後續排查問題。例如系統調用的開始和結束,中斷被觸發,網絡數據包發送等等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Linux的早期,內核維護者就一直想在內核中加入靜態 tracepoints,嘗試過各種策略。Ftrace 創造了Event Tracing 基礎設施,讓開發者使用 TRACE_EVENT() 宏添加內核 tracepoints,不用創建自定義內核模塊,使用 Event Tracing 基礎設施來註冊埋點函數。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在內核中的Tracepoints都使用了 TRACE_EVENT() 宏來定義,tracepoints 記錄的上下文信息作爲 Trace events 進入 Event Tracing 基礎設施,這樣我們就可以複用 Ftrace 的 tracefs ,通過文件接口來配置 tracepoint events,並使用 trace 或 trace_pipe 文件查看事件輸出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有的 tracepoint events 的控制文件都在 events 目錄下,按照類別以子目錄形式組織:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/02/02ae376f6f14df89f4169fae9430a6b1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們以 events/sched/sched_process_fork 事件爲例,該事件是在 include/trace/events/sched.h 中由 TRACE_EVENT 宏所定義:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/07de2cd771999c5c2b3a37e33b59b8c4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TRACE_EVENT 宏會根據事件名稱 sched_process_fork 生成 tracepoint 方法 trace_sched_process_fork()。你會在 kernel/fork.c 的 _do_fork() 中看到調用這個 tracepoint 方法。_do_fork() 是進程 fork 的主流程,在這裏放置 tracepoint 是一個合適的位置,trace_sched_process_fork(current, p) 記錄當前進程和 fork 出的子進程信息:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ea/ea9ff1c30eb462760e3b91aba725ee9b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 events/sched/sched_process_fork 目錄下,有這個事件的控制文件:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7c/7c5e2bac412d270b490fa6322cccab2b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們演示如何通過 enable 文件開啓和關閉這個 tracepoint 事件:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/57/5706309d333c917117fa2f5338448337.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前五列分別是進程名稱,PID,CPU ID,irqs-off 等標誌位,timestamp 和 tracepoint 事件名稱。其餘部分是 tracepoint 格式字符串,包含當前這個 tracepoint 記錄的重要信息。格式字符串可以在 events/sched/sched_process_fork/format 文件中查看:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1bf9278cbbb26263a6546e1a535105fd.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這個 format 文件,我們可以瞭解這個 tracepoint 事件每個字段的含義。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們再演示一個使用 trigger 控制文件的例子:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1b/1bf9278cbbb26263a6546e1a535105fd.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個例子使用了 hist triggers,通過 sched_process_fork 事件來統計 _do_fork 的次數,並按照進程ID生成直方圖。輸出顯示了 PID 24493 在追蹤期間 fork 了24個子進程,最後幾行顯示了統計數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於 Hist Triggers 的詳細介紹可以參考文檔 Event Histograms。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我的系統內核版本是 5.8.0-59-generic,當前可用的 tracepoints events 有2547個:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d40b3d866d53bbc8c3066dd6d1702541.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Event Tracing 基礎設施應該是 Ftrace 的另一大貢獻,它提供的 TRACE_EVENT 宏統一了內核 tracepoint 的實現方式,爲 tracepoint events 提供了基礎支持。perf 的 tracepoint events 也是基於 Ftrace 實現的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"利用 Tracepoints 理解內核代碼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於 tracepoints 是內核維護者在流程重要位置設置的埋點,因此我們可以從 tracepoints 入手來學習內核代碼。所有的 tracepoints 都定義在 include/trace/events/ 目錄下的頭文件中,例如進程調度相關的 tracepoints 定義在 include/trace/events/sched.h中,我們以 sched_switch 爲例:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f68a4ba1b30c1ff370f439a534f14827.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TRACE_EVENT 宏會根據事件名稱 sched_switch 生成 tracepoint 方法 trace_sched_switch(),在源碼中查找該方法,發現在 kernel/sched/core.c 的 __schedule()中調用了trace_sched_switch() :","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6a/6a8ba7121acda7e47f07368867dbb7c6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣我們就找到了 scheduler 的主流程,可以從這裏開始閱讀進程調度的源碼。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"寫在最後","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ftrace 就包含在內核源碼中 kernel/trace,理解了 Ftrace 內核不再是黑箱,你會有豁然開朗的感覺,內核源碼忽然有條理了起來。讓我們從 Ftrace 開始內核探索之旅吧。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章