系統性能分析工具perf（2 of 2）:perf工作原理簡析

原文鏈接：https://my.oschina.net/u/2475751/blog/1823736

背景

此前工作中，筆者使用perf測過CPU的CPI[1]，cache miss, 內存帶寬等性能指標。另外，還移植過perf uncore[2]相關的補丁。這些讓我很好奇：perf大概是怎麼工作的? 帶着這個問題，筆者謹希望把自己的一點經驗分享出來。

perf-list

perf list列出的event有這幾類：1. hardware,如cache-misses; 2. software, 如context switches; 3. cache, 如L1-dcache-loads；4. tracepoint； 5. pmu。但是，perf list僅僅把有符號名稱的事件列出來了，而缺了很多硬件相關的事件。這些硬件相關事件叫作Raw Hardware Event, man perf-list有介紹。

舉個例子，PMU是一組監控CPU各種性能的硬件，包括各種core, offcore和uncore事件。單說perf uncore， Intel處理器就提供了各種的性能監控單元，如內存控制器(IMC), 電源控制（PCU）等等，詳見《Intel® Xeon® Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual》[3]。這些uncore的PMU設備，註冊在MSR space或PCICFG space[4]，可以通過下面命令看到(抹掉同類別設備)：

$ls /sys/devices/ | grep uncore
uncore_cbox_0
uncore_ha_0
uncore_imc_0
uncore_pcu
uncore_qpi_0
uncore_r2pcie
uncore_r3qpi_0
uncore_ubox

但是，使用perf list只能顯示IMC相關事件：

$perf list|grep uncore
  uncore_imc_0/cas_count_read/                       [Kernel PMU event]
  uncore_imc_0/cas_count_write/                      [Kernel PMU event]
  uncore_imc_0/clockticks/                           [Kernel PMU event]
 ...                    
  uncore_imc_3/cas_count_read/                       [Kernel PMU event]
  uncore_imc_3/cas_count_write/                      [Kernel PMU event]
  uncore_imc_3/clockticks/                           [Kernel PMU event]

爲什麼perf list沒有顯示其他uncore事件呢？從代碼分析來看，perf list會通過sysfs去讀取uncore設備所支持的event，見linux/tools/perf/util/pmu.c:pmu_aliases():

/*
 * Reading the pmu event aliases definition, which should be located at:
 * /sys/bus/event_source/devices/<dev>/events as sysfs group attributes.
 */
 static int pmu_aliases(const char *name, struct list_head *head)

再看perf uncore的驅動代碼，發現只有iMC uncore設備註冊了events相關屬性, 見arch/x86/events/intel/uncore_snbep.c:hswep_uncore_imc_events:

static struct uncore_event_desc hswep_uncore_imc_events[] = {
        INTEL_UNCORE_EVENT_DESC(clockticks,      "event=0x00,umask=0x00"),
        INTEL_UNCORE_EVENT_DESC(cas_count_read,  "event=0x04,umask=0x03"),
        INTEL_UNCORE_EVENT_DESC(cas_count_read.scale, "6.103515625e-5"),
        INTEL_UNCORE_EVENT_DESC(cas_count_read.unit, "MiB"),
        INTEL_UNCORE_EVENT_DESC(cas_count_write, "event=0x04,umask=0x0c"),
        INTEL_UNCORE_EVENT_DESC(cas_count_write.scale, "6.103515625e-5"),
        INTEL_UNCORE_EVENT_DESC(cas_count_write.unit, "MiB"),
        { /* end: all zeroes */ },
};

從實用性看，在所有uncore設備中，系統工程師可能最常用的就是iMC提供的內存帶寬監測。其它不常用到的uncore PMU事件，可以通過Raw Hardware Event的方式，查看Intel Uncore手冊[5]來指定。

在使用過程中，發現一個perf list存在的bug，iMC channel的編號不正確，發了個補丁得到了Intel工程師review，upstream還沒有merge，見perf/x86/intel/uncore: allocate pmu index for pci device dynamically[6]。這是一個很明顯的問題，剛開始我不相信上游或Intel會允許這樣明顯的問題存在，雖然問題不大，通過解決這個問題的感受是perf可能隱藏一些問題，需要在測試中提高警惕，最好能通過其他測量方式進行粗略的對比驗證。

perf-stat

perf-stat是最常用到的命令，用man手冊的原話就是Run a command and gathers performance counter statistics from it。perf-record命令可看做是perf-stat的一種包裝，核心代碼路徑與perf-stat一樣，加上週期性採樣，用一種可被perf-report解析的格式將結果輸出到文件。因此，很好奇perf-stat是如何工作的。

perf是由用戶態的perf tool命令和內核態perf驅動兩部分，加上一個連通用戶態和內核態的系統調用sys_perf_event_open組成。

最簡單的perf stat示例

perf工具是隨內核tree一起維護的，構建和調試都非常方便：

$cd linux/tools/perf
$make
...
$./perf stat ls
...
 Performance counter stats for 'ls':

          1.011337      task-clock:u (msec)       #    0.769 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               105      page-faults:u             #    0.104 M/sec
         1,105,427      cycles:u                  #    1.093 GHz
         1,406,263      instructions:u            #    1.27  insn per cycle
           282,440      branches:u                #  279.274 M/sec
             9,686      branch-misses:u           #    3.43% of all branches

       0.001314310 seconds time elapsed

以上是一個非常簡單的perf-stat命令，運行了ls命令，在沒有指定event的情況下，輸出了幾種默認的性能指標。下面，我們以這個簡單的perf-stat命令爲例分析其工作過程。

用戶態工作流

如果perf-stat命令沒有通過-e參數指定任何event，函數add_default_attributes()會默認添加8個events。 event是perf工具的核心對象，各種命令都是圍繞着event工作。perf-stat命令可以同時指定多個events，由一個核心全局變量struct perf_evlist *evsel_list組織起來，以下僅列出幾個很重要的成員：

struct perf_evlist {
        struct list_head entries;
        bool             enabled;
                struct {
                int     cork_fd;
                pid_t   pid;
        } workload;
        struct fdarray   pollfd;
        struct thread_map *threads;
        struct cpu_map    *cpus;
        struct events_stats stats;
        ...
}

entries: 所有events列表, 即struct perf_evsel對象；
pid: 運行cmd的進程pid, 即運行ls命令的進程pid;
pollfd: 保存sys_perf_event_open()返回的fd;
threads: perf-stat可以通過-t參數指定多個線程，僅在這些線程運行時進行計數；
cpus: perf-stat能通過 -C參數指定多個cpu, 僅當程序運行在這些cpu上時纔會計數；
stats: 計數統計結果，perf-stat從mmap內存區讀取counter值後，還要做一些數值轉換或聚合等處理

perf_evlist::entries是一個event鏈表，鏈接的對象是一個個event，由struct perf_evsel表示，其中非常重要的成員如下:

struct perf_evsel {
char                    *name;
struct perf_event_attr  attr;
struct perf_counts      *counts;
struct xyarray          *fd;
struct cpu_map          *cpus;
struct thread_map       *threads;
}

name: event的名稱；
attr: event的屬性，傳遞給perf系統調用非常重要的參數；
cpus, threads, fd: perf-stat可以指定一些對event計數的限制條件，只統計哪些task或哪些cpu, 其實就是一個由struct xyarray表示的二維表格，最終的計數值被分解成cpus*threads個小的counter，sys_perf_event_open()請求perf驅動爲每個分量值創建一個子counter，並分別返回一個fd;
counts: perf_counts::values保存每個分量計數值，perf_counts::aggr保存最終所有分量的聚合值。

perf的性能計數器本質上是一些特殊的硬件寄存器，perf對這樣的硬件能力進行抽象，提供針對event的per-CPU和per-thread的64位虛機計數器("virtual" 64-bit counters)。當perf-stat不指定任何thread或cpu時，這樣的一個二維表格就變成一個點，即一個event對應一個counter，對應一個fd。

簡單介紹了核心數據結構，終於可以繼續看看perf-stat的工作流了。perf-stat的工作邏輯主要在__run_perf_stat()中，大致是這樣: a. fork一個子進程，準備用來運行cmd，即示例中的ls命令；b. 爲每一個event事件，通過sys_perf_event_open()系統調用，創建一個counter; c. 通過管道給子進程發消息，exec命令, 即運行示例中的ls命令, 並立即enable計數器; d. 當程序運行結束後，disable計數器，並讀取counter。用戶態的工作流大致如下：

__run_perf_stat()
  perf_evlist__prepare_workload()
  create_perf_stat_counter()
     sys_perf_event_open()
  enable_counters()
     perf_evsel__run_ioctl(evsel, ncpus, nthreads, PERF_EVENT_IOC_DISABLE)
        ioctl(fd, ioc, arg)
  wait()
  disable_counters()
     perf_evsel__run_ioctl(evsel, ncpus, nthreads, PERF_EVENT_IOC_ENABLE)
  read_counters()
    perf_evsel__read(evsel, cpu, thread, count)
      readn(fd, count, size)

用戶態工作流比較清晰，最終都可以很方便通過ioctl()控制計數器，通過read()讀取計數器的值。而這樣方便的條件都是perf系統調sys_perf_event_open（）用創造出來的，已經迫不及待想看看這個系統調用做了些什麼。

perf系統調用

perf系統調用會爲一個虛機計數器(virtual counter)打開一個fd，然後perf-stat就通過這個fd向perf內核驅動發請求。perf系統調用定義如下(linux/kernel/events/core.c):

/**
 * sys_perf_event_open - open a performance event, associate it to a task/cpu
 *
 * @attr_uptr:  event_id type attributes for monitoring/sampling
 * @pid:                target pid
 * @cpu:                target cpu
 * @group_fd:           group leader event fd
 */
SYSCALL_DEFINE5(perf_event_open,
                struct perf_event_attr __user *, attr_uptr,
                pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)

特別提一下， struct perf_event_attr是一個信息量很大的結構體，kernel中有文檔詳細介紹[7]。其它參數如何使用，man手冊有詳細的解釋，並且手冊最後還給出了用戶態編程例子，見man perf_event_open。

sys_perf_event_open()主要做了這幾件事情：

a. 根據struct perf_event_attr，創建和初始化struct perf_event, 它包含幾個重要的成員:

/**
 * struct perf_event - performance event kernel representation:
 */
struct perf_event {
	struct pmu                      *pmu; //硬件pmu抽象
	local64_t                       count; // 64-bit virtual counter
	u64                             total_time_enabled;
	u64                             total_time_running;
	struct perf_event_context       *ctx; // 與task相關
...
}

b. 爲這個event找到或創建一個struct perf_event_context, context和event是1:N的關係，一個context會與一個進程的task_struct關聯，perf_event_count::event_list表示所有對這個進程感興趣的事件, 它包括幾個重要成員：

struct perf_event_context {
        struct pmu                      *pmu;
        struct list_head                event_list;
        struct task_struct              *task;
        ...
}

c. 把event與一個context進行關聯，見perf_install_in_context();

d. 最後,把fd和perf_fops進行綁定：

static const struct file_operations perf_fops = {
        .llseek                 = no_llseek,
        .release                = perf_release,
        .read                   = perf_read,
        .poll                   = perf_poll,
        .unlocked_ioctl         = perf_ioctl,
        .compat_ioctl           = perf_compat_ioctl,
        .mmap                   = perf_mmap,
        .fasync                 = perf_fasync,
};

perf系統調用大致的調用鏈如下:

sys_perf_event_open()
	get_unused_fd_flags()
 	perf_event_alloc()
 	find_get_context()
   		alloc_perf_context()
 	anon_inode_getfile()
 	perf_install_in_context()
   		add_event_to_ctx()
 	fd_install(event_fd, event_file)

內核態工作流

perf event有兩種方式：計數(counting)和採樣(sampled)。

計數方式會對發生在所有指定cpu和指定進程的事件次數進行求和，對事件數值通過read()獲得。

採樣方式會週期性地把計數結果放在由mmap()創建的ring buffer中。回到開始的簡單perf-stat示例，用的是計數(counting)方式。

接下來，我們主要了解這幾個問題：

怎麼enable和disable計數器？
進行計數的時機在哪裏？
如何讀取計數結果？

回答這些問題的入口，基本都在perf實現的文件操作集中:

static const struct file_operations perf_fops = {
        .read                   = perf_read,
        .unlocked_ioctl         = perf_ioctl,
...

首先，我們看一下怎樣enable計數器的，主要步驟如下：

perf_ioctl()
	__perf_event_enable()
		ctx_sched_out() IF ctx->is_active
		ctx_resched()
			perf_pmu_disable()
			task_ctx_sched_out()
			cpu_ctx_sched_out()
			perf_event_sched_in()
				event_sched_in()
					event->pmu->add(event, PERF_EF_START)
			perf_pmu_enable()
				pmu->pmu_enable(pmu)

這個過程有很多調度相關的處理，使整個邏輯顯得複雜，我們暫且不關心太多調度細節。硬件的PMU資源是有限的，當event數量多於可用的PMC時，多個virtual counter就會複用硬件PMC。因此, PMU先把event添加到激活列表(pmu->add(event, PERF_EF_START)), 最後enable計數（pmu->pmu_enable(pmu) ）。PMU是CPU體系結構相關的，可以想象它有一套爲event分配具體硬件PMC的邏輯，我們暫不深究。

我們繼續瞭解如何獲取計數器結果，大致的callchain如下：

perf_read()
	perf_read_one()
		perf_event_read_value()
			__perf_event_read()
				pmu->start_txn(pmu, PERF_PMU_TXN_READ)
				pmu->read(event)
				pmu->commit_txn(pmu)

PMU最終會通過rdpmcl(counter, val)獲得計數器的值，保存在perf_event::count中。關於PMU各種操作說明，可以參考include/linux/perf_event.h:struct pmu{}。PMU操作的實現是體系結構相關的，x86上的read()的實現是arch/x86/events/core.c:x86_pmu_read()。

event可以設置限定條件，僅當指定的進程運行在指定的cpu上時，才能進行計數，這就是上面提到的計數時機問題。很容易想到，這樣的時機發生在進程切換的時候。當目標進程切換出目標CPU時，PMU停止計數，並將硬件寄保存在內存變量中，反之亦然，這個過程類似進程切換時對硬件上下文的保護。在kernel/sched/core.c, 我們能看到這些計數時機。

在進程切換前：

prepare_task_switch()
	perf_event_task_sched_out()
		__perf_event_task_sched_out() // stop each event and update the event value in event->count
			perf_pmu_sched_task()
				pmu->sched_task(cpuctx->task_ctx, sched_in)

進程切換後:

finish_task_switch()
	perf_event_task_sched_in()
		perf_event_context_sched_in()
			perf_event_sched_in()

小結

通過對perf-list和perf-stat這兩個基本的perf命令進行分析，引出了一些有意思的問題，在嘗試回答這些問題的過程中，基本上總結了目前我對perf這個工具的認識。但是，本文僅對perf的工作原理做了很粗略的梳理，也沒有展開對PMU層，perf uncore等硬件相關代碼進行分析，希望以後能補上這部分內容。

最後，能堅持看到最後的親們都是希望更深瞭解性能測試的，作爲福利給大家推薦本書：《system performance: enterprise and the cloud》書的作者是一位從事多年性能優化工作的一線工程師，想必大家都聽說過他寫的火焰圖程序： perf Examples

Cheers!

參考索引

Cycles per instruction: https://en.wikipedia.org/wiki/Cycles_per_instruction
uncore: https://en.wikipedia.org/wiki/Uncore
《Intel® Xeon® Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual》
《Linux設備驅動程序》中第二章PCI驅動程序
https://patchwork.kernel.org/patch/10412883/
linux/tools/perf/design.txt

系統性能分析工具perf（2 of 2）:perf工作原理簡析

背景

perf-list

perf-stat

最簡單的perf stat示例

用戶態工作流

perf系統調用

內核態工作流

小結

參考索引

製作USB Ubuntu 安裝盤及安裝Ubuntu系統

Ceph 進階系列（二）：如何讓某個 pool使用特定的OSD設備（1 of 2，手動版,早於luminous版本）

Ceph 進階系列（二）：如何讓某個 pool使用特定的OSD設備（2 of 2，luminous新特性）

SPECjbb 牽手 jdk 系列（一）：什麼是SPECjbb ？

Ceph 擼源碼系列（二）：Ceph源代碼裏的那些鎖 std::mutex（2 of 3）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結