系統事件:
perf tool 支持一系列可計算的事件類型。該工具和底層內核接口可以監測來自不同來源的事件。
例如,一些事件是來源於純粹的內核計數器,這些event在這種情況下被稱爲軟件事件。例子包括:context-switch、minor-fault等。
事件的另一個來源是處理器本身及其性能監控裝置(PMU)。它提供了一系列事件來監測 micro-architectural 事件,如cycle的數量、指令退出、L1緩存不命中等等。
這些事件被稱爲PMU硬件事件或硬件事件。每種處理器型號都有各種不同的硬件事件類型。perf_events接口還提供了一組通用的硬件事件名稱。
在每個處理器,這些事件映射到實際的CPU提供的實際存在的事件類型,如果不存在則不能使用。
而且這些也被稱爲硬件事件(hardware events)和硬件緩存事件(hardware cache events)。
最後,還有各種 tracepoint事件是由內核ftrace框架實現的,而且這些事件都是2.6.3x或更加新的內核才支持。
Tracepoints:
Tracepoint 是散落在內核源代碼中的一些 hook,一旦使能,它們便可以在特定的代碼被運行到時被觸發,這一特性可以被各種 trace/debug 工具所使用。Perf 就是該特性的用戶之一。
假如您想知道在應用程序運行期間,內核內存管理模塊的行爲,便可以利用潛伏在 slab 分配器中的 tracepoint。當內核運行到這些 tracepoint 時,便會通知 perf。
Perf 將 tracepoint 產生的事件記錄下來,生成報告,通過分析這些報告,調優人員便可以瞭解程序運行時期內核的種種細節,對性能症狀作出更準確的診斷。
常用perf工具:
perf list:列出所有事件類型
perf top:類似top的全局監控工具
perf stat:對某個程序或者特定進程進行監控
perf record & perf report:先記錄再通過詳細報告的形式打印事件信息
- perf list
[root@localhost test]# perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
ref-cycles [Hardware event]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
minor-faults [Software event]
major-faults [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-prefetches [Hardware cache event]
L1-icache-prefetch-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-stores [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-prefetches [Hardware cache event]
dTLB-prefetch-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
branch-load-misses [Hardware cache event]
rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
sunrpc:rpc_call_status [Tracepoint event]
sunrpc:rpc_bind_status [Tracepoint event]
sunrpc:rpc_connect_status [Tracepoint event]
sunrpc:rpc_task_begin [Tracepoint event]
sunrpc:rpc_task_run_action [Tracepoint event]
sunrpc:rpc_task_complete [Tracepoint event]
sunrpc:rpc_task_sleep [Tracepoint event]
sunrpc:rpc_task_wakeup [Tracepoint event]
ext4:ext4_free_inode [Tracepoint event]
ext4:ext4_request_inode [Tracepoint event]
ext4:ext4_allocate_inode [Tracepoint event]
ext4:ext4_write_begin [Tracepoint event]
ext4:ext4_da_write_begin [Tracepoint event]
省略............
其中通過perf list命令運行在不同的系統會列出不同的結果,在 2.6.35 版本的內核中,該列表已經相當的長,但無論有多少,我們可以將它們劃分爲三類:
Hardware Event 是由 PMU 硬件產生的事件,比如 cache 命中,當您需要了解程序對硬件特性的使用情況時,便需要對這些事件進行採樣;Software Event 是內核軟件產生的事件,比如進程切換,tick 數等 ;
Tracepoint event 是內核中的靜態 tracepoint 所觸發的事件,這些 tracepoint 用來判斷程序運行期間內核的行爲細節,比如 slab 分配器的分配次數等。
在操作系統運行過程中,關於系統調用的調度優先級別,從高到低是這樣的:
硬中斷->軟中斷->實時進程->內核進程->用戶進程
Hard Interrupts – Devices tell the kernel that they are done processing. For example, a NIC delivers a packet or a hard drive provides an IO request(硬中斷)
Soft Interrupts – These are kernel software interrupts that have to do with maintenance of the kernel. For example, the kernel clock tick thread is a soft interrupt. It checks to make sure a process has not passed its allotted time on a processor.(軟中斷)
Real Time Processes – Real time processes have more priority than the kernel itself. A real time process may come on the CPU and preempt (or “kick off”) the kernel. The Linux 2.4 kernel is NOT a fully preemptable kernel, making it not ideal for real time application programming.(實時進程,會搶佔其他非實時進程時間片)
Kernel (System) Processes – All kernel processing is handled at this level of priority. (內核進程)
User Processes – This space is often referred to as “userland”. All software applications run in the user space. This space has the lowest priority in the kernel scheduling mechanism.(用戶進程)
- perf top
Samples: 475 of event 'cycles', Event count (approx.): 165419249
10.30% [kernel] [k] kallsyms_expand_symbol
7.75% perf [.] 0x000000000005f190
5.68% libc-2.12.so [.] memcpy
5.45% [kernel] [k] format_decode
5.45% libc-2.12.so [.] __strcmp_sse42
5.45% perf [.] symbols__insert
5.24% [kernel] [k] memcpy
4.86% [kernel] [k] vsnprintf
4.62% [kernel] [k] string
4.45% [kernel] [k] number
3.35% libc-2.12.so [.] __strstr_sse42
3.15% perf [.] hex2u64
2.72% perf [.] rb_next
2.10% [kernel] [k] pointer
1.89% [kernel] [k] security_real_capable_noaudit
1.89% [kernel] [k] strnlen
1.88% perf [.] rb_insert_color
1.68% libc-2.12.so [.] _int_malloc
1.68% libc-2.12.so [.] _IO_getdelim
1.28% [kernel] [k] update_iter
1.05% [kernel] [k] s_show
1.05% libc-2.12.so [.] memchr
1.05% [kernel] [k] get_task_cred
0.88% [kernel] [k] seq_read
0.85% [kernel] [k] clear_page_c
0.84% perf [.] symbol__new
0.63% libc-2.12.so [.] __libc_calloc
0.63% [kernel] [k] copy_user_generic_string
0.63% [kernel] [k] seq_vprintf
0.63% perf [.] kallsyms__parse
0.63% libc-2.12.so [.] _IO_feof
0.63% [kernel] [k] strcmp
0.63% [kernel] [k] page_fault
0.63% perf [.] dso__load_sym
0.42% perf [.] strstr@plt
0.42% libc-2.12.so [.] __strchr_sse42
0.42% [kernel] [k] seq_printf
0.42% libc-2.12.so [.] __memset_sse2
0.42% libelf-0.152.so [.] gelf_getsym
0.25% [kernel] [k] __mutex_init
0.21% [kernel] [k] _spin_lock
0.21% [kernel] [k] s_next
0.21% [kernel] [k] fsnotify
0.21% [kernel] [k] sys_read
0.21% [kernel] [k] __link_path_walk
0.21% [kernel] [k] intel_pmu_disable_all
0.21% [kernel] [k] mem_cgroup_charge_common
0.21% [kernel] [k] update_shares
0.21% [kernel] [k] native_read_tsc
0.21% [kernel] [k] apic_timer_interrupt
0.21% [kernel] [k] auditsys
0.21% [kernel] [k] do_softirq
0.21% [kernel] [k] update_process_times
0.21% [kernel] [k] native_sched_clock
0.21% [kernel] [k] mutex_lock
0.21% [kernel] [k] module_get_kallsym
perf top 類似top的作用,可以整體查看系統的全部事件並且按數量排序。- perf stat
perf stat可以直接接上命令或者通過pid形式對進程進行性能統計,完成後會打印出這個過程的性能數據統計
[root@localhost test]# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
記錄了1000000+0 的讀入
記錄了1000000+0 的寫出
512000000字節(512 MB)已複製,0.790059 秒,648 MB/秒
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':
791.176286 task-clock # 1.000 CPUs utilized
1 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
247 page-faults # 0.312 K/sec
1,248,519,452 cycles # 1.578 GHz [83.29%]
366,166,452 stalled-cycles-frontend # 29.33% frontend cycles idle [83.33%]
155,120,002 stalled-cycles-backend # 12.42% backend cycles idle [66.62%]
1,947,919,494 instructions # 1.56 insns per cycle
# 0.19 stalled cycles per insn [83.31%]
355,465,524 branches # 449.287 M/sec [83.41%]
2,021,648 branch-misses # 0.57% of all branches [83.42%]
0.791116595 seconds time elapsed
通過pid指定某進程進行性能監測,按ctrl-C結束,然後計算這段時間內的性能統計
[root@localhost test]# perf stat -p 18669
^C
Performance counter stats for process id '18669':
1.520699 task-clock # 0.001 CPUs utilized
56 context-switches # 0.037 M/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
2,178,120 cycles # 1.432 GHz [63.18%]
1,410,393 stalled-cycles-frontend # 64.75% frontend cycles idle [90.94%]
942,665 stalled-cycles-backend # 43.28% backend cycles idle
1,067,824 instructions # 0.49 insns per cycle
# 1.32 stalled cycles per insn
193,104 branches # 126.984 M/sec
14,544 branch-misses # 7.53% of all branches [61.93%]
2.061889979 seconds time elapsed
[root@localhost test]# perf stat
usage: perf stat [<options>] [<command>]
-e, --event <event> event selector. use 'perf list' to list available events
--filter <filter>
event filter
-i, --no-inherit child tasks do not inherit counters
-p, --pid <pid> stat events on existing process id
-t, --tid <tid> stat events on existing thread id
-a, --all-cpus system-wide collection from all CPUs
-g, --group put the counters into a counter group
-c, --scale scale/normalize counters
-v, --verbose be more verbose (show counter open errors, etc)
-r, --repeat <n> repeat command and print average + stddev (max: 100)
-n, --null null run - dont start any counters
-d, --detailed detailed run - start a lot of events
-S, --sync call sync() before starting a run
-B, --big-num print large numbers with thousands' separators
-C, --cpu <cpu> list of cpus to monitor in system-wide
-A, --no-aggr disable CPU count aggregation
-x, --field-separator <separator>
print counts with custom separator
-G, --cgroup <name> monitor event in cgroup name only
-o, --output <file> output file name
--append append to the output file
--log-fd <n> log output to fd, instead of stderr
--pre <command> command to run prior to the measured command
--post <command> command to run after to the measured command
perf stat 還有其他有用的選項如 -e 是匹配指定事件
perf stat -e cycles,instructions,cache-misses [...]也可以指定監控的級別
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000具體含義是:
Modifiers | Description | Example |
---|---|---|
u | monitor at priv level 3, 2, 1 (user) | event:u |
k | monitor at priv level 0 (kernel) | event:k |
h | monitor hypervisor events on a virtualization environment | event:h |
H | monitor host machine on a virtualization environment | event:H |
G | monitor guest machine on a virtualization environment | event:G |
操作系統資源一般分爲四大部分:CPU、內存、IO、網絡
容易出現瓶頸的一般是CPU和IO,所以我這裏寫了兩個小程序分別模擬這兩種情況,然後通過使用perf一系列的工具找到癥結所在。
程序1是計算素數,專門用來耗系統CPU的
#include <stdio.h>
int IsPrime(int num)
{
int i=2;
for(;i<=num/2;i++)
if(0==num%i)
return 0;
return 1;
}
void main()
{
int num;
for(num=2;num<=10000000;num++)
IsPrime(num);
}
程序2是開n個線程,每隔1ms寫1K的數據到文件,讀1k的數據。用來耗系統IO
#include <unistd.h>
#include <stdio.h>
#include <string>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <vector>
using namespace std;
bool gquit = false;
void* threadfun(void* param)
{
string filename = *(string*)param;
unlink(filename.c_str());
int fd = open(filename.c_str(), O_CREAT | O_RDWR, 666);
if (fd < 0) return 0;
string a(1024*16*1024, 'a');
bool flag = true;
while (!gquit)
{
if (flag)
{
write(fd, a.c_str(), a.length());
}
else
{
read(fd, &a[0], a.length());
lseek(fd, 0, SEEK_SET);
}
flag = !flag;
usleep(1);
}
close(fd);
return 0;
}
int main(int argc, char* argv[])
{
if (argc != 2)
{
printf("need thread num\n");
return 0;
}
int threadnum = atoi(argv[1]);
vector<string> v(threadnum, "./testio.filex_");
vector<pthread_t> vpid(threadnum, 0);
for(int i = 0 ; i < threadnum ; ++ i)
{
pthread_attr_t attr;
pthread_attr_init(&attr);
static char buf[32];
snprintf(buf, sizeof(buf), "%d", i);
v[i] += buf;
pthread_create(&vpid[i], & attr, threadfun, (void*)&v[i]);
}
printf("press q to quit\n");
while(getchar() != 'q');
gquit = true;
for(int i = 0 ; i < threadnum ; ++ i)
{
void* ret = NULL;
pthread_join(vpid[i], &ret);
unlink(v[i].c_str());
}
return 0;
}
我先運行程序1後,通過perf top進行全局監控
Samples: 1K of event 'cycles', Event count (approx.): 690036067
74.14% cpu [.] IsPrime
2.45% [kernel] [k] kallsyms_expand_symbol
2.21% perf [.] 0x000000000005f225
1.65% perf [.] symbols__insert
1.50% perf [.] hex2u64
1.41% libc-2.12.so [.] __strcmp_sse42
1.26% [kernel] [k] string
1.25% [kernel] [k] format_decode
1.10% [kernel] [k] vsnprintf
0.95% [kernel] [k] strnlen
0.85% [kernel] [k] memcpy
0.80% libc-2.12.so [.] memcpy
0.70% [kernel] [k] number
0.60% perf [.] rb_next
0.60% libc-2.12.so [.] _int_malloc
0.55% libc-2.12.so [.] __strstr_sse42
0.55% [kernel] [k] pointer
0.55% perf [.] rb_insert_color
可以看出排名第一是cpu這個進程的IsPrime函數佔用了最多的CPU性能
然後通過perf stat 指定pid進行監控
[root@localhost test]# perf stat -p 10704
^C
Performance counter stats for process id '10704':
4067.488066 task-clock # 1.001 CPUs utilized [100.00%]
5 context-switches # 0.001 K/sec [100.00%]
0 cpu-migrations # 0.000 K/sec [100.00%]
0 page-faults # 0.000 K/sec
10,743,902,971 cycles # 2.641 GHz [83.32%]
5,863,649,174 stalled-cycles-frontend # 54.58% frontend cycles idle [83.32%]
1,347,218,352 stalled-cycles-backend # 12.54% backend cycles idle [66.69%]
14,625,550,503 instructions # 1.36 insns per cycle
# 0.40 stalled cycles per insn [83.34%]
1,950,981,334 branches # 479.653 M/sec [83.35%]
46,273 branch-misses # 0.00% of all branches [83.32%]
4.064800952 seconds time elapsed
可以看到各種系統事件的統計信息
然後使用perf record和report進行更加具體的分析
[root@localhost test]# perf record -g -- ./cpu
^C[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.875 MB perf.data (~38244 samples) ]
./cpu: Interrupt
[root@localhost test]# perf report
Samples: 11K of event 'cycles', Event count (approx.):6815943846
+ 99.70% cpu cpu [.] IsPrime
+ 0.05% cpu [kernel.kallsyms] [k] hrtimer_interrupt
+ 0.03% cpu cpu [.] main
+ 0.02% cpu [kernel.kallsyms] [k] native_write_msr_safe
+ 0.02% cpu [kernel.kallsyms] [k] _spin_unlock_irqrestore
+ 0.02% cpu [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
+ 0.01% cpu [kernel.kallsyms] [k] rb_insert_color
+ 0.01% cpu [kernel.kallsyms] [k] raise_softirq
+ 0.01% cpu [kernel.kallsyms] [k] rcu_process_gp_end
+ 0.01% cpu [kernel.kallsyms] [k] scheduler_tick
+ 0.01% cpu [kernel.kallsyms] [k] do_timer
+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock_irqsave
+ 0.01% cpu [kernel.kallsyms] [k] apic_timer_interrupt
+ 0.01% cpu [kernel.kallsyms] [k] tick_sched_timer
+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock
+ 0.01% cpu [kernel.kallsyms] [k] __remove_hrtimer
+ 0.01% cpu [kernel.kallsyms] [k] idle_cpu
非常清晰可以看到各種函數的調用情況,由於這個程序是個很簡單的代碼,所以容易看出問題所在,但是如果是大項目的某個功能出現問題,通過perf工具就可以很方便地定位問題,後面再看看一個佔用IO的例子。
同樣也是先啓動程序2,先通過perf top、perf stat -p pid這兩個命令先定位是哪個進程導致性能問題
Samples: 394K of event 'cycles', Event count (approx.): 188517759438
75.39% [kernel] [k] _spin_lock
11.39% [kernel] [k] copy_user_generic_string
1.89% [jbd2] [k] jbd2_journal_stop
1.66% [kernel] [k] find_get_page
1.51% [jbd2] [k] start_this_handle
0.95% [ext4] [k] ext4_da_write_end
0.76% [kernel] [k] iov_iter_fault_in_readable
0.74% [ext4] [k] ext4_journal_start_sb
0.56% [jbd2] [k] __jbd2_log_space_left
0.44% [ext4] [k] __ext4_journal_stop
0.44% [kernel] [k] __block_prepare_write
0.41% [kernel] [k] __wake_up_bit
0.35% [kernel] [k] _cond_resched
0.32% [kernel] [k] kmem_cache_free
可以看到很多 spin lock的事件和各種文件操作事件
直接通過perf record和report進行分析,得出結果
[root@localhost test]# perf record -g -- ./io 10
press q to quit
q
[ perf record: Woken up 83 times to write data ]
[ perf record: Captured and wrote 23.361 MB perf.data (~1020643 samples) ]
[root@localhost test]# perf report
Samples: 138K of event 'cycles', Event count (approx.): 64858537028
+ 65.87% io [kernel.kallsyms] [k] _spin_lock
+ 14.19% io [kernel.kallsyms] [k] copy_user_generic_string
+ 2.74% io [jbd2] [k] jbd2_journal_stop
+ 2.46% io [kernel.kallsyms] [k] find_get_page
+ 2.11% io [jbd2] [k] start_this_handle
+ 1.06% io [kernel.kallsyms] [k] iov_iter_fault_in_readable
+ 1.05% io [ext4] [k] ext4_journal_start_sb
+ 0.85% io [ext4] [k] ext4_da_write_end
+ 0.73% io [jbd2] [k] __jbd2_log_space_left
+ 0.66% io [kernel.kallsyms] [k] generic_write_end
+ 0.61% io [ext4] [k] __ext4_journal_stop
+ 0.57% io [kernel.kallsyms] [k] __block_prepare_write
+ 0.54% io [kernel.kallsyms] [k] __wake_up_bit
好的,以上就是perf常用的工具,當然perf不止這幾個工具,但這幾個是最基礎最常用的工具
通過查看官方wiki https://perf.wiki.kernel.org/index.php/Tutorial 可以瞭解更多信息