#定位系統性能瓶頸# perf

perf是一個基於Linux 2.6+的調優工具，在liunx性能測量抽象出一套適應於各種不同CPU硬件的通用測量方法，其數據來源於比較新的linux內核提供的 perf_event 接口

系統事件:
perf tool 支持一系列可計算的事件類型。該工具和底層內核接口可以監測來自不同來源的事件。
例如,一些事件是來源於純粹的內核計數器,這些event在這種情況下被稱爲軟件事件。例子包括:context-switch、minor-fault等。

事件的另一個來源是處理器本身及其性能監控裝置(PMU)。它提供了一系列事件來監測 micro-architectural 事件，如cycle的數量、指令退出、L1緩存不命中等等。

這些事件被稱爲PMU硬件事件或硬件事件。每種處理器型號都有各種不同的硬件事件類型。

perf_events接口還提供了一組通用的硬件事件名稱。
在每個處理器,這些事件映射到實際的CPU提供的實際存在的事件類型,如果不存在則不能使用。
而且這些也被稱爲硬件事件（hardware events）和硬件緩存事件（hardware cache events）。

最後,還有各種 tracepoint事件是由內核ftrace框架實現的，而且這些事件都是2.6.3x或更加新的內核才支持。

Tracepoints:
Tracepoint 是散落在內核源代碼中的一些 hook，一旦使能，它們便可以在特定的代碼被運行到時被觸發，這一特性可以被各種 trace/debug 工具所使用。Perf 就是該特性的用戶之一。
假如您想知道在應用程序運行期間，內核內存管理模塊的行爲，便可以利用潛伏在 slab 分配器中的 tracepoint。當內核運行到這些 tracepoint 時，便會通知 perf。
Perf 將 tracepoint 產生的事件記錄下來，生成報告，通過分析這些報告，調優人員便可以瞭解程序運行時期內核的種種細節，對性能症狀作出更準確的診斷。

常用perf工具：
perf list：列出所有事件類型
perf top：類似top的全局監控工具
perf stat：對某個程序或者特定進程進行監控

perf record & perf report：先記錄再通過詳細報告的形式打印事件信息

perf list

[root@localhost test]# perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
  
  sunrpc:rpc_call_status                             [Tracepoint event]
  sunrpc:rpc_bind_status                             [Tracepoint event]
  sunrpc:rpc_connect_status                          [Tracepoint event]
  sunrpc:rpc_task_begin                              [Tracepoint event]
  sunrpc:rpc_task_run_action                         [Tracepoint event]
  sunrpc:rpc_task_complete                           [Tracepoint event]
  sunrpc:rpc_task_sleep                              [Tracepoint event]
  sunrpc:rpc_task_wakeup                             [Tracepoint event]
  ext4:ext4_free_inode                               [Tracepoint event]
  ext4:ext4_request_inode                            [Tracepoint event]
  ext4:ext4_allocate_inode                           [Tracepoint event]
  ext4:ext4_write_begin                              [Tracepoint event]
  ext4:ext4_da_write_begin                           [Tracepoint event]
	省略............

其中通過perf list命令運行在不同的系統會列出不同的結果，在 2.6.35 版本的內核中，該列表已經相當的長，但無論有多少，我們可以將它們劃分爲三類：

Hardware Event 是由 PMU 硬件產生的事件，比如 cache 命中，當您需要了解程序對硬件特性的使用情況時，便需要對這些事件進行採樣；
Software Event 是內核軟件產生的事件，比如進程切換，tick 數等 ;
Tracepoint event 是內核中的靜態 tracepoint 所觸發的事件，這些 tracepoint 用來判斷程序運行期間內核的行爲細節，比如 slab 分配器的分配次數等。

在操作系統運行過程中，關於系統調用的調度優先級別，從高到低是這樣的：
硬中斷->軟中斷->實時進程->內核進程->用戶進程

Hard Interrupts – Devices tell the kernel that they are done processing. For example, a NIC delivers a packet or a hard drive provides an IO request（硬中斷）
Soft Interrupts – These are kernel software interrupts that have to do with maintenance of the kernel. For example, the kernel clock tick thread is a soft interrupt. It checks to make sure a process has not passed its allotted time on a processor.（軟中斷）
Real Time Processes – Real time processes have more priority than the kernel itself. A real time process may come on the CPU and preempt (or “kick off”) the kernel. The Linux 2.4 kernel is NOT a fully preemptable kernel, making it not ideal for real time application programming.（實時進程，會搶佔其他非實時進程時間片）
Kernel (System) Processes – All kernel processing is handled at this level of priority. （內核進程）
User Processes – This space is often referred to as “userland”. All software applications run in the user space. This space has the lowest priority in the kernel scheduling mechanism.（用戶進程）

perf top

Samples: 475  of event 'cycles', Event count (approx.): 165419249
 10.30%  [kernel]          [k] kallsyms_expand_symbol
  7.75%  perf              [.] 0x000000000005f190
  5.68%  libc-2.12.so      [.] memcpy
  5.45%  [kernel]          [k] format_decode
  5.45%  libc-2.12.so      [.] __strcmp_sse42
  5.45%  perf              [.] symbols__insert
  5.24%  [kernel]          [k] memcpy
  4.86%  [kernel]          [k] vsnprintf
  4.62%  [kernel]          [k] string
  4.45%  [kernel]          [k] number
  3.35%  libc-2.12.so      [.] __strstr_sse42
  3.15%  perf              [.] hex2u64
  2.72%  perf              [.] rb_next
  2.10%  [kernel]          [k] pointer
  1.89%  [kernel]          [k] security_real_capable_noaudit
  1.89%  [kernel]          [k] strnlen
  1.88%  perf              [.] rb_insert_color
  1.68%  libc-2.12.so      [.] _int_malloc
  1.68%  libc-2.12.so      [.] _IO_getdelim
  1.28%  [kernel]          [k] update_iter
  1.05%  [kernel]          [k] s_show
  1.05%  libc-2.12.so      [.] memchr
  1.05%  [kernel]          [k] get_task_cred
  0.88%  [kernel]          [k] seq_read
  0.85%  [kernel]          [k] clear_page_c
  0.84%  perf              [.] symbol__new
  0.63%  libc-2.12.so      [.] __libc_calloc
  0.63%  [kernel]          [k] copy_user_generic_string
  0.63%  [kernel]          [k] seq_vprintf
  0.63%  perf              [.] kallsyms__parse
  0.63%  libc-2.12.so      [.] _IO_feof
  0.63%  [kernel]          [k] strcmp
  0.63%  [kernel]          [k] page_fault
  0.63%  perf              [.] dso__load_sym
  0.42%  perf              [.] strstr@plt
  0.42%  libc-2.12.so      [.] __strchr_sse42
  0.42%  [kernel]          [k] seq_printf
  0.42%  libc-2.12.so      [.] __memset_sse2
  0.42%  libelf-0.152.so   [.] gelf_getsym
  0.25%  [kernel]          [k] __mutex_init
  0.21%  [kernel]          [k] _spin_lock
  0.21%  [kernel]          [k] s_next
  0.21%  [kernel]          [k] fsnotify
  0.21%  [kernel]          [k] sys_read
  0.21%  [kernel]          [k] __link_path_walk
  0.21%  [kernel]          [k] intel_pmu_disable_all
  0.21%  [kernel]          [k] mem_cgroup_charge_common
  0.21%  [kernel]          [k] update_shares
  0.21%  [kernel]          [k] native_read_tsc
  0.21%  [kernel]          [k] apic_timer_interrupt
  0.21%  [kernel]          [k] auditsys
  0.21%  [kernel]          [k] do_softirq
  0.21%  [kernel]          [k] update_process_times
  0.21%  [kernel]          [k] native_sched_clock
  0.21%  [kernel]          [k] mutex_lock
  0.21%  [kernel]          [k] module_get_kallsym

perf top 類似top的作用，可以整體查看系統的全部事件並且按數量排序。

perf stat

perf stat可以直接接上命令或者通過pid形式對進程進行性能統計，完成後會打印出這個過程的性能數據統計

[root@localhost test]# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
記錄了1000000+0 的讀入
記錄了1000000+0 的寫出
512000000字節(512 MB)已複製，0.790059 秒，648 MB/秒

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

        791.176286 task-clock                #    1.000 CPUs utilized          
                 1 context-switches          #    0.001 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               247 page-faults               #    0.312 K/sec                  
     1,248,519,452 cycles                    #    1.578 GHz                     [83.29%]
       366,166,452 stalled-cycles-frontend   #   29.33% frontend cycles idle    [83.33%]
       155,120,002 stalled-cycles-backend    #   12.42% backend  cycles idle    [66.62%]
     1,947,919,494 instructions              #    1.56  insns per cycle        
                                             #    0.19  stalled cycles per insn [83.31%]
       355,465,524 branches                  #  449.287 M/sec                   [83.41%]
         2,021,648 branch-misses             #    0.57% of all branches         [83.42%]

       0.791116595 seconds time elapsed

通過pid指定某進程進行性能監測，按ctrl-C結束，然後計算這段時間內的性能統計

[root@localhost test]# perf stat -p 18669
^C
 Performance counter stats for process id '18669':

          1.520699 task-clock                #    0.001 CPUs utilized          
                56 context-switches          #    0.037 M/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
                 0 page-faults               #    0.000 K/sec                  
         2,178,120 cycles                    #    1.432 GHz                     [63.18%]
         1,410,393 stalled-cycles-frontend   #   64.75% frontend cycles idle    [90.94%]
           942,665 stalled-cycles-backend    #   43.28% backend  cycles idle   
         1,067,824 instructions              #    0.49  insns per cycle        
                                             #    1.32  stalled cycles per insn
           193,104 branches                  #  126.984 M/sec                  
            14,544 branch-misses             #    7.53% of all branches         [61.93%]

       2.061889979 seconds time elapsed

[root@localhost test]# perf stat

 usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
        --filter <filter>
                          event filter
    -i, --no-inherit      child tasks do not inherit counters
    -p, --pid <pid>       stat events on existing process id
    -t, --tid <tid>       stat events on existing thread id
    -a, --all-cpus        system-wide collection from all CPUs
    -g, --group           put the counters into a counter group
    -c, --scale           scale/normalize counters
    -v, --verbose         be more verbose (show counter open errors, etc)
    -r, --repeat <n>      repeat command and print average + stddev (max: 100)
    -n, --null            null run - dont start any counters
    -d, --detailed        detailed run - start a lot of events
    -S, --sync            call sync() before starting a run
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -A, --no-aggr         disable CPU count aggregation
    -x, --field-separator <separator>
                          print counts with custom separator
    -G, --cgroup <name>   monitor event in cgroup name only
    -o, --output <file>   output file name
        --append          append to the output file
        --log-fd <n>      log output to fd, instead of stderr
        --pre <command>   command to run prior to the measured command
        --post <command>  command to run after to the measured command

perf stat 還有其他有用的選項如 -e 是匹配指定事件

perf stat -e cycles,instructions,cache-misses [...]

也可以指定監控的級別

perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000

具體含義是：

Modifiers	Description	Example
u	monitor at priv level 3, 2, 1 (user)	event:u
k	monitor at priv level 0 (kernel)	event:k
h	monitor hypervisor events on a virtualization environment	event:h
H	monitor host machine on a virtualization environment	event:H
G	monitor guest machine on a virtualization environment	event:G

這些修飾符其實相當於flag標記的布爾類型，u是用戶態，k是內核態，後面三個是和虛擬化相關的。

操作系統資源一般分爲四大部分：CPU、內存、IO、網絡

容易出現瓶頸的一般是CPU和IO，所以我這裏寫了兩個小程序分別模擬這兩種情況，然後通過使用perf一系列的工具找到癥結所在。

程序1是計算素數，專門用來耗系統CPU的

#include <stdio.h>
int IsPrime(int num)
{
 int i=2;
 for(;i<=num/2;i++)
  if(0==num%i)
   return 0;
 return 1;
}
void main()
{
 int num;
 for(num=2;num<=10000000;num++)
    IsPrime(num);
    
}

程序2是開n個線程，每隔1ms寫1K的數據到文件，讀1k的數據。用來耗系統IO

#include <unistd.h>
#include <stdio.h>
#include <string>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <vector>
using namespace std;

bool gquit = false;

void* threadfun(void* param)
{
    string filename = *(string*)param;

    unlink(filename.c_str());
    
    int fd = open(filename.c_str(), O_CREAT | O_RDWR, 666);
    if (fd < 0) return 0;

    string a(1024*16*1024, 'a');
    bool flag = true;
    while (!gquit)
    {

        if (flag)
        {
            write(fd, a.c_str(), a.length());
        }
        else
        {
            read(fd, &a[0], a.length());    
            lseek(fd, 0, SEEK_SET);
        }
        flag = !flag;
        usleep(1);
    }
    close(fd);
    return 0;
}


int main(int argc, char* argv[])
{
    if (argc != 2)
    {
        printf("need thread num\n");
        return 0;
    }
        
    int threadnum = atoi(argv[1]);

    vector<string> v(threadnum, "./testio.filex_");
    vector<pthread_t> vpid(threadnum, 0);
    for(int i = 0 ; i < threadnum ; ++ i)
    {
        pthread_attr_t attr;
        pthread_attr_init(&attr);

        static char buf[32];
        snprintf(buf, sizeof(buf), "%d", i);
        v[i] += buf;
        pthread_create(&vpid[i], & attr, threadfun, (void*)&v[i]); 
        
    }
    printf("press q to quit\n");    
    
    while(getchar() != 'q');

    gquit = true;
    for(int i = 0 ; i < threadnum ; ++ i)
    {
        void* ret = NULL;
        pthread_join(vpid[i], &ret);    
        unlink(v[i].c_str());
    }   
    return 0;
}

我先運行程序1後，通過perf top進行全局監控

Samples: 1K of event 'cycles', Event count (approx.): 690036067
 74.14%  cpu               [.] IsPrime
  2.45%  [kernel]          [k] kallsyms_expand_symbol
  2.21%  perf              [.] 0x000000000005f225
  1.65%  perf              [.] symbols__insert
  1.50%  perf              [.] hex2u64
  1.41%  libc-2.12.so      [.] __strcmp_sse42
  1.26%  [kernel]          [k] string
  1.25%  [kernel]          [k] format_decode
  1.10%  [kernel]          [k] vsnprintf
  0.95%  [kernel]          [k] strnlen
  0.85%  [kernel]          [k] memcpy
  0.80%  libc-2.12.so      [.] memcpy
  0.70%  [kernel]          [k] number
  0.60%  perf              [.] rb_next
  0.60%  libc-2.12.so      [.] _int_malloc
  0.55%  libc-2.12.so      [.] __strstr_sse42
  0.55%  [kernel]          [k] pointer
  0.55%  perf              [.] rb_insert_color

可以看出排名第一是cpu這個進程的IsPrime函數佔用了最多的CPU性能

然後通過perf stat 指定pid進行監控

[root@localhost test]# perf stat -p 10704
^C
 Performance counter stats for process id '10704':

       4067.488066 task-clock                #    1.001 CPUs utilized           [100.00%]
                 5 context-switches          #    0.001 K/sec                   [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
                 0 page-faults               #    0.000 K/sec                  
    10,743,902,971 cycles                    #    2.641 GHz                     [83.32%]
     5,863,649,174 stalled-cycles-frontend   #   54.58% frontend cycles idle    [83.32%]
     1,347,218,352 stalled-cycles-backend    #   12.54% backend  cycles idle    [66.69%]
    14,625,550,503 instructions              #    1.36  insns per cycle        
                                             #    0.40  stalled cycles per insn [83.34%]
     1,950,981,334 branches                  #  479.653 M/sec                   [83.35%]
            46,273 branch-misses             #    0.00% of all branches         [83.32%]

       4.064800952 seconds time elapsed

可以看到各種系統事件的統計信息

然後使用perf record和report進行更加具體的分析

[root@localhost test]# perf record -g -- ./cpu    
^C[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.875 MB perf.data (~38244 samples) ]
./cpu: Interrupt


[root@localhost test]# perf report
Samples: 11K of event 'cycles', Event count (approx.):6815943846
+ 99.70% cpu cpu [.] IsPrime
+ 0.05% cpu [kernel.kallsyms] [k] hrtimer_interrupt
+ 0.03% cpu cpu [.] main
+ 0.02% cpu [kernel.kallsyms] [k] native_write_msr_safe
+ 0.02% cpu [kernel.kallsyms] [k] _spin_unlock_irqrestore
+ 0.02% cpu [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
+ 0.01% cpu [kernel.kallsyms] [k] rb_insert_color
+ 0.01% cpu [kernel.kallsyms] [k] raise_softirq
+ 0.01% cpu [kernel.kallsyms] [k] rcu_process_gp_end
+ 0.01% cpu [kernel.kallsyms] [k] scheduler_tick
+ 0.01% cpu [kernel.kallsyms] [k] do_timer
+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock_irqsave
+ 0.01% cpu [kernel.kallsyms] [k] apic_timer_interrupt
+ 0.01% cpu [kernel.kallsyms] [k] tick_sched_timer
+ 0.01% cpu [kernel.kallsyms] [k] _spin_lock
+ 0.01% cpu [kernel.kallsyms] [k] __remove_hrtimer
+ 0.01% cpu [kernel.kallsyms] [k] idle_cpu

非常清晰可以看到各種函數的調用情況，由於這個程序是個很簡單的代碼，所以容易看出問題所在，但是如果是大項目的某個功能出現問題，通過perf工具就可以很方便地定位問題，後面再看看一個佔用IO的例子。

同樣也是先啓動程序2，先通過perf top、perf stat -p pid這兩個命令先定位是哪個進程導致性能問題

Samples: 394K of event 'cycles', Event count (approx.): 188517759438
 75.39%  [kernel]             [k] _spin_lock
 11.39%  [kernel]             [k] copy_user_generic_string
  1.89%  [jbd2]               [k] jbd2_journal_stop
  1.66%  [kernel]             [k] find_get_page
  1.51%  [jbd2]               [k] start_this_handle
  0.95%  [ext4]               [k] ext4_da_write_end
  0.76%  [kernel]             [k] iov_iter_fault_in_readable
  0.74%  [ext4]               [k] ext4_journal_start_sb
  0.56%  [jbd2]               [k] __jbd2_log_space_left
  0.44%  [ext4]               [k] __ext4_journal_stop
  0.44%  [kernel]             [k] __block_prepare_write
  0.41%  [kernel]             [k] __wake_up_bit
  0.35%  [kernel]             [k] _cond_resched
  0.32%  [kernel]             [k] kmem_cache_free

可以看到很多 spin lock的事件和各種文件操作事件

直接通過perf record和report進行分析，得出結果

[root@localhost test]# perf record -g -- ./io 10
press q to quit
q
[ perf record: Woken up 83 times to write data ]
[ perf record: Captured and wrote 23.361 MB perf.data (~1020643 samples) ]

[root@localhost test]# perf report
Samples: 138K of event 'cycles', Event count (approx.): 64858537028
+  65.87%  io  [kernel.kallsyms]    [k] _spin_lock
+  14.19%  io  [kernel.kallsyms]    [k] copy_user_generic_string
+   2.74%  io  [jbd2]               [k] jbd2_journal_stop
+   2.46%  io  [kernel.kallsyms]    [k] find_get_page
+   2.11%  io  [jbd2]               [k] start_this_handle
+   1.06%  io  [kernel.kallsyms]    [k] iov_iter_fault_in_readable
+   1.05%  io  [ext4]               [k] ext4_journal_start_sb
+   0.85%  io  [ext4]               [k] ext4_da_write_end
+   0.73%  io  [jbd2]               [k] __jbd2_log_space_left
+   0.66%  io  [kernel.kallsyms]    [k] generic_write_end
+   0.61%  io  [ext4]               [k] __ext4_journal_stop
+   0.57%  io  [kernel.kallsyms]    [k] __block_prepare_write
+   0.54%  io  [kernel.kallsyms]    [k] __wake_up_bit

好的，以上就是perf常用的工具，當然perf不止這幾個工具，但這幾個是最基礎最常用的工具

通過查看官方wiki https://perf.wiki.kernel.org/index.php/Tutorial 可以瞭解更多信息

Hi峯兄

發佈了25 篇原創文章 · 獲贊 16 · 訪問量 19萬+

私信關注

#定位系統性能瓶頸# perf

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

阿里移動技術峯會的一些體會 2015-07-04

#定位系統性能瓶頸# 序言

#定位系統性能瓶頸# perf

#定位系統性能瓶頸# strace & ltrace

MySQL 主主（雙主）複製

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結