perf_event_open 學習 —— 通過read的方式讀取硬件技術器

示例程序1

剛剛入職的時候我就研究了perf_event_open()這個巨無霸級別的系統調用,還用Python封裝了一層,非常便於獲取計數器。然而之後由於工作上一直直接用perf命令來獲取各種計數器的值,於是就有所淡忘。又後來自己的筆記本重裝,代碼也不見了。最近由於需要開發一款略有複雜性的工具,對性能要求很高,而且需要能夠實時查詢結果,所以必須使用C的接口自己開發,而不能再使用命令行。

其實關於perf_event_open()這個系統調用的一切用法,在官方手冊《perf_event_open(2) - Linux manual page》上都有所涉及,雖然未提及有些編程的細節。

大致上講,perf_event_open()有兩個使用模式,一個叫做計數,一個叫做採樣。所謂計數,就是測量一段時間內某個事件發生的次數,比如獲取每1秒內運行的指令數。所謂採樣,就是在某個時間點查看某個狀態,比如我想每1秒記錄下當前的IP寄存器的值。從編程的角度看,計數模式要容易理解地多,也容易實現地多。所以這篇博客先講怎麼使用perf的計數模式。

最簡單的計數模式就是隻監測一個計數器。比如我每一秒獲取剛剛過去的那一秒內的指令數(instructions)。複雜的計數模式就是同時監測多個計數器。比如我每一秒獲取剛剛過去的那一秒內的指令數(instructions)、時鐘週期數(cycles)、分支指令數(branch instructions)等等。接下來就先講單一計數器模式,而後講多計數器模式。

單計數器

=階段一:單計數器=

就以剛剛的例子作爲需求,我們要每一秒輸出剛剛一秒內運行的指令數。那麼代碼是這樣的:

single.c

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

//目前perf_event_open在glibc中沒有封裝,需要手工封裝一下
int perf_event_open(struct perf_event_attr *attr,pid_t pid,int cpu,int group_fd,unsigned long flags)
{
    return syscall(__NR_perf_event_open,attr,pid,cpu,group_fd,flags);
}

int main()
{
    struct perf_event_attr attr;
    memset(&attr,0,sizeof(struct perf_event_attr));
    attr.size=sizeof(struct perf_event_attr);
    //監測硬件
    attr.type=PERF_TYPE_HARDWARE;
    //監測指令數
    attr.config=PERF_COUNT_HW_INSTRUCTIONS;
    //初始狀態爲禁用
    attr.disabled=1;
    //創建perf文件描述符,其中pid=0,cpu=-1表示監測當前進程,不論運行在那個cpu上
    int fd=perf_event_open(&attr,0,-1,-1,0);
    if(fd<0)
    {
        perror("Cannot open perf fd!");
        return 1;
    }
    //啓用(開始計數)
    ioctl(fd,PERF_EVENT_IOC_ENABLE,0);
    while(1)
    {
        uint64_t instructions;
        //讀取最新的計數值
        read(fd,&instructions,sizeof(instructions));
        printf("instructions=%ld\n",instructions);
        sleep(1);
    }
}

不需要任何的編譯選項,直接gcc,然後運行:

gcc single.c -o single
sudo ./single

image

可以看到指令數不斷增加。其實這個增速非常緩慢,因爲循環裏面除了printf什麼事都沒幹。這個指令數是perf始能之後的累積值。那麼如果想要獲取每一秒內的指令數呢?一種辦法就是自己在應用層計算前後兩次的差分,另一種辦法是告訴kernel把計數器清零:

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

//目前perf_event_open在glibc中沒有封裝,需要手工封裝一下
int perf_event_open(struct perf_event_attr *attr,pid_t pid,int cpu,int group_fd,unsigned long flags)
{
    return syscall(__NR_perf_event_open,attr,pid,cpu,group_fd,flags);
}

int main()
{
    struct perf_event_attr attr;
    memset(&attr,0,sizeof(struct perf_event_attr));
    attr.size=sizeof(struct perf_event_attr);
    //監測硬件
    attr.type=PERF_TYPE_HARDWARE;
    //監測指令數
    attr.config=PERF_COUNT_HW_INSTRUCTIONS;
    //初始狀態爲禁用
    attr.disabled=1;
    //創建perf文件描述符,其中pid=0,cpu=-1表示監測當前進程,不論運行在那個cpu上
    int fd=perf_event_open(&attr,0,-1,-1,0);
    if(fd<0)
    {
        perror("Cannot open perf fd!");
        return 1;
    }
    //啓用(開始計數)
    ioctl(fd,PERF_EVENT_IOC_ENABLE,0);
    while(1)
    {
        uint64_t instructions;
        //讀取最新的計數值
        read(fd,&instructions,sizeof(instructions));
        //讀取後清零
	ioctl(fd,PERF_EVENT_IOC_RESET,0);
        printf("instructions=%ld\n",instructions);
        sleep(1);
    }
}

每個計數器本身是有很多選項的,比如可以指定監測的進程號(pid)、cpu號,也可以設置只監測用戶態的事件,或只監測內核態的事件等等,具體可以查看手冊。另外,計數器可以監測的事件也有很多,有硬件的,有軟件的。這裏截取了手冊上一些常用的:
image
image

多計數器

===階段二:多計數器

如果你說多個計數器簡單,創建多個perf文件描述符,然後每個都用read去讀嘛。額,確實可以!但是呢,當監測的事件很多,而且讀取頻率很高時,那麼read()調用的開銷就不可再忽略了。那麼如果能夠把多個計數器的值通過一次read()調用獲取,性能上能夠提高不少。perf提供了“組”的概念,這也是多計數器perf的核心內容。

讀讀手冊的這一段:

image

所以呢,第一個perf fd還是和本來一樣的方法創建(除了要多設置一個read_format),而後面的perf_event_open()中的參數group_fd就傳入第一個perf fd。這樣他們就成爲了一個組,並且用第一個perf fd代表整個組。

下面代碼演示瞭如何同時監測指令數和時鐘週期數。

multi.c

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

//目前perf_event_open在glibc中沒有封裝,需要手工封裝一下
int perf_event_open(struct perf_event_attr *attr,pid_t pid,int cpu,int group_fd,unsigned long flags)
{
    return syscall(__NR_perf_event_open,attr,pid,cpu,group_fd,flags);
}

//每次read()得到的結構體
struct read_format
{
    //計數器數量(爲2)
    uint64_t nr;
    //兩個計數器的值
    uint64_t values[2];
};

int main()
{
    struct perf_event_attr attr;
    memset(&attr,0,sizeof(struct perf_event_attr));
    attr.size=sizeof(struct perf_event_attr);
    //監測硬件
    attr.type=PERF_TYPE_HARDWARE;
    //監測指令數
    attr.config=PERF_COUNT_HW_INSTRUCTIONS;
    //初始狀態爲禁用
    attr.disabled=1;
    //每次讀取一個組
    attr.read_format=PERF_FORMAT_GROUP;
    //創建perf文件描述符,其中pid=0,cpu=-1表示監測當前進程,不論運行在那個cpu上
    int fd=perf_event_open(&attr,0,-1,-1,0);
    if(fd<0)
    {
        perror("Cannot open perf fd!");
        return 1;
    }
    //接下來創建第二個計數器
    memset(&attr,0,sizeof(struct perf_event_attr));
    attr.size=sizeof(struct perf_event_attr);
    //監測
    attr.type=PERF_TYPE_HARDWARE;
    //監測時鐘週期數
    attr.config=PERF_COUNT_HW_CPU_CYCLES;
    //初始狀態爲禁用
    attr.disabled=1;
    //創建perf文件描述符
    int fd2=perf_event_open(&attr,0,-1,fd,0);
    if(fd2<0)
    {
        perror("Cannot open perf fd2!");
        return 1;
    }
    //啓用(開始計數),注意PERF_IOC_FLAG_GROUP標誌
    ioctl(fd,PERF_EVENT_IOC_ENABLE,PERF_IOC_FLAG_GROUP);
    while(1)
    {
        struct read_format aread;
        //讀取最新的計數值,每次讀取一個結構體
        read(fd,&aread,sizeof(struct read_format));
        printf("instructions=%ld,cycles=%ld\n",aread.values[0],aread.values[1]);
        sleep(1);
    }
}

可以注意到,最大的變化就是數據的讀取。當使用了“組”的形式之後,那麼每次read()就是讀取一個特定的結構體。這個結構體struct read_format不是固定的,會根據組內計數器數量和struct perf_event_attr的read_format字段的設置而變化。上面的代碼用了最簡單的方法,把struct perf_event_attr的read_format設置爲PERF_FORMAT_GROUP,那麼每次讀取的結構體其實就是1+nr個64位整數,其中第一個整數nr就是計數器數量,後面nr個整數就是每一個計數器的值,順序和加入組的順序相同。

第二大變化就是ioctl()的第三個參數由0變爲了PERF_IOC_FLAG_GROUP。這個標誌表明操作是對組進行的,可以理解爲kernel幫我們把ioctl依次作用在了組的每一個計數器上。所以呢,如果每次讀取後,要把組內所有計數器都清空,需要使用:

ioctl(fd,PERF_EVENT_IOC_RESET,PERF_IOC_FLAG_GROUP);

示例程序2

The perf_event_open Linux system call can be used to read hardware counters. In this section, two examples are provided. The first example shows how to read a single counter, the second example shows how to read a group of counters without multiplexing. perf_event_open does not support multiplexing.

Configure a single counter

The example below shows how to use the perf_event_open system call to read a single counter.

Use a text editor to create a file named perf_event_example1.c and paste the code below into the file:

#include <linux/perf_event.h> /* Definition of PERF_* constants */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#include <inttypes.h>

// The function to counting through (called in main)
void code_to_measure(){
  int sum = 0;
    for(int i = 0; i < 1000000000; ++i){
      sum += 1;
    }
}

// Executes perf_event_open syscall and makes sure it is succesful or exit
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags){
  int fd;
  fd = syscall(SYS_perf_event_open, hw_event, pid, cpu, group_fd, flags);
  if (fd == -1) {
    fprintf(stderr, "Error creating event");
    exit(EXIT_FAILURE);
  }

  return fd;
}

int main() {
  int fd;
  uint64_t  val;
  struct perf_event_attr  pe;

  // Configure the event to count
  memset(&pe, 0, sizeof(struct perf_event_attr));
  pe.type = PERF_TYPE_HARDWARE;
  pe.size = sizeof(struct perf_event_attr);
  pe.config = PERF_COUNT_HW_INSTRUCTIONS;
  pe.disabled = 1;
  pe.exclude_kernel = 1;   // Do not measure instructions executed in the kernel
  pe.exclude_hv = 1;  // Do not measure instructions executed in a hypervisor

  // Create the event
  fd = perf_event_open(&pe, 0, -1, -1, 0);

  //Reset counters and start counting
  ioctl(fd, PERF_EVENT_IOC_RESET, 0);
  ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
  // Example code to count through
  code_to_measure();
  // Stop counting
  ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);

  // Read and print result
  read(fd, &val, sizeof(val));
  printf("Instructions retired: %"PRIu64"\n", val);

  // Clean up file descriptor
  close(fd);

  return 0;
}

The example counts the number of instructions executed in the code_to_measure function.

Just as with PAPI, the counter is started right before the call to code_to_measure and the counter is stopped and read just after the call to code_to_measure.

The event being counted is PERF_COUNT_HW_INSTRUCTIONS which maps to the Arm PMU INST_RETIRED (ID: 0x08) event.

The perf_event_open documentation lists the preset events that can be used.

It is also possible to use a raw event code if a preset doesn’t exist. The data structure perf_event_attr is how the event to count is configured. This data structure has numerous fields. In the example above, the data structure is setup so that instructions executed in the kernel (or Arm exception level EL1) and instructions executed in the hypervisor (or Arm exception level EL2) are not counted. This means the example is only counting user space instructions executed (or Arm exception level EL0).

You can review the manual page to understand the configuration options for event counting.

Compile the example using the GNU compiler:

gcc perf_event_example1.c -o perf_event_example1

Run the application as root (or using sudo):

sudo ./perf_event_example1

The output will be similar to:

image

Your counter value may be different from what is shown above. There are many variables that change the count including the CPU design and the compiler.

Configure multiple counters (no multiplexing)

Counting a group of events makes it possible to calculate ratios like Instructions Per Cycle (IPC). Below is an example of counting multiple events.

Use a text editor to create a file named perf_event_example2.c and paste the code below into the file:

#include <linux/perf_event.h> /* Definition of PERF_* constants */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#include <inttypes.h>
#define TOTAL_EVENTS 6

// The function to counting through (called in main)
void code_to_measure(){
  int sum = 0;
  for(int i = 0; i < 1000000000; ++i){
    sum += 1;
  }
}

// Executes perf_event_open syscall and makes sure it is succesful or exit
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags){
  int fd;
  fd = syscall(SYS_perf_event_open, hw_event, pid, cpu, group_fd, flags);
  if (fd == -1) {
    fprintf(stderr, "Error creating event");
    exit(EXIT_FAILURE);
  }

  return fd;
}

// Helper function to setup a perf event structure (perf_event_attr; see man perf_open_event)
void configure_event(struct perf_event_attr *pe, uint32_t type, uint64_t config){
  memset(pe, 0, sizeof(struct perf_event_attr));
  pe->type = type;
  pe->size = sizeof(struct perf_event_attr);
  pe->config = config;
  pe->read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
  pe->disabled = 1;
  pe->exclude_kernel = 1;
  pe->exclude_hv = 1;
}

// Format of event data to read
// Note: This format changes depending on perf_event_attr.read_format
// See `man perf_event_open` to understand how this structure can be different depending on event config
// This read_format structure corresponds to when PERF_FORMAT_GROUP & PERF_FORMAT_ID are set
struct read_format {
  uint64_t nr;
  struct {
    uint64_t value;
    uint64_t id;
  } values[TOTAL_EVENTS];
};

int main() {
  int fd[TOTAL_EVENTS];  // fd[0] will be the group leader file descriptor
  int id[TOTAL_EVENTS];  // event ids for file descriptors
  uint64_t pe_val[TOTAL_EVENTS]; // Counter value array corresponding to fd/id array.
  struct perf_event_attr pe[TOTAL_EVENTS];  // Configuration structure for perf events (see man perf_event_open)
  struct read_format counter_results;

  // Configure the group of PMUs to count
  configure_event(&pe[0], PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES);
  configure_event(&pe[1], PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS);
  configure_event(&pe[2], PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND);
  configure_event(&pe[3], PERF_TYPE_HARDWARE, PERF_COUNT_HW_STALLED_CYCLES_BACKEND);
  configure_event(&pe[4], PERF_TYPE_RAW, 0x70);  // Count of speculative loads (see Arm PMU docs)
  configure_event(&pe[5], PERF_TYPE_RAW, 0x71);  // Count of speculative stores (see Arm PMU docs)

  // Create event group leader
  fd[0] = perf_event_open(&pe[0], 0, -1, -1, 0);
  ioctl(fd[0], PERF_EVENT_IOC_ID, &id[0]);
  // Let's create the rest of the events while using fd[0] as the group leader
  for(int i = 1; i < TOTAL_EVENTS; i++){
    fd[i] = perf_event_open(&pe[i], 0, -1, fd[0], 0);
    ioctl(fd[i], PERF_EVENT_IOC_ID, &id[i]);
  }

  // Reset counters and start counting; Since fd[0] is leader, this resets and enables all counters
  // PERF_IOC_FLAG_GROUP required for the ioctl to act on the group of file descriptors
  ioctl(fd[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP);
  ioctl(fd[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP);

  // Example code to count through
  code_to_measure();

  // Stop all counters
  ioctl(fd[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);

  // Read the group of counters and print result
  read(fd[0], &counter_results, sizeof(struct read_format));
  printf("Num events captured: %"PRIu64"\n", counter_results.nr);
  for(int i = 0; i < counter_results.nr; i++) {
    for(int j = 0; j < TOTAL_EVENTS ;j++){
      if(counter_results.values[i].id == id[j]){
        pe_val[i] = counter_results.values[i].value;
      }
    }
  }
  printf("CPU cycles: %"PRIu64"\n", pe_val[0]);
  printf("Instructions retired: %"PRIu64"\n", pe_val[1]);
  printf("Frontend stall cycles: %"PRIu64"\n", pe_val[2]);
  printf("Backend stall cycles: %"PRIu64"\n", pe_val[3]);
  printf("Loads executed speculatively: %"PRIu64"\n", pe_val[4]);
  printf("Stores executed speculatively: %"PRIu64"\n", pe_val[5]);

  // Close counter file descriptors
  for(int i = 0; i < TOTAL_EVENTS; i++){
    close(fd[i]);
  }

  return 0;
}

Near the top of the code there is a data structure called read_format. It is setup to contain TOTAL_EVENTS (6 in this case) of an inner structure called values. This structure is populated when the group of 6 counters is read.

  • Note1
    The read_format structure can take different forms depending on how the perf_event_attr structure is configured. Refer to the man page for more information.

In addition to read_format, there is also the perf_event_attr structure which allows configuration of each of the 6 events. This is why the perf_event_attr structure array called pe is a size of TOTAL_EVENTS (or 6 in this case). This means there is 1 perf_event_attr structure per event to count.

  • Note2
    It is possible to reuse one perf_event_attr structure for setting up all events but this is not done here.

The events to count are configured using the configure_event function. In this example, there are 6 events to count, 4 are the preset events of PERF_COUNT_HW_CPU_CYCLES, PERF_COUNT_HW_INSTRUCTIONS, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND and PERF_COUNT_HW_STALLED_CYCLES_BACKEND.

The last two are raw events 0x70 and 0x71 which correspond to loads executed speculatively (LD_SPEC) and stores executed speculatively (ST_SPEC).

Remember that these event codes (0x70 and 0x71) can be found in the TRM for the CPU.

These last two events are examples of how an event that might not have a preset can be counted. Of these 6 events, one needs to be selected as the group leader. When this is done, whenever an action on the group leader is taken (such as start counting), that action is taken on all of the counters in the group.

The last thing that is different in this example is the ioctl calls that reset, start and stop the group of counters. There is an additional flag called PERF_IOC_FLAG_GROUP. This is required to trigger the entire group to count. If this is missing then only the group leader will be counted.

Compile the example using the GNU compiler:

gcc perf_event_example2.c -o perf_event_example2

Run the application as root (or using sudo):

sudo ./perf_event_example2

The output will be similar to:

image

Your counter values may be different from the output above.

If you want to measure more counters than is supported by the CPU, you will need to implement multiplexing yourself.

If you choose to do this, be sure to set the PERF_FORMAT_TOTAL_TIME_ENABLED and PERF_FORMAT_TOTAL_TIME_RUNNING fields in the perf_event_attr.read_format structure. This is done by ORing these flags into the same line you see PERF_FORMAT_GROUP and PERF_FORMAT_ID above. If this is done, the read_format structure will need to be changed to include the time enabled and time running fields. If this multiplexing is implemented, the resulting counts should be taken as an estimate.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章