Linux C/C++調試之四：callgrind的侷限

原創

2019-09-04 15:24

在上篇文章中我介紹了callgrind的大致用法，可以看出來，callgrind是一個非侵入式的，使用起來也很傻瓜的調優工具。初用時感覺這個工具非常趁手，是個程序都想用callgrind去分析一下。但深入使用後發現，callgrind不是銀彈，它還是有一些缺陷的。這些缺陷的根源在於：callgrind使用指令數來衡量性能，而程序員用耗時來衡量性能，指令數與耗時僅僅是一個正相關的關係，而非成比例的關係，這就導致了用插時間戳統計出來的性能數據，和用callgrind統計出來的性能數據有出入。爲了說明這些出入，我舉3個實例。

軟硬件環境：

OS: Ubuntu 18.04.3(Linux 4.15.0-58-generic)

CPU: Intel® Core™ i5-8250U @1.60GHz

Compiler: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

cache的影響

對於二維數組，使用行優先訪問要比使用列優先要快，這是CPU cache的功勞。看下面的例子：

#include "tool.h"
#include <vector>

const int SIZE = 2000;
std::vector<double> rowFirstVector;
std::vector<double> columnFirstVector;

double rowFirst()
{
    Logger l("row first access");
    double total;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            total += rowFirstVector[i * SIZE + j];
        }
    }
    return total;
}

double columnFirst()
{
    Logger l("column first access");
    double total;
    for (int i = 0; i < SIZE; i++)
    {
        for (int j = 0; j < SIZE; j++)
        {
            total += columnFirstVector[j * SIZE + i];
        }
    }
    return total;
}

int main()
{
    rowFirstVector = std::vector<double>(SIZE * SIZE);
    rowFirst();

    columnFirstVector = std::vector<double>(SIZE * SIZE);
    columnFirst();

    return 0;
}

代碼中的Logger類用來記錄時間戳打印耗時數據，使用了RAII的手法來實現。由於二維數組在內存中也是線性的，所以就用vector來代替了。rowFirst使用行優先訪問內存，columnFirst使用列優先訪問內存，開啓O2優化編譯源碼，運行結果如下：

row first access cost 15.018 ms
column first access cost 22.841 ms

行優先比列優先快，符合我們的預期。而callgrind卻得到了這樣的結果：

在callgrind看來，rowFirst和columnFirst的性能平分秋色，因爲他們執行了相同多的指令。

當然，在使用callgrind的時候可以打開cache仿真來統計cache miss的數據，但是這個數據無法與指令計數數據綜合在一起，只能各自比較，並不直觀。

單指令耗時的影響

浮點指令比整數指令要慢（至少在我的機器上是），所以有了這個例子：

#include "tool.h"
#include <vector>

const int SIZE = 10000000;
std::vector<int> intVector;
std::vector<double> doubleVector;

int intMultiply()
{
    Logger l("intMultiply");
    int result;
    for (int i = 0; i < SIZE; i++)
    {
        result += intVector[i] * 1;
    }
    return result;
}

double doubleMultiply()
{
    Logger l("doubleMultiply");
    double result;
    for (int i = 0; i < SIZE; i++)
    {
        result += doubleVector[i] * 1.0;
    }
    return result;
}

int main()
{
    intVector = std::vector<int>(SIZE);
    intMultiply();
    doubleVector = std::vector<double>(SIZE);
    doubleMultiply();
    return 0;
}

執行結果爲如下，符合我們的預期：

intMultiply cost 7.388 ms
doubleMultiply cost 29.994 ms

而callgrind的結果是：

沒錯，仍然分不出來哪個快哪個慢，還是因爲它們執行了相同多的指令。

系統調用性能無法統計

關於這一條，主要是受callgrind實現機制所限，callgrind實際上是實現了一個虛擬機，對程序每條彙編指令解釋執行同時計數。但是對於系統調用沒辦法解釋執行啊，只能交給真實的CPU去執行，在內核空間的系統調用實現也都會在真實CPU上執行，callgrind無法統計它們的指令數量。因此我們有了下面的例子：

#include "tool.h"
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>

const int MB_BYTES = 1024 * 1024;

void read1M()
{
    Logger l("read1M");
    int fd = open("./1M.txt", O_RDONLY);
    assert(fd >= 0);
    const int SIZE = 1 * MB_BYTES;
    char *buf = static_cast<char *>(malloc(SIZE));
    int result = read(fd, buf, SIZE);
    assert(result = SIZE);
    close(fd);
}

void read10M()
{
    Logger l("read10M");
    int fd = open("./10M.txt", O_RDONLY);
    assert(fd >= 0);
    const int SIZE = 10 * MB_BYTES;
    char *buf = static_cast<char *>(malloc(SIZE));
    int result = read(fd, buf, SIZE);
    assert(result = SIZE);
    close(fd);
}

int main()
{
    read1M();
    read10M();
    return 0;
}

1M.txt是一個大小爲1MB的文件，10M.txt是一個大小爲10MB的文件，我們會非常自然地期望讀取前者比讀取後者要快，即使速度與大小不見得成比例。執行結果如下，符合我們的預期：

read1M cost 1.801 ms
read10M cost 8.935 ms

callgrind的統計結果卻是這樣的：

總結

這裏我列舉出來3個因素來說明問題，實際可能有更多的因素。可以想想，程序規模大了以後，這些因素會一起疊加起來影響callgrind的統計結果，導致數據參考價值降低。

我當然不是說callgrind一無是處，只是當老闆問你哪個模塊是性能瓶頸的時候，你可千萬別直接丟一份callgrind統計的數據過去，這個數據我們自己參考就行。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Linux C/C++調試之四：callgrind的侷限

cache的影響

單指令耗時的影響

系統調用性能無法統計

總結

公司新來一個幹練小夥，把 MyBatis 替換成 MyBatis-Plus，上線後哭暈在廁所。。。

支持非IE瀏覽器真的那麼難嗎？

爲啥就那麼痛恨IE？

2024 開源數據工程生態系統全景圖

Brian Sun：回覆“爲啥就那麼痛恨IE？”

【筆記】動手學深度學習-前言

體驗下，大廠在使用功能的API網關！

見鬼了！我家的 WiFi 只有下雨天才能正常使用...

短視頻文案提取原來如此簡單

oa系統集成及案例樣式

Linux：利用內核日誌記錄系統啓動時產生的進程樹

QString和QByteArray

Linux C/C++調試之五：程序運行耗時的組成

利用libclang提取C++中enum值與名的映射

Linux C/C++調試之一：利用LD_PRELOAD機制監控程序IO操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結