一、背景

最近在做 AI 編譯器生成 Kernel 支持 Bert 模型訓練調優工作，在分析 bert 的timeline中發現，在每個 step 的前兩個 cinn_instruction_run 之後，總是固定跟着一個 2.5 ms 左右的空白。但 HOST 端其實很早就 emit 了CUDA API，只是爲什麼 GPU 要有那個大的 Latency 後才執行呢？

從 Nvidia 官方論壇上可知，正常情況下一個 cuda kernel launch 的 Latency 在 us 級別。

Nvidia 官方文檔原文：CUDA kernel launch latency could be defined as the time range from the beginning of the launch API call to the beginning of the kernel execution. There are about 20 µs of launch latency. If the launch API call takes 10 µs on your system, you can only launch at most 100,000 kernels per second.

Nvidia 官方論壇討論：Kernel launch overhead is frequently cited as 5 microseconds. My understanding of the PCIe transactions is limited, but best I know a kernel launch requires at least two transactions: (1) host sending a kernel launch command to the GPU (2) GPU sending an acknowledgement back to the host.

二、研習 Nvidia 手冊

2.1 官網手冊

在Nvidia的官方文檔中，Overhead 主要包括如下幾個部分：

CPU wrapper
memory
GPU lauch overhead

2.2 CPU wrapper

這部分要包含了在多線程下硬件上所有的 mutex-lock 相關的操作。若進行了 mutex 相關操作，在 Nsight Timeline 的 os runtime 那一行會出現 pthread_mutex_lock

原文： This includes any mutex-lock contention that occurs in the driver if doing multi-threaded launching. You can see if you are hitting mutex contention within the driver by collecting OS Runtime data, which shows any pthread_mutex_lock calls lasting above a user-settable threshold.

2.3 memory

這部分主要包括數據搬運的開銷，如 H2D、D2H、D2D。

原文：This is the overhead of moving data back and forth from the CPU to the GPU, or from one GPU to another. For example, this would be the time it takes to read the input tensors and writing output to DRAM.

2.4 GPU launch overhead

這部分主要包括從「取一個command」到「GPU上開始執行」之間的時間開銷。主要包括：

原文：This is the time it takes for the GPU to retrieve the command and begin executing it.

GPU 上可能有不同的 context active ，在執行一個新的應用程序時，需要進行「上下文」切換，比如 GPU 正在渲染 PC Desktop，則需要進行上下文切換以運行另一個 command 任務。若命中「上下文切換」，這會通過收集 GPU Context Switch 信息展示出來。（綠色表示沒有進行切換）
GPU 可能被前面的 comman 給 blocked 阻塞住，觸發等待操作
CUDA 支持多 stream，且每個 stream 的 kernel 是序貫執行的，且 memcpys 必須按照順序執行。
GPU 必須按照優先級，優先執行優先級高的 kernel

這裏從Nvidia的官方文檔上，發現了一個很重要的信息：Nsight 會額外實時收集CPU 上的 IP/Backstrace 信息，就是上圖的中黃色方框的sampling point，這個可以輔助判斷當前時間節點 CPU 在做什麼事情。

原文：Sampling data was also collected, as you can see by the orange/yellow marks below the thread state timeline. Each mark represents the point when a CPU IP/backtrace sample was collected. When this screenshot was captured, the mouse (not shown) was hovering on the sampling mark just above the left side of the tooltip. The tooltip shows the CPU IP/backtrace for that thread at that moment. Looking at the vectorAdd source code, you can easily see the application was checking the results of the GPU’s calculation.

三、 GLOG_v日誌和源碼

首先看下空白後面這個 Kernel 的代碼：

function fn_broadcast_to_224_elementwise_add_225_reshape_264_transpose_303_1614_kernel (_linear_2__b_0, _var_1137, _var_1381)
if ((blockIdx.x < 12288)) {
  if ((threadIdx.x < 1024)) {
    var_1381[((1024 * blockIdx.x) + threadIdx.x)] = (var_1137[((((blockIdx.x % 96) / 8) * 64) + ((768 * (threadIdx.x / 64)) + (((blockIdx.x / 96) * 98304) + ((12288 * (blockIdx.x % 8)) + (threadIdx.x % 64)))))] + linear_2__b_0[(((threadIdx.x % 64) + (((blockIdx.x % 96) / 8) * 64)) % 768)])
  }
}

但是，通過分析不同 step 的初始空白，發現有不同的情況。在 Bert 模型訓練中，前面的幾個 Kernel 對應3個平行的 matmul + fn_broadcast_to_elementwise_add_reshape_transpose 組合對，我們下面簡稱爲：matmul + fn_xx 吧：

場景一：fn_xx 之後出現大空白
場景二：matmul 之後出現大空白
由此可知，大空白的出現與 Kernel 不是強烈耦合的，可能有其他潛在的原因在裏面，因爲我們首先要找到「是什麼因素影響了這個Latency」?

總覽的看了不同step的timeline，發現不同step下的GPU 空白表現不穩定，有的step下GPU佔用率會比較好，有的step下GPU空白會比較多，如下圖：

四、新A100機器上交叉複測

詳細分析了A100 機器 A 上的timeline，違背經驗認知，故在之前分佈式隊列上下線的A100 機器 B 上安裝 NSight 腳本交叉複測一組 Timeline 文件，排除機器的影響（A 是一個多人複用的開發機，B 機器使用的人比較少）

從 timeline上可以看出，在新的 A100 機器上，情況就比較簡單了：穩定在第一個matmul 的 cublas API 之後，且 HgemmStridedBatched 的開銷與空白時間嚴格對應。這一點跟 A機器上完全不同，從文檔的最前面 timeline 來看，HgemmStridedBatched 的API 調用位置是與GPU stream的時間點是錯開的。

也許這個 HgemmStridedBatched 是一個可以深入分析的思路。首先我們先對比下與原生動轉靜的 timeline 裏的 Kernel，從下圖可以看處，主要包括兩大類：

GemmEx
GemmStridedBatchedEx

CINN 裏統一使用的 HgemmStridedBatched 查看了相關API調用的入口函數源碼，在函數各個分支裏添加了VLOG，輸出必要的信息，查看走了哪些分支：

void cinn_call_cublas(void *v_args,
                      int num_args,
                      bool trans_a,
                      bool trans_b,
                      bool trans_o,
                      float alpha,
                      float beta,
                      int a1,
                      int a2,
                      int a3,
                      int a4,
                      int b1,
                      int b2,
                      int b3,
                      int b4,
                      void *stream) {
                      
    // 省略
    CUBLAS_CALL(cublasGemmStridedBatched(cuda_dtype,
                                         cuhandle,
                                         trans_op_l,
                                         trans_op_r,
                                         m,
                                         n,
                                         k,
                                         alpha,
                                         lhs,
                                         ldl,
                                         stride_l,
                                         rhs,
                                         ldr,
                                         stride_r,
                                         beta,
                                         C,
                                         ldc,
                                         m * n,
                                         batch));
 
     // 省略
                      
                      }

場景一：[128, 128, 768] * [768, 768] ，調用了 cublasGemmStridedBatched ，但主框架是調用的是 GemmEX ，不太符合預期

I0508 09:34:19.758335 99667 cuda_util.cc:134] a1: 1
I0508 09:34:19.758383 99667 cuda_util.cc:135] a2: 128
I0508 09:34:19.758399 99667 cuda_util.cc:136] a3: 128
I0508 09:34:19.758404 99667 cuda_util.cc:137] a4: 768
I0508 09:34:19.758407 99667 cuda_util.cc:138] b1: 1
I0508 09:34:19.758412 99667 cuda_util.cc:139] b2: 1
I0508 09:34:19.758419 99667 cuda_util.cc:140] b3: 768
I0508 09:34:19.758422 99667 cuda_util.cc:141] b4: 768
I0508 09:34:19.758430 99667 cuda_util.cc:183] call cublasGemmStridedBatched with batch 128， isl: 0 isr: 98304

場景二：[128, 12, 128, 64] * [128, 12, 128 ,64] ，shape相同，trans_b = True ，符合預期。

I0508 09:47:09.494791 100378 cuda_util.cc:134] a1: 128
I0508 09:47:09.494797 100378 cuda_util.cc:135] a2: 12
I0508 09:47:09.494799 100378 cuda_util.cc:136] a3: 128
I0508 09:47:09.494801 100378 cuda_util.cc:137] a4: 64
I0508 09:47:09.494804 100378 cuda_util.cc:138] b1: 128
I0508 09:47:09.494807 100378 cuda_util.cc:139] b2: 12
I0508 09:47:09.494809 100378 cuda_util.cc:140] b3: 128
I0508 09:47:09.494812 100378 cuda_util.cc:141] b4: 64
I0508 09:47:09.494813 100378 cuda_util.cc:142] trans_a: 0
I0508 09:47:09.494817 100378 cuda_util.cc:143] trans_b: 1
I0508 09:47:09.494818 100378 cuda_util.cc:144] trans_o: 0
I0508 09:47:09.494822 100378 cuda_util.cc:217] call cublasGemmStridedBatched sl: 8192 sr: 8192

場景三：[128, 12, 128, 128] * [128, 12, 128, 64] ，符合預期

I0508 09:47:09.495852 100378 cuda_util.cc:134] a1: 128
I0508 09:47:09.495857 100378 cuda_util.cc:135] a2: 12
I0508 09:47:09.495860 100378 cuda_util.cc:136] a3: 128
I0508 09:47:09.495862 100378 cuda_util.cc:137] a4: 128
I0508 09:47:09.495865 100378 cuda_util.cc:138] b1: 128
I0508 09:47:09.495867 100378 cuda_util.cc:139] b2: 12
I0508 09:47:09.495870 100378 cuda_util.cc:140] b3: 128
I0508 09:47:09.495872 100378 cuda_util.cc:141] b4: 64
I0508 09:47:09.495874 100378 cuda_util.cc:142] trans_a: 0
I0508 09:47:09.495877 100378 cuda_util.cc:143] trans_b: 0
I0508 09:47:09.495879 100378 cuda_util.cc:144] trans_o: 0
I0508 09:47:09.495882 100378 cuda_util.cc:217] call cublasGemmStridedBatched sl: 8192 sr: 16384

五、優化思路

5.1 最小復現樣例

根據 Bert 裏的模型結構抽離了最小代碼case：

#!/usr/bin/env python3
# Please set "export PYTHONPATH=${CINN_ROOT}/build/python:${PYTHONPATH}" first
import paddle
import unittest
import numpy as np
import cinn
from cinn.frontend import *
from cinn.common import *
from op_test import OpTest

class TestGroup(unittest.TestCase):
  def test_group(self):
    builder = NetBuilder("matmul")
    x_shape = [128, 128, 768]
    y_shape = [768, 768]

    x = builder.create_input(Float16(),x_shape, "x")
    y = builder.create_input(Float16(), y_shape, "y")
    out = builder.matmul(
            x, y, transpose_x=False, transpose_y=False)

    feed_list = [x, y]
    fetch_list = [out]

    prog = builder.build()

    feed_data = [OpTest.random(shape=var.shape(), dtype=var.type()) for var in feed_list]
    result = prog.build_and_get_output(DefaultNVGPUTarget(), feed_list, feed_data, fetch_list)

    result = [res.numpy(DefaultNVGPUTarget()) for res in result]
    for i in range(len(result)):
      info_str = fetch_list[i].name()
      info_str += ", shape=" + str(result[i].shape)
      info_str += ", dtype=" + str(result[i].dtype) + ":\n"
      print(info_str)


if __name__ == "__main__":
  unittest.main()

5.2 修復 PR

修復思路，是參考主框架將其 y_batch_size=1 && trans_a = False 分支邏輯遷移到CINN中，見 PR：https://github.com/PaddlePaddle/CINN/pull/1407

5.3 收益測試

在 B 機器上測試 Bert 訓練的收益：「無明顯收益」。藉助Nsight工具跑出了 timeline，經過分析發現「GPU空白依舊存在」。

5.4 諮詢Nvidia同學

諮詢了英偉達的同學，反饋說：「 cuBLAS 第一次將kernel 加載進內存，所以時間較長」，反饋主框架中引入了cublaslt，同時對api 內的deacriptor 創建進行了cache操作，可能有用，但CINN中是沒有這個機制的。
要解決這個問題，可以參考主框架實現 AutoTune + Cache 機制：

cuBlas API Launch Latency 耗時異常分析記錄

一、背景

二、研習 Nvidia 手冊

2.1 官網手冊

2.2 CPU wrapper

2.3 memory

2.4 GPU launch overhead

三、 GLOG_v日誌和源碼

四、新A100機器上交叉複測

五、優化思路

5.1 最小復現樣例

5.2 修復 PR

5.3 收益測試

5.4 諮詢Nvidia同學

如何在低代碼平臺中引用 JavaScript ？

探究職業發展的關鍵：能力模型解讀

高效率使用windows

如何使用 JavaScript 獲取當前頁面幀率 FPS

工程款拖欠，農民工怎麼了？就得一直忍着委屈求全嗎？

HarmonyOS 實現下拉刷新，上拉加載更多

語音信號處理中的“窗函數”

智能決策新時代：可視化大屏是否能夠超越傳統白板？

解密Prompt系列28. LLM Agent之金融領域摸索：FinMem & FinAgent

分享幾個.NET開源的AI和LLM相關項目框架

【源碼研讀】MLIR Dialect 分層設計

《Modern C++ Design》之上篇

C++ 中 Concept-Model 概念模型

AI編譯器CINN v.s TVM 中CodeGen 源碼解讀

工程師的十條精進原則

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結