CUDA矩陣乘法

背景

大多數情況下，我們是不需要自己去實現矩陣乘法的，因爲Nvidia提供了cuda版的cublas庫，我們利用庫函數就可以搞定。但是，總會有些情況下，我們需要實現自己的矩陣乘法。這裏我們要實現的是自己的cublasSgemm函數。

cublasSgemm介紹

cublasSgemm函數的功能可以用如下的公式表示：

α∗op(A)op(B)+β∗C,α和β是標量，其餘是矩陣，op表示轉置或者非轉置
cublasSgemm函數原型如下：
- 參數解釋：
  op(A) 是 m x k 維矩陣，op(B) 是 k x n 維矩陣，C 是 m x n 維矩陣
  注意：如果A需要轉置，那麼AT 是 m x k 維矩陣；如果A不需要轉置，那麼A是 m x k 維矩陣。B同理。
  lda >= (transa == CUBLAS_OP_N) ? max(1,m) : max(1,k)
  ldb >= (transb == CUBLAS_OP_N) ? max(1,k) : max(1,n)
  ldc >= max(1,m)
  注意： ld 是 leading dimension的縮寫，指的是矩陣元素在存儲時嵌套在最外面的維度。之所以是>= 的原因在於可能需要進行內存對齊。

實現

這裏我們用兩種不同的方法實現：使用優化和不使用優化，完整的程序可以從資料的鏈接獲取。注意：雖然A和B可能需要轉置，但是我們不需要進行物理轉置，只需要在計算元素地址的時候使用不同的方法就可以。

方法一 ：不使用優化

__global__ void
matrixMul_(cublasHandle_t handle,
       cublasOperation_t transA, cublasOperation_t transB,
       int M, int N, int K,
       const float alpha,
       const float *A, int lda,
       const float *B, int ldb,
       const float beta,
       float *C, int ldc)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0.0;
    for(int i = 0; i < K; i++) {
        int a_addr = (transA == CUBLAS_OP_T) ? _GET_ADDR_T(by, i, lda) : _GET_ADDR_N(by, i, lda);
        int b_addr = (transB == CUBLAS_OP_T) ? _GET_ADDR_T(i, bx, ldb) : _GET_ADDR_N(i, bx, ldb);
        sum += A[a_addr] * B[b_addr];
    }

    int c_addr = _GET_ADDR_N(by, bx, ldc);
    C[c_addr] = alpha * sum + beta * C[c_addr];
}

注意：_GET_ADDR_T 和 _GET_ADDR_N分別計算轉置和不轉置時的元素地址。

方法二 ：使用優化

// C = alpha * op(A) * op(B) + beta * C
// handle : not used, just keep same with cublasSgemm
template <int BLOCK_SIZE> __global__ void
matrixMul_2(cublasHandle_t handle,
        cublasOperation_t transa, cublasOperation_t transb,
        int m, int n, int k,
        const float alpha,   // use (float*) seems wrong, why ?
        const float *A, int lda,
        const float *B, int ldb,
        const float beta,    // use (float*) seems wrong, why ?
        float *C, int ldc)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Steps each thread, k is col of op(A) or row of op(B)
    int steps = int((k + BLOCK_SIZE - 1) / (BLOCK_SIZE));

    // Csub is used to shtore the element of the block sub-matrix
    // that is computed by the thread
    float Csub = 0.0;

    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for(int step = 0; step < steps; ++step)
    {
        // Declaration of the shared memory array As used to
        // store the sub-matrix of A and B
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

        // Load the matrices from device memory
        // to shared memory, each thread loads
        // one element of each matrix
        int a_x = BLOCK_SIZE * step + tx;
        int a_y = BLOCK_SIZE * by   + ty;
        int a_addr = (transa == CUBLAS_OP_T) ? _GET_ADDR_T(a_y, a_x, lda)
                                             : _GET_ADDR_N(a_y, a_x, lda);
        As[ty][tx] = _OUT_OF_RANGE(a_y, m, a_x, k) ? 0.0
                                                   : A[a_addr];

        int b_x = BLOCK_SIZE * bx   + tx;
        int b_y = BLOCK_SIZE * step + ty;
        int b_addr = (transb == CUBLAS_OP_T) ? _GET_ADDR_T(b_y, b_x, ldb)
                                             : _GET_ADDR_N(b_y, b_x, ldb);
        Bs[ty][tx] = _OUT_OF_RANGE(b_y, k, b_x, n) ? 0.0
                                                   : B[b_addr];

        // Synchronize to make sure the matrices are loaded
        __syncthreads();

        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrx
        for(int bs = 0; bs < BLOCK_SIZE; ++bs)
        {
            Csub += As[ty][bs] * Bs[bs][tx];
        }

        // Synchronize to make sure that the preceding
        // computation is done befroe laoding two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }

    int c_x = bx * BLOCK_SIZE + tx;
    int c_y = by * BLOCK_SIZE + ty;
    int c_addr = _GET_ADDR_N(c_y, c_x, ldc);
    if(!_OUT_OF_RANGE(c_y, m, c_x, n)) {
        C[c_addr] = alpha * Csub + beta * C[c_addr];
    }
}

注意： __syncthreads()用於同步一個block內的線程，BLOCK_SIZE用於設置每一小塊的大小。

對比分析
- 方法一中沒有使用任何優化技巧，每次計算都從GPU的全局內存中取數據，讀寫速度較慢。計算C中的一個元素，需要k次乘法，2k(只包括A和B)次取數，α∗op(A)op(B) 一共需要取2mnk次數，α∗op(A)op(B)+β∗C 一共需要2mnk + mn次讀，mn次寫。
- 方法二中使用共享內存，每次從全局內存中取一個小塊（塊大小爲BLOCK_SIZE * BLOCK_SIZE）的元素，然後存放在共享內存中，每次計算C中的一個塊。每個小塊需要從全局內存中取 2∗BLOCK_SIZE2 次數。爲了計算C中一個塊的元素，需要從全局內存取⌈kBLOCK_SIZE⌉ 個塊，因此計算C中一個塊的元素，需要取2∗BLOCK_SIZE2∗⌈kBLOCK_size⌉ 。 C中共有⌈mBLOCK_size⌉∗⌈nBLOCK_SIZE⌉ 個小塊，因此需要從全局內存中取2∗BLOCK_SIZE2∗⌈kBLOCK_size⌉∗⌈mBLOCK_size⌉∗⌈nBLOCK_SIZE⌉≈2mnkBLOCK_SIZE 。α∗op(A)op(B)+β∗C 需要 2mnkBLOCK_SIZE + mn次全局內存讀，mn次寫。
- 通過對比，可以發現，使用共享內存後從全局內存讀取A和B的次數變爲了原來的1BLOCK_SIZE ，從全局內存讀取C的次數保持不變。如果忽略從共享內存中讀數據的時間，則方法二的執行時間是方法一的1BLOCK_SIZE 。當然，由於方法二使用了同步函數，而且共享內存的讀取也會佔用時間，實際加速比會比這個低。方法二可以用下圖表示：
行主序vs列主序
- 假設我們有一個3x3的矩陣A = [1 2 3; 4 5 6; 7 8 9]， c/c++是行主序的，在內存中存儲的順序是 [1 2 3 4 5 6 7 8 9]; 但是cuda是列主序，在內存中的存儲的順序是[1 4 7 2 5 8 3 6 9]。

資料

矩陣乘法源文件
編譯腳本
注意：源文件中對cublas的庫函數和自己實現的函數進行了對比，結果會有微小差異，這是正常的。受計算機精度限制，浮點數 a*b*c 的結果可能會和 a*c*b 的結果有微小差異。

CUDA矩陣乘法

CUDA矩陣乘法

背景

cublasSgemm介紹

實現

資料

參考

DAPPER 事務 TRANSACTION

Win8.1 + Dev C++

Time-Delay Neural Network(TDNN)-上

"Unhandled exception in app.exe (QtGuid4.dll): 0xC0000005: Access Violation"錯誤解決

Kaldi-Timit 訓練

Time-Delay Neural Network(TDNN)-下

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結