【技術博客】GPU 編程之從零開始實現 MNIST-CNN

很多人最開始接觸“ GPU ”想必都是通過遊戲，一塊高性能的 GPU 能帶來非凡的遊戲體驗。而真正使GPU被越來越多人熟知是因爲機器學習、深度學習的大熱（也有人用於比特幣挖礦），因爲龐大的數據與計算量需要更快的處理速度，GPU 編程也因此越來越普遍。從事深度學習的工作者常常自嘲自己爲“煉丹師”，因爲日常工作是：搭網絡，調參，調參，調參......作爲剛入門深度學習的小白更是如此，雖然不停的復現着一個又一個的網絡，但總有些迷茫。我想這個迷茫來源於深度學習的“黑盒”性質，而我們所做的工作絕大部分時間又僅是調參，對網絡內部的計算實際是模糊的。因此，本文試着結合 GPU 編程從零開始寫一個簡單的 CNN 網絡，從內部深入理解網絡的運行，希望能夠一定程度上消除這種迷茫，也通過這一個簡單的應用來了解 GPU 編程。因此本文主要分爲兩個部分：

GPU 編程的介紹與入門。
使用 GPU 編程從零開始搭建 MNIST-CNN。

1 GPU 編程的介紹與入門

1.1 介紹

圖 1 爲 CPU 與 GPU 的物理結構對比（圖片源自網絡）。圖中的有色方塊代表了處理核心，雖然 CPU 的核心比較少（如圖中只有8塊），但每一個算能力都非常強；而 GPU 的核心非常多，但每一個的能力有限。核心數量決定着處理計算過程的線程數量，因此對於計算問題來說，GPU 比 CPU 能調度更多的線程，這也是爲什麼 GPU 可以用來加速的原因。但僅僅靠 GPU 還不行，一些複雜的控制流程仍需要 CPU 來做，因此現在的計算機基本是 CPU+GPU 的形式。CPU 控制整體程序的調度，當需要繁重的計算任務時，將其交給 GPU 完成。圖 1 CPU 與 GPU 物理結構對比

1.2 使用Numba進行GPU編程

Nvidia 提供了在自家生產的 GPU 上進行編程的框架，該框架在不同的編程語言上都有實現。本文介紹一種支持 Python 的 GPU 編程模型—— Numba（在下面的內容開始之前，請確保你的 Pyhon 編譯器中是有 Numba 模塊的，如果沒有請先安裝，安裝方式與一般的模塊相同）。在 GPU 編程中我們稱在 CPU 端爲Host，GPU 端爲 Device，在 CPU 運行的函數爲主機函數，在GPU 運行的、受CPU 調用的函數爲核函數，在 GPU 運行、被 GPU 調用的函數爲設備函數。在 GPU 上運行的函數的定義需要有一個專門的函數修飾符，如

from numba import cuda
import numpy as np

@cuda.jit()
def my_kernel(io_array): # 核函數
	# To do

@cuda.jit(device=True)
def device_func(io_array): # 設備函數
    # To do

調用核函數的過程爲：主機端先將數據傳入設備端，主機端調用核函數在 GPU 上進行計算，完成計算後再將結果數據傳回主機端。使用 Numba 的一個好處是不需要這些額外的數據傳輸操作，核函數能自動對 Numpy 數組進行傳輸。值得注意的是，核函數不能有返回值，GPU 編程的一大特點是函數的計算結果都通過參數傳遞。那如何控制線程進行處理呢？——用數組索引與線程 id 相互對應的方式。比如我要實現一個函數，使得矩陣 A 的每一個元素都加 1，則代碼可直接按如下方式實現

from numba import cuda
import numpy as np

@cuda.jit()
def my_kernel(io_array):
	tx = cuda.threadIdx.x # 獲取當前線程的 id 分量
    ty = cuda.threadIdx.y # 獲取當前線程的 id 分量
    tz = cuda.blockIdx.x # 獲取當前線程的 id 分量
    io_array[tz, tx, ty] += 1 # 用該id 對應數組的索引，處理對應的元素
    
A = np.zeros(shape=[10, 32, 32])
gridsize = (10)
blocksize=(32, 32)
my_kernel[gridsize, blocksize](A) # 特殊的調用方式，調用時需要告訴核函數線程的組織情況
print(A)

上述代碼的思想是：對每一個A矩陣的元素都調用一個線程處理，互相對應的準則是線程的 id 與 A矩陣的索引相對應。gridesize 與 blockszie 定義了一個與 A 矩陣相同結構的線程矩陣，其中的每一個線程都會調用該核函數，核函數中的 tx、ty、 tz 變量表示了當前線程的 id，因此每個線程執行時實際上只處理了各自 id 所對應的元素。gridsize 與 blocksize 變量的定義可以不考慮底層的物理結構，Numba 會自動替我們分配以實現代碼層到物理層線程的映射。實際 GPU 編程需要的事項有很多，這裏不一一列舉，想深入瞭解可查閱官方手冊。但有了上述編程思想的基礎，我們已經可以來編寫簡單的神經網絡了。

2 結合 GPU 從零搭建 CNN

CNN 網絡的主要結構有全連接層、卷積層、池化層、非線性激活層以及輸出損失層，各層都有許多參數選擇，這裏僅搭建 MNIST-CNN 所包含的結構，即簡單的卷積層、池化層、reLu 激活函數、全連接層以及 Cross entropy 損失輸出。本文着重用 GPU 實現卷積層與池化層的前向傳播與鏈式求導。全連接與激活等計算只涉及到矩陣的乘法相對簡單，因此直接用 Numpy 處理（也由於篇幅限制，這裏也不再贅述這些操作的原理及代碼，整個網絡的代碼可以到本人的 Github 上下載，文末附有代碼鏈接）。

2.1 結合 GPU 實現池化與卷積

先實現池化層，這裏只簡單設計一個的最大池化操作，池化層的前向計算與反向計算的原理如圖 2 所示（圖片來源於網絡）圖2 池化層傳播計算原理它前向傳播的原理比較簡單，其反向傳播需要用到圖示的 Index 矩陣。這裏的實現方法爲每一次前向傳播都先保存輸入與輸出，在反向傳播時將輸入數據與輸出數據做對比，數值相等的位置即爲池化輸出的位置，因此反向傳播的梯度應傳遞到該位置，其它不相等的位置置爲 0 。其代碼實現如下

class MaxPooling2x2(object):
    def __init__(self):
        self.inputs = None
        self.outputs = None
        self.g_inputs = None

    def forward(self, inputs):
        self.inputs = inputs.copy()
        self.outputs = np.zeros(shape=(inputs.shape[0], inputs.shape[1]//2, inputs.shape[2]//2, inputs.shape[3]), dtype=np.float32)
        grid = (inputs.shape[3])
        block = (self.outputs.shape[1], self.outputs.shape[2])
        pool[grid, block](self.inputs, self.outputs)
        return self.outputs

    def backward(self, gradient_loss_this_outputs):
        self.g_inputs = np.zeros(shape=self.inputs.shape, dtype=np.float32)
        grid = (self.outputs.shape[3])
        block = (self.outputs.shape[1], self.outputs.shape[2])
        cal_pool_gradient[grid, block](self.inputs, self.outputs, gradient_loss_this_outputs, self.g_inputs)
        return self.g_inputs

前向傳播與反向傳播的核心計算過程由 GPU核函數實現。前向傳播時，對於單個樣本每個線程處理一塊的區域從而輸出最大值，並循環處理一個 batchSize；其反向傳播時線程安排與前向傳播相同。代碼如下

@cuda.jit()
def pool(inputs, outputs):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    d = cuda.blockIdx.x

    for i in range(inputs.shape[0]):
        outputs[i, tx, ty, d] = max(inputs[i, 2 * tx, 2 * ty, d], inputs[i, 2 * tx + 1, 2 * ty, d],
                                    inputs[i, 2 * tx, 2 * ty + 1, d], inputs[i, 2 * tx + 1, 2 * ty + 1, d])


@cuda.jit()
def cal_pool_gradient(inputs, outputs, gradient_to_outputs, grident_to_inputs):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    d = cuda.blockIdx.x

    for k in range(outputs.shape[0]):
        for i in range(2):
            for j in range(2):
                if outputs[k, tx, ty, d] == inputs[k, 2 * tx + i, 2 * ty + j, d]:
                    grident_to_inputs[k, 2 * tx + i, 2 * ty + j, d] = gradient_to_outputs[k, tx, ty, d]

卷積操作的正向傳播與前向傳播原理不再贅述。程序實現上卷積層同樣需要保存輸入與輸出。前向傳播時，設置線程矩陣的 size 與輸出張量的 size 相同，從而使一個線程處理一個輸出，循環 batchSize 次完成所有輸出的計算；在梯度反向傳播時，設置線程矩陣的 size 與損失對該卷積層輸出的梯度的 size 相同，每個線程各自計算與該輸出元素相關的輸入與參數應累加的梯度數值，並循環 batchSize 次，所有線程計算完畢後，最終結果即爲損失分別對輸入與參數的梯度，其代碼實現如下

class Conv2D(object):
    def __init__(self, in_channels, kernel_size, features):
        self.features = features
        self.ksize = kernel_size
        weights_scale = np.sqrt(kernel_size * kernel_size * in_channels / 2)
        self.weights = np.random.standard_normal((features, kernel_size, kernel_size, in_channels)) / weights_scale
        self.biases = np.random.standard_normal(features) / weights_scale

        self.g_weights = None
        self.g_biases = None
        self.g_inputs = None
        self.inputs = None
        self.outputs = None

    def forward(self, inputs):
        self.inputs = np.zeros(shape=(inputs.shape[0], inputs.shape[1]+(self.ksize // 2)*2,
                                      inputs.shape[2] + (self.ksize // 2)*2, inputs.shape[3]), dtype=np.float32)
        self.inputs[:, self.ksize // 2: inputs.shape[1] + self.ksize // 2,
        self.ksize // 2: inputs.shape[2] + self.ksize // 2, :] = inputs.copy()
        self.outputs = np.zeros(shape=(inputs.shape[0], inputs.shape[1], inputs.shape[2], self.features), dtype=np.float32)
        grid = (self.features)
        block = (inputs.shape[1], inputs.shape[2])
        cov[grid, block](self.inputs, self.weights, self.biases, self.outputs)
        return self.outputs

    def backward(self, gradient_loss_to_this_outputs):
        self.g_inputs = np.zeros(shape=self.inputs.shape, dtype=np.float32)
        self.g_weights = np.zeros(self.weights.shape, dtype=np.float32)
        self.g_biases = np.zeros(self.biases.shape, dtype=np.float32)
        grid = (self.features)
        block = (self.inputs.shape[1], self.inputs.shape[2])
        cal_cov_grident[grid, block](self.inputs, self.weights, gradient_loss_to_this_outputs,
                                     self.g_weights, self.g_biases, self.g_inputs)

        self.g_inputs = self.g_inputs[:, self.ksize//2: self.g_inputs.shape[1] - self.ksize//2,
                        self.ksize//2: self.g_inputs.shape[2] - self.ksize//2, :]
        return self.g_inputs

    def update_parameters(self, lr):
        self.weights -= self.g_weights * lr / self.inputs.shape[0]
        self.biases -= self.g_biases * lr / self.inputs.shape[0]
        
@cuda.jit()
def cov(inputs, weights, biases, outputs):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    f_num = cuda.blockIdx.x

    for n in range(inputs.shape[0]):
        for i in range(weights.shape[1]):
            for j in range(weights.shape[2]):
                for k in range(weights.shape[3]):
                    outputs[n, tx, ty, f_num] += (inputs[n, tx + i, ty + j, k] * weights[f_num, i, j, k])
        outputs[n, tx, ty, f_num] += biases[f_num]


@cuda.jit()
def cal_cov_grident(inputs, weights, gradient_loss_to_this_outputs, g_weights, g_biases, g_inputs):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    f_num = cuda.blockIdx.x

    for n in range(gradient_loss_to_this_outputs.shape[0]):
        for i in range(weights.shape[1]):
            for j in range(weights.shape[2]):
                for k in range(weights.shape[3]):
                    tmp1 = gradient_loss_to_this_outputs[n, tx, ty, f_num] * weights[f_num, i, j, k]
                    tmp2 = gradient_loss_to_this_outputs[n, tx, ty, f_num] * inputs[n, tx+i, ty+j, k]
                    g_inputs[n, tx+i, ty+j, k] += tmp1
                    g_weights[f_num, i, j, k] += tmp2
        g_biases[f_num] += gradient_loss_to_this_outputs[n, tx, ty, f_num]

2.2 搭建MNIST CNN 模型

首先導入已經實現的各個模塊，並讀取數據集

import numpy as np
import time
from read_mnist import DataSet
from my_deep_learning_pkg import FullyConnect, ReLu, CrossEntropy, MaxPooling2x2, Conv2D

# read data
mnistDataSet = DataSet()

利用自己寫的類搭建網絡

# construct neural network
conv1 = Conv2D(1, 5, 32)
reLu1 = ReLu()
pool1 = MaxPooling2x2()
conv2 = Conv2D(32, 5, 64)
reLu2 = ReLu()
pool2 = MaxPooling2x2()
fc1 = FullyConnect(7*7*64, 512)
reLu3 = ReLu()
fc2 = FullyConnect(512, 10)
lossfunc = CrossEntropy()

訓練中的前向傳播、反向傳播、以及梯度更新

# train
lr = 1e-2
for epoch in range(10):
    for i in range(600):
        train_data, train_label = mnistDataSet.next_batch(100)

        # forward
        A = conv1.forward(train_data)
        A = reLu1.forward(A)
        A = pool1.forward(A)
        A = conv2.forward(A)
        A = reLu2.forward(A)
        A = pool2.forward(A)
        A = A.reshape(A.shape[0], 7*7*64)
        A = fc1.forward(A)
        A = reLu3.forward(A)
        A = fc2.forward(A)
        loss = lossfunc.forward(A, train_label)

        # backward
        grad = lossfunc.backward()
        grad = fc2.backward(grad)
        grad = reLu3.backward(grad)
        grad = fc1.backward(grad)
        grad = grad.reshape(grad.shape[0], 7, 7, 64)
        grad = pool2.backward(grad)
        grad = reLu2.backward(grad)
        grad = conv2.backward(grad)
        grad = grad.copy()
        grad = pool1.backward(grad)
        grad = reLu1.backward(grad)
        grad = conv1.backward(grad)

        # update parameters
        fc2.update_parameters(lr)
        fc1.update_parameters(lr)
        conv2.update_parameters(lr)
        conv1.update_parameters(lr)

在測試集上進行檢測

test_index = 0
sum_accu = 0
for j in range(100):
    test_data, test_label = mnistDataSet.test_data[test_index: test_index + 100], \
                            mnistDataSet.test_label[test_index: test_index + 100]
    A = conv1.forward(test_data)
    A = reLu1.forward(A)
    A = pool1.forward(A)
    A = conv2.forward(A)
    A = reLu2.forward(A)
    A = pool2.forward(A)
    A = A.reshape(A.shape[0], 7 * 7 * 64)
    A = fc1.forward(A)
    A = reLu3.forward(A)
    A = fc2.forward(A)
    preds = lossfunc.cal_softmax(A)
    preds = np.argmax(preds, axis=1)
    sum_accu += np.mean(preds == test_label)
    test_index += 100
print("epoch{} train_number{} accuracy: {}%".format(epoch+1, i+1, sum_accu))

程序的輸出如圖 3 所示，隨着訓練的進行，準確率不斷上升，在一個 epoch 後準確率已經達到 91% , 說明代碼是正確的。圖 3 程序運行結果

代碼下載鏈接

https://github.com/WHDY/mnist_cnn_numba_cuda

引用

[1] https://zhuanlan.zhihu.com/c_162633442 [2] https://github.com/SpiritSeeker/cnn-from-scratch

【技術博客】GPU 編程之從零開始實現 MNIST-CNN

【技術博客】GPU 編程之從零開始實現 MNIST-CNN

1 GPU 編程的介紹與入門

1.1 介紹

1.2 使用Numba進行GPU編程

2 結合 GPU 從零搭建 CNN

2.1 結合 GPU 實現池化與卷積

2.2 搭建MNIST CNN 模型

代碼下載鏈接

引用

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

一文了解基於 ITIL 的運維管理體系框架

一圖帶你解鎖數字化運維的建設思路

【騰訊雲 BI 數據分析可視化大賽】有獎徵文活動

KubeKey 部署 K8s v1.28.8 實戰

KubeSphere 社區雙週報｜2024.04.26-05.09

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結