【代碼閱讀】詳解在Pytorch中定義自己寫的CUDA編程函數

目前,3D的網絡,尤其時point-based的網絡,很多模塊在pytorch中都沒有官方實現,這就需要我們自己寫。例如PointNet++中的FPS,group,query等函數。之前也只是用過,對其的修改也限於python層面,這次,就好好探究一下,如何自定義一個函數,如何將其加入到pytorch中,使得在pytorch中也能用。

其實,這一塊,有非常詳細的官方文檔,講述瞭如何自定義一個函數,並將其放入Pytorch中。當然,如何寫一個函數,我們還需要cuda編程的知識,這裏就先講外圍一部分,假設我們已經寫好了一個函數。官網文檔的示例講的很清楚了,這裏就拿PointNet++來說明一下。如果想要詳細瞭解的話,可以先看一下官方文檔:
https://pytorch.org/tutorials/advanced/cpp_extension.html?highlight=pybind11_module

官方文檔中清楚的給出了兩種將自己定義的cuda編程的函數放入pytorch中的方法。一種是通過編譯,生成一個python的包,一種是在程序執行中調用。

個人認爲編譯的方法更好一些,生成了一個python包,在其他的project中也很方便調用。

首先,我們先看一下pytorch接口的設置,這裏,我們先假設已經寫好了函數。

pytorch接口設置

編譯的方式

這裏的PointNet++版本以這個鏈接中的爲例:
https://github.com/sshaoshuai/Pointnet2.PyTorch/tree/5a4416f51ceaeba242828cabf39133433336850d

假設我們已經寫好了要實現的函數,在本例中,函數包括pointnet2/src中的一系列xxx.cpp,xxx.cu和xxx.h。

那麼我們如何放到pytorch的接口中呢?這就要看pointnet2/setup.py中:

# 這兩個import是標準寫法,不用改,setuptools是爲了把我們自定義的函數變成一個包
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
	# 這個包的name是pointnet2
    name='pointnet2', 
    ext_modules=[
    	# 模塊的name是pointnet2_cuda,就是說要import pointnet2_cuda
    	# 定義與這個包關聯的xxx.cpp, xxx.cu, xxx.h
        CUDAExtension('pointnet2_cuda', [
            'src/pointnet2_api.cpp',
            
            'src/ball_query.cpp', 
            'src/ball_query_gpu.cu',
            'src/group_points.cpp', 
            'src/group_points_gpu.cu',
            'src/interpolate.cpp', 
            'src/interpolate_gpu.cu',
            'src/sampling.cpp', 
            'src/sampling_gpu.cu',
        ],
        # 以下的東西都不用改
        extra_compile_args={'cxx': ['-g'],
                            'nvcc': ['-O2']})
    ],
    cmdclass={'build_ext': BuildExtension}
)

在我們用這些函數之前,要先運行

python setup.py install

其實就是在把我們定義的這些函數,集合成一個包安裝起來。這就出現了一個問題,函數包是安裝上了,但我們用什麼接口去調用函數呢?

這部分就定義在pointnet2/pointnet2_api.py中

#include <torch/serialize/tensor.h>
#include <torch/extension.h>

// 把寫好的函數都先include進來
#include "ball_query_gpu.h"
#include "group_points_gpu.h"
#include "sampling_gpu.h"
#include "interpolate_gpu.h"

// 使用PYBIND11_MODULE,這個是在torch/extension.h中包含了的
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
	// python中調用時使用的函數名爲:ball_query_wrapper
	// cpp中相關的函數是:ball_query_wrapper_fast
	// python中調用help所產生的提示是:"ball_query_wrapper_fast"
    m.def("ball_query_wrapper", &ball_query_wrapper_fast, "ball_query_wrapper_fast");

    m.def("group_points_wrapper", &group_points_wrapper_fast, "group_points_wrapper_fast");
    m.def("group_points_grad_wrapper", &group_points_grad_wrapper_fast, "group_points_grad_wrapper_fast");

    m.def("gather_points_wrapper", &gather_points_wrapper_fast, "gather_points_wrapper_fast");
    m.def("gather_points_grad_wrapper", &gather_points_grad_wrapper_fast, "gather_points_grad_wrapper_fast");

    m.def("furthest_point_sampling_wrapper", &furthest_point_sampling_wrapper, "furthest_point_sampling_wrapper");
    
    m.def("three_nn_wrapper", &three_nn_wrapper_fast, "three_nn_wrapper_fast");
    m.def("three_interpolate_wrapper", &three_interpolate_wrapper_fast, "three_interpolate_wrapper_fast");
    m.def("three_interpolate_grad_wrapper", &three_interpolate_grad_wrapper_fast, "three_interpolate_grad_wrapper_fast");
}

上面就完成了pytorch中要調用的接口了。那麼我們看一下,是如何調用的?

這個在pointnet2/pointnet2_utils.py中,以

import torch
from torch.autograd import Variable
from torch.autograd import Function
import torch.nn as nn
from typing import Tuple

# import我們自己定義的包
import pointnet2_cuda as pointnet2

# 定義一個pytorch的函數,要繼承torch.autograd.Function
class GatherOperation(Function):
	
	# 定義前向運算,ctx保存一些變量,保存如ctx中的變量會傳入backward中
    @staticmethod
    def forward(ctx, features: torch.Tensor, idx: torch.Tensor) -> torch.Tensor:
        """
        :param ctx:
        :param features: (B, C, N)
        :param idx: (B, npoint) index tensor of the features to gather
        :return:
            output: (B, C, npoint)
        """
        assert features.is_contiguous()
        assert idx.is_contiguous()

        B, npoint = idx.size()
        _, C, N = features.size()
        output = torch.cuda.FloatTensor(B, C, npoint)
		
		# 調用我們定義的函數,進行計算
        pointnet2.gather_points_wrapper(B, C, N, npoint, features, idx, output)
		
		# 將反向傳播中要用到的變量放入ctx中
        ctx.for_backwards = (idx, C, N)
        return output
	
	# 定義反向傳播的函數,其輸入的第一個變量是ctx,然後其他輸入的數量與forward的輸出的數量相同
    @staticmethod
    def backward(ctx, grad_out):
		# 從ctx中取出前向計算中保存的變量
        idx, C, N = ctx.for_backwards
        B, npoint = idx.size()

        grad_features = Variable(torch.cuda.FloatTensor(B, C, N).zero_())
        grad_out_data = grad_out.data.contiguous()
        pointnet2.gather_points_grad_wrapper(B, C, N, npoint, grad_out_data, idx, grad_features.data)

		# 輸出變量的數量必須與forward輸入的變量數量(除ctx之外)相同
        return grad_features, None

# 調用我們定義的函數的方法是outputs = xxx.apply(inputs),這裏預先把apply取出來,所以用的時候就可以直接使用 outputs = gather_operation(inputs)即可
gather_operation = GatherOperation.apply

在運行是調用的形式

以PVCNN中的代碼爲例。PVCNN中的xxx.cpp,xxx.cu,xxx.h都modules/functional/src文件夾中。

對應編譯的方式的順序來看,先看看,xxx.cpp和xxx.cu是怎麼被pytorch所知道的呢?這個在modules/backend.py中:

import os

from torch.utils.cpp_extension import load

_src_path = os.path.dirname(os.path.abspath(__file__))
_backend = load(name='_pvcnn_backend',
                extra_cflags=['-O3', '-std=c++17'],
                sources=[os.path.join(_src_path,'src', f) for f in [
                    'ball_query/ball_query.cpp',
                    'ball_query/ball_query.cu',
                    'grouping/grouping.cpp',
                    'grouping/grouping.cu',
                    'interpolate/neighbor_interpolate.cpp',
                    'interpolate/neighbor_interpolate.cu',
                    'interpolate/trilinear_devox.cpp',
                    'interpolate/trilinear_devox.cu',
                    'sampling/sampling.cpp',
                    'sampling/sampling.cu',
                    'voxelization/vox.cpp',
                    'voxelization/vox.cu',
                    'bindings.cpp',
                ]]
                )
__all__ = ['_backend']

可以說,這個就是標準代碼,也就name和sources需要按照自己的寫一下。

那就出現了下一個問題,這些程序是已經被pytorch知道了,但接口該怎麼調用呢,哪個函數是哪個函數呢?這個跟編譯的方式的接口的定義方式是一樣的。在modules/functional/src/bindings.cpp中:

#include <pybind11/pybind11.h>

#include "ball_query/ball_query.hpp"
#include "grouping/grouping.hpp"
#include "interpolate/neighbor_interpolate.hpp"
#include "interpolate/trilinear_devox.hpp"
#include "sampling/sampling.hpp"
#include "voxelization/vox.hpp"

PYBIND11_MODULE(_pvcnn_backend, m) {
  m.def("gather_features_forward", &gather_features_forward,
        "Gather Centers' Features forward (CUDA)");
  m.def("gather_features_backward", &gather_features_backward,
        "Gather Centers' Features backward (CUDA)");
  m.def("furthest_point_sampling", &furthest_point_sampling_forward,
        "Furthest Point Sampling (CUDA)");
  m.def("ball_query", &ball_query_forward, "Ball Query (CUDA)");
  m.def("grouping_forward", &grouping_forward,
        "Grouping Features forward (CUDA)");
  m.def("grouping_backward", &grouping_backward,
        "Grouping Features backward (CUDA)");
  m.def("three_nearest_neighbors_interpolate_forward",
        &three_nearest_neighbors_interpolate_forward,
        "3 Nearest Neighbors Interpolate forward (CUDA)");
  m.def("three_nearest_neighbors_interpolate_backward",
        &three_nearest_neighbors_interpolate_backward,
        "3 Nearest Neighbors Interpolate backward (CUDA)");

  m.def("trilinear_devoxelize_forward", &trilinear_devoxelize_forward,
        "Trilinear Devoxelization forward (CUDA)");
  m.def("trilinear_devoxelize_backward", &trilinear_devoxelize_backward,
        "Trilinear Devoxelization backward (CUDA)");
  m.def("avg_voxelize_forward", &avg_voxelize_forward,
        "Voxelization forward with average pooling (CUDA)");
  m.def("avg_voxelize_backward", &avg_voxelize_backward,
        "Voxelization backward (CUDA)");
}

緊接着,下一個問題,知道了xxx.cpp所對應的函數在python中是怎麼調用的,那如何封裝爲一個pytorch的Function呢?這個與編譯的方式的定義方式也是一樣的。以modules/functional/grouping.py爲例:

from torch.autograd import Function

from modules.functional.backend import _backend

__all__ = ['grouping']


class Grouping(Function):
    @staticmethod
    def forward(ctx, features, indices):
        """
        :param ctx:
        :param features: features of points, FloatTensor[B, C, N]
        :param indices: neighbor indices of centers, IntTensor[B, M, U], M is #centers, U is #neighbors
        :return:
            grouped_features: grouped features, FloatTensor[B, C, M, U]
        """
        features = features.contiguous()
        indices = indices.contiguous()
        ctx.save_for_backward(indices)
        ctx.num_points = features.size(-1)
        return _backend.grouping_forward(features, indices)

    @staticmethod
    def backward(ctx, grad_output):
        indices, = ctx.saved_tensors
        grad_features = _backend.grouping_backward(grad_output.contiguous(), indices, ctx.num_points)
        return grad_features, None
        
grouping = Grouping.apply

由此,這個就和torch.Function中的函數使用方法都一樣了。

要點

下面,我翻譯了一些pytorch官方例子的一些我認爲重要的點。

A wonderful fact about PyTorch’s ATen backend is that it abstracts the computing device you are running on. This means the same code we wrote for CPU can also run on GPU, and individual operations will correspondingly dispatch to GPU-optimized implementations. For certain operations like matrix multiply (like mm or addmm), this is a big win. Let’s take a look at how much performance we gain from running our C++ code with CUDA tensors. No changes to our implementation are required, we simply need to put our tensors in GPU memory from Python, with either adding device=cuda_device argument at creation time or using .to(cuda_device) after creation:

“關於PyTorch的ATen後端的一個奇妙事實是,它可以抽象化您正在運行的計算設備。 這意味着我們爲CPU編寫的相同代碼也可以在GPU上運行,並且各個操作將相應地分派到GPU優化的實現。 對於某些運算,如矩陣乘法(如mm或addmm),這是一個很大的勝利。 讓我們看一下使用CUDA張量運行C ++代碼所獲得的性能。 無需更改實現,只需將張量從Python放到GPU內存中,可以在創建時添加device = cuda_device參數,或者在創建後使用.to(cuda_device):”

The general strategy for writing a CUDA extension is to first write a C++ file which defines the functions that will be called from Python, and binds those functions to Python with pybind11. Furthermore, this file will also declare functions that are defined in CUDA (.cu) files. The C++ functions will then do some checks and ultimately forward its calls to the CUDA functions. In the CUDA files, we write our actual CUDA kernels. The cpp_extension package will then take care of compiling the C++ sources with a C++ compiler like gcc and the CUDA sources with NVIDIA’s nvcc compiler. This ensures that each compiler takes care of files it knows best to compile. Ultimately, they will be linked into one shared library that is available to us from Python code.
“編寫CUDA擴展的一般策略是首先編寫一個C ++文件,該文件定義將從Python調用的函數,然後使用pybind11將這些函數綁定到Python。此外,此文件還將聲明在CUDA(.cu)文件中定義的函數。然後,C ++函數將進行一些檢查,並最終將其調用轉發給CUDA函數。在CUDA文件中,我們編寫了實際的CUDA內核。然後,cpp_extension包將負責使用gcc之類的C ++編譯器來編譯C ++源代碼,並使用NVIDIA的nvcc編譯器來編譯CUDA源代碼。這樣可以確保每個編譯器都處理最瞭解要編譯的文件。最終,它們將被鏈接到一個共享庫中,該庫可以從Python代碼中獲得。”

如何自己編寫使用CUDA的函數

上面我們講的都是如何將自己定義的函數放到pytorch中,這我們假設了我們已經寫好了這些函數。這裏,我們就再講一下如何寫使用CUDA的函數,這部分內容需要使用CUDA編程的知識。這裏我不過多講述有關CUDA編程的知識和原理,只是把每一塊在幹什麼寫清楚。哪些東西是應該有的?哪些東西要隨着自己的任務不同而改的。

這裏仍然先以PointNet++爲例,然後看一下Faster-RCNN以做驗證。最後看一下PVCNN。

PointNet++

PointNet++使用以下版本:
https://github.com/sshaoshuai/Pointnet2.PyTorch/tree/5a4416f51ceaeba242828cabf39133433336850d

就以最簡單的furtherst point sampling的實現爲例,因爲這個就是返回一些idx,只有一個函數,也不需要計算梯度的反向傳播。有關FPS的函數定義在pointnet2/src/sampling.cpp,pointnet2/src/sampling_gpu.h,pointnet2/src/sampling_gpu.cu中。

先看sampling.cpp,這是cpu中執行的函數,也是python識別的接口

#include <torch/serialize/tensor.h>
#include <ATen/cuda/CUDAContext.h>
#include <vector>
#include <THC/THC.h>

// include gpu版本的函數
#include "sampling_gpu.h"


int furthest_point_sampling_wrapper(int b, int n, int m, 
    at::Tensor points_tensor, at::Tensor temp_tensor, at::Tensor idx_tensor) {
	/*
	Inputs:
	b: Batch的值
	n: 原始點雲中點的數量
	m: 要選取點的數量
	points_tensor: 原始點雲,大小爲b*n
	temp_tensor: 中間變量,大小爲b*n
	idx_tensor: 這個是返回值,儲存選取的idx, 大小爲b*m

	points_tensor, temp_tensor, idx_tensor都是在cuda上面的tensor
	*/
    const float *points = points_tensor.data<float>();
    float *temp = temp_tensor.data<float>();
    int *idx = idx_tensor.data<int>();
	
	// 定義一個cuda的stream,並初始化,這個不用改
    cudaStream_t stream = THCState_getCurrentStream(state);
	// 調用sampling_gpu.cu中的函數
    furthest_point_sampling_kernel_launcher(b, n, m, points, temp, idx, stream);
    return 1;
}

然後我們看一下sampling_gpu.cu中寫了什麼:

__device__ void __update(float *__restrict__ dists, int *__restrict__ dists_i, int idx1, int idx2){
    const float v1 = dists[idx1], v2 = dists[idx2];
    const int i1 = dists_i[idx1], i2 = dists_i[idx2];
    dists[idx1] = max(v1, v2);
    dists_i[idx1] = v2 > v1 ? i2 : i1;
}

template <unsigned int block_size>
__global__ void furthest_point_sampling_kernel(int b, int n, int m, 
    const float *__restrict__ dataset, float *__restrict__ temp, int *__restrict__ idxs) {
    // dataset: (B, N, 3)
    // tmp: (B, N)
    // output:
    //      idx: (B, M)

    if (m <= 0) return;
    __shared__ float dists[block_size];
    __shared__ int dists_i[block_size];

    int batch_index = blockIdx.x;
    dataset += batch_index * n * 3;
    temp += batch_index * n;
    idxs += batch_index * m;

    int tid = threadIdx.x;
    const int stride = block_size;

    int old = 0;
    if (threadIdx.x == 0)
    idxs[0] = old;

    __syncthreads();
    for (int j = 1; j < m; j++) {
    int besti = 0;
    float best = -1;
    float x1 = dataset[old * 3 + 0];
    float y1 = dataset[old * 3 + 1];
    float z1 = dataset[old * 3 + 2];
    for (int k = tid; k < n; k += stride) {
        float x2, y2, z2;
        x2 = dataset[k * 3 + 0];
        y2 = dataset[k * 3 + 1];
        z2 = dataset[k * 3 + 2];
        // float mag = (x2 * x2) + (y2 * y2) + (z2 * z2);
        // if (mag <= 1e-3)
        // continue;

        float d = (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) + (z2 - z1) * (z2 - z1);
        float d2 = min(d, temp[k]);
        temp[k] = d2;
        besti = d2 > best ? k : besti;
        best = d2 > best ? d2 : best;
    }
    dists[tid] = best;
    dists_i[tid] = besti;
    __syncthreads();

    if (block_size >= 1024) {
        if (tid < 512) {
            __update(dists, dists_i, tid, tid + 512);
        }
        __syncthreads();
    }

    if (block_size >= 512) {
        if (tid < 256) {
            __update(dists, dists_i, tid, tid + 256);
        }
        __syncthreads();
    }
    if (block_size >= 256) {
        if (tid < 128) {
            __update(dists, dists_i, tid, tid + 128);
        }
        __syncthreads();
    }
    if (block_size >= 128) {
        if (tid < 64) {
            __update(dists, dists_i, tid, tid + 64);
        }
        __syncthreads();
    }
    if (block_size >= 64) {
        if (tid < 32) {
            __update(dists, dists_i, tid, tid + 32);
        }
        __syncthreads();
    }
    if (block_size >= 32) {
        if (tid < 16) {
            __update(dists, dists_i, tid, tid + 16);
        }
        __syncthreads();
    }
    if (block_size >= 16) {
        if (tid < 8) {
            __update(dists, dists_i, tid, tid + 8);
        }
        __syncthreads();
    }
    if (block_size >= 8) {
        if (tid < 4) {
            __update(dists, dists_i, tid, tid + 4);
        }
        __syncthreads();
    }
    if (block_size >= 4) {
        if (tid < 2) {
            __update(dists, dists_i, tid, tid + 2);
        }
        __syncthreads();
    }
    if (block_size >= 2) {
        if (tid < 1) {
            __update(dists, dists_i, tid, tid + 1);
        }
        __syncthreads();
    }

    old = dists_i[0];
    if (tid == 0)
        idxs[j] = old;
    }
}

void furthest_point_sampling_kernel_launcher(int b, int n, int m, 
    const float *dataset, float *temp, int *idxs, cudaStream_t stream) {
    // dataset: (B, N, 3)
    // tmp: (B, N)
    // output:
    //      idx: (B, M)

    cudaError_t err;
    unsigned int n_threads = opt_n_threads(n);

    switch (n_threads) {
        case 1024:
        furthest_point_sampling_kernel<1024><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 512:
        furthest_point_sampling_kernel<512><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 256:
        furthest_point_sampling_kernel<256><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 128:
        furthest_point_sampling_kernel<128><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 64:
        furthest_point_sampling_kernel<64><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 32:
        furthest_point_sampling_kernel<32><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 16:
        furthest_point_sampling_kernel<16><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 8:
        furthest_point_sampling_kernel<8><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 4:
        furthest_point_sampling_kernel<4><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 2:
        furthest_point_sampling_kernel<2><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        case 1:
        furthest_point_sampling_kernel<1><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs); break;
        default:
        furthest_point_sampling_kernel<512><<<b, n_threads, 0, stream>>>(b, n, m, dataset, temp, idxs);
    }

    err = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

不得不說,好長啊。具體的我也就不講在做什麼了。懂CUDA編程的人,自然能看懂,還算比較簡單,不懂的,感覺也講不清楚。這裏就寫一下我認爲比較重要的是調用kernel的語句是:

kernel<<<num_block, num_thread, 0, stream>>>(a, b, c)

要在<<<>>>中,最後一個參數是在cpp中定義的stream

不得不說,目前沒懂的是在<<<>>>前面的<4>這個是什麼意思?有大佬懂的話,評論留個言。

所以,這裏總結一下,要自己寫一個函數怎麼寫:
先寫xxx.cpp:

#include <THC/THC.h>
extern THCState *state;
#include "xxx.h"

int xxx(inputs, 
    ...

    cudaStream_t stream = THCState_getCurrentStream(state);
    xxx_launcher(inputs, stream);
    return 1;
}

然後定義xxx.h

int xxx_kernel(kernel_inpus);

void xxx_launcher(inputs, cudaStream_t stream);

最後定義xxx.cu

#include "xxx.h"

__global__ void xxx_kernel(kernel_inpus) {
	...
}

void xxx_launcher(inputs, cudaStream_t stream) {
	...
	xxx_kernel<<<n_blocks, n_threads, 0, stream>>>(kernel_inpus);
	...
}

說實話,include這一塊也不是很清楚,實在不行多include一些,總沒有啥問題

上面就總結完了PointNet++的,下面就拿Faster-RCNN的代碼來驗證一下我所總結的。

Faster-RCNN

以下面鏈接這個版本爲例:
https://github.com/jwyang/faster-rcnn.pytorch

以roi_crop爲例,在lib/model/roi_crop/src中,roi_crop_cuda.c中,出現了以下:

#include <THC/THC.h>
#include <stdbool.h>
#include <stdio.h>
#include "roi_crop_cuda_kernel.h"

#define real float

extern THCState *state;

int BilinearSamplerBHWD_updateOutput_cuda(THCudaTensor *inputImages, THCudaTensor *grids, THCudaTensor *output){
  int success = 0;
  success = BilinearSamplerBHWD_updateOutput_cuda_kernel(THCudaTensor_size(state, output, 1),
							 THCudaTensor_size(state, output, 3),
							 THCudaTensor_size(state, output, 2),
							 THCudaTensor_size(state, output, 0),
							 THCudaTensor_size(state, inputImages, 1),
							 THCudaTensor_size(state, inputImages, 2),
							 THCudaTensor_size(state, inputImages, 3),
							 THCudaTensor_size(state, inputImages, 0),
							 THCudaTensor_data(state, inputImages),
							 THCudaTensor_stride(state, inputImages, 0),
							 THCudaTensor_stride(state, inputImages, 1),
							 THCudaTensor_stride(state, inputImages, 2),
							 THCudaTensor_stride(state, inputImages, 3),
							 THCudaTensor_data(state, grids),
							 THCudaTensor_stride(state, grids, 0),
							 THCudaTensor_stride(state, grids, 3),
							 THCudaTensor_stride(state, grids, 1),
							 THCudaTensor_stride(state, grids, 2),
							 THCudaTensor_data(state, output),
							 THCudaTensor_stride(state, output, 0),
							 THCudaTensor_stride(state, output, 1),
							 THCudaTensor_stride(state, output, 2),
							 THCudaTensor_stride(state, output, 3),
							 THCState_getCurrentStream(state));

  //check for errors
  if (!success) {
    THError("aborting");
  }
  return 1;
}

可以從上述代碼中看到,同樣是出現了state和cuda_stream的定義!

從roi_crop_cuda_kernel.cu中,同樣出現了:

bilinearSamplingFromGrid<<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock, kThreadsPerBlock, 0, stream>>>(output_size,...);

這種<<<n_block, n_thread, 0, stream>>>這種格式

PVCNN

這裏就不放代碼了,仍然是以xxx.cpp+xxx.cu的形式,cu中一個launch函數被cpp中調用,一個kernel函數,在GPU上做運算。不同的是,可以從PVCNN中看到,stream並不是必須。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章