多GPU编程可能遇到的一些问题

原創

2019-09-04 21:12

最近在写一个多GPU编程的程序，按照传统的写法写了之后，程序一直卡在核函数运行之后，没有任何输出

// includes, project
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cufft.h>
#include <cuComplex.h>
//CUFFT Header file
#include <cufftXt.h>

#define NX 9
#define NY 10
#define NZ 10
#define NZ2 (NZ/2+1)
#define NN (NX*NY*NZ)
#define L (2*M_PI)
#define TX 8
#define TY 8
#define TZ 8
#define MAXGPU 16

__global__
void initialize(int NX_per_GPU, cufftComplex *f1,int Nelements,int offset)
{
	int index = blockDim.x * blockIdx.x + threadIdx.x;
	if(index<Nelements){
    f1[index].x = index+offset;
    f1[index].y = 0.0;
	}
}


int main (void)
{


    int i, j, k, idx, NX_per_GPU[MAXGPU];

    // Set GPU's to use and list device properties
    int nGPUs = 2, deviceNum[MAXGPU];
    for(i = 0; i<nGPUs; ++i)
    {
        deviceNum[i] = i;
        cudaSetDevice(deviceNum[i]);
        printf("set id num : %d \n",deviceNum[i]);
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, deviceNum[i]);
        printf("  Device name: %s\n", prop.name);
        printf("  Memory Clock Rate (KHz): %d\n",
           prop.memoryClockRate);
        printf("  Memory Bus Width (bits): %d\n",
           prop.memoryBusWidth);
        printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
            2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
    }

    printf("Running Multi_GPU_FFT_check using %d GPUs on a %dx%dx%d grid.\n",nGPUs,NX,NY,NZ);


    // divide data

    for(i=0;i<nGPUs-1;i++)
    	NX_per_GPU[i]=NX/nGPUs;
    NX_per_GPU[i]=NX - NX_per_GPU[0]*i;


    printf(" divide data : %d %d \n",NX_per_GPU[0],NX_per_GPU[1]);


    // Declare variables
    cufftComplex *u;

    // Allocate memory for arrays
    cudaMalloc(&u, sizeof(cufftComplex)*NN );

    cufftComplex *data_cpu = (cufftComplex *)malloc(sizeof(cufftComplex)*NN);

    // Launch CUDA kernel to initialize velocity field

    int offset=0;
    //nGPUs=1;
    for (i = 0; i<nGPUs; ++i){
        cudaSetDevice(deviceNum[i]);
        int threadsPerBlock = 256;
        int blocksPerGrid =(NX_per_GPU[i]*NY*NZ + threadsPerBlock - 1) / threadsPerBlock;

        initialize<<<blocksPerGrid, threadsPerBlock>>>(NX_per_GPU[i], &u[offset],NX_per_GPU[i]*NY*NZ,offset);

        offset+=NX_per_GPU[i]*NY*NZ;

    }
/*
    for (i = 0; i<nGPUs; ++i){
        cudaSetDevice(deviceNum[i]);
        cudaDeviceSynchronize();
    }*/
    offset=0;
    for(i = 0; i<nGPUs; ++i)
    {
    	cudaSetDevice(deviceNum[i]);
    	cudaMemcpy(data_cpu,u,sizeof(cufftComplex)*NX*NY*NZ,cudaMemcpyDeviceToHost);
    }
    for(int i=80;i<80+10;i++)
    	printf("%f ",data_cpu[i].x);
    printf("\n");
    for(int i=800;i<800+10;i++)
    	printf("%f ",data_cpu[i].x);
    printf("\n");
    return 0;
}

通过分析发现，原因在于cudaMalloc声明的内存是在一张卡上的，按照逻辑来说另外一张卡想要访问这块内存，就需要通过pcie进行数据传输可能由于自己服务器的配置问题，一直卡在核函数的调用，就一直在运行了。

为了让程序正确有两个方案可以修改这段代码：

1.两块在不同卡上的数据分别在每个卡上进行malloc，这样核函数调用时可以不受干扰

2.将cudaMalloc语句换成cudaMallocManaged，即统一内存管理，这样两个卡也都能读取到这个数字，使用offset的方式就可以访问。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

多GPU编程可能遇到的一些问题

高性能與並行計算領域一些期刊與會議及資料閱讀列表

離散卷積及計算

TensorCore使用

循環卷積與線性卷積

日常閱讀期刊列表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結