以cufftPlanMany爲例FFT變換中embed,stride,dist的解釋與設置

關於FFT的自定義數據分佈進行變換，之前每次都是用的寫demo，這次搞明白之後記錄一下，以便以後查閱。

比如需要對一個二維數組裏的每一行或者每一列進行傅里葉變換，那麼需要對cufftPlanMany進行設置，然後進行批量處理。

cufftPlanMany的函數聲明如下

cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed,
int istride, int idist, int *onembed, int ostride,int odist, cufftType type, int batch);

比如對於一個10*5的二維數組，想對每一行進行FFT：

NX=10；NY=5；

rank=1，表明是進行一維傅里葉變換；

n[1]，數組n維度與rank相同，表明每一維的變換數目，本例中n只有一維，n[0]=NX，表示一維傅里葉變換需要10個元素；

inembed[2]，表明原始輸入數據的維度，本例中輸入數據是一個二維數組，所以inembed維度爲2，inembed[0]=NX, inembed[1]=NY;

istride，表明在一個傅里葉變換內部每一個元素相距的距離，本例中 istride=1；

idist，表明在兩個傅里葉變換之間，每一個元素相距的距離，本例中 isist=NX;

batch 表明多少個獨立的傅里葉變換，本例中 batch=NY

本例中我們進行C2C的傅里葉變換，所以輸入和輸出格式一樣。實現代碼如下:

#include <stdlib.h>
#include <stdio.h>

#include <string.h>
#include <math.h>
#include "timer.h"

#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#define Ndim 2
#define NX 10
#define NY 5


void testplanmany() {

	int N[2];
	N[0] = NX, N[1] = NY;
	int NXY = N[0] * N[1];
	cufftComplex *input = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	cufftComplex *output = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	int i;
	for (i = 0; i < NXY; i++) {
		input[i].x = i % 1000;
		input[i].y = 0;
	}
	cufftComplex *d_inputData, *d_outData;
	cudaMalloc((void**) &d_inputData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMalloc((void**) &d_outData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMemcpy(d_inputData, input, N[0] * N[1] * sizeof(cufftComplex), cudaMemcpyHostToDevice);
	cufftHandle plan;
	/*
	cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
	int istride, int idist, int *onembed, int ostride,
	int odist, cufftType type, int batch, size_t *workSize);
	 */
	int rank=1;
	int n[1];
	n[0]=NX;
	int istride=1;
	int idist = NX;
	int ostride=1;
	int odist = NX;
	int inembed[2];
	int onembed[2];
	inembed[0]=NX;  onembed[0]=NX;
	inembed[1] = NY; onembed[0] = NY;

	cufftPlanMany(&plan,rank,n,inembed, istride ,idist , onembed, ostride,odist, CUFFT_C2C, NY);
	cufftExecC2C(plan, d_inputData, d_outData, CUFFT_FORWARD);
	cudaMemcpy(output, d_outData, NXY * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

	for (i = 0; i < NXY; i++) {
		if(i%NX==0)
			printf("\n");
		printf("%f %f \n", output[i].x, output[i].y);
	}

	cufftDestroy(plan);
	free(input);
	free(output);
	cudaFree(d_inputData);
	cudaFree(d_outData);
}

int main() {

	testplanmany();
}

實際輸出如下：

45.000000 0.000000 
-5.000000 15.388418 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632713 
-5.000000 -6.881910 
-5.000000 -15.388418 

145.000000 0.000000 
-5.000000 15.388418 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632713 
-5.000000 -6.881910 
-5.000000 -15.388418 

245.000000 0.000000 
-4.999997 15.388416 
-5.000000 6.881910 
-5.000000 3.632713 
-5.000000 1.624597 
-5.000000 0.000000 
-5.000000 -1.624597 
-5.000000 -3.632713 
-5.000000 -6.881910 
-4.999997 -15.388416 

345.000000 0.000000 
-4.999998 15.388418 
-4.999999 6.881909 
-4.999999 3.632712 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-4.999999 -3.632712 
-4.999999 -6.881909 
-4.999998 -15.388418 

445.000000 0.000000 
-4.999999 15.388418 
-5.000001 6.881911 
-5.000000 3.632714 
-5.000000 1.624598 
-5.000000 0.000000 
-5.000000 -1.624598 
-5.000000 -3.632714 
-5.000001 -6.881911 
-4.999999 -15.388418

對於上述數據每一列進行一次FFT來說：

NX=10；NY=5；

rank=1，表明是進行一維傅里葉變換；

n[1]，數組n維度與rank相同，表明每一維的變換數目，本例中n只有一維，n[0]=NY，表示一維傅里葉變換需要5個元素；

inembed[2]，表明原始輸入數據的維度，本例中輸入數據是一個二維數組，所以inembed維度爲2，inembed[0]=NX, inembed[1]=NY;

istride，表明在一個傅里葉變換內部每一個元素相距的距離，本例中 istride= NX；

idist，表明在兩個傅里葉變換之間，每一個元素相距的距離，本例中 isist=1; （因爲兩列傅里葉變換第一個元素相鄰）

batch 表明多少個獨立的傅里葉變換，本例中 batch=NX

代碼如下：

#include <stdlib.h>
#include <stdio.h>

#include <string.h>
#include <math.h>
#include "timer.h"

#include <cuda_runtime.h>
#include <cufft.h>
#include "device_launch_parameters.h"
#define Ndim 2
#define NX 10
#define NY 5


void testplanmany() {

	int N[2];
	N[0] = NX, N[1] = NY;
	int NXY = N[0] * N[1];
	cufftComplex *input = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	cufftComplex *output = (cufftComplex*) malloc(NXY * sizeof(cufftComplex));
	int i;
	for (i = 0; i < NXY; i++) {
		input[i].x = i % 1000;
		input[i].y = 0;
	}
	cufftComplex *d_inputData, *d_outData;
	cudaMalloc((void**) &d_inputData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMalloc((void**) &d_outData, N[0] * N[1] * sizeof(cufftComplex));
	cudaMemcpy(d_inputData, input, N[0] * N[1] * sizeof(cufftComplex), cudaMemcpyHostToDevice);
	cufftHandle plan;
	/*
	cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
	int istride, int idist, int *onembed, int ostride,
	int odist, cufftType type, int batch, size_t *workSize);
	 */
	int rank=1;
	int n[1];
	n[0]=NY;
	int istride=NX;
	int idist = 1;
	int ostride=NX;
	int odist = 1;
	int inembed[2];
	int onembed[2];
	inembed[0]=NX;  onembed[0]=NX;
	inembed[1] = NY; onembed[0] = NY;

	cufftPlanMany(&plan,rank,n,inembed, istride ,idist , onembed, ostride,odist, CUFFT_C2C, NX);
	cufftExecC2C(plan, d_inputData, d_outData, CUFFT_FORWARD);
	cudaMemcpy(output, d_outData, NXY * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

	for (i = 0; i < NXY; i++) {
		if(i%NX==0)
			printf("\n");
		printf("%f %f \n", output[i].x, output[i].y);
	}

	cufftDestroy(plan);
	free(input);
	free(output);
	cudaFree(d_inputData);
	cudaFree(d_outData);
}

int main() {

	testplanmany();
}

結果如下：


100.000000 0.000000 
105.000000 0.000000 
110.000000 0.000000 
115.000000 0.000000 
120.000000 0.000000 
125.000000 0.000000 
130.000000 0.000000 
135.000000 0.000000 
140.000000 0.000000 
145.000000 0.000000 

-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 
-25.000000 34.409550 

-25.000002 8.122991 
-25.000002 8.122991 
-24.999998 8.122991 
-24.999998 8.122991 
-24.999998 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 
-25.000000 8.122991 

-25.000002 -8.122991 
-25.000002 -8.122991 
-24.999998 -8.122991 
-24.999998 -8.122991 
-24.999998 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 
-25.000000 -8.122991 

-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550 
-25.000000 -34.409550

另外文章參考了這個資源：

https://rocfft.readthedocs.io/en/latest/real.html#setting-strides

以cufftPlanMany爲例FFT變換中embed,stride,dist的解釋與設置

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

高性能與並行計算領域一些期刊與會議及資料閱讀列表

離散卷積及計算

TensorCore使用

循環卷積與線性卷積

日常閱讀期刊列表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結