随想录（windows上cuda环境安装）

cuda是nvidia公司用于gpu开发的一门语言。它来自于c，但是又对c进行了扩展。目前cuda广泛用于高性能计算、深度学习训练、嵌入式设备等各种应用场景。然而cuda入门容易，深入困难，如果只是基本概念学习，不足以对其有深刻的认识，因此最好结合具体的代码来一起开发，才能真正将gpu的作用发挥出来。

因此，我查看了个人笔记本，虽然显示的是nvidia mx150，一个比较低阶的移动GPU版本，但是也可以用cuda进行开发。所以利用下午这一段时间，安装了一下cuda环境，收获很多。

1、安装visual studio 2015

目前cuda支持vs2012、vs2013、vs2015、vs2017、vs2019。

2、下载cuda安装包

可以选择适合自己的cuda包，我这里下载的是cuda_10.2.89_441.22_win10.exe。

3、安装cuda软件

基本上不停的按下一步、下一步就可以了。中间cuda会先自解压，然后再进行安装。

4、确认cuda是否安装成功

在cmd下面输入nvcc --help，如果有打印信息，代表一切ok。

5、查找示例代码

如果安装没有问题，在C:\ProgramData\NVIDIA Corporation\CUDA Samples会看到一个目录，目录下有很多的示例代码。

6、编译示例代码

示例代码很多，分别是0_Simple、1_Utilities、2_Graphics、3_Imaging、4_Finance、5_Simulations、6_Advanced、7_CUDALibraries。刚开始的时候可以只选择编译一部分内容，比如0_Simple，如果编译、运行本身没多大问题，说明我们的安装是非常成功的。

7、简单的示例代码vectorAdd.cu

代码内容就是一个向量的计算，比较简洁，但是对我们了解cuda加速的原理足够了。

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/**
 * Vector addition: C = A + B.
 *
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.
 */

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>
/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

/**
 * Host main routine
 */
int
main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
    {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}

8、验证是否可以从visual studio创建nvida工程

9、从windows到ubuntu开发环境

windows上面的vs环境调试比较方便，整体使用比较容易。对于开发嵌入式设备的朋友来说，在移植到nvidia jetson ubuntu环境之前，最好还是在windows环境上优化好，这样可以节省不少的时间。

随想录（windows上cuda环境安装）

使用c#强大的表达式树实现对象的深克隆之解决循环引用的问题

GPT-4o 引领人机交互新风向，向量数据库赛道沸腾了

free AI online tools All In One

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU启动那些事（12.A）- uSDHC eMMC启动时间(RT1170)

基于Ubuntu-22.04安装K8s-v1.28.2实验（二）使用kube-vip实现集群VIP访问

基于Ubuntu-22.04安装K8s-v1.28.2实验（三）数据卷挂载NFS（网络文件系统）

企业大模型如何成为自己数据的“百科全书”？

本地SSL证书过期输入命令在IIS自动生成

.NET周刊【5月第2期 2024-05-12】

基于Ubuntu-22.04安装K8s-v1.28.2实验（一）部署K8s

隨想錄（386cpu保護模式）

機器學習與計算機視覺（計算機視覺的嵌入式平臺）

機器學習與計算機視覺（FPGA的圖像處理方法）

隨想錄（文件系統的第一個用戶程序shell）

隨想錄（反調試技術）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結