Perform Element-wise Addition of Arrays using CUDA C++

Perform Element-wise Addition of Arrays using CUDA C++

When handling massive datasets or performance-intensive programs, speed and efficiency are essential. A frequently encountered task in such contexts is adding arrays together by summing each element with its counterpart from another array. While this might seem trivial for small arrays, the computational cost can grow significantly with size. In such scenarios, leveraging GPU acceleration via CUDA can provide significant performance improvements.

The provided code defines a kernel function vectorAdd that each GPU thread executes to compute the sum of corresponding elements from input arrays.

In the main function, host vectors are initialized and memory is allocated on the GPU using cudaMalloc. The data is then copied from host to device memory using cudaMemcpy. The kernel is launched with a calculated number of blocks and threads to cover all array elements, and after execution, the result is copied back to the host. Finally, the output is printed, and GPU memory is freed to avoid leaks.

#include <iostream>
#include <vector>

__global__ void vectorAdd(const float *a, const float *b, float *result, const size_t n) {
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) {
        result[i] = a[i] + b[i];
    }
}

int main() {
    std::vector<float> a = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    };
    std::vector<float> b = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };
    std::vector<float> result(a.size());

    size_t bytes = a.size() * sizeof(float);
    float *da, *db, *dresult;
    cudaMalloc(&da, bytes);
    cudaMalloc(&db, bytes);
    cudaMalloc(&dresult, bytes);

    cudaMemcpy(da, a.data(), bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(db, b.data(), bytes, cudaMemcpyHostToDevice);

    size_t blockSize = 256;
    size_t numBlocks = (a.size() + blockSize - 1) / blockSize;

    vectorAdd<<< numBlocks, blockSize >>>(da, db, dresult, a.size());
    cudaMemcpy(result.data(), dresult, bytes, cudaMemcpyDeviceToHost);

    for (auto value: result) {
        std::cout << value << " ";
    }

    cudaFree(da);
    cudaFree(db);
    cudaFree(dresult);

    return 0;
}

Output:

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Leave a Comment

Cancel reply

Your email address will not be published.