When working with big datasets on a GPU, it's helpful to keep values within a certain range. This process, called clamping, can be done quickly using parallel computing. It helps keep the data accurate and within set limits. This tutorial explains how to limit array elements within range using CUDA C++.
We define a kernel called clamp
that accepts a pointer to the data array, the total number of elements, and the minimum and maximum bounds. This kernel ensures each element in the array stays within the specified range by adjusting any values that fall outside.
In the main function, the array is first created on the host, and device memory is allocated using cudaMalloc
. The data is transferred to the GPU with cudaMemcpy
, and the kernel is launched with enough threads and blocks to handle all elements. Once the GPU finishes processing, the updated array is copied back to the host, printed to the console, and the allocated device memory is released.
#include <iostream>
#include <vector>
__global__ void clamp(float *data, const size_t n, const float min, const float max) {
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) {
if (data[i] < min) {
data[i] = min;
}
if (data[i] > max) {
data[i] = max;
}
}
}
int main() {
std::vector<float> a = {
-0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7, 0.8, 1, 1.1, 1.2, 1.3, 1.4,
};
size_t bytes = a.size() * sizeof(float);
float *da;
cudaMalloc(&da, bytes);
cudaMemcpy(da, a.data(), bytes, cudaMemcpyHostToDevice);
size_t blockSize = 256;
size_t numBlocks = (a.size() + blockSize - 1) / blockSize;
clamp<<< numBlocks, blockSize >>>(da, a.size(), 0, 1);
cudaMemcpy(a.data(), da, bytes, cudaMemcpyDeviceToHost);
for (auto value: a) {
std::cout << value << " ";
}
cudaFree(da);
return 0;
}
Output:
0 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 1 1 1 1
Leave a Comment
Cancel reply