Limit Array Elements Within Range using C++ SIMD

Limit Array Elements Within Range using C++ SIMD

In many applications, it is essential to ensure that the values in an array remain within a specific range. This process, known as clamping, can prevent out-of-bounds errors and ensure data integrity. While a simple approach involves iterating through each element and checking its value, this method can become inefficient, especially when working with large datasets. Fortunately, the SIMD in modern CPUs allows us to perform these operations in parallel, resulting in significant performance gains.

Here's the basic implementation:

#include <iostream>
#include <vector>

void clamp(float *data, const size_t n, const float min, const float max) {
    for (size_t i = 0; i < n; ++i) {
        if (data[i] < min) {
            data[i] = min;
        }
        if (data[i] > max) {
            data[i] = max;
        }
    }
}

int main() {
    std::vector<float> a = {
        -0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4,
        0.5, 0.6, 0.7, 0.8, 1, 1.1, 1.2, 1.3, 1.4,
    };

    clamp(a.data(), a.size(), 0, 1);
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

In this code, the clamp function accepts an array and adjusts each element to be within the provided minimum and maximum bounds. The main function initializes an array, applies the clamp function, and then prints the resulting values. Output:

0 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 1 1 1 1

To enhance performance, we can use AVX2 SIMD instructions. This enables us to process multiple elements of the array simultaneously.

Here's how to implement the clamping function using AVX2:

#include <immintrin.h>

void clamp(float *data, const size_t n, const float min, const float max) {
    __m256 vmin = _mm256_set1_ps(min);
    __m256 vmax = _mm256_set1_ps(max);

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vdataMin = _mm256_max_ps(vdata, vmin);
        __m256 vresult = _mm256_min_ps(vdataMin, vmax);
        _mm256_storeu_ps(&data[i], vresult);
    }

    for (; i < n; ++i) {
        if (data[i] < min) {
            data[i] = min;
        }
        if (data[i] > max) {
            data[i] = max;
        }
    }
}

In this optimized implementation, the clamp function uses AVX2 intrinsics for efficient processing:

  • _mm256_set1_ps and _mm256_set1_ps create vectors containing the minimum and maximum values replicated across all elements.
  • _mm256_loadu_ps loads 8 floating-point numbers from the array.
  • _mm256_max_ps clamps each value to ensure it is not below the minimum.
  • _mm256_min_ps then clamps the result to ensure it does not exceed the maximum.
  • _mm256_storeu_ps stores the modified values back into the array.

The function processes chunks of 8 elements at a time, and any remaining elements are handled using the basic approach to ensure all values are clamped.

Leave a Comment

Cancel reply

Your email address will not be published.