In performance-critical applications like game engines, real-time simulations, or high-throughput data processing, handling large arrays efficiently is crucial. One common operation you might encounter is clamping all positive values in an array to zero. While a naive loop works just fine for small data sizes, it doesn't leverage the full power of modern CPUs. SIMD allows a single instruction to operate on multiple data points in parallel, significantly speeding up such operations.
Here's the basic implementation:
#include <iostream>
#include <vector>
void clampToZero(float *data, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        data[i] = data[i] > 0 ? 0 : data[i];
    }
}
int main() {
    std::vector<float> a = {
        -1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
    };
    clampToZero(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }
    return 0;
}The scalar approach iterates over each element in the array, checking if it's positive - and if so, replaces it with zero. Output:
-1.5 0 0 -1.5 0 -2.5 0 -3.5 0 -5 0 -7 0 -9 0 -11 0 -13This method performs well on small arrays, it quickly becomes inefficient when applied to larger datasets.
Here's the optimized implementation using AVX2:
#include <immintrin.h>
void clampToZero(float *data, const size_t n) {
    __m256 zero = _mm256_setzero_ps();
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vdataZero = _mm256_min_ps(vdata, zero);
        _mm256_storeu_ps(&data[i], vdataZero);
    }
    for (; i < n; ++i) {
        data[i] = data[i] > 0 ? 0 : data[i];
    }
}Here's how the AVX2 version operates:
- _mm256_setzero_psinitializes a vector filled with zeros.
- _mm256_loadu_psreads 8 floating-point values at once from the input array.
- _mm256_min_psperforms an element-wise comparison between the loaded values and zero, selecting the smaller of the two. This effectively sets all positive values to zero, while keeping non-positives unchanged.
- _mm256_storeu_pswrites the modified vector back into the original array.
Any leftover elements that don't fit into a full SIMD register are handled with the regular scalar loop.
 
             
                         
                         
                        
Leave a Comment
Cancel reply