In performance-critical applications like game engines, real-time simulations, or high-throughput data processing, handling large arrays efficiently is crucial. One common operation you might encounter is clamping all positive values in an array to zero. While a naive loop works just fine for small data sizes, it doesn't leverage the full power of modern CPUs. SIMD allows a single instruction to operate on multiple data points in parallel, significantly speeding up such operations.
Here's the basic implementation:
#include <iostream>
#include <vector>
void clampToZero(float *data, const size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = data[i] > 0 ? 0 : data[i];
}
}
int main() {
std::vector<float> a = {
-1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
};
clampToZero(a.data(), a.size());
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
The scalar approach iterates over each element in the array, checking if it's positive - and if so, replaces it with zero. Output:
-1.5 0 0 -1.5 0 -2.5 0 -3.5 0 -5 0 -7 0 -9 0 -11 0 -13
This method performs well on small arrays, it quickly becomes inefficient when applied to larger datasets.
Here's the optimized implementation using AVX2:
#include <immintrin.h>
void clampToZero(float *data, const size_t n) {
__m256 zero = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 vdataZero = _mm256_min_ps(vdata, zero);
_mm256_storeu_ps(&data[i], vdataZero);
}
for (; i < n; ++i) {
data[i] = data[i] > 0 ? 0 : data[i];
}
}
Here's how the AVX2 version operates:
_mm256_setzero_ps
initializes a vector filled with zeros._mm256_loadu_ps
reads 8 floating-point values at once from the input array._mm256_min_ps
performs an element-wise comparison between the loaded values and zero, selecting the smaller of the two. This effectively sets all positive values to zero, while keeping non-positives unchanged._mm256_storeu_ps
writes the modified vector back into the original array.
Any leftover elements that don't fit into a full SIMD register are handled with the regular scalar loop.
Leave a Comment
Cancel reply