Counting negative elements in an array is a common task that can be optimized with SIMD operations. Using SIMD allows us to process multiple elements simultaneously, enhancing performance, especially for large datasets.
The scalar implementation:
#include <iostream>
#include <vector>
size_t countNegative(const float *data, const size_t n) {
size_t count = 0;
for (size_t i = 0; i < n; ++i) {
if (data[i] < 0.0f) {
++count;
}
}
return count;
}
int main() {
std::vector<float> a = {
-2.1, 0, 4.7, 9.8, -7.2, 0, 3.3, 0, 2.1,
15, 0, 8.2, -8.3, 0, -4.2, 6.1, 9.9, 0,
};
auto value = countNegative(a.data(), a.size());
std::cout << value;
return 0;
}
In the code, we iterate through each element in the array, checking if it is negative and incrementing a counter if true. The code outputs 4.
This approach works well for small datasets, but may become inefficient for larger arrays due to its sequential nature.
The AVX2 implementation:
#include <immintrin.h>
size_t countNegative(const float *data, const size_t n) {
size_t count = 0;
__m256 zero = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 cmp = _mm256_cmp_ps(vdata, zero, _CMP_LT_OQ);
size_t mask = _mm256_movemask_ps(cmp);
count += _mm_popcnt_u32(mask);
}
for (; i < n; ++i) {
if (data[i] < 0.0f) {
++count;
}
}
return count;
}
Explanation of AVX2 instructions:
_mm256_setzero_ps
creates a vector of eight zero values which will be used for comparison._mm256_loadu_ps
loads eight float values from the array._mm256_cmp_ps
compares each element with zero, returning a result to reflect negative elements._mm256_movemask_ps
converts the comparison result to a bit mask, where each bit indicates whether the corresponding element is negative._mm_popcnt_u32
counts the number of set bits in the mask, which tells us how many of the eight elements are negative.
Any remaining elements are handled in a final loop.
Leave a Comment
Cancel reply