Counting positive elements in an array is a common task, especially in fields like signal processing and data analysis, where you need to filter or transform data based on conditions. A basic scalar approach iterates through each element in the array, checking if it's greater than zero. However, SIMD can process multiple elements simultaneously, resulting in a faster implementation for large arrays.
The scalar version:
#include <iostream>
#include <vector>
size_t countPositive(const float *data, const size_t n) {
size_t count = 0;
for (size_t i = 0; i < n; ++i) {
if (data[i] > 0.0f) {
++count;
}
}
return count;
}
int main() {
std::vector<float> a = {
-2.1, 0, 4.7, 9.8, -7.2, 0, 3.3, 0, 2.1,
15, 0, 8.2, -8.3, 0, -4.2, 6.1, 9.9, 0,
};
auto value = countPositive(a.data(), a.size());
std::cout << value;
return 0;
}
In this code, each element of the array is checked in a loop to see if it's greater than zero. The counter is incremented if the condition is met. The code outputs 8.
Here's the optimized implementation using AVX2:
#include <immintrin.h>
size_t countPositive(const float *data, const size_t n) {
size_t count = 0;
__m256 zero = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 cmp = _mm256_cmp_ps(vdata, zero, _CMP_GT_OQ);
size_t mask = _mm256_movemask_ps(cmp);
count += _mm_popcnt_u32(mask);
}
for (; i < n; ++i) {
if (data[i] > 0.0f) {
++count;
}
}
return count;
}
Explanation of AVX2 instructions:
_mm256_setzero_ps
creates a vector of eight zeros, which we use to compare against array elements._mm256_loadu_ps
loads eight floating-point elements from the array._mm256_cmp_ps
compares each element in the loaded vector with zero, returning a result to reflect positive elements._mm256_movemask_ps
converts the comparison result into a bit mask, where each bit indicates whether the corresponding element is positive._mm_popcnt_u32
counts the number of set bits in the mask, which tells us how many of the eight elements are positive.
Any remaining elements are processed in a final fallback loop.
Leave a Comment
Cancel reply