Calculating the mean of an array is a common operation that involves summing all elements and dividing by the number of elements. When using SIMD, we can perform this calculation faster by processing multiple elements simultaneously.
The scalar version:
#include <iostream>
#include <vector>
float calculateMean(const float *data, const size_t n) {
float sum = 0.0f;
for (size_t i = 0; i < n; ++i) {
sum += data[i];
}
return sum / (float) n;
}
int main() {
std::vector<float> a = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
auto value = calculateMean(a.data(), a.size());
std::cout << value;
return 0;
}
The loop iterates over each element, adding it to a running total. Once the loop completes, the mean is calculated by dividing the total sum by the array size. The code output is 9.5.
Code optimization with AVX2:
#include <immintrin.h>
float calculateMean(const float *data, const size_t n) {
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
vsum = _mm256_add_ps(vsum, vdata);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float sum = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
sum += data[i];
}
return sum / (float) n;
}
There are three main parts:
- Vectorized Sum Accumulation
The _mm256_loadu_ps
loads eight elements from the array, and _mm256_add_ps
accumulates these values in a vector.
- Horizontal Summation
After the loop, the eight partial sums in vsum
are combined using _mm_hadd_ps
and _mm_add_ps
.
- Final Sum and Mean
Remaining elements (if any) are added to the total sum in a final loop, then divided by total size to obtain the mean.
Leave a Comment
Cancel reply