When working with large datasets in data analysis, statistics, or scientific computing, calculating metrics like standard deviation can become time-consuming. Standard deviation, which measures how spread out numbers are in a dataset, requires iterating through data to compute both the mean and the variance. By leveraging SIMD, we can significantly accelerate this process.
The scalar version:
#include <iostream>
#include <vector>
#include <cmath>
float calculateStd(const float *data, const size_t n) {
float mean = 0.0f;
for (size_t i = 0; i < n; ++i) {
mean += data[i];
}
mean /= (float) n;
float result = 0.0f;
for (size_t i = 0; i < n; ++i) {
float diff = data[i] - mean;
result += diff * diff;
}
return std::sqrt(result / (float) n);
}
int main() {
std::vector<float> a = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
auto value = calculateStd(a.data(), a.size());
std::cout << value;
return 0;
}
Standard deviation calculation works as follows:
- Mean Calculation: It iterates over each element to compute the mean.
- Variance Accumulation: It calculates the squared difference between each element and the mean, accumulating this to get the variance.
- Standard Deviation: The square root of the variance gives the standard deviation.
The code outputs 5.18813.
AVX2 implementation:
#include <immintrin.h>
float calculateStd(const float *data, const size_t n) {
__m256 vmean = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
vmean = _mm256_add_ps(vmean, vdata);
}
__m128 bottom = _mm256_castps256_ps128(vmean);
__m128 top = _mm256_extractf128_ps(vmean, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float mean = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
mean += data[i];
}
mean /= (float) n;
vmean = _mm256_set1_ps(mean);
__m256 vresult = _mm256_setzero_ps();
for (i = 0; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 vdiff = _mm256_sub_ps(vdata, vmean);
vdiff = _mm256_mul_ps(vdiff, vdiff);
vresult = _mm256_add_ps(vresult, vdiff);
}
bottom = _mm256_castps256_ps128(vresult);
top = _mm256_extractf128_ps(vresult, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float result = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
float diff = data[i] - mean;
result += diff * diff;
}
return std::sqrt(result / (float) n);
}
Code explanation:
- Mean Calculation:
The code calculates the sum by loading 8 elements at a time from the array, accumulating these into vmean
with _mm256_add_ps
. After processing in chunks of 8, the partial sums in vmean
are horizontally summed using _mm_hadd_ps
and then combined. Any remaining elements are added in a scalar loop to get the total sum, and the mean is then calculated by dividing by n
.
- Variance Accumulation
For each chunk of 8 elements, we calculate the difference from the mean using _mm256_sub_ps
, square each difference with _mm256_mul_ps
, and accumulate these squared differences into vresult
. Again, we horizontally sum the final vresult
to combine partial variances, and any leftover elements are handled by a scalar loop.
- Standard Deviation
The accumulated variance is averaged by dividing by n
, and finally, we calculate the square root of this average to get the standard deviation.
Leave a Comment
Cancel reply