Calculate Standard Deviation of Array Elements using C++ SIMD

Calculate Standard Deviation of Array Elements using C++ SIMD

When working with large datasets in data analysis, statistics, or scientific computing, calculating metrics like standard deviation can become time-consuming. Standard deviation, which measures how spread out numbers are in a dataset, requires iterating through data to compute both the mean and the variance. By leveraging SIMD, we can significantly accelerate this process.

The scalar version:

#include <iostream>
#include <vector>
#include <cmath>

float calculateStd(const float *data, const size_t n) {
    float mean = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        mean += data[i];
    }
    mean /= (float) n;

    float result = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        float diff = data[i] - mean;
        result += diff * diff;
    }

    return std::sqrt(result / (float) n);
}

int main() {
    std::vector<float> a = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    auto value = calculateStd(a.data(), a.size());
    std::cout << value;

    return 0;
}

Standard deviation calculation works as follows:

  • Mean Calculation: It iterates over each element to compute the mean.
  • Variance Accumulation: It calculates the squared difference between each element and the mean, accumulating this to get the variance.
  • Standard Deviation: The square root of the variance gives the standard deviation.

The code outputs 5.18813.

AVX2 implementation:

#include <immintrin.h>

float calculateStd(const float *data, const size_t n) {
    __m256 vmean = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vmean = _mm256_add_ps(vmean, vdata);
    }

    __m128 bottom = _mm256_castps256_ps128(vmean);
    __m128 top = _mm256_extractf128_ps(vmean, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float mean = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        mean += data[i];
    }
    mean /= (float) n;

    vmean = _mm256_set1_ps(mean);
    __m256 vresult = _mm256_setzero_ps();

    for (i = 0; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vdiff = _mm256_sub_ps(vdata, vmean);
        vdiff = _mm256_mul_ps(vdiff, vdiff);
        vresult = _mm256_add_ps(vresult, vdiff);
    }

    bottom = _mm256_castps256_ps128(vresult);
    top = _mm256_extractf128_ps(vresult, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float result = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        float diff = data[i] - mean;
        result += diff * diff;
    }

    return std::sqrt(result / (float) n);
}

Code explanation:

  • Mean Calculation:

The code calculates the sum by loading 8 elements at a time from the array, accumulating these into vmean with _mm256_add_ps. After processing in chunks of 8, the partial sums in vmean are horizontally summed using _mm_hadd_ps and then combined. Any remaining elements are added in a scalar loop to get the total sum, and the mean is then calculated by dividing by n.

  • Variance Accumulation

For each chunk of 8 elements, we calculate the difference from the mean using _mm256_sub_ps, square each difference with _mm256_mul_ps, and accumulate these squared differences into vresult. Again, we horizontally sum the final vresult to combine partial variances, and any leftover elements are handled by a scalar loop.

  • Standard Deviation

The accumulated variance is averaged by dividing by n, and finally, we calculate the square root of this average to get the standard deviation.

Leave a Comment

Cancel reply

Your email address will not be published.