Calculate Mean Absolute Error using C++ SIMD

Calculate Mean Absolute Error using C++ SIMD

In machine learning and statistical analysis, the Mean Absolute Error (MAE) is a common metric used to evaluate the accuracy of predictions. It measures the average magnitude of the errors between predicted values and actual values. While a basic implementation of MAE can work for small datasets, optimizing the calculation using SIMD can dramatically improve performance.

The traditional implementation:

#include <iostream>
#include <vector>
#include <cmath>

float mae(const float *a, const float *b, const size_t n) {
    float sum = 0;
    for (size_t i = 0; i < n; ++i) {
        sum += std::fabs(a[i] - b[i]);
    }

    return sum / (float) n;
}

int main() {
    std::vector<float> a = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
        11, 12, 13, 14, 15, 16, 17,
    };
    std::vector<float> b = {
        0.5, 1, 2.5, 3, 4.5, 5, 6.5, 7, 8.5,
        9, 10.5, 11, 12.5, 13, 14.5, 15, 16.5, 17,
    };

    float value = mae(a.data(), b.data(), a.size());
    std::cout << value;

    return 0;
}

In this code, the function mae calculates the absolute difference between corresponding elements of two arrays, sums them up, and divides by the number of elements to get the mean. Output:

0.25

While this method is straightforward, it can become a bottleneck for larger datasets.

Here's the optimized implementation using AVX2:

#include <immintrin.h>

float mae(const float *a, const float *b, const size_t n) {
    __m256 signMask = _mm256_set1_ps(-0.0f);
    __m256 vsum = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vdiff = _mm256_sub_ps(va, vb);
        vdiff = _mm256_andnot_ps(signMask, vdiff);
        vsum = _mm256_add_ps(vsum, vdiff);
    }

    __m128 bottom = _mm256_castps256_ps128(vsum);
    __m128 top = _mm256_extractf128_ps(vsum, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float sum = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        sum += std::fabs(a[i] - b[i]);
    }

    return sum / (float) n;
}

Breakdown of the AVX2 implementation:

  • _mm256_loadu_ps loads 8 elements at a time from the input arrays.
  • _mm256_sub_ps computes the differences between the two vectors.
  • _mm256_andnot_ps using a sign mask calculates the absolute values without needing a conditional statement.
  • _mm256_add_ps used to accumulate results in the vsum register.

To convert the SIMD result back to a scalar, we first split the vsum into two halves and add them together, reducing the sum to a single value.

The remaining elements are handled with a standard loop.

Leave a Comment

Cancel reply

Your email address will not be published.