In machine learning and statistical analysis, the Mean Absolute Error (MAE) is a common metric used to evaluate the accuracy of predictions. It measures the average magnitude of the errors between predicted values and actual values. While a basic implementation of MAE can work for small datasets, optimizing the calculation using SIMD can dramatically improve performance.

The traditional implementation:

```
#include <iostream>
#include <vector>
#include <cmath>
float mae(const float *a, const float *b, const size_t n) {
float sum = 0;
for (size_t i = 0; i < n; ++i) {
sum += std::fabs(a[i] - b[i]);
}
return sum / (float) n;
}
int main() {
std::vector<float> a = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17,
};
std::vector<float> b = {
0.5, 1, 2.5, 3, 4.5, 5, 6.5, 7, 8.5,
9, 10.5, 11, 12.5, 13, 14.5, 15, 16.5, 17,
};
float value = mae(a.data(), b.data(), a.size());
std::cout << value;
return 0;
}
```

In this code, the function `mae`

calculates the absolute difference between corresponding elements of two arrays, sums them up, and divides by the number of elements to get the mean. Output:

`0.25`

While this method is straightforward, it can become a bottleneck for larger datasets.

Here's the optimized implementation using AVX2:

```
#include <immintrin.h>
float mae(const float *a, const float *b, const size_t n) {
__m256 signMask = _mm256_set1_ps(-0.0f);
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vdiff = _mm256_sub_ps(va, vb);
vdiff = _mm256_andnot_ps(signMask, vdiff);
vsum = _mm256_add_ps(vsum, vdiff);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float sum = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
sum += std::fabs(a[i] - b[i]);
}
return sum / (float) n;
}
```

Breakdown of the AVX2 implementation:

`_mm256_loadu_ps`

loads 8 elements at a time from the input arrays.`_mm256_sub_ps`

computes the differences between the two vectors.`_mm256_andnot_ps`

using a sign mask calculates the absolute values without needing a conditional statement.`_mm256_add_ps`

used to accumulate results in the`vsum`

register.

To convert the SIMD result back to a scalar, we first split the `vsum`

into two halves and add them together, reducing the sum to a single value.

The remaining elements are handled with a standard loop.

## Leave a Comment

Cancel reply