Find Element-wise Maximum of Array Elements using C++ SIMD

Find Element-wise Maximum of Array Elements using C++ SIMD

In computational tasks, comparing and finding the maximum values from two arrays is a common operation. While a scalar implementation works well for small datasets, it can become a bottleneck when handling large arrays. By leveraging SIMD capabilities in modern CPUs, we can perform these comparisons simultaneously, significantly boosting performance.

The scalar implementation:

#include <iostream>
#include <vector>

void vectorMax(const float *a, const float *b, float *result, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = std::max(a[i], b[i]);
    }
}

int main() {
    std::vector<float> a = {
        5, -1, 10, -3, 14, 5, -6, 17, 8, 3, -12, 11, 2, 13, -7, 9, 16, 1
    };
    std::vector<float> b = {
        4, 2, -3, 12, 5, -6, 15, 8, -9, 10, 1, 20, -13, 14, 7, -5, 4, 18
    };
    std::vector<float> result(a.size());

    vectorMax(a.data(), b.data(), result.data(), a.size());
    for (auto value: result) {
        std::cout << value << " ";
    }

    return 0;
}

This approach compares each corresponding pair of elements from two arrays and stores the maximum value in a result array. Output:

5 2 10 12 14 5 15 17 8 10 1 20 2 14 7 9 16 18

While this is simple and effective, it does not utilize the capabilities of modern hardware, leading to inefficiencies for larger datasets.

The AVX2 implementation:

#include <immintrin.h>

void vectorMax(const float *a, const float *b, float *result, const size_t n) {
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vresult = _mm256_max_ps(va, vb);
        _mm256_storeu_ps(&result[i], vresult);
    }

    for (; i < n; ++i) {
        result[i] = std::max(a[i], b[i]);
    }
}

Explanation of AVX2 code:

  • _mm256_loadu_ps loads 8 floating-point elements from each input array.
  • _mm256_max_ps computes the maximum of these pairs.
  • _mm256_storeu_ps stores the result back into the result array.

A scalar loop handles any leftover elements if the array size isn't a multiple of 8.

Leave a Comment

Cancel reply

Your email address will not be published.