Find Element-wise Minimum of Array Elements using C++ SIMD

Find Element-wise Minimum of Array Elements using C++ SIMD

In performance-critical applications like image processing, scientific simulations, and machine learning, computing operations like element-wise minimum across arrays can become a bottleneck. Thankfully, SIMD instructions available in modern CPUs allow us to process multiple data elements simultaneously - greatly accelerating such operations.

Here's the basic version using a loop:

#include <iostream>
#include <vector>

void vectorMin(const float *a, const float *b, float *result, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = std::min(a[i], b[i]);
    }
}

int main() {
    std::vector<float> a = {
        5, -1, 10, -3, 14, 5, -6, 17, 8, 3, -12, 11, 2, 13, -7, 9, 16, 1
    };
    std::vector<float> b = {
        4, 2, -3, 12, 5, -6, 15, 8, -9, 10, 1, 20, -13, 14, 7, -5, 4, 18
    };
    std::vector<float> result(a.size());

    vectorMin(a.data(), b.data(), result.data(), a.size());
    for (auto value: result) {
        std::cout << value << " ";
    }

    return 0;
}

It's simple and portable, but processes elements one by one. Output:

4 -1 -3 -3 5 -6 -6 8 -9 3 -12 11 -13 13 -7 -5 4 1

This approach doesn't take advantage of modern hardware capabilities, resulting in inefficiencies with larger datasets.

To leverage SIMD, we use AVX2, which operate on 256-bit wide registers - allowing us to process 8 floats at once:

#include <immintrin.h>

void vectorMin(const float *a, const float *b, float *result, const size_t n) {
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vresult = _mm256_min_ps(va, vb);
        _mm256_storeu_ps(&result[i], vresult);
    }

    for (; i < n; ++i) {
        result[i] = std::min(a[i], b[i]);
    }
}

Here's an explanation of the AVX2 code:

  • _mm256_loadu_ps - loads 8 floating-point elements from each input array.
  • _mm256_min_ps - computes the minimum of each corresponding pair of elements.
  • _mm256_storeu_ps - stores the result back into the result array.

A scalar loop processes any remaining elements when the array size is not divisible by 8.

Leave a Comment

Cancel reply

Your email address will not be published.