Perform Element-wise Addition of Arrays using C++ SIMD

Perform Element-wise Addition of Arrays using C++ SIMD

Efficient computation is critical when working with large datasets or performance-intensive applications, and one common operation is element-wise addition of arrays. It involves adding the corresponding elements between two arrays. Modern CPUs offer powerful instructions that allow us to speed up this process significantly through SIMD.

Here's the basic implementation:

#include <iostream>
#include <vector>

void vectorAdd(const float *a, const float *b, float *result, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

int main() {
    std::vector<float> a = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    };
    std::vector<float> b = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };
    std::vector<float> result(a.size());

    vectorAdd(a.data(), b.data(), result.data(), a.size());
    for (auto value: result) {
        std::cout << value << " ";
    }

    return 0;
}

In this code, we define a vectorAdd function that takes two input arrays and computes their element-wise sum, storing the result in a third array. The main function initializes two vectors and performs the addition, and then prints the resulting array. Output:

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

While the above method works fine for small arrays, its performance can degrade as the size of the arrays increases, especially in high-performance applications where speed is critical.

Here's the optimized implementation using AVX2:

#include <immintrin.h>

void vectorAdd(const float *a, const float *b, float *result, const size_t n) {
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vresult = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&result[i], vresult);
    }

    for (; i < n; ++i) {
        result[i] = a[i] + b[i];
    }
}

In this version, the vectorAdd function uses AVX2 intrinsics:

  • _mm256_loadu_ps loads 8 floating-point numbers from arrays.
  • _mm256_add_ps performs parallel addition of these 8 elements.
  • _mm256_storeu_ps stores the resulting sum back into the result array.

The loop processes chunks of 8 elements at a time, and any remaining elements are handled in a fallback loop at the end.

Leave a Comment

Cancel reply

Your email address will not be published.