Efficient computation is critical when working with large datasets or performance-intensive applications, and one common operation is element-wise addition of arrays. It involves adding the corresponding elements between two arrays. Modern CPUs offer powerful instructions that allow us to speed up this process significantly through SIMD.

Here's the basic implementation:

```
#include <iostream>
#include <vector>
void vectorAdd(const float *a, const float *b, float *result, const size_t n) {
for (size_t i = 0; i < n; ++i) {
result[i] = a[i] + b[i];
}
}
int main() {
std::vector<float> a = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
};
std::vector<float> b = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
std::vector<float> result(a.size());
vectorAdd(a.data(), b.data(), result.data(), a.size());
for (auto value: result) {
std::cout << value << " ";
}
return 0;
}
```

In this code, we define a `vectorAdd`

function that takes two input arrays and computes their element-wise sum, storing the result in a third array. The main function initializes two vectors and performs the addition, and then prints the resulting array. Output:

`1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35`

While the above method works fine for small arrays, its performance can degrade as the size of the arrays increases, especially in high-performance applications where speed is critical.

Here's the optimized implementation using AVX2:

```
#include <immintrin.h>
void vectorAdd(const float *a, const float *b, float *result, const size_t n) {
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vresult = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&result[i], vresult);
}
for (; i < n; ++i) {
result[i] = a[i] + b[i];
}
}
```

In this version, the `vectorAdd`

function uses AVX2 intrinsics:

`_mm256_loadu_ps`

loads 8 floating-point numbers from arrays.`_mm256_add_ps`

performs parallel addition of these 8 elements.`_mm256_storeu_ps`

stores the resulting sum back into the result array.

The loop processes chunks of 8 elements at a time, and any remaining elements are handled in a fallback loop at the end.

## Leave a Comment

Cancel reply