In many numerical applications, you often need to perform operations on arrays, such as division by a scalar value. While a straightforward approach is to iterate through each element, this can become a performance bottleneck, especially when dealing with large datasets. Fortunately, modern CPUs provide SIMD.

Here's the straightforward implementation:

```
#include <iostream>
#include <vector>
void divide(float *data, const size_t n, const float divisor) {
for (size_t i = 0; i < n; ++i) {
data[i] /= divisor;
}
}
int main() {
std::vector<float> a = {
0, 14, 28, 42, 56, 70, 84, 98, 112, 126, 140, 154, 168, 182, 196, 210, 224, 255
};
divide(a.data(), a.size(), 255);
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
```

In this code, we define a function `divide`

computing the result of each element divided by the divisor. The main function initializes an array and creates a result vector. After calling the `divide`

function, it prints the resulting array. A part of the output:

`0 0.054902 0.109804 0.164706 ... 0.768627 0.823529 0.878431 1`

To improve performance, we can use SIMD instructions provided by AVX2. This enables us to perform division on multiple elements simultaneously.

Here's how to implement the same operation using AVX2:

```
#include <immintrin.h>
void divide(float *data, const size_t n, const float divisor) {
__m256 vdivisor = _mm256_set1_ps(divisor);
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 vresult = _mm256_div_ps(vdata, vdivisor);
_mm256_storeu_ps(&data[i], vresult);
}
for (; i < n; ++i) {
data[i] /= divisor;
}
}
```

In this optimized version, the `divide`

function leverages AVX2 intrinsics:

`_mm256_set1_ps`

creates a vector containing the scalar divisor replicated across all elements.`_mm256_loadu_ps`

loads 8 floating-point numbers from the array.`_mm256_div_ps`

executes the division operation in parallel for the 8 loaded elements.`_mm256_storeu_ps`

stores the results back into the array.

The loop processes chunks of 8 elements, and any remaining elements are handled in a fallback loop.

## Leave a Comment

Cancel reply