Divide Array Elements by Scalar using C++ SIMD

Divide Array Elements by Scalar using C++ SIMD

In many numerical applications, you often need to perform operations on arrays, such as division by a scalar value. While a straightforward approach is to iterate through each element, this can become a performance bottleneck, especially when dealing with large datasets. Fortunately, modern CPUs provide SIMD.

Here's the straightforward implementation:

#include <iostream>
#include <vector>

void divide(float *data, const size_t n, const float divisor) {
    for (size_t i = 0; i < n; ++i) {
        data[i] /= divisor;
    }
}

int main() {
    std::vector<float> a = {
        0, 14, 28, 42, 56, 70, 84, 98, 112, 126, 140, 154, 168, 182, 196, 210, 224, 255
    };

    divide(a.data(), a.size(), 255);
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

In this code, we define a function divide computing the result of each element divided by the divisor. The main function initializes an array and creates a result vector. After calling the divide function, it prints the resulting array. A part of the output:

0 0.054902 0.109804 0.164706 ... 0.768627 0.823529 0.878431 1

To improve performance, we can use SIMD instructions provided by AVX2. This enables us to perform division on multiple elements simultaneously.

Here's how to implement the same operation using AVX2:

#include <immintrin.h>

void divide(float *data, const size_t n, const float divisor) {
    __m256 vdivisor = _mm256_set1_ps(divisor);

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vresult = _mm256_div_ps(vdata, vdivisor);
        _mm256_storeu_ps(&data[i], vresult);
    }

    for (; i < n; ++i) {
        data[i] /= divisor;
    }
}

In this optimized version, the divide function leverages AVX2 intrinsics:

  • _mm256_set1_ps creates a vector containing the scalar divisor replicated across all elements.
  • _mm256_loadu_ps loads 8 floating-point numbers from the array.
  • _mm256_div_ps executes the division operation in parallel for the 8 loaded elements.
  • _mm256_storeu_ps stores the results back into the array.

The loop processes chunks of 8 elements, and any remaining elements are handled in a fallback loop.

Leave a Comment

Cancel reply

Your email address will not be published.