Multiply Array Elements by Scalar using C++ SIMD

Multiply Array Elements by Scalar using C++ SIMD

Multiplying each element of an array by a scalar is a common operation in various applications, from image processing to data normalization. By using SIMD, we can accelerate this operation significantly by processing multiple elements in parallel.

Here's the straightforward implementation:

#include <iostream>
#include <vector>

void multiply(float *data, const size_t n, const float multiplier) {
    for (size_t i = 0; i < n; ++i) {
        data[i] *= multiplier;
    }
}

int main() {
    std::vector<float> a = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    multiply(a.data(), a.size(), 255);
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

In the code, we iterate over each element of the array, multiplying it by a given multiplier. A part of the output:

255 510 765 ... 4080 4335 4590

This scalar approach processes one element at a time. For large arrays, it may become slow.

Here's how to implement the same operation using AVX2:

#include <iostream>
#include <vector>
#include <immintrin.h>

void multiply(float *data, const size_t n, const float multiplier) {
    __m256 vmultiplier = _mm256_set1_ps(multiplier);

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vresult = _mm256_mul_ps(vdata, vmultiplier);
        _mm256_storeu_ps(&data[i], vresult);
    }

    for (; i < n; ++i) {
        data[i] *= multiplier;
    }
}

int main() {
    std::vector<float> a = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    multiply(a.data(), a.size(), 255);
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

Here's how it works:

  • _mm256_set1_ps creates a vector where each of the eight elements is set to multiplier.
  • _mm256_loadu_ps loads eight elements from the array.
  • _mm256_mul_ps multiplies each element in the vector with the multiplier.
  • _mm256_storeu_ps stores the eight processed elements back into the array.

The loop processes sets of 8 elements at a time, with any leftover elements being processed in a fallback loop.

Leave a Comment

Cancel reply

Your email address will not be published.