Calculate Dot Product of Arrays using C++ SIMD

Calculate Dot Product of Arrays using C++ SIMD

The dot product is a fundamental operation in many fields, including computer graphics, physics simulations, and machine learning. It measures the similarity between two vectors and is calculated as the sum of the products of their corresponding elements. While the naive approach to compute the dot product involves iterating through each element, this can be inefficient for large arrays. Fortunately, modern CPUs support SIMD.

Here's the basic implementation:

#include <iostream>
#include <vector>

float dotProduct(const float *a, const float *b, const size_t n) {
    float result = 0;
    for (size_t i = 0; i < n; ++i) {
        result += a[i] * b[i];
    }

    return result;
}

int main() {
    std::vector<float> a = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    };
    std::vector<float> b = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    float value = dotProduct(a.data(), b.data(), a.size());
    std::cout << value;

    return 0;
}

In this code, the dotProduct function calculates the dot product by iterating through the elements of the arrays, multiplying each pair of elements, and summing the results. The main function initializes two vectors, calls the dotProduct function, and prints the final result. Output:

1938

Let's now look at an optimized version of the dot product calculation that uses AVX2 SIMD intrinsics to take advantage of parallel processing.

Here's how to implement the dot product using AVX2:

#include <immintrin.h>

float dotProduct(const float *a, const float *b, const size_t n) {
    __m256 vsum = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vmul = _mm256_mul_ps(va, vb);
        vsum = _mm256_add_ps(vsum, vmul);
    }

    __m128 bottom = _mm256_castps256_ps128(vsum);
    __m128 top = _mm256_extractf128_ps(vsum, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float result = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        result += a[i] * b[i];
    }

    return result;
}

In this optimized version, the dotProduct function employs AVX2 intrinsics for efficient computation:

  • _mm256_setzero_ps initializes a register to hold the accumulated sum.
  • __m256_loadu_ps functions load 8 elements from arrays.
  • __m256_mul_ps function performs element-wise multiplication of the two vectors.
  • __m256_add_ps used to accumulate results in the vsum register.

To convert the SIMD result back to a scalar, we first split the vsum into two halves and add them together, reducing the sum to a single value.

Finally, any remaining elements that were not part of the SIMD processing loop are calculated using the basic multiplication approach.

Leave a Comment

Cancel reply

Your email address will not be published.