The dot product is a fundamental operation in many fields, including computer graphics, physics simulations, and machine learning. It measures the similarity between two vectors and is calculated as the sum of the products of their corresponding elements. While the naive approach to compute the dot product involves iterating through each element, this can be inefficient for large arrays. Fortunately, modern CPUs support SIMD.
Here's the basic implementation:
#include <iostream>
#include <vector>
float dotProduct(const float *a, const float *b, const size_t n) {
float result = 0;
for (size_t i = 0; i < n; ++i) {
result += a[i] * b[i];
}
return result;
}
int main() {
std::vector<float> a = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
};
std::vector<float> b = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
float value = dotProduct(a.data(), b.data(), a.size());
std::cout << value;
return 0;
}
In this code, the dotProduct
function calculates the dot product by iterating through the elements of the arrays, multiplying each pair of elements, and summing the results. The main function initializes two vectors, calls the dotProduct
function, and prints the final result. Output:
1938
Let's now look at an optimized version of the dot product calculation that uses AVX2 SIMD intrinsics to take advantage of parallel processing.
Here's how to implement the dot product using AVX2:
#include <immintrin.h>
float dotProduct(const float *a, const float *b, const size_t n) {
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vmul = _mm256_mul_ps(va, vb);
vsum = _mm256_add_ps(vsum, vmul);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float result = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
result += a[i] * b[i];
}
return result;
}
In this optimized version, the dotProduct
function employs AVX2 intrinsics for efficient computation:
_mm256_setzero_ps
initializes a register to hold the accumulated sum._mm256_loadu_ps
functions load 8 elements from arrays._mm256_mul_ps
function performs element-wise multiplication of the two vectors._mm256_add_ps
used to accumulate results in thevsum
register.
To convert the SIMD result back to a scalar, we first split the vsum
into two halves and add them together, reducing the sum to a single value.
Finally, any remaining elements that were not part of the SIMD processing loop are calculated using the basic multiplication approach.
Leave a Comment
Cancel reply