Calculating the Euclidean distance between two arrays is a fundamental operation in many applications. The Euclidean distance provides a measure of similarity between two arrays by summing the squared differences of their elements and taking the square root of the result. While the operation is simple, calculating this distance for large arrays can become computationally expensive. Using SIMD can optimize this process significantly by handling multiple elements simultaneously.
Here's a basic scalar implementation:
#include <iostream>
#include <vector>
#include <cmath>
float distance(const float *a, const float *b, const size_t n) {
float sum = 0;
for (size_t i = 0; i < n; ++i) {
float diff = a[i] - b[i];
sum += diff * diff;
}
return std::sqrt(sum);
}
int main() {
std::vector<float> a = {
17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
};
std::vector<float> b = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
auto value = distance(a.data(), b.data(), a.size());
std::cout << value;
return 0;
}
The code iterates through the elements of two arrays, calculates the squared difference for each pair of elements, and accumulates these differences. Finally, the square root of the accumulated sum gives the Euclidean distance. The code output is 44.2267.
Here's the optimized implementation using AVX2:
#include <immintrin.h>
float distance(const float *a, const float *b, const size_t n) {
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vdiff = _mm256_sub_ps(va, vb);
vdiff = _mm256_mul_ps(vdiff, vdiff);
vsum = _mm256_add_ps(vsum, vdiff);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float sum = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
float diff = a[i] - b[i];
sum += diff * diff;
}
return std::sqrt(sum);
}
Explanation of the AVX2 code:
_mm256_setzero_ps
initializes the vector with zero, creating a starting point for accumulating results._mm256_loadu_ps
loads eight elements from arrays._mm256_sub_ps
computes the element-wise difference between vectors._mm256_mul_ps
multiplies corresponding elements._mm256_add_ps
accumulates the squared differences to track the sum.
We reduce the eight values in vsum
to a single scalar sum by extracting the upper and lower halves, summing them, and using horizontal additions.
If the array size is not a multiple of eight, the loop processes any remaining elements using scalar arithmetic.
Leave a Comment
Cancel reply