The Manhattan distance is the sum of the absolute differences between corresponding elements in two arrays. SIMD is highly effective in optimizing such operations, enabling us to process multiple array elements simultaneously.
The traditional implementation:
#include <iostream>
#include <vector>
#include <cmath>
float distance(const float *a, const float *b, const size_t n) {
float sum = 0;
for (size_t i = 0; i < n; ++i) {
float diff = a[i] - b[i];
sum += std::fabs(diff);
}
return sum;
}
int main() {
std::vector<float> a = {
17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
};
std::vector<float> b = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
auto value = distance(a.data(), b.data(), a.size());
std::cout << value;
return 0;
}
In the code, the function iterates over each element in the arrays, calculates the absolute difference, and adds it to a cumulative sum to calculate the Manhattan distance. The code output is 162.
This implementation is straightforward, but may be slow for large arrays due to the lack of parallel processing.
Here's the optimized implementation using AVX2:
#include <immintrin.h>
float distance(const float *a, const float *b, const size_t n) {
__m256 signMask = _mm256_set1_ps(-0.0f);
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vdiff = _mm256_sub_ps(va, vb);
vdiff = _mm256_andnot_ps(signMask, vdiff);
vsum = _mm256_add_ps(vsum, vdiff);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float sum = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
float diff = a[i] - b[i];
sum += std::fabs(diff);
}
return sum;
}
Explanation of key AVX2 instructions:
_mm256_setzero_ps
initializes vector with all elements set to zero, used as an accumulator._mm256_set1_ps
sets a mask with -0.0f in all positions, allowing us to calculate the absolute value by masking the sign bit._mm256_loadu_ps
loads eight floating-point values from arrays._mm256_sub_ps
performs element-wise subtraction._mm256_andnot_ps
clears the sign bit in each element to obtain the absolute value of each difference._mm256_add_ps
adds corresponding elements from vectors, accumulating the absolute differences._mm256_castps256_ps128
,_mm256_extractf128_ps
,_mm_hadd_ps
used to reduce the values stored in the 256-bit register down to a single floating-point sum for the final result.
Any remaining elements are processed in a final fallback loop.
Leave a Comment
Cancel reply