Calculate Manhattan Distance Between Two Arrays using C++ SIMD

Calculate Manhattan Distance Between Two Arrays using C++ SIMD

The Manhattan distance is the sum of the absolute differences between corresponding elements in two arrays. SIMD is highly effective in optimizing such operations, enabling us to process multiple array elements simultaneously.

The traditional implementation:

#include <iostream>
#include <vector>
#include <cmath>

float distance(const float *a, const float *b, const size_t n) {
    float sum = 0;
    for (size_t i = 0; i < n; ++i) {
        float diff = a[i] - b[i];
        sum += std::fabs(diff);
    }

    return sum;
}

int main() {
    std::vector<float> a = {
        17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
    };
    std::vector<float> b = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    auto value = distance(a.data(), b.data(), a.size());
    std::cout << value;

    return 0;
}

In the code, the function iterates over each element in the arrays, calculates the absolute difference, and adds it to a cumulative sum to calculate the Manhattan distance. The code output is 162.

This implementation is straightforward, but may be slow for large arrays due to the lack of parallel processing.

Here's the optimized implementation using AVX2:

#include <immintrin.h>

float distance(const float *a, const float *b, const size_t n) {
    __m256 signMask = _mm256_set1_ps(-0.0f);
    __m256 vsum = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vdiff = _mm256_sub_ps(va, vb);
        vdiff = _mm256_andnot_ps(signMask, vdiff);
        vsum = _mm256_add_ps(vsum, vdiff);
    }

    __m128 bottom = _mm256_castps256_ps128(vsum);
    __m128 top = _mm256_extractf128_ps(vsum, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float sum = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        float diff = a[i] - b[i];
        sum += std::fabs(diff);
    }

    return sum;
}

Explanation of key AVX2 instructions:

  • _mm256_setzero_ps initializes vector with all elements set to zero, used as an accumulator.
  • _mm256_set1_ps sets a mask with -0.0f in all positions, allowing us to calculate the absolute value by masking the sign bit.
  • _mm256_loadu_ps loads eight floating-point values from arrays.
  • _mm256_sub_ps performs element-wise subtraction.
  • _mm256_andnot_ps clears the sign bit in each element to obtain the absolute value of each difference.
  • _mm256_add_ps adds corresponding elements from vectors, accumulating the absolute differences.
  • _mm256_castps256_ps128, _mm256_extractf128_ps, _mm_hadd_ps used to reduce the values stored in the 256-bit register down to a single floating-point sum for the final result.

Any remaining elements are processed in a final fallback loop.

Leave a Comment

Cancel reply

Your email address will not be published.