Calculate Reciprocal Square Root of Array Elements using C++ SIMD

Calculate Reciprocal Square Root of Array Elements using C++ SIMD

Calculating the reciprocal square root is a frequent operation in scientific computing, machine learning, and graphics applications. A scalar implementation of this operation works well for small datasets. However, for larger arrays, performance becomes a concern. Using SIMD can greatly accelerate this process by performing multiple reciprocal square root calculations simultaneously.

Here's the straightforward implementation:

#include <iostream>
#include <vector>
#include <cmath>

void reciprocalSqrt(float *data, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        data[i] = 1.0f / std::sqrt(data[i]);
    }
}

int main() {
    std::vector<float> a = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    reciprocalSqrt(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

In this code, we iterate over each element, calculate the square root, take its reciprocal, and then store it back into the array. A part of the output:

1 0.707107 0.57735 ... 0.25 0.242536 0.235702

For large datasets, this approach does not fully utilize the CPU capability to perform multiple operations in parallel.

Here's how to perform the same operation using AVX2:

#include <immintrin.h>

void reciprocalSqrt(float *data, const size_t n) {
    __m256 one = _mm256_set1_ps(1.0f);

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vsqrt = _mm256_sqrt_ps(vdata);
        __m256 vresult = _mm256_div_ps(one, vsqrt);
        _mm256_storeu_ps(&data[i], vresult);
    }

    for (; i < n; ++i) {
        data[i] = 1.0f / std::sqrt(data[i]);
    }
}

Explanation of AVX2 code:

  • _mm256_loadu_ps loads 8 elements from the input array.
  • _mm256_sqrt_ps computes the square root.
  • _mm256_div_ps calculates the reciprocal.
  • _mm256_storeu_ps stores the result of each reciprocal square root back in the array.

Any elements left over are processed in a final scalar loop.

Leave a Comment

Cancel reply

Your email address will not be published.