Calculate Square Root of Array Elements using C++ SIMD

October 18, 2024
C++
0 Comments
1 View

In many numerical applications, you often need to perform the same operation on a large set of data elements, such as calculating the square root for each element in an array. Traditionally, this can be done using a loop that iterates through the array and applies the sqrt function element-wise. However, when dealing with large arrays, performance becomes a critical factor. Fortunately, modern CPUs provide SIMD.

Here's the straightforward implementation:

#include <iostream>
#include <vector>
#include <cmath>

void sqrt(float *data, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        data[i] = std::sqrt(data[i]);
    }
}

int main() {
    std::vector<float> a = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };

    sqrt(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

This code calculates the square root for each element in the array, and it prints the results. A part of the output:

1 1.41421 1.73205 2 ... 3.87298 4 4.12311 4.24264

Here's how to perform the same operation using AVX2:

#include <immintrin.h>

void sqrt(float *data, const size_t n) {
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vsqrt = _mm256_sqrt_ps(vdata);
        _mm256_storeu_ps(&data[i], vsqrt);
    }

    for (; i < n; ++i) {
        data[i] = std::sqrt(data[i]);
    }
}

In this optimized version, the sqrt function utilizes AVX2 intrinsics:

_mm256_loadu_ps loads 8 floating-point numbers from the array.
_mm256_sqrt_ps computes the square root for all 8 floats in parallel.
_mm256_storeu_ps stores the result back into the array.

Remaining elements are processed using the traditional scalar method in the second loop.

Related

Leave a Comment