In many numerical applications, you often need to perform the same operation on a large set of data elements, such as calculating the square root for each element in an array. Traditionally, this can be done using a loop that iterates through the array and applies the sqrt
function element-wise. However, when dealing with large arrays, performance becomes a critical factor. Fortunately, modern CPUs provide SIMD.
Here's the straightforward implementation:
#include <iostream>
#include <vector>
#include <cmath>
void sqrt(float *data, const size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = std::sqrt(data[i]);
}
}
int main() {
std::vector<float> a = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
sqrt(a.data(), a.size());
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
This code calculates the square root for each element in the array, and it prints the results. A part of the output:
1 1.41421 1.73205 2 ... 3.87298 4 4.12311 4.24264
Here's how to perform the same operation using AVX2:
#include <immintrin.h>
void sqrt(float *data, const size_t n) {
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 vsqrt = _mm256_sqrt_ps(vdata);
_mm256_storeu_ps(&data[i], vsqrt);
}
for (; i < n; ++i) {
data[i] = std::sqrt(data[i]);
}
}
In this optimized version, the sqrt
function utilizes AVX2 intrinsics:
_mm256_loadu_ps
loads 8 floating-point numbers from the array._mm256_sqrt_ps
computes the square root for all 8 floats in parallel._mm256_storeu_ps
stores the result back into the array.
Remaining elements are processed using the traditional scalar method in the second loop.
Leave a Comment
Cancel reply