The ceil function rounds each element in an array up to the nearest integer value. For example, applying ceil to 4.3 returns 5, while applying it to -2.7 returns -2. When working with large data arrays, even a simple ceil function applied to each element can impact performance. SIMD allows us to apply the same operation to multiple data elements simultaneously, leveraging parallelism in modern CPUs.
Here's a basic implementation:
#include <iostream>
#include <vector>
#include <cmath>
void ceil(float *data, const size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = std::ceil(data[i]);
}
}
int main() {
std::vector<float> a = {
-2.1, -3.5, 4.7, 9.8, -7.2, 0, 3.3, -1.9, 2.1,
15, 1.4, 8.2, -8.3, -5.5, -4.2, 6.1, 9.9, -2.8,
};
ceil(a.data(), a.size());
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
This code uses the standard C++ library function std::ceil
, which processes one element at a time. Output:
-2 -3 5 10 -7 0 4 -1 3 15 2 9 -8 -5 -4 7 10 -2
It is straightforward implementation, but not optimized for large arrays.
Here's the optimized version using AVX2:
#include <immintrin.h>
void ceil(float *data, const size_t n) {
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 vresult = _mm256_ceil_ps(vdata);
_mm256_storeu_ps(&data[i], vresult);
}
for (; i < n; ++i) {
data[i] = std::ceil(data[i]);
}
}
Explanation of the AVX2 code:
_mm256_loadu_ps
loads eight elements from the array._mm256_ceil_ps
applies the ceil function to each of the eight elements simultaneously._mm256_storeu_ps
writes the results to the original array.
A final loop processes the remaining elements individually to ensure all elements are processed.
Leave a Comment
Cancel reply