The floor function rounds each element of an array down to the nearest integer. For instance, applying floor to 4.7 gives 4.0, and to -2.1 gives -3.0. When processing large data arrays, even basic operations like applying a floor function to each element can affect performance. SIMD enhances efficiency by performing the same operation on multiple data elements at once, taking advantage of parallel processing capabilities in modern CPUs.
The straightforward implementation:
#include <iostream>
#include <vector>
#include <cmath>
void floor(float *data, const size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = std::floor(data[i]);
}
}
int main() {
std::vector<float> a = {
-2.1, -3.5, 4.7, 9.8, -7.2, 0, 3.3, -1.9, 2.1,
15, 1.4, 8.2, -8.3, -5.5, -4.2, 6.1, 9.9, -2.8,
};
floor(a.data(), a.size());
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
This code utilizes the standard C++ library function std::floor
, which processes elements individually. Output:
-3 -4 4 9 -8 0 3 -2 2 15 1 8 -9 -6 -5 6 9 -3
While this works well for small arrays, performance can decrease for larger arrays.
Here's the optimized version using AVX2:
#include <immintrin.h>
void ceil(float *data, const size_t n) {
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
__m256 vresult = _mm256_floor_ps(vdata);
_mm256_storeu_ps(&data[i], vresult);
}
for (; i < n; ++i) {
data[i] = std::floor(data[i]);
}
}
Explanation of the AVX2 code:
_mm256_loadu_ps
loads eight floating-point values from the array._mm256_floor_ps
applies the floor function to each of the eight elements in parallel._mm256_storeu_ps
writes the processed elements back to the original array.
For arrays with lengths that aren't a multiple of eight, we process the remaining elements one by one to ensure no data is missed.
Leave a Comment
Cancel reply