Reversing array elements is a common operation in programming, and it can be done efficiently using various methods, such as scalar or SIMD. SIMD allows for parallel processing of multiple data points simultaneously, providing a significant performance boost over scalar implementations, especially for large arrays.
Here's the implementation using scalar approach:
#include <iostream>
#include <vector>
void reverse(const float *data, float *result, const size_t n) {
for (size_t i = 0; i < n; ++i) {
result[i] = data[n - i - 1];
}
}
int main() {
std::vector<float> a = {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
};
std::vector<float> result(a.size());
reverse(a.data(), result.data(), a.size());
for (auto value: result) {
std::cout << value << " ";
}
return 0;
}
In the scalar version, we iterate through the array from the start to the end, copying elements in reverse order. Output:
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
While simple, this approach processes one element at a time, which can become inefficient with large datasets.
Here's the optimized implementation using AVX2:
#include <immintrin.h>
void reverse(const float *data, float *result, const size_t n) {
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[n - i - 8]);
__m256i idx = _mm256_setr_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256 vresult = _mm256_permutevar8x32_ps(vdata, idx);
_mm256_storeu_ps(&result[i], vresult);
}
for (; i < n; ++i) {
result[i] = data[n - i - 1];
}
}
Here's how the AVX2 version works:
_mm256_loadu_ps
- loads 8 floating-point elements from the input array._mm256_setr_epi32
- creates a vector containing a set of specific indices. In this case, it initializes an index vector with the values {7, 6, 5, 4, 3, 2, 1, 0}. This vector is used to reverse the order of elements in the next step._mm256_permutevar8x32_ps
- it takes the loaded vector and applies the indices in the index vector to permute (rearrange) the elements. Essentially, it reverses the order of the 8 elements in the vector._mm256_storeu_ps
- after reversing the order of the elements, this function stores the processed 8-element vector back into the result array.
After processing as many full 8-element chunks as possible, any remaining elements (less than 8) are processed individually using the scalar method.
Leave a Comment
Cancel reply