Reverse Array Elements using C++ SIMD

Reverse Array Elements using C++ SIMD

Reversing array elements is a common operation in programming, and it can be done efficiently using various methods, such as scalar or SIMD. SIMD allows for parallel processing of multiple data points simultaneously, providing a significant performance boost over scalar implementations, especially for large arrays.

Here's the implementation using scalar approach:

#include <iostream>
#include <vector>

void reverse(const float *data, float *result, const size_t n) {
    for (size_t i = 0; i < n; ++i) {
        result[i] = data[n - i - 1];
    }
}

int main() {
    std::vector<float> a = {
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
    };
    std::vector<float> result(a.size());

    reverse(a.data(), result.data(), a.size());
    for (auto value: result) {
        std::cout << value << " ";
    }

    return 0;
}

In the scalar version, we iterate through the array from the start to the end, copying elements in reverse order. Output:

18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

While simple, this approach processes one element at a time, which can become inefficient with large datasets.

Here's the optimized implementation using AVX2:

#include <immintrin.h>

void reverse(const float *data, float *result, const size_t n) {
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[n - i - 8]);
        __m256i idx = _mm256_setr_epi32(7, 6, 5, 4, 3, 2, 1, 0);
        __m256 vresult = _mm256_permutevar8x32_ps(vdata, idx);
        _mm256_storeu_ps(&result[i], vresult);
    }

    for (; i < n; ++i) {
        result[i] = data[n - i - 1];
    }
}

Here's how the AVX2 version works:

  • _mm256_loadu_ps - loads 8 floating-point elements from the input array.
  • _mm256_setr_epi32 - creates a vector containing a set of specific indices. In this case, it initializes an index vector with the values {7, 6, 5, 4, 3, 2, 1, 0}. This vector is used to reverse the order of elements in the next step.
  • _mm256_permutevar8x32_ps - it takes the loaded vector and applies the indices in the index vector to permute (rearrange) the elements. Essentially, it reverses the order of the 8 elements in the vector.
  • _mm256_storeu_ps - after reversing the order of the elements, this function stores the processed 8-element vector back into the result array.

After processing as many full 8-element chunks as possible, any remaining elements (less than 8) are processed individually using the scalar method.

Leave a Comment

Cancel reply

Your email address will not be published.