In many numerical applications, you often need to perform operations on arrays, such as division by a scalar value. While a straightforward approach is to iterate through each element, this can become a performance bottleneck, especially when dealing with large datasets. Fortunately, modern CPUs provide SIMD.
Here's the straightforward implementation:
#include <iostream>
#include <vector>
void divide(float *data, const size_t n, const float divisor) {
    for (size_t i = 0; i < n; ++i) {
        data[i] /= divisor;
    }
}
int main() {
    std::vector<float> a = {
        0, 14, 28, 42, 56, 70, 84, 98, 112, 126, 140, 154, 168, 182, 196, 210, 224, 255
    };
    divide(a.data(), a.size(), 255);
    for (auto value: a) {
        std::cout << value << " ";
    }
    return 0;
}In this code, we define a function divide computing the result of each element divided by the divisor. The main function initializes an array and creates a result vector. After calling the divide function, it prints the resulting array. A part of the output:
0 0.054902 0.109804 0.164706 ... 0.768627 0.823529 0.878431 1To improve performance, we can use SIMD instructions provided by AVX2. This enables us to perform division on multiple elements simultaneously.
Here's how to implement the same operation using AVX2:
#include <immintrin.h>
void divide(float *data, const size_t n, const float divisor) {
    __m256 vdivisor = _mm256_set1_ps(divisor);
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vresult = _mm256_div_ps(vdata, vdivisor);
        _mm256_storeu_ps(&data[i], vresult);
    }
    for (; i < n; ++i) {
        data[i] /= divisor;
    }
}In this optimized version, the divide function leverages AVX2 intrinsics:
- _mm256_set1_pscreates a vector containing the scalar divisor replicated across all elements.
- _mm256_loadu_psloads 8 floating-point numbers from the array.
- _mm256_div_psexecutes the division operation in parallel for the 8 loaded elements.
- _mm256_storeu_psstores the results back into the array.
The loop processes chunks of 8 elements, and any remaining elements are handled in a fallback loop.
 
             
                         
                         
                        
Leave a Comment
Cancel reply