In performance-critical applications like image processing, scientific simulations, and machine learning, computing operations like element-wise minimum across arrays can become a bottleneck. Thankfully, SIMD instructions available in modern CPUs allow us to process multiple data elements simultaneously - greatly accelerating such operations.
Here's the basic version using a loop:
#include <iostream>
#include <vector>
void vectorMin(const float *a, const float *b, float *result, const size_t n) {
for (size_t i = 0; i < n; ++i) {
result[i] = std::min(a[i], b[i]);
}
}
int main() {
std::vector<float> a = {
5, -1, 10, -3, 14, 5, -6, 17, 8, 3, -12, 11, 2, 13, -7, 9, 16, 1
};
std::vector<float> b = {
4, 2, -3, 12, 5, -6, 15, 8, -9, 10, 1, 20, -13, 14, 7, -5, 4, 18
};
std::vector<float> result(a.size());
vectorMin(a.data(), b.data(), result.data(), a.size());
for (auto value: result) {
std::cout << value << " ";
}
return 0;
}
It's simple and portable, but processes elements one by one. Output:
4 -1 -3 -3 5 -6 -6 8 -9 3 -12 11 -13 13 -7 -5 4 1
This approach doesn't take advantage of modern hardware capabilities, resulting in inefficiencies with larger datasets.
To leverage SIMD, we use AVX2, which operate on 256-bit wide registers - allowing us to process 8 floats at once:
#include <immintrin.h>
void vectorMin(const float *a, const float *b, float *result, const size_t n) {
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vresult = _mm256_min_ps(va, vb);
_mm256_storeu_ps(&result[i], vresult);
}
for (; i < n; ++i) {
result[i] = std::min(a[i], b[i]);
}
}
Here's an explanation of the AVX2 code:
_mm256_loadu_ps
- loads 8 floating-point elements from each input array._mm256_min_ps
- computes the minimum of each corresponding pair of elements._mm256_storeu_ps
- stores the result back into the result array.
A scalar loop processes any remaining elements when the array size is not divisible by 8.
Leave a Comment
Cancel reply