Normalization is a common technique used in data processing, particularly for machine learning and statistical applications. One popular method is normalization using the L1 norm, which scales the elements of an array so that the sum of the absolute values becomes 1. Modern CPUs offer powerful instructions that allow us to speed up processing significantly through SIMD.

Here's the basic implementation:

```
#include <iostream>
#include <vector>
#include <cmath>
void normalizeL1(float *data, const size_t n) {
float sum = 0;
for (size_t i = 0; i < n; ++i) {
sum += std::fabs(data[i]);
}
for (size_t i = 0; i < n; ++i) {
data[i] /= sum;
}
}
int main() {
std::vector<float> a = {
-1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
};
normalizeL1(a.data(), a.size());
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
```

The scalar version of the normalization process is straightforward. It involves two steps: summing the absolute values of the array elements and dividing each element by the total sum to normalize the values. Output:

`-0.015 0 0.01 ... -0.11 0.12 -0.13`

For larger datasets, SIMD can significantly accelerate the normalization process by processing multiple elements simultaneously.

Here's the optimized implementation using AVX2:

```
#include <immintrin.h>
void normalizeL1(float *data, const size_t n) {
__m256 signMask = _mm256_set1_ps(-0.0f);
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
vdata = _mm256_andnot_ps(signMask, vdata);
vsum = _mm256_add_ps(vsum, vdata);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float sum = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
sum += std::fabs(data[i]);
}
vsum = _mm256_set1_ps(sum);
i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
vdata = _mm256_div_ps(vdata, vsum);
_mm256_storeu_ps(&data[i], vdata);
}
for (; i < n; ++i) {
data[i] /= sum;
}
}
```

Key points of AVX2 implementation:

`_mm256_loadu_ps`

loads 8 float values from array.`_mm256_andnot_ps`

performs bitwise operations to compute the absolute value.`_mm256_add_ps`

accumulates the sum of the absolute values.`_mm256_div_ps`

performs element-wise division to normalize the array.`_mm256_storeu_ps`

stores the normalized values back to array.

## Leave a Comment

Cancel reply