Normalization is a widely used technique in data processing. One popular method is normalization using the L2 norm, which also known as Euclidean normalization. It scales an array so that the square root of the sum of squared elements is 1. This technique is widely used in machine learning, especially in vector space models. A scalar implementation of L2 norm works well for small datasets. However, optimizing this calculation with SIMD can greatly improve performance for large datasets.

Here's the scalar version of L2 norm:

```
#include <iostream>
#include <vector>
#include <cmath>
void normalizeL2(float *data, const size_t n) {
float sum = 0;
for (size_t i = 0; i < n; ++i) {
sum += data[i] * data[i];
}
sum = std::sqrt(sum);
for (size_t i = 0; i < n; ++i) {
data[i] /= sum;
}
}
int main() {
std::vector<float> a = {
-1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
};
normalizeL2(a.data(), a.size());
for (auto value: a) {
std::cout << value << " ";
}
return 0;
}
```

This approach is a straightforward two-step process. Sums the squares of each element and takes the square root. Normalizes each element by dividing it by this computed value. Output:

`-0.0516934 0 0.0344623 ... -0.379085 0.413547 -0.44801`

For larger arrays, SIMD operations can significantly reduce processing time by computing operations in parallel.

Here's the AVX2 code:

```
#include <immintrin.h>
void normalizeL2(float *data, const size_t n) {
__m256 vsum = _mm256_setzero_ps();
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
vdata = _mm256_mul_ps(vdata, vdata);
vsum = _mm256_add_ps(vsum, vdata);
}
__m128 bottom = _mm256_castps256_ps128(vsum);
__m128 top = _mm256_extractf128_ps(vsum, 1);
bottom = _mm_add_ps(bottom, top);
bottom = _mm_hadd_ps(bottom, bottom);
bottom = _mm_hadd_ps(bottom, bottom);
float sum = _mm_cvtss_f32(bottom);
for (; i < n; ++i) {
sum += data[i] * data[i];
}
sum = std::sqrt(sum);
vsum = _mm256_set1_ps(sum);
i = 0;
for (; i + 8 <= n; i += 8) {
__m256 vdata = _mm256_loadu_ps(&data[i]);
vdata = _mm256_div_ps(vdata, vsum);
_mm256_storeu_ps(&data[i], vdata);
}
for (; i < n; ++i) {
data[i] /= sum;
}
}
```

Explanation of AVX2 instructions:

`_mm256_loadu_ps`

loads 8 floating-point values from the array.`_mm256_mul_ps`

multiplies elements in parallel, used here to square each element.`_mm256_add_ps`

adds elements of two vectors, used here to accumulate the sum of squares.`_mm256_set1_ps`

sets all elements in a register to a single value (the computed L2 norm).`_mm256_div_ps`

divides each element by the L2 norm for normalization.`_mm256_storeu_ps`

stores the normalized values back to array.

## Leave a Comment

Cancel reply