Normalize Array Elements by L2 Norm using C++ SIMD

Normalize Array Elements by L2 Norm using C++ SIMD

Normalization is a widely used technique in data processing. One popular method is normalization using the L2 norm, which also known as Euclidean normalization. It scales an array so that the square root of the sum of squared elements is 1. This technique is widely used in machine learning, especially in vector space models. A scalar implementation of L2 norm works well for small datasets. However, optimizing this calculation with SIMD can greatly improve performance for large datasets.

Here's the scalar version of L2 norm:

#include <iostream>
#include <vector>
#include <cmath>

void normalizeL2(float *data, const size_t n) {
    float sum = 0;
    for (size_t i = 0; i < n; ++i) {
        sum += data[i] * data[i];
    }
    sum = std::sqrt(sum);
    for (size_t i = 0; i < n; ++i) {
        data[i] /= sum;
    }
}

int main() {
    std::vector<float> a = {
        -1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
    };

    normalizeL2(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

This approach is a straightforward two-step process. Sums the squares of each element and takes the square root. Normalizes each element by dividing it by this computed value. Output:

-0.0516934 0 0.0344623 ... -0.379085 0.413547 -0.44801

For larger arrays, SIMD operations can significantly reduce processing time by computing operations in parallel.

Here's the AVX2 code:

#include <immintrin.h>

void normalizeL2(float *data, const size_t n) {
    __m256 vsum = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_mul_ps(vdata, vdata);
        vsum = _mm256_add_ps(vsum, vdata);
    }

    __m128 bottom = _mm256_castps256_ps128(vsum);
    __m128 top = _mm256_extractf128_ps(vsum, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float sum = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        sum += data[i] * data[i];
    }

    sum = std::sqrt(sum);
    vsum = _mm256_set1_ps(sum);

    i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_div_ps(vdata, vsum);
        _mm256_storeu_ps(&data[i], vdata);
    }

    for (; i < n; ++i) {
        data[i] /= sum;
    }
}

Explanation of AVX2 instructions:

  • _mm256_loadu_ps loads 8 floating-point values from the array.
  • _mm256_mul_ps multiplies elements in parallel, used here to square each element.
  • _mm256_add_ps adds elements of two vectors, used here to accumulate the sum of squares.
  • _mm256_set1_ps sets all elements in a register to a single value (the computed L2 norm).
  • _mm256_div_ps divides each element by the L2 norm for normalization.
  • _mm256_storeu_ps stores the normalized values back to array.

Leave a Comment

Cancel reply

Your email address will not be published.