Normalize Array Elements by L-infinity Norm using C++ SIMD

Normalize Array Elements by L-infinity Norm using C++ SIMD

Normalizing data is a common preprocessing step in many scientific and machine learning applications, and one of the widely used methods for normalization is the L-infinity norm. The L-infinity norm (or maximum norm) of a vector is the maximum absolute value of its elements. While the scalar approach works fine for small datasets, it becomes inefficient for large arrays. Leveraging SIMD allows for processing multiple elements in parallel, significantly speeding up operations.

Here's the scalar version:

#include <iostream>
#include <vector>
#include <cmath>

void normalizeLInfinity(float *data, const size_t n) {
    float value = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        float adata = std::fabs(data[i]);
        if (adata > value) {
            value = adata;
        }
    }
    for (size_t i = 0; i < n; ++i) {
        data[i] /= value;
    }
}

int main() {
    std::vector<float> a = {
        -2.1, -3.5, 4.7, 9.8, -7.2, 0, 3.3, -1.9, 2.1,
        -15, 1.4, 8.2, -8.3, -5.5, -4.2, 6.1, 9.9, -2.8,
    };

    normalizeLInfinity(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

The scalar implementation is straightforward. Finds the maximum absolute value and each element in the array is divided by this value to normalize the data. A part of the output:

-0.14 -0.233333 0.313333 ... 0.406667 0.66 -0.186667

While the scalar implementation works fine, it processes elements one by one in a loop, which can be inefficient for large arrays.

Here's the AVX2 code:

#include <immintrin.h>

void normalizeLInfinity(float *data, const size_t n) {
    __m256 signMask = _mm256_set1_ps(-0.0f);
    __m256 vmax = _mm256_set1_ps(0.0f);

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        __m256 vadata = _mm256_andnot_ps(signMask, vdata);
        vmax = _mm256_max_ps(vmax, vadata);
    }

    float maxArray[8];
    _mm256_storeu_ps(maxArray, vmax);
    float value = data[0];
    for (size_t j = 0; j < 8; ++j) {
        if (maxArray[j] > value) {
            value = maxArray[j];
        }
    }

    for (; i < n; ++i) {
        float adata = std::fabs(data[i]);
        if (adata > value) {
            value = adata;
        }
    }

    vmax = _mm256_set1_ps(value);

    i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_div_ps(vdata, vmax);
        _mm256_storeu_ps(&data[i], vdata);
    }

    for (; i < n; ++i) {
        data[i] /= value;
    }
}

Explanation of the AVX2 code:

Finding the maximum absolute value:

  • _mm256_loadu_ps loads 8 elements at once from the array.
  • _mm256_andnot_ps computes the absolute values of the 8 elements by masking out the sign bits.
  • _mm256_max_ps compares the absolute values and stores the maximum value across the 8 elements in the vector.
  • Any remaining elements are handled using a scalar loop.

Normalization:

  • _mm256_set1_ps sets all elements in a register to a single value (the computed L-infinity norm).
  • _mm256_div_ps divides all 8 elements in the vector by the L-infinity norm in parallel.
  • _mm256_storeu_ps stores the normalized values back to array.
  • Any remaining elements are handled using a scalar loop.

Leave a Comment

Cancel reply

Your email address will not be published.