Normalize Array Elements by L1 Norm using C++ SIMD

Normalize Array Elements by L1 Norm using C++ SIMD

Normalization is a common technique used in data processing, particularly for machine learning and statistical applications. One popular method is normalization using the L1 norm, which scales the elements of an array so that the sum of the absolute values becomes 1. Modern CPUs offer powerful instructions that allow us to speed up processing significantly through SIMD.

Here's the basic implementation:

#include <iostream>
#include <vector>
#include <cmath>

void normalizeL1(float *data, const size_t n) {
    float sum = 0;
    for (size_t i = 0; i < n; ++i) {
        sum += std::fabs(data[i]);
    }
    for (size_t i = 0; i < n; ++i) {
        data[i] /= sum;
    }
}

int main() {
    std::vector<float> a = {
        -1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
    };

    normalizeL1(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

The scalar version of the normalization process is straightforward. It involves two steps: summing the absolute values of the array elements and dividing each element by the total sum to normalize the values. Output:

-0.015 0 0.01 ... -0.11 0.12 -0.13

For larger datasets, SIMD can significantly accelerate the normalization process by processing multiple elements simultaneously.

Here's the optimized implementation using AVX2:

#include <immintrin.h>

void normalizeL1(float *data, const size_t n) {
    __m256 signMask = _mm256_set1_ps(-0.0f);
    __m256 vsum = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_andnot_ps(signMask, vdata);
        vsum = _mm256_add_ps(vsum, vdata);
    }

    __m128 bottom = _mm256_castps256_ps128(vsum);
    __m128 top = _mm256_extractf128_ps(vsum, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float sum = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        sum += std::fabs(data[i]);
    }

    vsum = _mm256_set1_ps(sum);

    i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_div_ps(vdata, vsum);
        _mm256_storeu_ps(&data[i], vdata);
    }

    for (; i < n; ++i) {
        data[i] /= sum;
    }
}

Key points of AVX2 implementation:

  • _mm256_loadu_ps loads 8 float values from array.
  • _mm256_andnot_ps performs bitwise operations to compute the absolute value.
  • _mm256_add_ps accumulates the sum of the absolute values.
  • _mm256_div_ps performs element-wise division to normalize the array.
  • _mm256_storeu_ps stores the normalized values back to array.

Leave a Comment

Cancel reply

Your email address will not be published.