Normalize Array Elements by L1 Norm using C++ SIMD

October 24, 2024
C++
0 Comments
113 Views

Normalization is a common technique used in data processing, particularly for machine learning and statistical applications. One popular method is normalization using the L1 norm, which scales the elements of an array so that the sum of the absolute values becomes 1. Modern CPUs offer powerful instructions that allow us to speed up processing significantly through SIMD.

Here's the basic implementation:

#include <iostream>
#include <vector>
#include <cmath>

void normalizeL1(float *data, const size_t n) {
    float sum = 0;
    for (size_t i = 0; i < n; ++i) {
        sum += std::fabs(data[i]);
    }
    for (size_t i = 0; i < n; ++i) {
        data[i] /= sum;
    }
}

int main() {
    std::vector<float> a = {
        -1.5, 0, 1, -1.5, 2, -2.5, 3, -3.5, 4, -5, 6, -7, 8, -9, 10, -11, 12, -13,
    };

    normalizeL1(a.data(), a.size());
    for (auto value: a) {
        std::cout << value << " ";
    }

    return 0;
}

The scalar version of the normalization process is straightforward. It involves two steps: summing the absolute values of the array elements and dividing each element by the total sum to normalize the values. Output:

-0.015 0 0.01 ... -0.11 0.12 -0.13

For larger datasets, SIMD can significantly accelerate the normalization process by processing multiple elements simultaneously.

Here's the optimized implementation using AVX2:

#include <immintrin.h>

void normalizeL1(float *data, const size_t n) {
    __m256 signMask = _mm256_set1_ps(-0.0f);
    __m256 vsum = _mm256_setzero_ps();

    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_andnot_ps(signMask, vdata);
        vsum = _mm256_add_ps(vsum, vdata);
    }

    __m128 bottom = _mm256_castps256_ps128(vsum);
    __m128 top = _mm256_extractf128_ps(vsum, 1);

    bottom = _mm_add_ps(bottom, top);
    bottom = _mm_hadd_ps(bottom, bottom);
    bottom = _mm_hadd_ps(bottom, bottom);

    float sum = _mm_cvtss_f32(bottom);
    for (; i < n; ++i) {
        sum += std::fabs(data[i]);
    }

    vsum = _mm256_set1_ps(sum);

    i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256 vdata = _mm256_loadu_ps(&data[i]);
        vdata = _mm256_div_ps(vdata, vsum);
        _mm256_storeu_ps(&data[i], vdata);
    }

    for (; i < n; ++i) {
        data[i] /= sum;
    }
}

Key points of AVX2 implementation: