Compare Strings using grapheme_levenshtein in PHP 8.5

Compare Strings using grapheme_levenshtein in PHP 8.5

PHP includes a built-in levenshtein function that calculates the Levenshtein distance between two strings. This distance represents the minimal number of characters that need to replace, insert or delete to transform one string into another. This is useful for string comparison, such as spotting typos or finding similar words. However, the levenshtein function works on a byte-by-byte basis, so with multibyte encodings like UTF-8, its results may be inaccurate or misleading.

<?php

echo levenshtein('cafe', 'cafe').PHP_EOL; // 0
echo levenshtein('cafe', 'cafa').PHP_EOL; // 1
echo levenshtein('cafe', 'café').PHP_EOL; // 2 (Why not 1?)

Since PHP 8.5, the Intl extension includes the grapheme_levenshtein function, which calculates Levenshtein distance between strings with proper UTF-8 support. Unlike levenshtein, it compares strings by graphemes rather than raw bytes, ensuring accurate results for multibyte characters. A grapheme represents what users perceive as a single character, even if it's made up of multiple code points.

<?php

echo grapheme_levenshtein('cafe', 'cafe').PHP_EOL; // 0
echo grapheme_levenshtein('cafe', 'cafa').PHP_EOL; // 1
echo grapheme_levenshtein('cafe', 'café').PHP_EOL; // 1

Leave a Comment

Cancel reply

Your email address will not be published.