Discussion on: Converting UTF-8 strings to ASCII using the ICU Transliterator

View post

Do you use Laravel? How about performance Transliterator vs str_slug? And convert string results? Thanks!

Here the test, 10.000 iterations over 2 strings:

$string1 = '<?php François😎: _+ / Стравинский`😜.';
$string2 = 'Daniël Renée François Bjørn in’t Veld';

$time = microtime(true);

for ($i = 0; $i < 10000; $i++) {
    slugify($string1);
    slugify($string2);
}

echo 'slugify: '.round(microtime(true) - $time, 3).' seconds - '.slugify($string1).' - '.slugify($string2)."\n";

$time = microtime(true);

for ($i = 0; $i < 10000; $i++) {
    str_slug($string1);
    str_slug($string2);
}

echo 'str_slug: '.round(microtime(true) - $time, 3).' seconds - '.str_slug($string1).' - '.str_slug($string2)."\n";

And results:

slugify: 12.817 seconds - php-francois-stravinskij - daniel-renee-francois-bjorn-int-veld
str_slug: 0.151 seconds - php-francois-stravinskii - daniel-renee-francois-bjorn-int-veld

Laravel str_slug function has a great performance, but result is not same.

Bart van Raaij • Oct 17 '20

That’s a great question Lito — which you’ve answered yourself :-)
Because the PHP Transliterator is a wrapper for the native ICU lib in C, I’m not surprised it performs a lot worse than Laravel’s native php str_slug.

I’ll take a look at Laravel’s implementation tomorrow. 👍🏻 Very curious how they do it.

Lito • Oct 17 '20

For me, all related with performance is always a MUST. I work with a lot of data and I always need a efficient solution for every problem :)

Bart van Raaij • Oct 18 '20 • Edited

I've taken a look at Laravel's str_slug. It uses voku/helper/ASCII::to_ascii under the hood.
That lib and function uses a quite clever in-memory cache on runtime, in which every character is cached in an array:
github.com/voku/portable-ascii/blo...
So subsequent transforms are much faster because they don't need to be transformed again.
This is of course highly beneficial to the performance.

The output difference between my slugify() and voku's to_ascii is explained by the fact that the latter takes a locale into account (English by default).

That being said: my "bonus tip" slugify example was never meant to be production code. It's just another example of what the ICU Transliterator can do. Of course there are other libs out there that do the same kind of stuff, which are perhaps better/faster at doing so; because there's a lot of development in them.
I hope you liked my article anyway, even if it's not directly usable for you. 🤞🏻

Lito • Oct 18 '20

Oh! caches 😅

Your article is great! and is perfect as the subject say, to understand how UTF-8 and ASCII converion works.