DEV Community

Konstantin Läufer
Konstantin Läufer

Posted on

Unicode string length can mean different things in different languages

I was working on a text processing example across several different programming languages, including C++, Java, Rust, and Scala, and noticed some discrepancies in the results.

It turned out that these are due to Unicode string length meaning different things in different languages:

  • In Java, Scala, etc., the length() method returns the number of abstract, high-level characters (glyphs) from a human reader's point of view.

  • By contrast, in C++, Go, and Rust, the equivalent functions and methods return a result based on the number of bytes required to store those characters.

jshell> "résumé".length()
$1 ==> 6
Enter fullscreen mode Exit fullscreen mode
 evcxr
Welcome to evcxr. For help, type :help
>> "résumé".len()
8
>> "résumé".chars().count()
6
Enter fullscreen mode Exit fullscreen mode
len([]rune("résumé")) // returns 6
Enter fullscreen mode Exit fullscreen mode

Apparently it's a bit more complicated in C++.

Top comments (0)