DEV Community

Cover image for Weird Unicode Behaviors
Philippe Arteau
Philippe Arteau

Posted on • Edited on • Originally published at gosecure.net

3 3

Weird Unicode Behaviors

So you know about Unicode codepoint and encoding (UTF-8, UTF-16), but are you aware that few standard conversions have surprising outcomes? In this short post, I'll list some of the most surprising behaviors.

Case Mapping

Case mapping is the behavior behind the uppercase and lowercase functions in your favorite language. Unexpected behaviors can sometimes lead to bugs, some of them affecting software security.

While the strings “go\u017Fecure” and “gosecure” are not equal, a code that applies the uppercase transformation to both strings could mistakenly interpret both strings as being equal.

Here is a demonstration in Python.

>>> "GO\u017FECURE" == "GOSECURE"
False
>>> "GO\u017FECURE".upper() == "GOSECURE"
True

Enter fullscreen mode Exit fullscreen mode

The same behavior applies to Java

>>>"ADM\u0131N".toUpperCase().equals("ADMIN")
$1 ==> true
Enter fullscreen mode Exit fullscreen mode

This behavior occurs because the characters ı (U+0131) and ſ (U+017F) are converted to an ASCII characters as part of Unicode specification. Aside from a few exceptions, you can assume that your language apply these transformation by default.

Normalization

The purpose of normalization is to simplify expressions to allow matching equal or equivalent “meaning”.

Here is a demonstration using normalization functions in Ruby. The five unicode characters become six unicode characters (ASCII only).

irb(main):003:0> "\u216E\u32CE\uFF0E\u209C\u2134".unicode_normalize(:nfkc)
=> "DeV.to"
Enter fullscreen mode Exit fullscreen mode

API can sometimes hide those transformations. For example in C#, the class Uri normalize the hostname from URI entered.

> Console.Write(new Uri("https://faceboo\u212A.com").Host == "facebook.com");
True
Enter fullscreen mode Exit fullscreen mode

Here is a list of APIs with such behavior (see Section Auditing Source Code).

Testing your application

If you have an application that susceptible to issues related to Unicode, you can use this simple cheat sheet.
This cheat sheet can be used by developers to build regression test cases to make sure no characters are being misinterpreted.

Unicode Cheat sheet


If you are curious, you can read the full article with a security focus.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay