Mohamed Hamdy

Posted on Mar 19

🧩 How hidden unicode characters can break your system

#unicode #validation #sanitization

Introduction

While working on a feature that sends notifications to phone numbers, I recently faced one of the trickiest bugs I’ve encountered. Everything looked correct the phone number was valid, the logs seemed fine, and yet the system kept rejecting the request.

After digging deeper, the culprit turned out to be something almost invisible: hidden Unicode character inside the phone number.

In this article, I’ll explain:

What Unicode characters are
What hidden Unicode characters are
How they can silently break your systems
The real bug I faced
How I solved it

What Are Unicode Characters?

Computers represent text using numeric codes. Early systems used ASCII, which only supported about 128 characters By using 7 bits to represent each character. Mainly English letters, digits, and basic symbols.

To support all languages, symbols and emojis, the industry adopted the Unicode standard.

Unicode assigns a unique code point to every character.

Examples:

Character	Unicode Code Point
A	U+0041
5	U+0035
ع	U+0639
😊	U+1F60A

This allows computers to represent text from virtually every language consistently.

Hidden Unicode Characters

Some Unicode characters are not visible when rendered, but they still exist inside the string.

These characters are often used for:

text direction control
formatting
invisible separators

Examples include:

Character Unicode Description
LRM U+200E Left-to-Right Mark
RLM U+200F Right-to-Left Mark
ZWSP U+200B Zero Width Space

These characters can appear when:

copying text from messaging apps

copying from PDFs

user input from multilingual keyboards

some mobile devices

The problem is that they are invisible, but they still affect string processing.

The Bug I Faced

I was sending a request to an external service with a phone number like this: "+2‭12345678"

Everything looked normal, But the API kept rejecting the number.

When I logged the request payload, I saw something strange, At first glance, the number seemed valid.

But there was actually a hidden Unicode character before the digits.

The real string looked like this internally:

\u202D123456789

U+202D is a Left-to-Right Override (LRO) character.

It doesn't display visually, but it changes how text direction is interpreted.

This character was likely introduced when the phone number was copied from a copying from a chat message containing Arabic text or contact list.

Why This Breaks Systems

Most phone number validations expect digits only.

For example:

/^[0-9]+$/

When a hidden Unicode character exists in the string:

\u202D123456789

The regex fails because the string does not actually start with a digit.

So even though the number looks correct, it fails validation.

How I Debugged It

To detect the issue, I inspected the string at the byte level.

In PHP, you can check the raw string representation:

var_dump($phoneNumber);

Or to ensure you could check any online unicode parser i have used this

+2‭12345678 => the one with hidden LTR unicode
+212345678 => the normal number

This revealed the unexpected Unicode character before the digits.

The Solution

The fix was to sanitize the input and remove non-digit characters.

Example:

$phoneNumber = preg_replace('/\D/', '', $phoneNumber);
This removes everything except digits.

Another safe approach is to explicitly remove invisible Unicode characters:

$phoneNumber = preg_replace('/[\x{200B}-\x{200F}\x{202A}-\x{202E}]/u', '', $phoneNumber);

Or simply normalize the phone number before validation.

Key Takeaways

Hidden Unicode characters can silently break systems in ways that are extremely difficult to detect.

Things that can be affected include:

phone number validation
authentication tokens
payment identifiers
database lookups

If your system accepts user input, always consider sanitizing and normalizing text before processing it.

Invisible characters might be hiding in plain sight.

DEV Community