Yasser Elgammal

Posted on Mar 25

Handling Invisible characters with PHP

#php #laravel

As developers, we often assume that the data users enter into our forms is exactly what we see. But in reality, inputs can be deceiving. Sometimes a user enters a phone number or ID that looks correct, but your validation fails. Why?

A client was entering a valid number (let’s say 51000000) into a form, but our backend validation kept rejecting it.

No errors in the logic.
The input looked fine.
Manual testing with the same number passed.

Why This Happens

These invisible characters often sneak in when users copy text from:

Messaging apps (e.g., WhatsApp, Slack).
Formatted documents (e.g., Word, rich text editors).

Common Types of Invisible Characters

1. Whitespace Characters

Regular Space (U+0020): The standard space character.
Tab (U+0009): Written as \t, adds horizontal spacing.
Newlines:
- Line Feed (U+000A) → \n
- Carriage Return (U+000D) → \r

2. Zero-Width Characters

Zero-Width Space (U+200B): Doesn’t show up visually but still exists in the text.
Zero-Width Non-Joiner (U+200C): Used in some languages (like Arabic or Persian) to prevent character joining without adding space.
Zero-Width Joiner (U+200D): Used to force characters to join without any visible space.

3. Directional Characters

Left-to-Right Mark (LRM) (U+200E): Affects text direction but is invisible.
Right-to-Left Mark (RLM) (U+200F): Same as above but for right-to-left languages.

4. Control Characters

Soft Hyphen (U+00AD): Invisible in most cases, but may show up if the word is broken across lines.
Non-Breaking Space (U+00A0): Looks like a regular space but prevents the line from breaking at that position.

What Laravel Trims by Default

Laravel automatically trims whitespace (spaces, tabs, new lines) from request input when you use the TrimStrings middleware, which is enabled by default. However, it does not remove invisible Unicode characters like:

Zero-width spaces (\u{200B})
Left-to-right marks
Other hidden characters

🛠 Solution

This is ideal for removing zero-width characters and directional marks:

$number = "51000000";
$cleanedNumber = preg_replace('/[\p{Cf}]/u', '', $number);
echo $cleanedNumber;

💡 Another Solution

Removes all invisible control characters and whitespace from the input to ensure it's clean and valid.

$cleanedNumber = preg_replace('/[\p{C}\s]+/u', '', $number);

What it do by this code exactly covered in this table:

Type	Unicode Code	Example in `$text`	Notes
Space	`U+0020`	`5 1000000` ← regular space between digits	Part of `\s`
Tab	`U+0009`	`5\t1000000`	Included in `\s`
Newline	`U+000A`, `U+000D`	`5\n1000000` or `5\r1000000`	Included in `\s`
Zero-Width Space	`U+200B`	`51000000` ← visually invisible	Matches `\p{Cf}`
Zero-Width Joiner	`U+200D`	`5‍1000000` ← visually invisible	Matches `\p{Cf}`
Right-to-Left Mark (RLM)	`U+200F`	`5‏1000000` ← invisible RTL marker	Matches `\p{Cf}`
Soft Hyphen	`U+00AD`	`51000000` ← doesn’t show normally	Considered a control/formatting character
Non-Breaking Space (NBSP)	`U+00A0`	`5 1000000` ← looks like a space	Sometimes included in `\s`, or `\p{Zs}`
Control Character (e.g. BEL)	`U+0007`	`5\u{0007}1000000` ← invisible bell char	Matches `\p{C}`

🧾 Summary

Sometimes, what you don't see in the input is exactly what causes the problem.

It’s a subtle but important reminder to always sanitize and normalize user input before trusting it.

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More

DEV Community