DEV Community

Muhammad Arslan
Muhammad Arslan

Posted on

Why Your Profanity Filter Fails Against Unicode (And How to Fix It)

Most profanity filters only check raw input.

That’s the problem.

You can block fuck.

But what about:

fu\u0441k (Cyrillic “с” instead of Latin “c”)

fuck (fullwidth Unicode characters)

f.u.c.k (separator bypass)

Fr33 m0ney (leet-speak)

fuuuuck (character stretching)

They all bypass typical word-list filters.

The issue isn’t your regex.
It’s the order of operations.

Normalize First. Validate Second.

Before checking profanity or spam, input should be normalized:

  • Unicode NFKC normalization
  • Zero-width character removal
  • Separator stripping
  • Homoglyph mapping
  • Leet-speak normalization
  • Repetition reduction

After normalization, all evasions collapse into a canonical form.
Then your profanity/spam logic actually works.

What I Built

I created @marslanmustafa/input-shield — a zero-dependency TypeScript validation package that:

  • Detects Unicode homoglyph attacks
  • Catches leet-based spam
  • Blocks stretched profanity
  • Detects gibberish (e.g. asdfghjkl)
  • Supports Zod integration
  • Validates HTML email content safely

Example:

import { createValidator } from '@marslanmustafa/input-shield';

const validator = createValidator()
  .field('Message')
  .min(2).max(500)
  .noProfanity()
  .noSpam()
  .noGibberish();

validator.validate('fu\u0441k'); 
// → blocked
Enter fullscreen mode Exit fullscreen mode

Why This Matters

Unicode homoglyph attacks are not edge cases.
They’re easy, invisible, and widely ignored.

If you're validating user input in production, normalization isn’t optional. It’s required.

Links:

GitHub · npm

Top comments (0)