How to build a profanity filter that actually works

#datastructures #algorithms #softwareengineering

TL;DR: A production-ready profanity filter isn't just a list of banned words; it's a pipeline. You start with sanitization to normalize character substitutions, followed by a Trie for efficient prefix matching. To avoid the Scunthorpe problem, you cross-reference matches against an allow-list or use context-aware ML models to score intent, balancing raw speed with semantic accuracy.

Building a content filter seems like a Junior-level task until you actually have to deploy it to a live chat or a comment section. If you just use a regex or String.contains() on a list of banned words, you’ll quickly realize that users are incredibly creative at bypassing filters. Whether it's adding a period (b.u.m), using leetspeak (b@m), or hiding a word inside a valid one (bumpy), a simple search-and-replace won't cut it. You need a multi-stage pipeline that balances performance with accuracy.

How do you handle character substitutions and leetspeak?

Sanitization normalizes the input before it ever hits your matching logic by stripping non-alpha characters and mapping homoglyphs back to their base ASCII equivalents.

Before you run any comparisons, you need a canonical version of the text. This involves two steps: stripping noise (punctuation, whitespace, and special characters) and character mapping. If a user types b.u.m, your sanitizer should collapse that into bum. If they use @ for a or 0 for o, you map those visual lookalikes back to their standard letters.

// Conceptual normalization flow
const map = { '@': 'a', '0': 'o', '1': 'i', '3': 'e', '5': 's' };
const clean = input.toLowerCase()
  .replace(/[^a-z0-9]/g, '') // Strip symbols
  .split('').map(c => map[c] || c).join('');

This "clean" string is what you actually pass to your detection engine. Without this step, your dictionary would need to be millions of permutations long to catch even the simplest evasions.

Why use a Trie instead of a simple Hash Map?

A Trie (prefix tree) allows for O(L) lookup complexity where L is the length of the input string, making it significantly more efficient for detecting banned prefixes within a continuous stream of text.

In a standard hash map approach, to find every banned word in a 500-character paragraph, you would have to generate every possible substring and check it against the map. That’s an O(N²) operation. With a Trie, you iterate through the message once. As you walk the string, you walk the Trie. If you hit a terminal node in the Trie, you’ve found a match. This is the difference between a filter that lags your app and one that processes thousands of messages per second. It allows you to identify not just exact matches, but matches embedded within a larger stream of characters without re-scanning the string for every entry in your database.

How do you solve the Scunthorpe Problem?

To solve the Scunthorpe problem, you must validate flagged matches against an allow-list to ensure the "bad word" isn't actually a substring of a legitimate word like "bumpy" or "album."

This is where many engineers get stuck. If your Trie flags the word "bum," you shouldn't immediately trigger a block. Instead, you need to perform a look-ahead and look-behind on the original string. This is essentially a secondary validation step. If the Trie identifies a match at index i through j, you check if that specific range is part of a known-good word in your allow-list.

Filtering Stage	Technical Goal	Latency Cost
Sanitization	Normalize input characters	Low (O(N))
Trie Traversal	Fast prefix matching	Low (O(L))
Allow-listing	Resolve Scunthorpe false positives	Moderate (O(Match Count))
ML Inference	Context and intent scoring	High (O(Inference Time))

When should you use Machine Learning instead of a Trie?

Use ML scoring when you need to detect intent or harassment that doesn't rely on specific banned words, but be aware that ML introduces a significant latency trade-off compared to the Trie approach.

A Trie is a deterministic, high-speed tool. It is great at finding "bad words," but it’s terrible at finding "bad behavior." It can't catch sarcasm or a user being hostile without using slurs. This is where models like Google’s Perspective API or custom BERT-based classifiers come in. They provide a toxicity score (0 to 1) based on the context of the whole sentence.

However, from a systems design perspective, you shouldn't run every single message through an ML model. Inference is expensive and slow. A common pattern is to use the Trie as a first-pass filter. If the Trie catches a high-confidence match, you block it immediately. If the message passes the Trie but the user has been flagged recently or the message contains suspicious patterns, you then asynchronously or conditionally route it to the ML model for a deeper score. This saves your CPU cycles for messages that actually need the semantic analysis.

FAQ

How do you handle words that are safe in one context but not another?
This is the limitation of the Trie. If a word’s toxicity is context-dependent, you have to rely on ML scoring. A Trie can only tell you if a word exists; only a transformer-based model can tell you what that word means in that specific sentence.

What happens if a user uses Unicode characters that look like Latin letters?
This is a sanitization edge case called "IDN homograph attacks." Your character map needs to include common Unicode lookalikes (like the Cyrillic 'а') and map them back to their Latin counterparts before the text hits the Trie.

Should I block the message or just mask the bad words?
In high-throughput systems, masking (***) is often preferred because it provides immediate feedback to the user without breaking the flow of the UI. However, for severe toxicity, outright blocking is necessary to prevent the storage of harmful content in your database.