Scaling Profanity Filters: Why I Use Tries for Real-Time Chat

#datastructures #gamedev #algorithms

TL;DR: When I'm building high-traffic chat systems, a standard list lookup for profanity is too slow because search time grows with the size of the dictionary. I use a Trie (prefix tree) to move to O(K) performance. This ensures that filtering speed depends entirely on the length of the word being checked, not the number of banned words.

I’ve seen developers fall into the same trap over and over: they maintain a blacklist of 5,000 words and run a basic .contains() check for every word in a player's message. In a small app, you won't notice. But when I'm looking at game architecture handling millions of messages, that O(N) overhead is a disaster.

Every time you add a new word to that list, you're increasing the workload for your CPU. If you're processing a sentence with ten words against a list of 5,000, you’re potentially doing 50,000 comparisons. That overhead becomes unsustainable when you scale. To fix this, I move the logic into a Trie.

Why is a simple list lookup too slow for profanity filters?

I find that linear searches force the CPU to iterate through a dictionary until it finds a match, which means latency spikes as the blacklist grows. This dictionary-dependent speed is a bottleneck in any real-time system where milliseconds matter.

If you have a bank of 5,000 bad words, your server has to ask "Is it this word?" five thousand times for every piece of text sent. When I'm scaling a game to millions of active users, that's billions of unnecessary operations. You aren't just checking if a word exists; you're wasting cycles re-scanning the same prefixes across a flat array.

What is a Trie and how does it optimize string searching?

I use a Trie—a type of prefix tree—to break words down into a sequence of character nodes. This shifts the complexity from O(N) to O(K), where performance is determined solely by the number of letters in the word I’m checking.

Think of it as a roadmap of characters. Instead of scanning a whole list, I start at a root node and follow a path. For the word "BUM," I jump to 'B', then 'U', then 'M'. If the 'M' node is flagged as a word, I've caught a match. This is where the performance win happens: I only ever make three jumps, whether my dictionary has 50 words or 50,000. The size of the blacklist no longer affects my search time.

How do you implement a Trie for word validation?

I implement this by building a node structure where each node contains a map of its children and a boolean flag. The search method simply walks through these maps character by character until it hits a terminal node or a dead end.

class TrieNode {
  children: Map<string, TrieNode> = new Map();
  isBadWord: boolean = false;

  search(word: string): boolean {
    let current: TrieNode | undefined = this;
    for (const char of word) {
      current = current.children.get(char);
      if (!current) return false;
    }
    return current.isBadWord;
  }
}

What are the trade-offs when choosing a Trie over a List?

The primary trade-off I consider is that a Trie trades memory for speed. While it provides near-instant lookups, it requires more RAM to store the individual node objects and character maps compared to a flat array of strings.

Metric	Array / List Search	Trie (Prefix Tree)
Search Time	O(N) - Depends on list size	O(K) - Depends on word length
Scaling Cost	Linear - Slower as list grows	Constant - Size doesn't matter
Memory Usage	Minimal (stores strings as-is)	Moderate (object/map overhead)
Best For	Small, static lists	High-velocity chat streams

In my experience, the memory overhead is almost always worth the massive CPU savings. For a modern game server, a few extra megabytes of RAM to keep a 5,000-word Trie in memory is a small price to pay for lightning-fast validation across millions of chat messages.

Cheers.

FAQ

How does a Trie handle substrings or "hidden" words?

To catch words buried in other strings, I don't just check the whole word; I start a Trie traversal at every character index of the player's message. This allows me to catch banned terms even if they are part of a larger, unformatted string.

Is a Trie faster than a Hash Set?

In many cases, yes. While a Hash Set has O(1) average lookup, you still have to hash the entire input string first. A Trie allows me to fail-fast the moment a prefix doesn't match, which is often more efficient for long strings or partial matches.

Can I use a Trie for languages with different alphabets?

Definitely. Since I use a Map for the children, the Trie doesn't care if the keys are ASCII, UTF-8, or emojis. As long as you can map a character to a node, the O(K) lookup logic remains identical across different languages.