Crossword helper internals: regex vs trie for pattern matching
If you’ve ever spent a Sunday morning staring at a crossword puzzle, you know the frustration of having a word like C_A_E and absolutely no idea what fits. As developers, our first instinct is to build a tool to solve it. But when you move from a simple script to a production-grade word finder, you quickly hit a wall: how do you search a dictionary of 100,000+ words efficiently?
I recently spent some time refactoring a crossword solver, and I learned that the choice between a Regex-based approach and a Trie-based approach is a classic study in the trade-off between memory, startup time, and query latency.
The Regex Approach: The "Quick and Dirty"
The most intuitive way to solve C_A_E is to convert the pattern into a Regular Expression. You replace the underscores with a wildcard (like .) and anchor the string.
import re
def find_with_regex(pattern, dictionary):
# Convert C_A_E to ^c.a.e$
regex = re.compile(f"^{pattern.replace('_', '.')}$", re.IGNORECASE)
return [word for word in dictionary if regex.match(word)]
Why it works
It’s incredibly simple. You don’t need to pre-process your data, and the code is readable. If you are building a small tool or a quick prototype, this is the way to go.
The bottleneck
The problem is that re.match is an $O(n)$ operation relative to the size of your dictionary. For every single query, your CPU has to iterate through every word in your list, compile the regex, and perform the match. On a dictionary of 100,000 words, a single lookup takes roughly 50ms. That might sound fast, but if you’re building a site like a2zwordfinder.com, where users expect instant, type-ahead results, that latency adds up quickly.
The Trie Approach: The "Spatial Index"
A Trie (or prefix tree) is a tree-like data structure where each node represents a character. By traversing the tree, you can prune entire branches that don't match your pattern.
To handle crossword patterns, we don't just store words; we store them in a way that respects position. If we are looking for C_A_E, we only traverse the branch starting with C, then skip the next node (the wildcard), move to A, and so on.
class TrieNode:
def __init__(self):
self.children = {}
self.is_word = False
def search_trie(node, pattern, index=0):
if index == len(pattern):
return [[]] if node.is_word else []
char = pattern[index]
results = []
if char == '_':
for child_char, child_node in node.children.items():
for res in search_trie(child_node, pattern, index + 1):
results.append([child_char] + res)
elif char in node.children:
for res in search_trie(node.children[char], pattern, index + 1):
results.append([char] + res)
return results
Why it wins
The Trie turns your search into an $O(k)$ operation, where $k$ is the length of the pattern. Because you are only visiting nodes that could possibly match, you aren't scanning the entire dictionary.
In my benchmarks, once the Trie is built (which takes about 200ms on startup), the query time drops to roughly 0.1ms. That is a 500x speedup over the regex approach.
The Trade-off: When to use which?
Choosing between these two isn't about which is "better," but about the constraints of your application.
Use Regex if:
- Memory is tight: A Trie can consume significant RAM because of the overhead of storing thousands of node objects.
- Your dictionary is small: If you’re only searching through a few thousand words, the overhead of building a Trie isn't worth the performance gain.
- You need flexibility: Regex allows for complex patterns (like "starts with C, ends with E, and contains at least two vowels") that are much harder to implement in a standard Trie.
Use a Trie if:
- You are building a high-traffic service: If you look at professional-grade tools like Puzzle Depot, they prioritize sub-millisecond response times. A Trie is essential for providing a snappy user experience.
- You have a static dictionary: If your word list doesn't change often, you can build the Trie once at server startup and keep it in memory.
- You need "Search-as-you-type": The speed of the Trie allows you to filter results in real-time as the user types, which is a massive UX win.
Final Thoughts
Building a crossword helper taught me that performance optimization is rarely about finding the "fastest" algorithm and almost always about understanding the lifecycle of your data.
If you’re just starting out, stick with Regex. It’s clean, maintainable, and gets the job done. But if you find yourself hitting that 50ms latency wall and your users are starting to notice, it’s time to reach for a Trie. It’s a bit more work to implement, but the performance gains are undeniable.
Have you built a word-finding tool? Did you go the Regex route or build a custom index? Let me know in the comments!
Top comments (0)