How anagram solvers actually work: algorithms behind the scenes

#javascript #python #algorithms #programming

How anagram solvers actually work: algorithms behind the scenes

If youâ€™ve ever built a word game or a tool to help with Scrabble, youâ€™ve likely run into the "anagram problem." Given a string of characters, how do you efficiently find every valid word in the dictionary that can be formed using those letters?

A naive approachâ€”generating every possible permutation of the input string and checking if each exists in a setâ€”is a recipe for disaster. For a 10-letter word, youâ€™re looking at 3,628,800 permutations. Thatâ€™s not just slow; itâ€™s unusable.

To build a production-grade anagram solver, we need to shift our thinking from permutation to canonical representation.

The Canonical Sorted-Key Approach

The most efficient way to solve for exact anagrams is to normalize the dictionary. If two words are anagrams, they contain the exact same characters with the same frequencies. Therefore, if you sort the letters of any word alphabetically, all anagrams of that word will result in the same "key."

For example, "listen" and "silent" both become "eilnst" when sorted.

The Implementation

We preprocess our dictionary into a hash map (or dictionary in Python) where the key is the sorted string and the value is a list of matching words.

from collections import defaultdict

def build_anagram_index(word_list):
    index = defaultdict(list)
    for word in word_list:
        # Canonicalize: sort the letters
        key = "".join(sorted(word.lower()))
        index[key].append(word)
    return index

# Lookup is O(1) after O(n log n) preprocessing
def find_anagrams(word, index):
    key = "".join(sorted(word.lower()))
    return index.get(key, [])

This approach is incredibly fast. The lookup time is effectively $O(L \log L)$ where $L$ is the length of the input word (due to the sorting step). Once sorted, the hash map lookup is $O(1)$.

Try it live to see how this canonical mapping handles complex dictionary lookups in real-time.

Moving Beyond Exact Matches: The Trie

The sorted-key approach is perfect for finding full-word anagrams, but what if you want to find words that can be formed from a subset of your letters? Or what if you want to support "wildcard" tiles?

This is where the Trie (Prefix Tree) shines. A Trie stores words as a tree structure where each node represents a character.

Instead of sorting, you traverse the tree. If you have the letters "a, r, t," you start at the root and move to the 'a' branch, then 'r', then 't'. If you hit a node marked as a "word end," youâ€™ve found a valid word.

Why use a Trie?

Tries are excellent for partial matches and prefix searching. If you are building a game like Boggle or a crossword helper, you can prune the search space early. If the current path in your Trie doesn't exist, you stop searching that branch immediately.

class TrieNode {
  constructor() {
    this.children = {};
    this.isEndOfWord = false;
  }
}

// Searching for words that can be formed by a set of letters
function findWords(node, letters, currentWord, results) {
  if (node.isEndOfWord) results.push(currentWord);

  for (let char in node.children) {
    if (letters[char] > 0) {
      letters[char]--;
      findWords(node.children[char], letters, currentWord + char, results);
      letters[char]++; // Backtrack
    }
  }
}

This is significantly more flexible than the sorted-key method, though it requires more memory to store the tree structure. For a more visual look at how these letter-based constraints work in practice, check out lettersintowords.com.

The Bit-Vector Optimization

If you are working in a memory-constrained environment or need to perform millions of subset checks per second, you can represent a word's character count using a bit-vector.

Since there are only 26 letters in the English alphabet, you can map each letter to a specific bit position. However, a simple bit-mask only tells you if a letter is present. To handle anagrams, you need a frequency count. You can store the count of each letter in a fixed-size array (or a 64-bit integer if the counts are small).

Comparing if word A is a subset of word B becomes a simple vector subtraction:
if (wordA_counts[i] <= wordB_counts[i]) for all i.

This is the "secret sauce" behind high-performance engines that need to check if a rack of tiles can form a specific word without traversing a massive tree structure.

What Iâ€™d try next

If I were scaling this for a massive dictionary (like the Scrabble Tournament Word List), Iâ€™d look into Bloom Filters. They allow you to check if a word might exist in the dictionary with very little memory overhead, acting as a fast-fail layer before hitting the more expensive Trie or Hash Map.

Iâ€™d also experiment with DAWGs (Directed Acyclic Word Graphs). A DAWG is essentially a compressed Trie. Since many words share suffixes (like "-ing" or "-tion"), a DAWG merges these nodes, drastically reducing the memory footprint of your dictionary while maintaining the same lookup speed as a standard Trie.

Whether you choose the simplicity of the sorted-key hash map or the power of a Trie, the key is understanding the trade-off between your dictionary's memory footprint and the speed of your search. Happy coding!