Solving Wordle with information theory: entropy, guess trees, and why greedy wins

#python #algorithms #webdev #tutorial

Solving Wordle with information theory: entropy, guess trees, and why greedy wins

If you spent any time on the internet in early 2022, you likely saw the viral sensation that was Wordle. While most players were relying on intuition or a lucky "ADIEU" opener, the engineering community was busy treating the game as a classic information theory problem.

At its core, Wordle is a game of reducing uncertainty. Every time you submit a guess, the game provides a feedback pattern (gray, yellow, or green). This feedback acts as a filter, narrowing down the set of possible secret words. To solve the game efficiently, we don't just want to guess words; we want to maximize the information we gain from each guess.

The Math: Shannon Entropy

In information theory, we measure the "surprise" or information content of an event using Shannon entropy. For a given guess, we want to choose a word that splits the remaining possible secret words into the most balanced "buckets" of feedback patterns.

If a guess splits the remaining words into $N$ possible feedback patterns, where each pattern $i$ occurs with probability $p_i$, the entropy $H$ is defined as:

$$H = -\sum_{i=1}^{N} p_i \log_2(p_i)$$

A guess that results in a uniform distribution of feedback patternsâ€”where every possible outcome is equally likelyâ€”maximizes entropy. This is the "gold standard" for a first guess.

Implementing Entropy in Python

Calculating this is surprisingly straightforward. We need to simulate the feedback for every possible secret word in our list and group them by the resulting pattern.

import math
from collections import Counter

def calculate_entropy(guess, possible_words):
    patterns = []
    for secret in possible_words:
        # get_feedback returns a tuple like (0, 1, 2, 0, 0)
        patterns.append(get_feedback(guess, secret))

    counts = Counter(patterns)
    total = len(possible_words)
    entropy = 0

    for count in counts.values():
        p = count / total
        entropy -= p * math.log2(p)
    return entropy

Why TARES beats ADIEU

Many players swear by "ADIEU" because it hits four vowels. However, from an information theory perspective, "ADIEU" is suboptimal. It is a "vowel-heavy" strategy that often results in highly skewed feedback. You might get a lot of yellow letters, but you haven't effectively partitioned the remaining word space.

"TARES," on the other hand, is a powerhouse. It hits high-frequency consonants (T, R, S) and a common vowel (A, E). When you run the entropy calculation across the standard Wordle dictionary, "TARES" (or "SALET" or "CRANE") consistently provides a higher average information gain. It doesn't just tell you what is in the word; it effectively eliminates large swaths of the dictionary that aren't.

If you want to see how these strategies hold up against different word lists, if you want to play with real Wordle variants, you can test your own openers against various constraints.

Greedy vs. Minimax: The Search for Perfection

The "greedy" approachâ€”picking the word with the highest entropy at each stepâ€”is incredibly effective. It will solve the vast majority of Wordle puzzles in 3 to 4 guesses. However, it is not mathematically "optimal."

A greedy solver only looks one step ahead. It optimizes for the next guess, not the final guess. To achieve the theoretical minimum of 3.42 guesses, you need a Minimax approach.

A Minimax solver builds a full decision tree. It asks: "If I pick word X, what is the worst-case scenario for the remaining number of guesses?" It then chooses the move that minimizes that worst-case outcome.

def minimax(guess, possible_words, depth):
    if len(possible_words) <= 2:
        return len(possible_words)

    # Branching factor: simulate all possible feedback patterns
    # This is computationally expensive!
    scores = []
    for pattern in get_all_possible_patterns(guess):
        subset = filter_words(possible_words, guess, pattern)
        scores.append(max_depth_for_subset(subset))
    return max(scores)

While the greedy approach is a simple loop, the Minimax approach requires a recursive search through a massive state space. It is the difference between a quick script and a heavy-duty search algorithm. If you ever get stuck on a particularly brutal daily puzzle, a2zwords.com can serve as a helpful companion tool to see what the "optimal" path might have looked like.

Moving to Production

If you were to build a production-grade Wordle solver, you would quickly run into performance bottlenecks. Here is how you would scale it:

Memoization/Caching: The state space of Wordle is finite. You should cache the entropy results for every (guess, remaining_word_list) tuple. Once you've calculated the entropy of "CRANE" for a specific subset of words, you never need to calculate it again.
Bitmasking: Instead of storing words as strings, represent them as bitmasks. Comparing a guess to a secret word becomes a series of bitwise AND/OR operations, which are orders of magnitude faster than string manipulation.
Polyglot Dictionaries: A production solver should handle multiple languages. By decoupling the solver logic from the dictionary file (using JSON or SQLite), you can swap between English, Spanish, or even custom word lists without changing a line of code.

Wordle is a perfect example of how a simple game can be transformed into a rigorous exercise in computer science. Whether you prefer the quick-and-dirty greedy approach or the exhaustive perfection of a Minimax tree, the math remains the same: itâ€™s all about maximizing the information you extract from every single guess.