Vinit Jogani

Posted on Mar 31, 2022

The optimal strategy for solving a Wordle....

#algorithms #go #python #programming

Okay, so like many other people, I've been addicted to that word game for a while now. As any engineer, I began to wonder if there is an optimal strategy to play the game and if we can write a program to discover this. This post just goes through my thought process and a sequence of strategies that achieve varying levels of success on the game. Enough talk, let's dive right in.

Pre-work

To test out all the algorithms in this post, we would first need a dictionary of all possible 5-letter words that Wordle could use. I ended up using NLTK to retrieve this list:

from nltk.corpus import words
[w for w in words.words() if len(w) == 5]

This gave me ~10K words to work with. In practice, Wordle uses a subset of these to be valid words so it's not a completely accurate dictionary, but hopefully that subset is a relatively uniform sample of the 10K words so our estimates on the full dictionary carry forward.

Idea 1

Alright, intuitively, what is it that we try to do when we solve a Wordle? A common (and reasonable, at first?) strategy seems to start with covering the most frequent letters. For example, words that have many vowels (e.g. "adieu") so that they can really help us narrow down our search.

We can implement this by a simple character-frequency counter to compute the most frequent characters. Then, we can score each of the words in our dictionary as the sum of the frequencies of each individual character. For example:

score['adieu'] = freq['a'] + freq['d'] + ...

When we get some feedback from the game, e.g. it "a" and "e" are exact matches, "d" is a soft match and "i" and "u" don't exist, we can prune our dictionary to only include words that satisfy these constraints and re-run the scoring.

Problem 1: The first problem we run into here is that the optimal selection of the word may not be a word that satisfies all the existing constraints. At least in the "easy mode" on Wordle, where we are not required to make use of all evidence received so far, it may be beneficial to use a word that violates some constraints but still offers more information.

Problem 2: Another problem here is that this will naturally try to favor words which repeat high-frequency letters. For example, "alala" happened to be a valid word in the dictionary and it has 3 "a"s so it is naturally scored higher. However, that's clearly not of value because it lacks coverage.

To solve these problems, we come up with our second idea.

Idea 2

To solve the first problem, we can maintain two sets of words: (1) A = all words in the dictionary, and (2) S = words that satisfy constraints so far. We rank all words in A but only compute the frequency maps using the words in S. This makes sense because it lets us choose a word outside our dictionary as long as it makes decent use of the information that we have so far.

To solve the second problem, we just add the score for each unique letter in the word. That is, repeated letters don't get multiple points. This should help by scoring words like "alala" much lower because it will only get points from "a" and "l".

Problem 1: Some words have repeats, so we might need special checks to detect if we're close to solving the puzzle and in that case, change the scoring mechanism perhaps.

Problem 2: While this strategy makes use of the green/black responses from the game, it doesn't make the best use of yellow responses (i.e. soft matches). That is, suppose we know that "d" exists in the target word but NOT at the second position, it should ideally rank "midst" higher than "adieu" because it will give us more information about the location of "d" assuming all other letters have a score of 0.

We patch this issue in our third idea.

Idea 3

Let's start with the second problem first. The real problem we face there is that we treat letters in all positions equally wherein we know that position matters. Instead, we can compute position-wise character frequencies instead of global frequencies. Then, we can score words as follows:

score['adieu'] = freq[('a', 0)] + freq[('d', 1)] + ...

That is, the score for "adieu" is how many times we have seen "a" in the 0-th position plus how many times we have seen "d" in the 1-st position, etc.

This nicely incorporates yellow feedback because the positional scores for those words will go down. It turns out that this actually works fairly well and problem 1 seems to resolve itself automatically -- after just a few guesses, the repeat-lettered words start scoring high enough to be considered by the model even if they contribute to the score.

There are still some big problems, however.

Problem 1: This strategy is still not perfect -- it uses the evidence a bit too much. For example, consider that we tried the word "bribe" and we got the feedback that all letters are correct except the one on the 4th position. What our model will do at the moment is try all the words like "brice", "bride" and "brine" one after the other. However, we could gain a lot more information by trying a word like "dance" which would tell us if "d", "n" or "c" occur and help us eliminate the other two directly.

Problem 2: The BIG problem here is that our initial intuition about picking high frequency letters is actually completely wrong. HOW? Well, consider this. You find that "a" has the highest frequency. You try a word with "a" and you find that it does indeed exist in your word. So what can you do with it? Well not much -- because it's already a high frequency word and you aren't able to eliminate many words. Instead, you want to ideally pick a word which is somewhere in the middle so that either way, we can eliminate 50% of the other words.

We can fix this.

Idea 4

Again, let's start with the second problem. The "pick a letter from the middle" concept actually has a formal word: find letters with the highest entropy. That is, we don't care for letters that occur in 99% of the words or 1% of the words. We care about letters that occur in ~50% of the words. These letters will have a higher entropy. In particular, let's say we normalize all our frequencies by dividing with the length of the dictionary so that they are relative frequencies instead:

p[('a', 0)] = freq[('a', 0)] / N  # where N is len(A)

These are 0-1 values. Then, we compute entropy as:

entropy[x] = p[x] * (1 - p[x])  # for all x

This value is highest when p[x] is close to 50% and low on either sides. This means we can then swap out our scoring mechanism to be:

score['adieu'] = entropy[('a', 0)] + entropy[('d', 1)] + ...

Which brings us to the first problem -- combining local (position-dependant) and global (position-independant) information. Here, we can use a little hack -- instead of solely relying on global frequency counts or position-wise counts, compute two scores and then average them:

local_score['adieu'] = entropy[('a', 0)] + entropy[('d', 1)] + ...
global_score['adieu'] = entropy['a'] + entropy['d'] + ...

score = (local_score['adieu'] + global_score['adieu']) / 2

Why doing this in that exact manner does not have a strong conceptual reasoning -- for instance, why not multiply the two scores? Turns out, all those ways of combining the scores seem to work just as well.

At this point, our method gets a success rate of about 99%! That is, it is able to guess the actual word within 6 tries 99% of the time. Pretty good! But can we do better? Really why this doesn't work is that we are computing some joint likelihood of each word while assuming each letter is independent of the other letters in the word. This makes sense because computing the joint distribution gets very hard very quickly. However, perhaps we can take a more holistic approach to this.

Idea 5

At this point, I decided to take a step back and gather the high level learnings:

The entropy trick works better than the frequency trick because it splits our dictionary in a more consistent manner.
The dictionary pruning strategy helps incorporate new information into the scoring strategy.
Using a combination of local and global scores is necessary.

Really, though, the metric to determine whether a selected word is good or not boils down to how many other words it can eliminate on average. That is, consider this brute force algorithm:

For each query word, w1, in the dictionary:
- For each target word, w2, in the dictionary:
  - Compute the feedback we would receive if we guessed w1 when the actual word was w2
  - Calculate the number of words this feedback would help us prune, num_pruned
- Compute the average of num_pruned

In effect, this would give us the expected value of the number of eliminations a word can provide, where the expectation is over all possible target words (which is unknown to us).

When implemented naively, this is an O(N^3) algorithm due to the nested loops and the fact that computing num_pruned is itself an O(N) operation. That is very expensive. However, note that the possible feedbacks for any two words is just 3^5=243 which is a relatively small number (compared to 10K). So we can actually do this in O(N^2) time.

To understand this, however, you need to have a solid understanding of how we can represent feedback. Say the actual word for today is "bribe", and you guess a word like "bikes". You would receive some combination of green, yellow and black boxes. You can represent this as a string of symbols where "+" is green, "-" is black and "~" is yellow. So in this case, you might get "+~-~-". Nice, right?

Now, suppose you are given a guess word w1. You want to find how many other words it will eliminate on average. You go through all possible other words and find that feedback f1 occurs 2 times, f2 occurs 5 times and f3 occurs 1 time. Whenever you get a feedback, you can eliminate all the words that get any of the other feedbacks. So eliminations are just an inverse of the feedback distribution, e.g. e1 = N - f1. You can compute this elimination distribution in O(N^2) time.

Now, you can do another O(N^2) loop where you go through all possible guess word and target word pairs (w1, w2) and compute the feedback, f. You already know how many words the guess-feedback pair (w1, f) will allow you to eliminate based on the distribution above. You can average this over all w2 and you get the average number of eliminations for w1.

When you get some actual feedback (instead of simulated feedbacks over all possible target words), you can simply eliminate those words from the list of possible targets and the algorithm should just work.

It is possible to make this run even faster with some Monte-Carlo Simulation. That is, since we only care about the expectation, we can randomly sample a small set of target words w2 instead of going through all possibilities. While this may be noisy, it should approximate the same results with a large enough sample size!

And that's that -- this would actually give us the theoretically optimal starting word, which happens to be the word "raise"! This actually has a 100% success rate but is much slower to run. To speed it up, I decided to implement it in Golang instead of Python for easy parallelization with goroutines. Also, once this "raise" word was computed (took ~5mins), we could just assume you used that as your starting point which would prune over 97% of the words on average. Since you will now only have 300 or so remaining words, the quadratic runtime is very tractable.

The code for this algorithm can be found here.

Conclusion

Through all these ideas, we just discussed two algorithms. The first one is a noisier algorithm that runs in O(N) time. It is not perfect because it makes the naive assumption of independence (kinda like Naive Bayes!) between letters of the word. Then, we present an O(N^2) algorithm that performs a more thorough estimation of the number of eliminations each word can provide.

Are we done? Well not quite. This is the optimal solution to the one-shot problem. However, in theory, suppose our choice of the first word is very good, it might actually use up all the nice letters and only leave bad words for our second guess. That means in the two-shot problem, we might have another more optimal strategy. In our case, it is a six-shot problem. Doing this would take O((N^2)^6) = O(N^12) steps which is pretty expensive. Until we get some quantum computers working full time on solving the Wordle problem, we will have to be satisfied with the approximation we have achieved so far...

DEV Community