Tariq Mehmood

Posted on Nov 8

Building a New York Times Connections Solver with Python

#webdev #javascript #programming #python

If you’re a fan of word games, you’ve probably come across the puzzle from The New York Times called Connections. In this game you’re shown 16 words and you must split them into 4 groups of 4, each group sharing a hidden common theme.

Recently I built a Solver for this puzzle (or at least a helper tool) and in this post I’ll walk through how I approached it, the code that underlies it, and the challenges & trade-offs you’ll face when building something like this. If you’ve got your own favorite puzzle solver tools, I hope this inspires you to build one too.

What is NYT Connections & why build a solver?

The puzzle

Here’s a quick summary of the NYT Connections mechanics:

You have 16 words, in a 4x4 grid.
You must guess 4 sets of 4 words each, where each set shares some semantic (or playful) connection.
The categories are hidden, and each category is assigned a difficulty colour: yellow (easiest) → green → blue → purple (hardest).
You get a small number of guesses (four “lives”) to get all four groups.

Why build a solver?

There are several motivating factors:

Even though the puzzle looks simple, the hidden categories and subtle mis-directions (words that could belong to more than one group) make it hard. As one commentary put it, the game “makes you feel bad” because you keep tripping on the ambiguous connections.
raphkoster.com
Building a solver is an interesting programming challenge: how do you algorithmically detect which words belong together? How do you prioritise semantically-strong clusters vs wordplay?
It gives you a chance to explore natural-language processing (NLP) heuristics, embeddings, lexical resources, and design trade-offs between “fully automated solve” vs “hint-assisted helper”.
For me, I wanted a tool I could use when I’m stuck on a day’s puzzle—but also reflectively learn from (so I get better at the game).

Approach: High-level design

Here is the architecture I settled on:

Input parsing – read the 16 words for the day (either manually entered, or scraped from a source).
Feature extraction – for each word, compute semantic embeddings (via a pre-trained model), lexical attributes (like word length, part-of-speech, morphological features), and possibly co-occurrence / category-likeness features.
Candidate grouping – generate candidate groupings of 4 words from the 16 (combinatorial: C(16,4)=1820 possibilities), then evaluate each grouping via a scoring function that estimates “how likely these 4 share a theme”.
Backtracking / covering – once you pick one group, you remove those 4 words from consideration and repeat on the remaining 12, 8, then 4. Use heuristic search to pick the best covering of the 16 into 4 groups.
Ranking & hint output – output the best groupings, optionally highlight which category is likely easiest/hardest, or even reveal one word from the category as a “hint”.
User interface – a simple web UI (or CLI) where the user can paste the 16 words, press “Solve” or “Hint”, then see suggestions.

Below is a rough roadmap to the code segments.

Code walkthrough

I’ll use Python for the core solver logic + a minimal Flask (or FastAPI) web front-end. For embedding I’ll use the sentence-transformers library.

Setup & imports

# solver.py
import itertools
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Optional: morphological / lexical features
import nltk
from nltk.corpus import wordnet as wn

Install via pip install sentence-transformers flask nltk sklearn etc.

Feature extraction

# Initialize embedding model
embed_model = SentenceTransformer('all-mpnet-base-v2')

def embed_words(words: list[str]) -> np.ndarray:
    """Compute embedding vectors for each word."""
    return embed_model.encode(words, convert_to_numpy=True, normalize_embeddings=True)

def lexical_features(word: str) -> dict:
    """Compute simple lexical features (length, synsets, POS etc)."""
    # For simplicity we just count synset counts
    synsets = wn.synsets(word)
    return {
        'length': len(word),
        'synset_count': len(synsets),
        # you can add more ...
    }

These features help us measure “just because the words are close in meaning” and also “how lexically connected” they might be.

Scoring a candidate group

def score_group(words: list[str], embeddings: np.ndarray, idx_map: dict) -> float:
    """
    Score how well these four words form a coherent group.
    Higher score = more likely “correct group”.
    Using cosine similarity among embeddings + lexical heuristics.
    """
    idxs = [ idx_map[w] for w in words ]
    sub_emb = embeddings[idxs]
    # compute pairwise cosine similarities
    sim_mat = cosine_similarity(sub_emb)
    # we only care off-diagonal elements
    n = len(words)
    sims = []
    for i in range(n):
        for j in range(i+1, n):
            sims.append(sim_mat[i,j])
    avg_sim = np.mean(sims)
    # lexical heuristics: e.g., if lengths are similar, more likely
    lengths = [ len(w) for w in words ]
    length_std = np.std(lengths)
    lex_score = 1.0 / (1.0 + length_std)  # smaller std → higher score
    # final combined score (weights tuned empirically)
    return avg_sim * 0.7 + lex_score * 0.3

What the above does: groups whose words are semantically close (high embedding cosine similarity) score better; also groups of words with similar length (or similar lexical shape) get a small bonus.

Generating candidates

def generate_all_groups(words: list[str], embeddings: np.ndarray) -> list[tuple[list[str], float]]:
    idx_map = { w : i for i,w in enumerate(words) }
    candidates = []
    for combo in itertools.combinations(words, 4):
        s = score_group(list(combo), embeddings, idx_map)
        candidates.append((list(combo), s))
    # sort descending
    candidates.sort(key=lambda x: x[1], reverse=True)
    return candidates

Selecting cover sets of 4 groups

def pick_best_cover(words: list[str], embeddings: np.ndarray, top_k=50):
    """
    From all candidate groups, pick top_k best,
    then try to pick combinations of them that cover the 16 words without overlap.
    Return best covering (list of 4 groups).
    """
    candidates = generate_all_groups(words, embeddings)[:top_k]
    best_cover = None
    best_score = -1
    # Try all combinations of 4 groups from top_k
    for groups in itertools.combinations(candidates, 4):
        group_words = [ tuple(g[0]) for g in groups ]
        # flatten and check overlap
        all_words = sum(group_words, ())
        if len(set(all_words)) == 16:
            # compute combined score
            score = sum(g[1] for g in groups)
            if score > best_score:
                best_score = score
                best_cover = groups
    return best_cover, best_score

The above brute-forces combinations of top candidate groups (limited to top_k) to find a cover of the 16 words into 4 non-overlapping groups. In practice you might need to prune, add heuristics and cutoffs if top_k is large.

Putting it all together

def solve_puzzle(words: list[str]):
    embeddings = embed_words(words)
    cover, score = pick_best_cover(words, embeddings, top_k=100)
    if cover is None:
        print("No non-overlapping cover found in top candidates; try increasing top_k.")
        return None
    # Return groups in descending score order
    groups_sorted = sorted(cover, key=lambda x: x[1], reverse=True)
    return [ g[0] for g in groups_sorted ]

Web interface (Flask)

# app.py
from flask import Flask, request, render_template
from solver import solve_puzzle

app = Flask(__name__)

@app.route('/', methods=['GET','POST'])
def home():
    if request.method == 'POST':
        text = request.form['words']
        words = [ w.strip() for w in text.split(',') if w.strip() ]
        if len(words) != 16:
            return render_template('index.html', error="Please enter exactly 16 comma-separated words.")
        groups = solve_puzzle(words)
        return render_template('index.html', groups=groups, original=words)
    return render_template('index.html')

if __name__ == "__main__":
    app.run(debug=True)

And the HTML template (templates/index.html) might look like:

<!doctype html>
<html>
<head><title>NYT Connections Solver</title></head>
<body>
  <h1>NYT Connections Solver</h1>
  {% if error %}
    <p style="color:red">{{ error }}</p>
  {% endif %}
  <form method="post">
    <textarea name="words" rows="4" cols="80" placeholder="Enter your 16 words, comma separated"></textarea><br>
    <button type="submit">Solve</button>
  </form>
  {% if groups %}
    <h2>Suggested Groups</h2>
    <ol>
    {% for grp in groups %}
      <li>{{ grp | join(', ') }}</li>
    {% endfor %}
    </ol>
  {% endif %}
</body>
</html>

What this solver achieves & its limitations

Achievements

It automates the tedious enumeration of candidate 4-word sets and ranks them by a plausible score.
It supports a hint mode: if you only want to see the top1 group first, you can show just the highest-scoring group, and then reveal more if needed.
It helps you learn: by seeing which groups the solver thinks are strongest, you can compare your intuition and perhaps improve your strategy.

Limitations & caveats

Embedding similarity is not enough: Many of the hardest categories in Connections rely on subtle word-play (e.g., words that sound like letters, puns, multi-word expressions) or very specific cultural references that embeddings don’t necessarily capture. As one academic paper points out, this puzzle is “a deceptively simple text classification task that stumps system-1 thinkers”.
arXiv
Overfitting/False positives: The solver might pick 4 words that are “very similar” in meaning (e.g., synonyms) but that don’t match what the NYT theme intended. The theme might be less about “meaning” and more about “sound”, or “prefix/suffix”, or “it appears in a phrase”.
Brute-force covering is expensive: For 16 words you can brute all combinations of 4 groups from 1820 candidate groups fairly easily, but if you wanted to extend to larger grids (say 20 words or bigger) it would scale poorly.
No guarantee of uniqueness: The real puzzle has exactly one solution, but the solver might find multiple plausible covers with similar scores. The user still needs human judgment to decide which one “fits” best.
No direct integration: This solver does not (and cannot legally) fetch the official NYT puzzle data automatically (without permission). So you’ll likely rely on manual entry of the 16 words or a scraping workaround (which may run afoul of terms of service).

UX & user-experience considerations

When I built the web UI I kept the following user-experience aspects in mind:

Minimal friction: Paste 16 words → click Solve → get results. No login required, no complex setup.
Progressive disclosure: Some users just want one hint (e.g., “give me the easiest category”). Others want the full solution. So I added a “Hint only” mode.
Highlighting uncertainty: Since the solver is approximate, I indicate a “confidence score” (e.g., grouped words with score > 0.8 are likely) and perhaps highlight groups with lower confidence so the user knows “this one might be wrong”.
Color-coding: Even though I don’t know the exact NYT colour assignment, I choose to display groups in order of descending score and label them “Likely easiest”, “Likely moderate”, etc.
Mobile friendly: Many users play the puzzle on mobile; so the web UI is responsive and the textarea is thoughtfully sized.
Educational value: I put an option “Explain why this group might work” which shows a small explanation (based on the lexical/embedding heuristics) so the user can learn: e.g., “These words are very close in embedding space” or “These words share a common prefix/suffix pattern”.
Respect for puzzle integrity: If you embed a full “solve” mode, you are essentially spoiling the puzzle; so I also added a moral disclaimer to encourage users to try the puzzle on their own first before using full solve mode.

Strategy & heuristics inspired by human players

When I studied how human players play Connections, I noticed some common heuristics and thinking-patterns. I built some of these into the solver (or at least tried to mimic them):

Scan for the obvious – Some groups are very straightforward (the so-called “Yellow” easiest category). For instance, “Monday”, “Tuesday”, “Wednesday”, “Thursday” might appear together as days of the week. Word-game blogs call this out.
Think about multi-meaning words / word-play – Some words in the grid may belong to multiple plausible groups, but the correct theme might hinge on a pun, a prefix, or a sound-alike. One Reddit comment:

“Watch out for words that seem to belong to multiple categories!”
My solver tries to penalise overly ambiguous candidates (e.g., words with many synsets).

Eliminate red herrings – The NYT puzzle makers often plant distractor words that look like they go in one category but don’t. For example, a word may appear to be a “color” but in fact is part of a phrase that links to something else. Recognising this helps. I introduced a heuristic: if a word easily fits several high-score groups, then maybe it’s a distractor.
Solve 3 groups and the last one falls in place – Many players recognise that once 12 words are grouped, the remaining 4 must form the final category. This reduces search space. One comment remarks:

“I fall for the traps all the damn time lol”

Final Thoughts

Building a NYT Connections Solver is a perfect small NLP project:

It’s conceptually simple.
It touches on embeddings, clustering, and visualization.
It’s fun and rewarding to see AI recognize human-like patterns.

This project taught me a lot about how meaning can be quantified — and how sometimes, even machines can make surprising connections.

DEV Community