Michael Truong

Posted on Jun 4 • Edited on Jul 8

Schema first, prompt second: valid JSON wasn't enough

#ai #webdev #typescript #node

Over the last month I've been building Codenames AI, a small web game where an LLM plays Codenames with you. The guesser never sees unrevealed card identities. The server sends the board state and a clue; the model returns structured guesses with confidence scores and short explanations.

When I started, I assumed the hard part was prompting. I was half right. Getting something reasonable out of the model was fast. Making the system safe to expose to players was not.

My first milestone felt responsible: response_format: { type: "json_object" } on the chat completion, plus Zod schemas for the response body. If the JSON didn't parse or failed Zod, retry. Ship it.

Then I watched the model comply perfectly with the schema and still propose moves that would ruin a game.

Valid JSON, invalid game

Here's the distinction that mattered.

JSON schema (via Zod) answers: Did the model return the keys and types I asked for?

Domain validation answers: Is this output allowed on this board, for this clue, under these rules?

Those are not the same questions.

Three examples I hit while testing and running the game:

1. The model echoed the clue as a guess.

Codenames forbids guessing the clue word. The model would sometimes put it in guesses[] anyway—confidently, with a tidy explanation object. Zod was thrilled. The game was not.

2. The model hallucinated words that weren't on the board.

Perfect JSON. A guess list full of words that don't exist on the 25-card grid, or that were already revealed. Again, schema-valid.

3. The spymaster returned illegal clues.

Single-word clues can't match a codename, can't be a substring of one (or vice versa), and can't be near-miss spellings. The model regularly suggested clues that a human referee would reject. Valid JSON every time.

I spent too long fixing these by adding sentences to the system prompt. That helped a little. It did not help enough.

What actually moved reliability

The bigger wins came from code paths I treated as boring infrastructure.

Sanitization before trust. After Zod parses the guess payload, we strip clue echoes, off-board words, revealed cards, and duplicates, then realign the explanation array with whatever survived. The model can return whatever explanation it wants; the server decides which guesses survive validation.

Deterministic validators with explicit error strings. Clue validation returns things like "Clue cannot be a substring of a board word"—not "invalid." Those strings go back into the next attempt as rejectionFeedback, alongside an exclude list of clue words that already failed, so the next attempt could avoid repeating the same violations.

Post-processing for uncertainty. Even valid guesses get filtered by a confidence threshold before the client plays them. If nothing clears the bar, the API returns an empty guess list—the AI Guesser passes the turn rather than firing a weak pick. That's a product decision, but it only works because the earlier layers stopped nonsense from masquerading as success.

None of this required readers to know Codenames. It's the same shape as any LLM feature with invariants: inventory counts that can't go negative, user IDs that must exist, action enums that must match state machines.

Mistakes, surprises and tradeoffs

Mistake: Treating structured output as the guardrail. It only enforced shape.

Surprise: Sanitization outperformed prompt engineering for the dumbest failures (echoed clue, off-board tokens). Cheap deterministic filters beat another paragraph of "IMPORTANT RULES."

Surprise: Retry feedback with the reason a clue failed worked better than "try again." The model stopped repeating substring violations faster when the server named the violation.

Tradeoff: Retries burn tokens. Logging validation errors per attempt was essential to know whether we had a prompt problem or a missing rule.

Tradeoff: Sanitization can mask drift. If you silently drop bad guesses, monitor what you're dropping or you'll quietly turn the validator into the thing making all the decisions.

What I'd do on the next project

Define the wire shape (JSON + schema).
List domain invariants as pure functions with test cases
Add sanitization for the failure modes observed in the first 50 live calls.
Only then invest in prompt nuance—and feed validator messages into retries.

Prompt engineering still matters for quality. It is not a substitute for enforcement when the user can lose a game—or money, or data—because the model followed the JSON spec and ignored reality.

Takeaway: If your LLM integration stops at "parse JSON, call it a day," you haven't finished the feature. You've finished the demo.

If you'd like to see the project that inspired these lessons, you can try Codenames AI.

Top comments (7)

xulingfeng • Jun 4

The valid-JSON-invalid-game distinction is the same gap we hit with agent memory validation — Zod (or any schema) tells you the shape is right, not whether the content should govern action. The Codenames example makes it concrete in a way abstract architecture talk doesn't. In our case we added a separate domain-validation pass (authority layer) after the schema pass, and it caught things like superseded policy being treated as current. Do you keep both layers in the same service or split them (schema at the edge, domain validation closer to the game logic)?

Michael Truong • Jun 5

Same service. I kept them as separate layers in code, not separate services.

Schema validation runs right after the model response (JSON parse → Zod). Its job is just “did I get the shape I expected?” Once that passes, game-specific validators take over: board membership, revealed cards, clue legality, explanation alignment, intent targets, etc.
On the guesser path there’s also a sanitization step in between that strips clue echoes, off-board words, and duplicates before the strict checks run.

What surprised me was how much still wasn’t game-valid even when the schema passed. I initially assumed JSON mode plus a schema would eliminate most of the reliability work, but it really only solved parsing and shape.

Confidence filtering is a step after that. Once validation passes, low-confidence guesses get dropped before anything hits the client.

Your memory example sounds similar. The retrieved memory may be well-formed, but you still need a second layer that decides whether it’s current, authoritative, or even applicable to the action being taken.

Glad the Codenames example made that distinction clearer. That was exactly what I was aiming for.

xulingfeng • Jun 5

This is exactly the kind of layered validation I push for in test automation — most teams stop at the first layer and call it done.

"Schema passed? Ship it." Then production finds the gaps. Your Codenames example is exactly what we see in testing: a test can pass schema validation and tell you nothing about whether the feature actually works.

The part that stood out to me as a QA person: your sanitization step that strips clue echoes and off-board words before the strict checks. That's "defensive validation" — treat the input as dirty until proven clean. Most pipelines do it the other way around.

And the error-string feedback — returning "Clue cannot be a substring" instead of "invalid" — that's solid assertion design. A pass/fail tells you nothing. A named failure tells the system what to fix.

Three validation layers, each with a different job, none trusting the one before it. Good LLM architecture and good QA architecture look the same.

Michael Truong • Jun 5

The QA parallel makes sense to me. "Schema passed" and "feature works" really are different assertions.

On defensive validation, that's exactly what the guess sanitization step became. We strip clue echoes, off-board words, and duplicates before the strict checks run rather than treating a schema-clean payload as trustworthy.

The named error feedback ended up paying off most on the spymaster retry loop. Returning something like "Clue cannot be a substring of an unrevealed board word" gave the model a much better chance of recovering than a generic "invalid clue". We also fed back previously rejected clues so it wouldn't keep trying the same ideas, which was particularly bad when the model decided to be less probabilistic than usual.

One extension we added later was Judge mode for the spymaster. Instead of generating a single clue and retrying on failure, we generate several candidates in one batch, run the validators across all of them, and expect some to be rejected. If enough survive, a judge pass picks the winner. In practice that was usually cheaper than serial retries because one response gave us multiple shots and the pruning happened in code.

Your "three layers, none trusting the one before" framing matches how it felt in practice.

xulingfeng • Jun 5

Judge mode is smart — basically a mini A/B test in one LLM call, then let code pick the winner. That's parallel test execution applied to prompt engineering.
The "feed back rejected clues" bit hits hard. I've seen the exact same pattern in test case generation: if you don't tell the system why something failed AND what's already been tried, it loops on the same broken idea forever. That's not a model problem, it's a feedback design problem.
One question though: did the Judge ever fall into consensus bias — picking the safest candidate over the actually best one? I've seen that happen when the evaluator and the generator share too much context.

Michael Truong • Jun 5

On consensus bias: partly yes, by design. The judge was primarily a safety referee rather than a creativity ranker. If all the surviving candidates were mediocre, it would usually choose the safest option. That trade-off was acceptable because an over-cautious clue still produces a playable game, whereas a clue that drifts toward the assassin can end it immediately.

We eventually ended up experimenting with alternative selection strategies because different selectors produced noticeably different personalities. That turned out to be almost as interesting as the clue generation itself.

xulingfeng • Jun 5

The safety-referee vs creativity-ranker framing clicks. In testing we call that the false-positive/false-negative trade-off — you tune for the scenario that costs more to miss, not the one you'd ideally have.
The selector-personality link is the part that'd keep me up at night though. If the same generation pipeline produces measurably different game feels depending on which selector runs, then the selection strategy isn't just a filter — it's a creative director you didn't hire. That's the kind of emergent behavior that makes this space fascinating.
Appreciate you sharing the internals 🙌