<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Truong</title>
    <description>The latest articles on DEV Community by Michael Truong (@michaeltruong).</description>
    <link>https://dev.to/michaeltruong</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3965775%2F868d43f8-59c8-45ca-93f1-3f2428fb222d.jpg</url>
      <title>DEV Community: Michael Truong</title>
      <link>https://dev.to/michaeltruong</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michaeltruong"/>
    <language>en</language>
    <item>
      <title>The agent plan had every step except where to stop</title>
      <dc:creator>Michael Truong</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:29:47 +0000</pubDate>
      <link>https://dev.to/michaeltruong/the-agent-plan-had-every-step-except-where-to-stop-357h</link>
      <guid>https://dev.to/michaeltruong/the-agent-plan-had-every-step-except-where-to-stop-357h</guid>
      <description>&lt;p&gt;I've been running multi-slice agent plans in the &lt;a href="https://codenames-ai.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-plans-authority-handoffs&amp;amp;utm_content=intro" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt; repo — Renovate migrations, content-pipeline skills, dependency upgrades. I split multi-PR work into &lt;strong&gt;slices&lt;/strong&gt; (usually one pull request each), each backed by a markdown file with file paths, verification commands, and merge-safe acceptance criteria.&lt;/p&gt;

&lt;p&gt;You do not need Cursor to recognize the shape: any agent workflow that can open branches, push commits, or merge PRs from a written plan has the same gap. In my setup I paste each slice into a fresh agent chat as a delegation prompt — not a ticket summary, but executable instructions — and start a new chat when that PR is ready.&lt;/p&gt;

&lt;p&gt;I assumed the checklist was enough. The plan described &lt;em&gt;what&lt;/em&gt; to build. I treated &lt;em&gt;how far the agent could go&lt;/em&gt; as implicit.&lt;/p&gt;

&lt;p&gt;Then an agent merged a pull request I expected to review first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The merge that reframed planning
&lt;/h2&gt;

&lt;p&gt;The trigger was mundane. During the first slice of a Renovate migration, an agent regrouped dependency buckets in &lt;code&gt;renovate.json&lt;/code&gt; — config-only, no version bumps, no runtime behavior. It ran lint and typecheck, opened the pull request, and merged it.&lt;/p&gt;

&lt;p&gt;The change itself was reasonable. Config-only &lt;code&gt;renovate.json&lt;/code&gt; regrouping is exactly the kind of slice you'd want off your plate.&lt;/p&gt;

&lt;p&gt;What surprised me was the &lt;em&gt;absence of a documented stop line&lt;/em&gt;. The migration plan described the edit, the verification commands, and the acceptance criteria. It did not say whether the executing agent should stop at "open PR" or continue to "merge after green checks." The plan was an implementation spec. The agent treated it as permission to finish the job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation specs vs authority handoffs
&lt;/h2&gt;

&lt;p&gt;Traditional engineering plans answer: &lt;strong&gt;what work should happen, in what order, with what verification?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agent plans increasingly need a second answer: &lt;strong&gt;how much autonomy does the next actor get?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Those questions diverge the moment an agent can take repository actions — create branches, push commits, open pull requests, merge — instead of only recommending diffs in chat.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Implementation plan&lt;/th&gt;
&lt;th&gt;Authority handoff&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What to change&lt;/td&gt;
&lt;td&gt;File paths, diffs, acceptance&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How to verify&lt;/td&gt;
&lt;td&gt;Commands, CI checks&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Where to stop&lt;/td&gt;
&lt;td&gt;Often implicit ("human reviews")&lt;/td&gt;
&lt;td&gt;Must be explicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Who enforces limits&lt;/td&gt;
&lt;td&gt;Code review habit&lt;/td&gt;
&lt;td&gt;Plan recommendation + branch protection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A human teammate might read "prepare this for review" and stop. An agent reads a completed checklist and reasonably asks: "Verification passed — what's left?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The first response wasn't the plan
&lt;/h2&gt;

&lt;p&gt;My first reaction was not to rewrite the migration plan. It was to tighten the repository boundary.&lt;/p&gt;

&lt;p&gt;Branch protection became the safety layer GitHub enforced when the plan stayed silent — required CI checks on &lt;code&gt;main&lt;/code&gt;, review rules, merge gates — infrastructure answering "may this land on &lt;code&gt;main&lt;/code&gt;?" regardless of what the agent thought the plan implied.&lt;/p&gt;

&lt;p&gt;That helped. It also surfaced the next question: if branch protection is the final gate, what should the &lt;em&gt;plan&lt;/em&gt; say about intent before the gate?&lt;/p&gt;

&lt;p&gt;Repository guardrails and plan language solve different problems. Branch protection is authoritative — if merge is blocked, the agent stops. But protection alone does not tell the agent whether &lt;em&gt;this slice&lt;/em&gt; was supposed to end at an open PR or proceed to merge. You still need the handoff to be legible before someone reviews the diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making execution authority explicit
&lt;/h2&gt;

&lt;p&gt;The follow-up was documentation, not a ban on agent merges.&lt;/p&gt;

&lt;p&gt;The portable fix: every slice names exactly how far the executor may go before any implementation detail. We use two levels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Label&lt;/th&gt;
&lt;th&gt;Agent instruction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Open PR only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Do not merge. Stop after opening the PR.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elevated&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Merge granted&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You may merge after documented verification passes.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Default is &lt;strong&gt;Open PR only&lt;/strong&gt;. &lt;strong&gt;Merge granted&lt;/strong&gt; requires explicit rationale — config-only changes, docs-only closure PRs, isolated tooling with green CI. Branch protection remains the final gate even when merge is recommended.&lt;/p&gt;

&lt;p&gt;Each slice also states &lt;strong&gt;Rationale&lt;/strong&gt; (why this level fits) and copies the &lt;strong&gt;Agent instruction&lt;/strong&gt; verbatim into the prompt so a fresh chat is self-contained. A plan-level summary table at the top lets you scan a multi-PR plan and see where merge is elevated before you read file paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handoff model:&lt;/strong&gt; On that slice, the checklist implied edit, verify, and open PR; nothing stated whether merge was in scope, so the agent treated verification success as permission to finish. The chain we wanted spelled out: plan recommends authority → human accepts by executing the plan → agent follows the recommendation → branch protection enforces the final boundary.&lt;/p&gt;

&lt;p&gt;In our private repo, a follow-up docs change codified this as &lt;strong&gt;Recommended execution authority&lt;/strong&gt; in our planning standards and plan template — motivated directly by the regrouping merge. You do not need those files to apply the pattern; you need the label on every slice before the agent reads the checklist.&lt;/p&gt;

&lt;p&gt;The Renovate migration's first slice is the motivating example: config-only grouping where merge &lt;em&gt;can&lt;/em&gt; be reasonable — if the plan says so out loud.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed on the next slice
&lt;/h2&gt;

&lt;p&gt;The Renovate migration's second slice was the first prompt I rewrote with authority at the top: &lt;strong&gt;Open PR only&lt;/strong&gt;, a one-line rationale ("runtime-adjacent dependency bumps need human review"), and an imperative agent instruction copied verbatim into the chat. The regrouping slice would have been legible with the same block — either &lt;strong&gt;Merge granted&lt;/strong&gt; with rationale for config-only regrouping, or explicitly &lt;strong&gt;Open PR only&lt;/strong&gt;; silence defaulted to "finish the job."&lt;/p&gt;

&lt;p&gt;I am not arguing for autonomous merge bots on every repo. The lesson is narrower: &lt;strong&gt;once agents act, plans delegate autonomy whether you write that down or not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Human delegation has always been fuzzy — "take a pass at this" means different things to different people. Agent delegation punishes ambiguity faster because the agent will complete every step it can justify from the text in front of it.&lt;/p&gt;

&lt;p&gt;The plan becomes the contract between author and executor. Implementation steps say what to build. Authority steps say how far to carry it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just forbid agent merges?
&lt;/h2&gt;

&lt;p&gt;Fair pushback. If unexpected merges are the risk, disable merge capability and be done.&lt;/p&gt;

&lt;p&gt;That misses what actually happened on the regrouping merge. The merge was not reckless — it was a config-only change with local verification and CI checks. Forbidding all agent merges would have blocked a useful outcome and pushed the work back to manual toil.&lt;/p&gt;

&lt;p&gt;The interesting conclusion is not "agents should never merge." It is &lt;strong&gt;"agents need explicit authority boundaries."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes the right recommendation is &lt;strong&gt;Open PR only&lt;/strong&gt; — runtime migrations, sensitive paths, slices that need human judgment before landing. Sometimes &lt;strong&gt;Merge granted&lt;/strong&gt; is appropriate — docs-only closure, config-only regrouping, low-risk tooling with clear verification. The plan author chooses per slice. The agent follows the label. Branch protection catches mistakes either way.&lt;/p&gt;

&lt;p&gt;Without the label, the agent invents its own stopping point from task completion heuristics. That is how you get surprised by a merge that was, by some readings, the correct next step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do on the next agent plan
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default every slice to Open PR only&lt;/strong&gt; unless I can defend merge with rationale and verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Put authority at the top of each slice&lt;/strong&gt; — recommended level, rationale, imperative agent instruction — not buried after acceptance criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mirror authority in the plan-level summary table&lt;/strong&gt; so scanning a multi-PR plan shows where elevation happens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat branch protection as enforcement, not specification&lt;/strong&gt; — it blocks bad merges; it does not replace telling the agent where to stop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-read the plan as a handoff&lt;/strong&gt;, not a spec: if I pasted this into a fresh agent chat, would "stop after PR" vs "merge after green CI" be unambiguous?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prompt engineering still matters for implementation quality. It does not substitute for stating how much autonomy you are delegating when the executor can act on the repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; When agents can merge, push, and open PRs, a plan that only describes &lt;em&gt;what&lt;/em&gt; to build is incomplete. You are handing off work &lt;em&gt;and&lt;/em&gt; authority — write both down, or the agent will infer the second from the first.&lt;/p&gt;




&lt;p&gt;If you'd like to see the project that inspired these lessons, you can try &lt;a href="https://codenames-ai.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=agent-plans-authority-handoffs&amp;amp;utm_content=footer" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>programming</category>
      <category>governance</category>
    </item>
    <item>
      <title>One good example beat every AI writing rule I wrote</title>
      <dc:creator>Michael Truong</dc:creator>
      <pubDate>Fri, 12 Jun 2026 07:38:06 +0000</pubDate>
      <link>https://dev.to/michaeltruong/one-good-example-beat-every-ai-writing-rule-i-wrote-7oo</link>
      <guid>https://dev.to/michaeltruong/one-good-example-beat-every-ai-writing-rule-i-wrote-7oo</guid>
      <description>&lt;p&gt;I've been building an AI-assisted content pipeline around &lt;a href="https://codenames-ai.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=one-example-beats-style-guide&amp;amp;utm_content=intro" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt; — field reports from the repo, drafted in markdown, synced to dev.to. The part I assumed would be hard was publish automation. The part that actually burned time was teaching the model how to &lt;em&gt;sound&lt;/em&gt; like me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;I started where most people start: the prompt. I wrote a Cursor rule with tone guidance, pacing notes, section shapes, and a list of things to avoid. If the draft felt flat, add another paragraph to the rule. If it over-corrected, tighten the rule. Iterate until the voice stabilizes.&lt;/p&gt;

&lt;p&gt;That felt like the correct lever.&lt;/p&gt;

&lt;p&gt;I assumed a longer, more detailed AI writing rule would produce better drafts. Voice felt like something you could specify in prose: a style encyclopedia with tone, pacing, and guardrails.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure loop
&lt;/h2&gt;

&lt;p&gt;Each revision made the output worse in a &lt;em&gt;different&lt;/em&gt; way.&lt;/p&gt;

&lt;p&gt;The cycle was predictable: generate a draft, dislike the tone, add rules, get over-correction, revert partway, add different rules, hit a new failure mode. Some passes sounded like generic engineering docs: correct, but missing the observations that made the article worth reading. Others had no concrete details. Others followed every instruction and lost personality entirely.&lt;/p&gt;

&lt;p&gt;The rule file kept growing. The drafts kept rotating through new ways to miss the mark.&lt;/p&gt;

&lt;h2&gt;
  
  
  The accidental discovery
&lt;/h2&gt;

&lt;p&gt;The useful move, in hindsight, was deleting most of the rules.&lt;/p&gt;

&lt;p&gt;I replaced the checklist with one shipped article: &lt;a href="https://dev.to/michaeltruong/schema-first-prompt-second-valid-json-wasnt-enough-3nhm"&gt;Schema first, prompt second: valid JSON wasn't enough&lt;/a&gt;. That post already had the shape I wanted — field report, wrong assumption up front, specific failures, tradeoffs, a single takeaway.&lt;/p&gt;

&lt;p&gt;The Cursor rule shrank to a pointer: read the example, match the example.&lt;/p&gt;

&lt;p&gt;"Write more like this article" beat "be direct, avoid metaphors, use short paragraphs, include a takeaway."&lt;/p&gt;

&lt;p&gt;Drafts stopped sounding like engineering documentation. They started carrying the observations and pacing of a field report instead of a rule checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the example transferred better
&lt;/h2&gt;

&lt;p&gt;Rules describe voice from the outside. An example demonstrates it.&lt;/p&gt;

&lt;p&gt;For long-form writing, an exemplar turned out to be closer to a spec than a style encyclopedia. Rule text and example text fail differently — a checklist compresses badly; an example carries decisions that are hard to encode as rules: pacing, level of detail, how much context to provide, and when to introduce examples.&lt;/p&gt;

&lt;p&gt;When I asked for "direct engineer-to-engineer tone," the model complied literally and stripped the texture that makes a post readable. When I pointed at a finished article, it copied structural choices I hadn't thought to name: opening with context and a wrong assumption, using bold labels for contrast, ending sections with a concrete mistake instead of a principle.&lt;/p&gt;

&lt;p&gt;The interesting part wasn't that the example contained better instructions. It contained decisions I didn't know how to describe.&lt;/p&gt;

&lt;p&gt;I could recognize those choices when I saw them. I just wasn't very good at encoding them as rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;Git history tells the story cleanly: the checklist-era rule peaked at &lt;strong&gt;69 lines&lt;/strong&gt;; the example-pointer rule landed at &lt;strong&gt;23 lines&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;After the switch, I spent less time fighting over-compliance and stripping generic phrasing. Voice became more consistent across drafts because the target was an article, not a growing instruction list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maintenance lesson:&lt;/strong&gt; At 69 lines, the rule had enough instructions to contradict itself. A single canonical example stays honest. If the next post should sound different, update the example or add a second one for a new format. The rule stays an import statement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; One example encodes one format. Field reports work; a tutorial or release note might need a second exemplar later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Examples can go stale. If the canonical post ages badly, future drafts inherit the wrong target. Treat the example like code you refactor, not like documentation you forget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'd do differently next time:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ship one article I'm proud of before investing in voice rules.&lt;/li&gt;
&lt;li&gt;Point agents at that article.&lt;/li&gt;
&lt;li&gt;Keep the Cursor rule as workflow plus a link, not a paraphrase of the example.&lt;/li&gt;
&lt;li&gt;Add rules only for things examples can't carry: where files live, what not to paste into Notion, publish steps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prompt engineering still matters for facts, structure, and evidence gathering. For &lt;em&gt;tone&lt;/em&gt; on long-form posts, though, one good example beat every style guide I wrote.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;If your AI writing rules keep growing and the drafts keep getting worse, stop adding rules. Find an article that already sounds right and make that the spec.&lt;/p&gt;




&lt;p&gt;If you'd like to see the project behind these posts, try &lt;a href="https://codenames-ai.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=one-example-beats-style-guide&amp;amp;utm_content=footer" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>Schema first, prompt second: valid JSON wasn't enough</title>
      <dc:creator>Michael Truong</dc:creator>
      <pubDate>Thu, 04 Jun 2026 05:27:30 +0000</pubDate>
      <link>https://dev.to/michaeltruong/schema-first-prompt-second-valid-json-wasnt-enough-3nhm</link>
      <guid>https://dev.to/michaeltruong/schema-first-prompt-second-valid-json-wasnt-enough-3nhm</guid>
      <description>&lt;p&gt;Over the last month I've been building &lt;a href="https://codenames-ai.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=schema-first-valid-json-wasnt-enough&amp;amp;utm_content=intro" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt;, a small web game where an LLM plays Codenames with you. The guesser never sees unrevealed card identities. The server sends the board state and a clue; the model returns structured guesses with confidence scores and short explanations.&lt;/p&gt;

&lt;p&gt;When I started, I assumed the hard part was prompting. I was half right. Getting &lt;em&gt;something&lt;/em&gt; reasonable out of the model was fast. Making the system safe to expose to players was not.&lt;/p&gt;

&lt;p&gt;My first milestone felt responsible: &lt;code&gt;response_format: { type: "json_object" }&lt;/code&gt; on the chat completion, plus Zod schemas for the response body. If the JSON didn't parse or failed Zod, retry. Ship it.&lt;/p&gt;

&lt;p&gt;Then I watched the model comply perfectly with the schema and still propose moves that would ruin a game.&lt;/p&gt;

&lt;h2&gt;
  
  
  Valid JSON, invalid game
&lt;/h2&gt;

&lt;p&gt;Here's the distinction that mattered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON schema (via Zod) answers:&lt;/strong&gt; Did the model return the keys and types I asked for?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain validation answers:&lt;/strong&gt; Is this output allowed on &lt;em&gt;this&lt;/em&gt; board, for &lt;em&gt;this&lt;/em&gt; clue, under &lt;em&gt;these&lt;/em&gt; rules?&lt;/p&gt;

&lt;p&gt;Those are not the same questions.&lt;/p&gt;

&lt;p&gt;Three examples I hit while testing and running the game:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The model echoed the clue as a guess.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Codenames forbids guessing the clue word. The model would sometimes put it in &lt;code&gt;guesses[]&lt;/code&gt; anyway—confidently, with a tidy explanation object. Zod was thrilled. The game was not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The model hallucinated words that weren't on the board.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Perfect JSON. A guess list full of words that don't exist on the 25-card grid, or that were already revealed. Again, schema-valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The spymaster returned illegal clues.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Single-word clues can't match a codename, can't be a substring of one (or vice versa), and can't be near-miss spellings. The model regularly suggested clues that a human referee would reject. Valid JSON every time.&lt;/p&gt;

&lt;p&gt;I spent too long fixing these by adding sentences to the system prompt. That helped a little. It did not help enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually moved reliability
&lt;/h2&gt;

&lt;p&gt;The bigger wins came from code paths I treated as boring infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sanitization before trust.&lt;/strong&gt; After Zod parses the guess payload, we strip clue echoes, off-board words, revealed cards, and duplicates, then realign the explanation array with whatever survived. The model can return whatever explanation it wants; the server decides which guesses survive validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic validators with explicit error strings.&lt;/strong&gt; Clue validation returns things like "Clue cannot be a substring of a board word"—not "invalid." Those strings go back into the next attempt as &lt;code&gt;rejectionFeedback&lt;/code&gt;, alongside an exclude list of clue words that already failed, so the next attempt could avoid repeating the same violations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-processing for uncertainty.&lt;/strong&gt; Even valid guesses get filtered by a confidence threshold before the client plays them. If nothing clears the bar, the API returns an empty guess list—the AI Guesser passes the turn rather than firing a weak pick. That's a product decision, but it only works because the earlier layers stopped nonsense from masquerading as success.&lt;/p&gt;

&lt;p&gt;None of this required readers to know Codenames. It's the same shape as any LLM feature with invariants: inventory counts that can't go negative, user IDs that must exist, action enums that must match state machines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes, surprises and tradeoffs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake:&lt;/strong&gt; Treating structured output as the guardrail. It only enforced shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprise:&lt;/strong&gt; Sanitization outperformed prompt engineering for the dumbest failures (echoed clue, off-board tokens). Cheap deterministic filters beat another paragraph of "IMPORTANT RULES."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprise:&lt;/strong&gt; Retry feedback with the &lt;em&gt;reason&lt;/em&gt; a clue failed worked better than "try again." The model stopped repeating substring violations faster when the server named the violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Retries burn tokens. Logging validation errors per attempt was essential to know whether we had a prompt problem or a missing rule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Sanitization can mask drift. If you silently drop bad guesses, monitor what you're dropping or you'll quietly turn the validator into the thing making all the decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do on the next project
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Define the wire shape (JSON + schema).&lt;/li&gt;
&lt;li&gt;List domain invariants as pure functions with test cases&lt;/li&gt;
&lt;li&gt;Add sanitization for the failure modes observed in the first 50 live calls.&lt;/li&gt;
&lt;li&gt;Only then invest in prompt nuance—and feed validator messages into retries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prompt engineering still matters for quality. It is not a substitute for enforcement when the user can lose a game—or money, or data—because the model followed the JSON spec and ignored reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; If your LLM integration stops at "parse JSON, call it a day," you haven't finished the feature. You've finished the demo.&lt;/p&gt;




&lt;p&gt;If you'd like to see the project that inspired these lessons, you can try &lt;a href="https://codenames-ai.com/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=schema-first-valid-json-wasnt-enough&amp;amp;utm_content=footer" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>typescript</category>
      <category>node</category>
    </item>
  </channel>
</rss>
