<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michael Truong</title>
    <description>The latest articles on DEV Community by Michael Truong (@michaeltruong).</description>
    <link>https://dev.to/michaeltruong</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3965775%2F868d43f8-59c8-45ca-93f1-3f2428fb222d.jpg</url>
      <title>DEV Community: Michael Truong</title>
      <link>https://dev.to/michaeltruong</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michaeltruong"/>
    <language>en</language>
    <item>
      <title>Schema first, prompt second: valid JSON wasn't enough</title>
      <dc:creator>Michael Truong</dc:creator>
      <pubDate>Thu, 04 Jun 2026 05:27:30 +0000</pubDate>
      <link>https://dev.to/michaeltruong/schema-first-prompt-second-valid-json-wasnt-enough-3nhm</link>
      <guid>https://dev.to/michaeltruong/schema-first-prompt-second-valid-json-wasnt-enough-3nhm</guid>
      <description>&lt;p&gt;Over the last month I've been building &lt;a href="https://codenames-ai.com/" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt;, a small web game where an LLM plays Codenames with you. The guesser never sees unrevealed card identities. The server sends the board state and a clue; the model returns structured guesses with confidence scores and short explanations.&lt;/p&gt;

&lt;p&gt;When I started, I assumed the hard part was prompting. I was half right. Getting &lt;em&gt;something&lt;/em&gt; reasonable out of the model was fast. Making the system safe to expose to players was not.&lt;/p&gt;

&lt;p&gt;My first milestone felt responsible: &lt;code&gt;response_format: { type: "json_object" }&lt;/code&gt; on the chat completion, plus Zod schemas for the response body. If the JSON didn't parse or failed Zod, retry. Ship it.&lt;/p&gt;

&lt;p&gt;Then I watched the model comply perfectly with the schema and still propose moves that would ruin a game.&lt;/p&gt;

&lt;h2&gt;
  
  
  Valid JSON, invalid game
&lt;/h2&gt;

&lt;p&gt;Here's the distinction that mattered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON schema (via Zod) answers:&lt;/strong&gt; Did the model return the keys and types I asked for?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain validation answers:&lt;/strong&gt; Is this output allowed on &lt;em&gt;this&lt;/em&gt; board, for &lt;em&gt;this&lt;/em&gt; clue, under &lt;em&gt;these&lt;/em&gt; rules?&lt;/p&gt;

&lt;p&gt;Those are not the same questions.&lt;/p&gt;

&lt;p&gt;Three examples I hit while testing and running the game:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The model echoed the clue as a guess.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Codenames forbids guessing the clue word. The model would sometimes put it in &lt;code&gt;guesses[]&lt;/code&gt; anyway—confidently, with a tidy explanation object. Zod was thrilled. The game was not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The model hallucinated words that weren't on the board.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Perfect JSON. A guess list full of words that don't exist on the 25-card grid, or that were already revealed. Again, schema-valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The spymaster returned illegal clues.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Single-word clues can't match a codename, can't be a substring of one (or vice versa), and can't be near-miss spellings. The model regularly suggested clues that a human referee would reject. Valid JSON every time.&lt;/p&gt;

&lt;p&gt;I spent too long fixing these by adding sentences to the system prompt. That helped a little. It did not help enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually moved reliability
&lt;/h2&gt;

&lt;p&gt;The bigger wins came from code paths I treated as boring infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sanitization before trust.&lt;/strong&gt; After Zod parses the guess payload, we strip clue echoes, off-board words, revealed cards, and duplicates, then realign the explanation array with whatever survived. The model can return whatever explanation it wants; the server decides which guesses survive validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic validators with explicit error strings.&lt;/strong&gt; Clue validation returns things like "Clue cannot be a substring of a board word"—not "invalid." Those strings go back into the next attempt as &lt;code&gt;rejectionFeedback&lt;/code&gt;, alongside an exclude list of clue words that already failed, so the next attempt could avoid repeating the same violations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-processing for uncertainty.&lt;/strong&gt; Even valid guesses get filtered by a confidence threshold before the client plays them. If nothing clears the bar, the API returns an empty guess list—the AI Guesser passes the turn rather than firing a weak pick. That's a product decision, but it only works because the earlier layers stopped nonsense from masquerading as success.&lt;/p&gt;

&lt;p&gt;None of this required readers to know Codenames. It's the same shape as any LLM feature with invariants: inventory counts that can't go negative, user IDs that must exist, action enums that must match state machines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes, surprises and tradeoffs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake:&lt;/strong&gt; Treating structured output as the guardrail. It only enforced shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprise:&lt;/strong&gt; Sanitization outperformed prompt engineering for the dumbest failures (echoed clue, off-board tokens). Cheap deterministic filters beat another paragraph of "IMPORTANT RULES."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprise:&lt;/strong&gt; Retry feedback with the &lt;em&gt;reason&lt;/em&gt; a clue failed worked better than "try again." The model stopped repeating substring violations faster when the server named the violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Retries burn tokens. Logging validation errors per attempt was essential to know whether we had a prompt problem or a missing rule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Sanitization can mask drift. If you silently drop bad guesses, monitor what you're dropping or you'll quietly turn the validator into the thing making all the decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do on the next project
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Define the wire shape (JSON + schema).&lt;/li&gt;
&lt;li&gt;List domain invariants as pure functions with test cases&lt;/li&gt;
&lt;li&gt;Add sanitization for the failure modes observed in the first 50 live calls.&lt;/li&gt;
&lt;li&gt;Only then invest in prompt nuance—and feed validator messages into retries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prompt engineering still matters for quality. It is not a substitute for enforcement when the user can lose a game—or money, or data—because the model followed the JSON spec and ignored reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; If your LLM integration stops at "parse JSON, call it a day," you haven't finished the feature. You've finished the demo.&lt;/p&gt;




&lt;p&gt;If you'd like to see the project that inspired these lessons, you can try &lt;a href="https://codenames-ai.com/" rel="noopener noreferrer"&gt;Codenames AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>typescript</category>
      <category>node</category>
    </item>
  </channel>
</rss>
