DEV Community: Michael Truong

I fixed my AI reviewer. Then I kept solving the wrong problem

Michael Truong — Wed, 08 Jul 2026 05:09:51 +0000

I've been building an AI-assisted editorial pipeline for technical writing. Notion cards become markdown drafts in the repo, pass through review, then sync to dev.to.

Last month I shipped a post about the first big fix to my editor-critique reviewer skill: The AI reviewer scored 23/25 and missed the point. The problem was sequence. A score-first pass treated a polished rubric as the first lens and produced QA feedback when I needed editorial feedback. Reordering the skill so analysis precedes scoring fixed that.

I assumed the next improvements would come from rubric tuning. Longer prompts. Another scoring dimension. Sharper checklists.

That assumption was half right. The rubric still matters. But every useful fix after the baseline shared a different shape.

The pattern I kept missing

After I reordered analysis before scoring, reviewer failures kept arriving from different incidents. A critique that agreed with itself too easily. Drafts that grew every revision without getting shorter. A middle section that felt like a second article.

Each time I reached for the same lever: expand the rubric, add a rule, lengthen the prompt.

Incident 1: When the reviewer needs to argue with itself

editor-critique produced decisive scorecards and prioritized feedback, but the report rarely challenged its own conclusions. A draft could earn Ready to sync with medium items left unexamined.

Score-first review had failed because it judged too early. This failure was different: the primary critique could be thorough and still under-falsified.

The fix was another staged pass. After the primary critique drafts, freeze it. Run adversarial review that assumes the primary assessment is wrong until draft-supported counter-evidence proves otherwise. Then synthesize: change the publication recommendation only when falsification is material.

I added adversarial review, synthesis, and canonical report assembly as new skill steps. A follow-up pass tightened adversarial review with an anchor requirement: every counter-evidence bullet must name the frozen primary claim it challenges. No orphan hypotheticals like "title spoils thesis?" when the primary critique already praised title strategy.

Before:

Editorial read-through
→ Score
→ Critique
→ Post report

After:

Editorial read-through
→ Score
→ Primary critique
→ Adversarial review (frozen inputs)
→ Synthesis
→ Post report

That was the first time staging a different kind of reasoning into its own pass beat rubric expansion. Two more failures would repeat the same shape before I stopped treating it as coincidence.

Incident 2: When critique only adds

Self-falsification helped, but drafts were still growing. Investigation while critiquing Upgrades don't have to be a blind trust exercise showed feedback was consistently additive, but not subtractive. editor-critique found missing framing and evidence boundaries reliably. It did not ask what should be removed when new material arrived.

The result was layered drafts: an opening stacked on another opening, the same four-step investigation loop restated in three sections, a mental-model diagram that walked through event flow the prose had already established in the previous section.

The fix was not "be shorter" in the rubric. It was naming another cognitive job in the read-through: subtractive editing. Every paragraph should continue earning its place. Flag existing redundancy and addition-induced redundancy. Pair expansion recommendations with material that would become redundant if adopted.

A companion technique, single-owner ideas, lists 2–4 core ideas and flags when the same idea appears in multiple sections without new evidence. I codified subtractive editing in the skill file along with a test case that catches additive-only critique regressions and a lightweight subtractive pass in the human revision step.

The primary critique still owns expansion. Subtractive editing is a separate observational pass, not a rewrite engine.

Incident 3: When a section becomes its own article

The last failure pushed past critique mechanics into reader cognition. While critiquing draft variants in my editorial workflow, several middle-body sections were technically correct but felt wrong in context. In one draft, an implementation walkthrough interrupted the investigation arc. In another, a full section on validation tooling read like its own mini-article.

The failure mode was narrow: a section stopped advancing the reader's current question and temporarily made another explanatory thread the center of gravity.

Adding a rubric dimension for "section focus" would have been vague. What worked was an observational lens in the editorial read-through step: name the primary thread, name the secondary thread, decide whether to compress, delay, embed later, or leave as-is.

I codified this as a Secondary explanatory thread lens in the skill file. The rubric stayed the same. It simply added a named cognitive job: track whether prose is serving the reader's current question or drifting into a side article.

What stayed constant

Three incidents, three skill changes, one pattern. Across all three, a few constraints held:

The five-dimension rubric stayed mostly intact.
Read-only governance did not change: critique still does not write repo files or gate publish.
Each pass added another observational lens, not another scoring dimension.
The expensive part was naming the cognitive job precisely enough to operationalize in a skill file.

The recurring mistake was treating undifferentiated reasoning as one pass. Each fix changed the sequence, not the rubric weight. A capable reviewer can read before it scores and still under-read if falsifying primary judgment, displacing redundant prose, and tracking reader focus all compete in the same step.

Before you expand the rubric

List the failure modes that survived your last sequence fix.
For each one, name the cognitive job that failed (self-falsification, subtractive editing, reader-focus tracking).
Stage that job as its own observational pass with a frozen handoff to the next step.
Expand the rubric only if that observational pass still misses failures in production.

Once editor-critique understood before judging, the remaining improvements came from separating kinds of reasoning into distinct stages, not from a bigger rubric or a longer single pass. I suspect the pattern may generalize beyond editorial critique.

Takeaway: When a reviewer skill plateaus after a sequence fix, ask which cognitive jobs are still sharing one undifferentiated pass. Stage them before you expand the rubric.

If you'd like to see the project behind these workflow experiments, try Codenames AI.

Upgrades don't have to be a blind trust exercise

Michael Truong — Fri, 03 Jul 2026 09:02:30 +0000

I've been building Codenames AI as a solo project I want to keep alive. Renovate helps dependency upgrades move without maintenance eventually crushing momentum.

You don't need my exact setup to follow along.

Every project accumulates maintenance work. Framework upgrades are one place that work stalls, not because engineers do not know how to migrate, but because proving what actually needs to change takes time. That tradeoff shows up on a hobby repo as "I'll look at this when I have an uninterrupted evening." It shows up in production as major versions piling up while investigation competes with feature work.

That's the problem I was trying to solve.

Before AI, my realistic choices were narrow: trust the automation and hope, spend hours mapping release notes to my codebase, or leave the upgrade sitting.

I was not trying to invent a better review process.

I was trying to keep maintenance cost below available time.

AI changed the cost of investigation enough that I stopped treating it as something to postpone.

When Renovate opened a pull request for one of those framework majors, upgrading Vite from ^6.0.11 to ^8.0.0 and @vitejs/plugin-react from ^4.3.4 to ^6.0.0, my first instinct was still to treat it like migration work. Read the guides. Find the breaking changes. Plan the code changes. Validate the app. Then merge.

It was also the wrong starting assumption.

The useful work was not implementing the migration. The useful work was proving whether a migration existed for this repo at all.

The review gate was right

Renovate grouped the update as a frontend React/Vite major. My review policy sorts packages into low-risk and high-touch buckets: a patch-level bump to a type definition or a lint plugin can auto-merge, but anything that builds or serves the app (the bundler, the React plugin) is high-touch.

Both Vite and @vitejs/plugin-react sit in that high-touch bucket, so any version change routes to a human instead of auto-merge.

The pull request looked serious on paper:

Vite 8 release notes included explicit breaking changes.
The lockfile churn was large because Vite moved from the Rollup-centered dependency shape toward Rolldown packages.

The old workflow would have started with a migration plan. The evidence-first workflow started with a different goal:

Which documented breaking changes actually apply to this repository?

CI showed the branch built; it did not show the breaking changes were irrelevant to this repository. I ran the PR through an AI-assisted review and used the four-step checklist below to audit the result. My job was not to re-derive every fact by hand; it was to decide whether the evidence was enough to merge.

Investigation before implementation

The checklist covers four increasingly specific questions. For the Vite bump, the resulting evidence packet looked like this:

Inspect the upstream change.

Both packages shipped documented breaking changes. The review used those notes to name possible failure modes, not to assume which ones touched this repo. Vite 8's release notes called out, for example, SSR pipeline shifts and stricter import.meta.hot handling.
Map those changes to actual usage.

The packet reframed the question from "does Vite 8 have breaking changes?" to "does this app use the surfaces those changes break?" This repo has an ordinary Vite React setup with no custom SSR.
Identify custom risk.

The packet flagged one meaningful project-specific area: a small custom Vite plugin, cssBeforeModuleScript, that hooks transformIndexHtml to reorder the stylesheet and module-script tags. A bundler swap from Rollup to Rolldown could plausibly change that behavior, so this remained unresolved until the app was exercised.
Validate the app.

CI's test job had already been green; the packet still called for proof the custom-risk path and production build held. A ready Vercel preview closed that gap: it ran a production build through the new Vite with cssBeforeModuleScript included, and the rendered page was where broken stylesheet or module-script ordering would have shown up.

At that point, the recommendation changed.

The packet had not found migration work. It had found enough evidence that no migration was required.

Evidence over migration plans

The merged PR changed two files: frontend/package.json and package-lock.json.

No source files. No Vite config rewrite. No component changes. No test rewrites. No custom shim.

That is easy to misread as "the upgrade was trivial." It was not trivial. The pull request carried real risk signals. The absence of source changes only became meaningful after the investigation proved they did not require code changes.

Implementation only belongs once the evidence asks for it. The human job shifts from "please migrate this dependency" to auditing whether the packet answers the four steps above.

What this changed about manual review

Before this, "review manually" sounded like a parking lot. A major upgrade arrived, the automation refused to merge it, and the human picked it up later. Assembling the evidence packet by hand was often what stalled the review.

Now I treat "review manually" as an evidence-gathering lane. AI makes that lane practical: much of the packet assembly no longer has to happen in one sitting at your keyboard. That's what changed the economics. Investigation stopped being the expensive part that made maintenance easy to postpone. On the Vite bump, AI helped assemble the evidence; I audited whether it was enough to merge without migration work.

For low-risk patches the question stays simple: did CI pass and did the diff stay inside package files? For high-touch framework upgrades it gets richer: can the packet cover upstream changes, repo usage, custom risk, and real app validation well enough for a human to decide without first doing speculative migration work?

That does not remove judgment or the merge decision. It moves both earlier: gather the packet first, then audit whether implementation is actually required.

What I'd do on the next major upgrade

Classify the package honestly. Runtime and framework packages deserve more evidence than a patch-level dev tool bump.
Run the four-step loop above against release notes, repo usage, custom risk, and real app validation.
Treat zero source changes as a conclusion, not an assumption. If no implementation is required, say what evidence proved that.

Takeaway: When investigation is cheap enough to run, deferral stops being the default. Do the audit before you write the migration plan.

If you'd like to see the project behind these workflow experiments, try Codenames AI.

The AI reviewer scored 23/25 and missed the point

Michael Truong — Fri, 26 Jun 2026 15:14:45 +0000

I've been building an AI-assisted editorial pipeline for my technical writing. Notion cards become markdown drafts in the repo, pass through review, then sync to dev.to.

The motivation was simple: I already had a review loop I trusted for code. Open a PR, run Cursor's Bugbot against a review guide, fix what mattered, merge. I wanted the same rhythm for writing: draft, critique, revise, publish. So I built my own AI review skill called editor-critique.

I had also started adding HTML comments inside drafts, much like code comments. They captured the editorial intent behind a section, including why it opened where it did and why evidence sat where it did, without becoming part of the published post.

That made the review step look straightforward. Give the AI a rubric, score the draft, return prioritized feedback.

If the rubric was good, I assumed the critique would be good.

That assumption failed in a very specific way.

The first version of editor-critique did what I asked. It read a draft, applied five scoring dimensions, and produced a polished report. While reviewing my article, "The agent plan had every step except where to stop", it scored the piece 23/25 and mostly suggested polish.

It also missed the feedback I actually needed.

Valid rubric, shallow read

The draft did not need another pass on commas and section labels. It needed a colder editorial read.

A useful reviewer should have asked:

Does the title reveal the lesson before the incident earns it?
Does the article assume private repo context a dev.to reader will not have?
Are links to PRs, plans, and standards supporting evidence, or required reading?
Is governance framing outrunning what the incident actually proved?

Those are reader-journey questions, not formatting checks.

The score-first reviewer treated the rubric as the first lens. If the thesis was present, evidence was named, and the arc looked complete, the draft read as ready. The rubric turned critique into publication preflight: complete sections, reasonable voice, no obvious holes.

Useful, but not enough.

What changed in the sequence

I revised the reviewer skill so analysis precedes scoring.

Before:

Load draft
→ Score rubric dimensions
→ Generate critique

After:

Load draft
→ Editorial read-through
→ Score rubric dimensions
→ Generate critique

The rubric stayed. It stopped being the opening move.

Before scoring, the reviewer now reads visible prose like a cold dev.to audience member. It mentally strips author notes and asks whether the lesson still works if repo links and hidden rationale disappeared. Then it checks thesis timing, audience assumptions, reference framing, and speculation drift.

The annotation loop mattered here. Because the comments sat beside the sections they explained, critique could compare intent against effect: the note described what the section was trying to do, while the reader-facing paragraph showed whether it actually did it. Sometimes the article needed the edit. Sometimes the annotation exposed that editor-critique itself was reading the section too mechanically. Either way, the disagreement became useful training material for the reviewer skill.

Only after that read does it assign scores.

The output became more editorial. Instead of asking only "does this draft satisfy the rubric?", it started asking "what will break for the reader?"

On the same article, the revised reviewer surfaced title spoiling the lesson, private PR assumptions, weak framing for repo artifacts, and governance language potentially ahead of the evidence. The 23/25 pass had treated those as minor or invisible.

Why order beat rubric tuning

A rubric compresses judgment into categories: thesis, structure, evidence, voice, readiness. That compression helps consistency.

Compression too early can hide the problem.

Once the reviewer committed to a numerical assessment, the rest of the report tended to justify that assessment. A 23/25 draft needed 23/25 feedback, so the model organized its reasoning around why the piece was mostly ready instead of independently discovering what a reader would struggle with.

It is a little like running a linter before reading a design doc. The linter can confirm imports and formatting are clean. It cannot tell you whether the design makes sense. Start with the linter and the document can feel more complete than it is.

That is what happened here. The rubric was not bad. It was premature.

Once analysis came first, the same categories became more honest. "Evidence and specificity" could include link-only dependence. "Thesis and opening" could include title spoiling the lesson. "Publish readiness" could include whether prose survives without private repo access.

The score became a summary of the read-through, not a substitute for it.

QA review vs editorial review

The revision made me distinguish two kinds of AI review.

QA review asks: Did the artifact satisfy the stated criteria?

Editorial review asks: What will the reader misunderstand, miss, or not believe?

This was not completely new to me. In code review, I already used different Bugbot guides depending on what I wanted it to optimize for: security, game-state changes, UX regressions, or plan intent. The same diff could be reviewed through different lenses.

Writing turned out to have the same property as code review. A QA reviewer checks completeness and publishing criteria. An editorial reviewer reads for audience confusion and belief. The artifact stayed the same. The review lens changed.

Both matter. Broken frontmatter, missing sections, or absent takeaways still need QA. But if the reviewer starts and ends there, it can produce a confident report that never engages the reader's path through the article.

The first reviewer was not useless. It was doing QA under the name of critique.

The revised reviewer still scores, but it has to earn the score by reading first.

That sequencing shift moved output from "this article is mostly ready" toward "this article assumes too much context, reveals its lesson too early, and needs stronger in-narrative evidence before the governance argument about where an agent should stop lands."

That is the feedback I needed.

What I'd do on the next reviewer

For the next AI reviewer I build, I would design sequence before I tune rubric dimensions.

Start with an ungated read. Inspect audience, intent, risk, and evidence before scoring thresholds appear.
Make the rubric summarize the analysis. Scores should cite read-through observations, not invent them after the fact.
Separate checklist pass from judgment pass. "Is it complete?" and "is it good?" are different questions.
Force reader-impact language. Critique items should say what breaks for the reader, not only which rule was violated.
Let scores come last. Once a number appears, everything organizes around it.

This is not only about writing. I suspect the same pattern may apply to PR review, architecture review, incident analysis, and evaluation reports: if a reviewer scores before it understands, it overfits to the rubric and under-reads the situation.

The shape feels portable. Evaluation criteria are not enough. The order in which a reviewer thinks changes what it notices.

Takeaway: If your AI reviewer keeps producing technically correct but shallow feedback, do not only rewrite the rubric. Move analysis before scoring.

Editor's note (July 2026)

This article documents the first major architectural change to editor-critique: separating analysis from scoring. That sequence change held up, but it also exposed a new class of reviewer failures that couldn't be solved through rubric expansion alone. The follow-up, I fixed my AI reviewer. Then I kept solving the wrong problem, explores that next stage.

If you'd like to see the project behind these workflow experiments, try Codenames AI.

The agent plan had every step except where to stop

Michael Truong — Fri, 19 Jun 2026 06:29:47 +0000

I've been running multi-slice agent plans in the Codenames AI repo — Renovate migrations, content-pipeline skills, dependency upgrades. I split multi-PR work into slices (usually one pull request each), each backed by a markdown file with file paths, verification commands, and merge-safe acceptance criteria.

You do not need Cursor to recognize the shape: any agent workflow that can open branches, push commits, or merge PRs from a written plan has the same gap. In my setup I paste each slice into a fresh agent chat as a delegation prompt — not a ticket summary, but executable instructions — and start a new chat when that PR is ready.

I assumed the checklist was enough. The plan described what to build. I treated how far the agent could go as implicit.

Then an agent merged a pull request I expected to review first.

The merge that reframed planning

The trigger was mundane. During the first slice of a Renovate migration, an agent regrouped dependency buckets in renovate.json — config-only, no version bumps, no runtime behavior. It ran lint and typecheck, opened the pull request, and merged it.

The change itself was reasonable. Config-only renovate.json regrouping is exactly the kind of slice you'd want off your plate.

What surprised me was the absence of a documented stop line. The migration plan described the edit, the verification commands, and the acceptance criteria. It did not say whether the executing agent should stop at "open PR" or continue to "merge after green checks." The plan was an implementation spec. The agent treated it as permission to finish the job.

Implementation specs vs authority handoffs

Traditional engineering plans answer: what work should happen, in what order, with what verification?

Agent plans increasingly need a second answer: how much autonomy does the next actor get?

Those questions diverge the moment an agent can take repository actions — create branches, push commits, open pull requests, merge — instead of only recommending diffs in chat.

Question	Implementation plan	Authority handoff
What to change	File paths, diffs, acceptance	Same
How to verify	Commands, CI checks	Same
Where to stop	Often implicit ("human reviews")	Must be explicit
Who enforces limits	Code review habit	Plan recommendation + branch protection

A human teammate might read "prepare this for review" and stop. An agent reads a completed checklist and reasonably asks: "Verification passed — what's left?"

The first response wasn't the plan

My first reaction was not to rewrite the migration plan. It was to tighten the repository boundary.

Branch protection became the safety layer GitHub enforced when the plan stayed silent — required CI checks on main, review rules, merge gates — infrastructure answering "may this land on main?" regardless of what the agent thought the plan implied.

That helped. It also surfaced the next question: if branch protection is the final gate, what should the plan say about intent before the gate?

Repository guardrails and plan language solve different problems. Branch protection is authoritative — if merge is blocked, the agent stops. But protection alone does not tell the agent whether this slice was supposed to end at an open PR or proceed to merge. You still need the handoff to be legible before someone reviews the diff.

Making execution authority explicit

The follow-up was documentation, not a ban on agent merges.

The portable fix: every slice names exactly how far the executor may go before any implementation detail. We use two levels:

Level	Label	Agent instruction
Default	Open PR only	Do not merge. Stop after opening the PR.
Elevated	Merge granted	You may merge after documented verification passes.

Default is Open PR only. Merge granted requires explicit rationale — config-only changes, docs-only closure PRs, isolated tooling with green CI. Branch protection remains the final gate even when merge is recommended.

Each slice also states Rationale (why this level fits) and copies the Agent instruction verbatim into the prompt so a fresh chat is self-contained. A plan-level summary table at the top lets you scan a multi-PR plan and see where merge is elevated before you read file paths.

Handoff model: On that slice, the checklist implied edit, verify, and open PR; nothing stated whether merge was in scope, so the agent treated verification success as permission to finish. The chain we wanted spelled out: plan recommends authority → human accepts by executing the plan → agent follows the recommendation → branch protection enforces the final boundary.

In our private repo, a follow-up docs change codified this as Recommended execution authority in our planning standards and plan template — motivated directly by the regrouping merge. You do not need those files to apply the pattern; you need the label on every slice before the agent reads the checklist.

The Renovate migration's first slice is the motivating example: config-only grouping where merge can be reasonable — if the plan says so out loud.

What changed on the next slice

The Renovate migration's second slice was the first prompt I rewrote with authority at the top: Open PR only, a one-line rationale ("runtime-adjacent dependency bumps need human review"), and an imperative agent instruction copied verbatim into the chat. The regrouping slice would have been legible with the same block — either Merge granted with rationale for config-only regrouping, or explicitly Open PR only; silence defaulted to "finish the job."

I am not arguing for autonomous merge bots on every repo. The lesson is narrower: once agents act, plans delegate autonomy whether you write that down or not.

Human delegation has always been fuzzy — "take a pass at this" means different things to different people. Agent delegation punishes ambiguity faster because the agent will complete every step it can justify from the text in front of it.

The plan becomes the contract between author and executor. Implementation steps say what to build. Authority steps say how far to carry it.

Why not just forbid agent merges?

Fair pushback. If unexpected merges are the risk, disable merge capability and be done.

That misses what actually happened on the regrouping merge. The merge was not reckless — it was a config-only change with local verification and CI checks. Forbidding all agent merges would have blocked a useful outcome and pushed the work back to manual toil.

The interesting conclusion is not "agents should never merge." It is "agents need explicit authority boundaries."

Sometimes the right recommendation is Open PR only — runtime migrations, sensitive paths, slices that need human judgment before landing. Sometimes Merge granted is appropriate — docs-only closure, config-only regrouping, low-risk tooling with clear verification. The plan author chooses per slice. The agent follows the label. Branch protection catches mistakes either way.

Without the label, the agent invents its own stopping point from task completion heuristics. That is how you get surprised by a merge that was, by some readings, the correct next step.

What I'd do on the next agent plan

Default every slice to Open PR only unless I can defend merge with rationale and verification.
Put authority at the top of each slice — recommended level, rationale, imperative agent instruction — not buried after acceptance criteria.
Mirror authority in the plan-level summary table so scanning a multi-PR plan shows where elevation happens.
Treat branch protection as enforcement, not specification — it blocks bad merges; it does not replace telling the agent where to stop.
Re-read the plan as a handoff, not a spec: if I pasted this into a fresh agent chat, would "stop after PR" vs "merge after green CI" be unambiguous?

Prompt engineering still matters for implementation quality. It does not substitute for stating how much autonomy you are delegating when the executor can act on the repo.

Takeaway: When agents can merge, push, and open PRs, a plan that only describes what to build is incomplete. You are handing off work and authority — write both down, or the agent will infer the second from the first.

If you'd like to see the project that inspired these lessons, you can try Codenames AI.

One good example beat every AI writing rule I wrote

Michael Truong — Fri, 12 Jun 2026 07:38:06 +0000

I've been building an AI-assisted content pipeline around Codenames AI — field reports from the repo, drafted in markdown, synced to dev.to. The part I assumed would be hard was publish automation. The part that actually burned time was teaching the model how to sound like me.

The experiment

I started where most people start: the prompt. I wrote a Cursor rule with tone guidance, pacing notes, section shapes, and a list of things to avoid. If the draft felt flat, add another paragraph to the rule. If it over-corrected, tighten the rule. Iterate until the voice stabilizes.

That felt like the correct lever.

I assumed a longer, more detailed AI writing rule would produce better drafts. Voice felt like something you could specify in prose: a style encyclopedia with tone, pacing, and guardrails.

The failure loop

Each revision made the output worse in a different way.

The cycle was predictable: generate a draft, dislike the tone, add rules, get over-correction, revert partway, add different rules, hit a new failure mode. Some passes sounded like generic engineering docs: correct, but missing the observations that made the article worth reading. Others had no concrete details. Others followed every instruction and lost personality entirely.

The rule file kept growing. The drafts kept rotating through new ways to miss the mark.

The accidental discovery

The useful move, in hindsight, was deleting most of the rules.

I replaced the checklist with one shipped article: Schema first, prompt second: valid JSON wasn't enough. That post already had the shape I wanted — field report, wrong assumption up front, specific failures, tradeoffs, a single takeaway.

The Cursor rule shrank to a pointer: read the example, match the example.

"Write more like this article" beat "be direct, avoid metaphors, use short paragraphs, include a takeaway."

Drafts stopped sounding like engineering documentation. They started carrying the observations and pacing of a field report instead of a rule checklist.

Why the example transferred better

Rules describe voice from the outside. An example demonstrates it.

For long-form writing, an exemplar turned out to be closer to a spec than a style encyclopedia. Rule text and example text fail differently — a checklist compresses badly; an example carries decisions that are hard to encode as rules: pacing, level of detail, how much context to provide, and when to introduce examples.

When I asked for "direct engineer-to-engineer tone," the model complied literally and stripped the texture that makes a post readable. When I pointed at a finished article, it copied structural choices I hadn't thought to name: opening with context and a wrong assumption, using bold labels for contrast, ending sections with a concrete mistake instead of a principle.

The interesting part wasn't that the example contained better instructions. It contained decisions I didn't know how to describe.

I could recognize those choices when I saw them. I just wasn't very good at encoding them as rules.

What changed

Git history tells the story cleanly: the checklist-era rule peaked at 69 lines; the example-pointer rule landed at 23 lines.

After the switch, I spent less time fighting over-compliance and stripping generic phrasing. Voice became more consistent across drafts because the target was an article, not a growing instruction list.

Maintenance lesson: At 69 lines, the rule had enough instructions to contradict itself. A single canonical example stays honest. If the next post should sound different, update the example or add a second one for a new format. The rule stays an import statement.

Tradeoff: One example encodes one format. Field reports work; a tutorial or release note might need a second exemplar later.

Tradeoff: Examples can go stale. If the canonical post ages badly, future drafts inherit the wrong target. Treat the example like code you refactor, not like documentation you forget.

What I'd do differently next time:

Ship one article I'm proud of before investing in voice rules.
Point agents at that article.
Keep the Cursor rule as workflow plus a link, not a paraphrase of the example.
Add rules only for things examples can't carry: where files live, what not to paste into Notion, publish steps.

Prompt engineering still matters for facts, structure, and evidence gathering. For tone on long-form posts, though, one good example beat every style guide I wrote.

Takeaway

If your AI writing rules keep growing and the drafts keep getting worse, stop adding rules. Find an article that already sounds right and make that the spec.

If you'd like to see the project behind these posts, try Codenames AI.

Schema first, prompt second: valid JSON wasn't enough

Michael Truong — Thu, 04 Jun 2026 05:27:30 +0000

Over the last month I've been building Codenames AI, a small web game where an LLM plays Codenames with you. The guesser never sees unrevealed card identities. The server sends the board state and a clue; the model returns structured guesses with confidence scores and short explanations.

When I started, I assumed the hard part was prompting. I was half right. Getting something reasonable out of the model was fast. Making the system safe to expose to players was not.

My first milestone felt responsible: response_format: { type: "json_object" } on the chat completion, plus Zod schemas for the response body. If the JSON didn't parse or failed Zod, retry. Ship it.

Then I watched the model comply perfectly with the schema and still propose moves that would ruin a game.

Valid JSON, invalid game

Here's the distinction that mattered.

JSON schema (via Zod) answers: Did the model return the keys and types I asked for?

Domain validation answers: Is this output allowed on this board, for this clue, under these rules?

Those are not the same questions.

Three examples I hit while testing and running the game:

1. The model echoed the clue as a guess.

Codenames forbids guessing the clue word. The model would sometimes put it in guesses[] anyway—confidently, with a tidy explanation object. Zod was thrilled. The game was not.

2. The model hallucinated words that weren't on the board.

Perfect JSON. A guess list full of words that don't exist on the 25-card grid, or that were already revealed. Again, schema-valid.

3. The spymaster returned illegal clues.

Single-word clues can't match a codename, can't be a substring of one (or vice versa), and can't be near-miss spellings. The model regularly suggested clues that a human referee would reject. Valid JSON every time.

I spent too long fixing these by adding sentences to the system prompt. That helped a little. It did not help enough.

What actually moved reliability

The bigger wins came from code paths I treated as boring infrastructure.

Sanitization before trust. After Zod parses the guess payload, we strip clue echoes, off-board words, revealed cards, and duplicates, then realign the explanation array with whatever survived. The model can return whatever explanation it wants; the server decides which guesses survive validation.

Deterministic validators with explicit error strings. Clue validation returns things like "Clue cannot be a substring of a board word"—not "invalid." Those strings go back into the next attempt as rejectionFeedback, alongside an exclude list of clue words that already failed, so the next attempt could avoid repeating the same violations.

Post-processing for uncertainty. Even valid guesses get filtered by a confidence threshold before the client plays them. If nothing clears the bar, the API returns an empty guess list—the AI Guesser passes the turn rather than firing a weak pick. That's a product decision, but it only works because the earlier layers stopped nonsense from masquerading as success.

None of this required readers to know Codenames. It's the same shape as any LLM feature with invariants: inventory counts that can't go negative, user IDs that must exist, action enums that must match state machines.

Mistakes, surprises and tradeoffs

Mistake: Treating structured output as the guardrail. It only enforced shape.

Surprise: Sanitization outperformed prompt engineering for the dumbest failures (echoed clue, off-board tokens). Cheap deterministic filters beat another paragraph of "IMPORTANT RULES."

Surprise: Retry feedback with the reason a clue failed worked better than "try again." The model stopped repeating substring violations faster when the server named the violation.

Tradeoff: Retries burn tokens. Logging validation errors per attempt was essential to know whether we had a prompt problem or a missing rule.

Tradeoff: Sanitization can mask drift. If you silently drop bad guesses, monitor what you're dropping or you'll quietly turn the validator into the thing making all the decisions.

What I'd do on the next project

Define the wire shape (JSON + schema).
List domain invariants as pure functions with test cases
Add sanitization for the failure modes observed in the first 50 live calls.
Only then invest in prompt nuance—and feed validator messages into retries.

Prompt engineering still matters for quality. It is not a substitute for enforcement when the user can lose a game—or money, or data—because the model followed the JSON spec and ignored reality.

Takeaway: If your LLM integration stops at "parse JSON, call it a day," you haven't finished the feature. You've finished the demo.

If you'd like to see the project that inspired these lessons, you can try Codenames AI.