Michael Truong

Posted on Jun 26 • Edited on Jul 9

The AI reviewer scored 23/25 and missed the point

#ai #workflow

Prioritizing analysis over numerical scores

I've been building an AI-assisted editorial pipeline for my technical writing. Notion cards become markdown drafts in the repo, pass through review, then sync to dev.to.

The motivation was simple: I already had a review loop I trusted for code. Open a PR, run Cursor's Bugbot against a review guide, fix what mattered, merge. I wanted the same rhythm for writing: draft, critique, revise, publish. So I built my own AI review skill called editor-critique.

I had also started adding HTML comments inside drafts, much like code comments. They captured the editorial intent behind a section, including why it opened where it did and why evidence sat where it did, without becoming part of the published post.

That made the review step look straightforward. Give the AI a rubric, score the draft, return prioritized feedback.

If the rubric was good, I assumed the critique would be good.

That assumption failed in a very specific way.

The first version of editor-critique did what I asked. It read a draft, applied five scoring dimensions, and produced a polished report. While reviewing my article, "The agent plan had every step except where to stop", it scored the piece 23/25 and mostly suggested polish.

It also missed the feedback I actually needed.

Valid rubric, shallow read

The draft did not need another pass on commas and section labels. It needed a colder editorial read.

A useful reviewer should have asked:

Does the title reveal the lesson before the incident earns it?
Does the article assume private repo context a dev.to reader will not have?
Are links to PRs, plans, and standards supporting evidence, or required reading?
Is governance framing outrunning what the incident actually proved?

Those are reader-journey questions, not formatting checks.

The score-first reviewer treated the rubric as the first lens. If the thesis was present, evidence was named, and the arc looked complete, the draft read as ready. The rubric turned critique into publication preflight: complete sections, reasonable voice, no obvious holes.

Useful, but not enough.

What changed in the sequence

I revised the reviewer skill so analysis precedes scoring.

Before:

Load draft
→ Score rubric dimensions
→ Generate critique

After:

Load draft
→ Editorial read-through
→ Score rubric dimensions
→ Generate critique

The rubric stayed. It stopped being the opening move.

Before scoring, the reviewer now reads visible prose like a cold dev.to audience member. It mentally strips author notes and asks whether the lesson still works if repo links and hidden rationale disappeared. Then it checks thesis timing, audience assumptions, reference framing, and speculation drift.

The annotation loop mattered here. Because the comments sat beside the sections they explained, critique could compare intent against effect: the note described what the section was trying to do, while the reader-facing paragraph showed whether it actually did it. Sometimes the article needed the edit. Sometimes the annotation exposed that editor-critique itself was reading the section too mechanically. Either way, the disagreement became useful training material for the reviewer skill.

Only after that read does it assign scores.

The output became more editorial. Instead of asking only "does this draft satisfy the rubric?", it started asking "what will break for the reader?"

On the same article, the revised reviewer surfaced title spoiling the lesson, private PR assumptions, weak framing for repo artifacts, and governance language potentially ahead of the evidence. The 23/25 pass had treated those as minor or invisible.

Why order beat rubric tuning

A rubric compresses judgment into categories: thesis, structure, evidence, voice, readiness. That compression helps consistency.

Compression too early can hide the problem.

Once the reviewer committed to a numerical assessment, the rest of the report tended to justify that assessment. A 23/25 draft needed 23/25 feedback, so the model organized its reasoning around why the piece was mostly ready instead of independently discovering what a reader would struggle with.

It is a little like running a linter before reading a design doc. The linter can confirm imports and formatting are clean. It cannot tell you whether the design makes sense. Start with the linter and the document can feel more complete than it is.

That is what happened here. The rubric was not bad. It was premature.

Once analysis came first, the same categories became more honest. "Evidence and specificity" could include link-only dependence. "Thesis and opening" could include title spoiling the lesson. "Publish readiness" could include whether prose survives without private repo access.

The score became a summary of the read-through, not a substitute for it.

QA review vs editorial review

The revision made me distinguish two kinds of AI review.

QA review asks: Did the artifact satisfy the stated criteria?

Editorial review asks: What will the reader misunderstand, miss, or not believe?

This was not completely new to me. In code review, I already used different Bugbot guides depending on what I wanted it to optimize for: security, game-state changes, UX regressions, or plan intent. The same diff could be reviewed through different lenses.

Writing turned out to have the same property as code review. A QA reviewer checks completeness and publishing criteria. An editorial reviewer reads for audience confusion and belief. The artifact stayed the same. The review lens changed.

Both matter. Broken frontmatter, missing sections, or absent takeaways still need QA. But if the reviewer starts and ends there, it can produce a confident report that never engages the reader's path through the article.

The first reviewer was not useless. It was doing QA under the name of critique.

The revised reviewer still scores, but it has to earn the score by reading first.

That sequencing shift moved output from "this article is mostly ready" toward "this article assumes too much context, reveals its lesson too early, and needs stronger in-narrative evidence before the governance argument about where an agent should stop lands."

That is the feedback I needed.

What I'd do on the next reviewer

For the next AI reviewer I build, I would design sequence before I tune rubric dimensions.

Start with an ungated read. Inspect audience, intent, risk, and evidence before scoring thresholds appear.
Make the rubric summarize the analysis. Scores should cite read-through observations, not invent them after the fact.
Separate checklist pass from judgment pass. "Is it complete?" and "is it good?" are different questions.
Force reader-impact language. Critique items should say what breaks for the reader, not only which rule was violated.
Let scores come last. Once a number appears, everything organizes around it.

This is not only about writing. I suspect the same pattern may apply to PR review, architecture review, incident analysis, and evaluation reports: if a reviewer scores before it understands, it overfits to the rubric and under-reads the situation.

The shape feels portable. Evaluation criteria are not enough. The order in which a reviewer thinks changes what it notices.

Takeaway: If your AI reviewer keeps producing technically correct but shallow feedback, do not only rewrite the rubric. Move analysis before scoring.

Editor's note (July 2026)

This article documents the first major architectural change to editor-critique: separating analysis from scoring. That sequence change held up, but it also exposed a new class of reviewer failures that couldn't be solved through rubric expansion alone. The follow-up, I fixed my AI reviewer. Then I kept solving the wrong problem, explores that next stage.

If you'd like to see the project behind these workflow experiments, try Codenames AI.

Top comments (21)

TxDesk • Jun 27

the line that does the work here is "once the reviewer committed to a number, the rest justified it." that's the real mechanism, and it's bigger than scoring: any verdict emitted early becomes a thing the reasoning defends instead of tests. the score is just the most common early verdict.

moving analysis before scoring fixes the worst version of it, but i'd watch for the residual: an analysis-first reviewer still anchors, just on its own first read instead of the rubric. better, because the first read is reader-shaped, but it's the same shape of bug one layer in. the read becomes the thing later passes justify.

the only thing i've found that doesn't drift is a second pass whose explicit job is to disagree with the first, not extend it. not "re-read and refine" but "argue the opposite verdict and see if it holds." analysis-first gets you an honest opening read; an adversarial second pass is what stops that read from becoming the new rubric. curious whether your editor-critique does one read or whether the read-through itself gets challenged before the score lands.

Michael Truong • Jun 27

Right now editor-critique does one editorial read-through before the score. It does not explicitly challenge that read with a second, adversarial pass. The current fix was aimed at preventing the score from anchoring the analysis, not at making the analysis debate another perspective before it becomes the basis for the score.

I think you're pointing at a second failure mode I hadn't separated cleanly before.

There are really two choices when designing a reviewer. The first is which domain it reviews from: editorial, security, UX, architecture, and so on. The second is who it is reading as within that domain. A security reviewer reading as a cooperative user and a security reviewer reading as an attacker are wearing the same domain hat, but playing very different roles.

In practice I probably wouldn't run every possible persona on every review, any more than I'd invite every engineer on a team to review every PR. The interesting orchestration problem is choosing the right mix of perspectives for the task.

So analysis-before-scoring solves one failure mode for me. But your comment suggests another one: even a good analysis can become the new anchor if it is never challenged from a genuinely different perspective.

TxDesk • Jun 30

the persona axis is the right second dimension, and the PR-reviewer analogy is good, but i think it hides one thing. the personas you list (editorial, security, UX, architecture) are all constructive readers from different domains. the adversarial pass isn't a fifth persona on that list, it's orthogonal to it. it's a mode any of them can run against its own first read. security-as-cooperative-user vs security-as-attacker isn't two domains, it's one persona in constructive then adversarial mode.

which matters for the orchestration question, because it's not "which of N personas do i invite," it's "which personas, each of them then made to argue against what it just concluded." the second pass isn't a different seat at the table. it's the same reviewer being told its own verdict is the thing under review now.

the cheap version i keep coming back to: after any reviewer produces its read, run one more turn whose only instruction is "argue this is wrong." not a new domain, not a new persona. same context, inverted burden. it catches the anchor specifically because it can't inherit the first read's confidence, its whole job is to attack it. curious if that's cheaper in your pipeline than spinning up a second full persona, since it's one extra turn on the same read rather than another reviewer to design.

Michael Truong • Jun 30

That's a sharp distinction. I hadn't separated "persona" from "mode" before. I was thinking about choosing different reviewers, whereas you're describing asking the same reviewer to attack its own conclusion.

The nice part is that this is cheap to test. editor-critique is already a multi-turn editorial conversation, so I can insert one extra turn that's simply "argue your previous conclusion is wrong" before the final synthesis and see whether it actually changes the outcome.

If it consistently catches issues the current flow misses, then I think that's evidence for another distinct failure mode beyond score anchoring. I'd also want to watch for the opposite failure mode: generating objections for the sake of it. One thing I've noticed while iterating on these review workflows is that every additional critique loop creates pressure to change something. So the interesting question isn't whether it finds more suggestions, but whether it actually changes the editorial judgment.

TxDesk • Jul 1

that opposite failure mode is the real risk, and i think it's the same disease one more layer up: the first pass bends toward defending the verdict, the forced-objection pass bends toward manufacturing dissent. both are the reasoning taking the shape of the prompt instead of the artifact.

which points at the fix. the adversarial turn can't be scored on whether it produces objections, it has to be scored on whether the objection survives contact with the draft. a manufactured one dies the moment you check it against the text, the evidence isn't there. a real one holds. so the second pass needs the same discipline as the first: the objection is a claim, and it's only worth acting on if it's checkable against the thing, not just plausible-sounding.

so maybe the test isn't "did the adversarial turn change the judgment," it's "did it raise an objection that re-grounds in the draft and still stands." if yes, that's a real miss the first pass anchored past. if the objection evaporates when you point it back at the text, it was loop-pressure, discard it. same standard both directions: trust what re-derives against the artifact, distrust what only sounds right. curious if editor-critique could gate the objection that way, make it cite the passage it breaks on before it counts.

Michael Truong • Jul 1

This turned out to be exactly the failure mode when I implemented the adversarial pass.

The easy part was getting it to generate objections. The hard part was stopping it from generating plausible sounding objections just because the prompt told it to disagree.

I ended up treating the adversarial pass less as "argue the opposite" and more as "prove the previous editor materially misread the draft." If it can't point back to the draft and show where the first read broke, the objection doesn't earn the right to change the recommendation.

One useful constraint was separating finding objections from acting on them. The primary critique stays frozen. The adversarial pass can only challenge it. Then a synthesis step asks whether any challenge actually survived contact with the draft strongly enough to change the editorial judgement. Often it doesn't, and that's a perfectly valid outcome.

So I think you've identified the real evaluation criterion. The interesting metric isn't "did the adversarial turn disagree?" It's "did it uncover a draft-supported miss that changed the editorial judgement?" Otherwise we're just rewarding the model for being argumentative.

TxDesk • Jul 2

"the objection doesn't earn the right to change the recommendation" is the exact line. and freezing the primary critique so the adversarial pass can only challenge, never rewrite, is the part i didn't have and will steal, it stops the second pass from quietly becoming the new first draft.

the "often it doesn't, and that's a valid outcome" point is the one i'd underline for anyone building this. a review step that can conclude "nothing survived, ship it unchanged" is doing its job. the failure mode is a pass that feels obligated to justify its own existence by finding something. good series, this was a genuinely useful one to think through.

Nazar Boyko • Jun 26

This matches something about any scored review, human or AI. The moment a number lands, the rest of the write-up turns into a defense of that number instead of an honest read. Putting analysis first is such a small reorder for how much it changes what gets noticed. One thing I'd be curious about: do you keep the read-through and the score as separate outputs, so you can see whether the score followed the read or quietly drifted back to justify itself?

Michael Truong • Jun 27

Yeah, that's the exact failure mode I'm trying to guard against.

Right now editor-critique keeps them separate in the execution sequence, but not as two separate visible outputs. The skill has to do an editorial read-through first — visible prose only, comments stripped — then score afterward. The scorecard notes have to cite observations from that read-through, and high/medium critique items have to explain the reader impact, so the score is summarizing the earlier read rather than becoming a justification for itself.

So the final critique is intended to present the conclusions of that process, not every intermediate step.

That said, your question did make me think about a useful debug mode. The default output is optimized for the editorial workflow, but a verbose mode could emit a short "read-through observations" block before the score, or explicitly link each score note back to a specific observation. That would make it much easier to inspect the review process when evolving the reviewer itself.

sehwan Moon • Jun 26

"This is exactly the gap AINAScan targets — AI reviewers score high on syntax and logic, but consistently miss structural bugs: save functions that never write to DB, async functions with no await, parameters with zero effect on return values. High score, wrong question."

Michael Truong • Jun 26

This reminds me of a discussion on my previous post about reviewing the change against frozen intent. The implementation can be perfectly consistent with the question the reviewer is asking, but if that isn't the right question, you can still end up with a very confident review that misses the thing that matters.

This article ended up being another example of the same idea, just from the writing side instead of code.

sehwan Moon • Jun 27

"Exactly — 'asking the right question' is the whole problem. Most static analysis tools ask 'is this syntax valid?' AINAScan asks 'does this function actually do what its name says?' That's why MISSING_WRITE catches save functions that return success without touching the DB — the code is correct, the question was wrong."

Pon • Jun 27

There's a third review you didn't list that breaks the order fix harder: security. Editorial asks what the reader will misunderstand. Security asks what an attacker does that the author never intended. Analysis-before-scoring helps editorial because the model can do the cold read, it can play a confused reader. But it reads code as a cooperative user by default, the person using the feature the way it was meant. Nobody tells it to read as the person trying to abuse it. So a security diff gets the same 23/25: a fluent read-through that lands on "no obvious issues," then a score summarizing a read-through that was never adversarial to begin with. Order still matters, but the security hat needs the analysis pass to be told who to read as, or it reads as the wrong person and scores high for it.

Michael Truong • Jun 27

It feels like there are actually two layers to choosing the review lens.

The first is which domain you're reviewing from (editorial, security, UX, etc.). The second is who you're reading as within that domain.

For editorial, that might be a cold reader, a skeptic, or even a devil's advocate. For security, it might be the difference between reviewing as someone checking the implementation against the spec versus someone actively trying to abuse it.

In hindsight, analysis-before-scoring solved one problem for me, but it assumes the analysis is happening from the right perspective in the first place. If the reviewer starts from the wrong persona, you can still end up with a very coherent review that answers the wrong question.

Pon • Jun 27

That second layer is exactly where I got burned. I had a review pass reading as someone checking the code against the spec, and every check came back clean because the code did exactly what the spec said. The spec just never mentioned the user who lies about their own ID. What helped was making the persona an explicit input instead of letting it default. I name who is reading before the analysis runs: read this as someone trying to pull another tenant's rows. Same model, same diff, very different findings. And you nailed the dangerous part, a coherent review that answers the wrong question looks finished, so nobody goes back to check it.

Mike Czerwinski • Jun 28

Reordering analysis before scoring removes one anchor and leaves a quieter one. The number stopped fixing the verdict, but the editorial read is still one model reading as a reader it picked, and the miss you are describing lives in the reader it did not pick. vollos put the seam in the comments: the spec-clean pass came back clean because the check read against a spec that never mentioned the case. An editorial read has the same shape. It can only surface what the chosen persona would notice, and missed the point is by definition the point no assigned reader was looking for. The fix that bites is not a better first pass, it is a second reader the first one cannot author: an adversarial read whose only job is to find the question the draft pre-empted. Reorder lowers how hard the score anchors the analysis. It does not add anyone to the room who can disagree with it.

Michael Truong • Jun 28

I think the conversation has gradually split what I originally thought was one design choice into three:

Which domain is reviewing? (editorial, security, UX, architecture...)
Which persona is reading within that domain? (cooperative reader, attacker, skeptic, devil's advocate...)
How many independent perspectives do you actually want before converging on a recommendation?

I've decided to keep editor-critique as a single editorial read-through for the time being. The immediate goal is generating useful editorial improvements rather than building a reviewer that can establish objective truth from multiple perspectives.

But this discussion has convinced me there's a separate orchestration problem once you start asking who else should get a vote? That's probably closer to assembling the right review team than designing a better individual reviewer.

For now, I'm happy to acknowledge that trade-off rather than prematurely complicate the workflow. If I eventually introduce multiple personas, I think it would be because the task genuinely benefits from a second perspective, not because every review should automatically become a committee.

Mike Czerwinski • Jun 28

The split tracks. Single editorial read is the right default once you accept that the goal is generating improvements, not establishing objective truth. The committee framing only earns its cost when the cost of being wrong is high enough to want a vote.

The trigger I would offer for when to escalate, since you are keeping it as a possible future rather than a default: not every review needs a second persona, but reviews where the first reader returns high-confidence approval are exactly the ones where a second-reader-who-disagrees-by-construction pays. Easy passes are when single editors miss things, because the prompt has already converged on a frame and nothing is contesting it. Hard finds find themselves. Easy approvals need someone whose job is to refuse to be convinced.

Keeping the workflow simple now and adding the second persona only when you see a class of failures it would have caught is the right shape. The trap is adding it because it sounds rigorous.

xulingfeng • Jun 26

Been on the other end of AI content moderation — shadow-banned, hit with the 'low quality' flag, had to argue my way back. So yeah, this one hit close to home. The gap between 'did it pass the checklist' and 'would a human actually care' is where most tools just... stop. And it's not just reviewers. Moderation, test suites, coverage reports — same trap every time. The editorial read-through is the piece everyone skips. Scoring is easy. Reading is hard. Glad you're doing the hard part first👊

Michael Truong • Jun 26

I really like the way you framed it as the gap between "did it pass the checklist?" and "would a human actually care?" That captures the failure mode really well.

It's also interesting that you've seen the same pattern in moderation, test suites, and coverage reports. That was one of the things I was hoping to explore beyond writing. In code review I'd already found that different review lenses (security, UX, plan intent, etc.) surface very different feedback on the same change. This felt like another example where how the reviewer approaches the artifact matters just as much as the rubric itself.

"Scoring is easy. Reading is hard." is a fantastic summary. We might have to borrow that one. 👊

xulingfeng • Jun 26

Thanks Michael! And yeah — security lens vs UX lens on the same diff is basically two different reviews. Same code, different question. Makes me think maybe the fix isn't better rubrics, but just asking 'what hat are we wearing today?' before looking at anything

View full discussion (21 comments)