<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: J Wang</title>
    <description>The latest articles on DEV Community by J Wang (@jinhua_wang_cfd95305d53e5).</description>
    <link>https://dev.to/jinhua_wang_cfd95305d53e5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4001129%2F8f38245b-7b88-42e7-9295-6a5636ad7f05.png</url>
      <title>DEV Community: J Wang</title>
      <link>https://dev.to/jinhua_wang_cfd95305d53e5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jinhua_wang_cfd95305d53e5"/>
    <language>en</language>
    <item>
      <title>How to grade an AI agent's output before it ships</title>
      <dc:creator>J Wang</dc:creator>
      <pubDate>Wed, 24 Jun 2026 19:18:37 +0000</pubDate>
      <link>https://dev.to/jinhua_wang_cfd95305d53e5/how-to-grade-an-ai-agents-output-before-it-ships-4571</link>
      <guid>https://dev.to/jinhua_wang_cfd95305d53e5/how-to-grade-an-ai-agents-output-before-it-ships-4571</guid>
      <description>&lt;p&gt;AI agents now produce work — code, support replies, claims decisions, research memos, documents — faster than any team can review it. The uncomfortable part: most models are aligned to be &lt;em&gt;helpful and agreeable&lt;/em&gt;, so an agent tends to approve its own output. At any real scale, that means unreviewed agent work reaches production.&lt;/p&gt;

&lt;p&gt;The fix isn't "review everything by hand" (you can't) or "trust the model" (it's the thing being checked). It's an &lt;strong&gt;acceptance gate&lt;/strong&gt;: an automated checkpoint between an agent and production that grades each output against an explicit policy and decides what happens to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four-band acceptance model
&lt;/h2&gt;

&lt;p&gt;A useful gate doesn't return a vibe — it returns a score and one of four decisions, so the outcome is policy-bound and auditable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ship&lt;/strong&gt; — meets the policy; accept it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;route to fix&lt;/strong&gt; — close, but send it back with the located flaws and concrete upgrades.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;quarantine&lt;/strong&gt; — hold for human review; don't ship yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;block&lt;/strong&gt; — fails the policy; must not reach production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The score is a single number (say 0.0–1.0, where 1.0 = ship and 0.0 = must block). The bands turn that number into an action your pipeline can branch on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a &lt;em&gt;hostile&lt;/em&gt; critic, not a friendly one
&lt;/h2&gt;

&lt;p&gt;The critical design choice: the grader should be aligned the &lt;strong&gt;opposite&lt;/strong&gt; way from the agent that produced the work. A general "LLM-as-a-judge" is helpful-by-default, so it rubber-stamps. An acceptance critic should be &lt;strong&gt;hostile-by-default&lt;/strong&gt; — aligned to find reasons to &lt;em&gt;block&lt;/em&gt;, graded against your acceptance criteria, and evaluating not just the final artifact but the &lt;strong&gt;trajectory&lt;/strong&gt; the agent took to get there.&lt;/p&gt;

&lt;p&gt;This is the part teams get wrong: they reuse a friendly model as the judge and wonder why it never catches anything. A grader that doesn't push back under pressure is worse than no grader, because it manufactures false confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The loop, concretely
&lt;/h2&gt;

&lt;p&gt;The gate is most useful when the agent can run it itself and iterate to a passing band. Here's the shape using &lt;a href="https://seaotter.ai" rel="noopener noreferrer"&gt;OtterScore&lt;/a&gt;, a hostile-by-default critic you call over HTTP or MCP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. get a free key (no human required)&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.seaotter.ai/api/v1/agent-keys/signup &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"email":"you@example.com"}'&lt;/span&gt;&lt;span class="c"&gt;# 2. grade the work (async — tolerates a cold GPU)&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.seaotter.ai/api/v1/eval/jobs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$OTTER_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"submission":"async","user_prompt":"&amp;lt;what the work was for&amp;gt;",
       "artifact_parts":[{"mime_type":"text/plain","text":"&amp;lt;your work&amp;gt;"}]}'&lt;/span&gt;

&lt;span class="c"&gt;# 3. poll until completed&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.seaotter.ai/api/v1/eval/jobs/&lt;span class="nv"&gt;$JOB_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$OTTER_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# -&amp;gt; { "status":"completed", "result_summary":{ "band":"ship", "score":0.95 } }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the band comes back &lt;code&gt;route_to_fix&lt;/code&gt; or &lt;code&gt;block&lt;/code&gt;, the response includes the located flaws and concrete upgrades — feed those back to the agent, regenerate, and re-grade until it clears the bar. Prefer MCP? Connect the hosted server by URL with no install: &lt;code&gt;https://mcp.seaotter.ai/mcp&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes the data hard (and the moat real)
&lt;/h2&gt;

&lt;p&gt;The genuinely hard problem isn't the loop — it's the &lt;em&gt;training data&lt;/em&gt; for the critic. The only data worth training an acceptance critic on is agent work that &lt;strong&gt;fools a strong discriminator&lt;/strong&gt;. Easy, obviously-bad examples teach it nothing. So you build the corpus adversarially: generate or mine flawed work, score it with a strong critic, and &lt;strong&gt;keep only the cases the critic misses&lt;/strong&gt;. That fail-set is the only thing that compounds, because by construction it's what a strong grader can't yet catch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to take it next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Score &lt;strong&gt;whole workflows&lt;/strong&gt;, not just single steps — a topology-aware composite plus a per-step critique tells you which stage of an agent pipeline is the weak link.&lt;/li&gt;
&lt;li&gt;Make the policy &lt;strong&gt;yours&lt;/strong&gt; — bring your own rubric/acceptance criteria so the gate enforces &lt;em&gt;your&lt;/em&gt; bar, not a generic notion of quality.&lt;/li&gt;
&lt;li&gt;Keep an &lt;strong&gt;audit trail&lt;/strong&gt; — every verdict recorded as signed evidence, so "why did this ship?" always has an answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full breakdown — the four-band model, the API, and the FAQ — is here: &lt;strong&gt;&lt;a href="https://seaotter.ai/docs/ai-agent-evaluation" rel="noopener noreferrer"&gt;AI agent evaluation: how to evaluate and gate agent output&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you're shipping agents to production, put a hostile gate in front of them before the unreviewed output does the deciding.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llmops</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
