<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nazar Kalytiuk</title>
    <description>The latest articles on DEV Community by Nazar Kalytiuk (@nazar_kalytiuk_2659345d1f).</description>
    <link>https://dev.to/nazar_kalytiuk_2659345d1f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3904543%2F7bcf5af5-8ad9-471c-88be-2c2d6987f674.png</url>
      <title>DEV Community: Nazar Kalytiuk</title>
      <link>https://dev.to/nazar_kalytiuk_2659345d1f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nazar_kalytiuk_2659345d1f"/>
    <language>en</language>
    <item>
      <title>An API testing tool built specifically for AI agent loops</title>
      <dc:creator>Nazar Kalytiuk</dc:creator>
      <pubDate>Wed, 29 Apr 2026 14:23:59 +0000</pubDate>
      <link>https://dev.to/nazar_kalytiuk_2659345d1f/an-api-testing-tool-built-specifically-for-ai-agent-loops-3bma</link>
      <guid>https://dev.to/nazar_kalytiuk_2659345d1f/an-api-testing-tool-built-specifically-for-ai-agent-loops-3bma</guid>
      <description>&lt;p&gt;I was working on a small API for an internal tool. I wanted my coding agent — Claude Code, in this case, but Cursor or opencode would have done — to take the boring part off my plate: write a happy-path test for each endpoint I added, run it, and fix it when something broke.&lt;/p&gt;

&lt;p&gt;The "write" part was great. Claude generated reasonable tests on the first shot. The "run" part was fine.&lt;/p&gt;

&lt;p&gt;The "fix" part is where it fell apart.&lt;/p&gt;

&lt;p&gt;Here's a typical failure as Jest spits it out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AssertionError: expected 200 to equal 404
    at Object.&amp;lt;anonymous&amp;gt; (/path/to/test.js:14:23)
    at processImmediate (node:internal/timers:483:21)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent would read this, guess "the URL is wrong" and patch the test. Sometimes that worked. But often the actual problem was different — the server wasn't even running, or the response shape had drifted from &lt;code&gt;{"uuid": "..."}&lt;/code&gt; to &lt;code&gt;{"request": {"uuid": "..."}}&lt;/code&gt;, or the body was JSON-shaped fine but the JSONPath in my assertion was wrong, or the timeout was tripping before the response came back.&lt;/p&gt;

&lt;p&gt;All of those look identical in stderr. They all collapse into the prose phrase "200 != 404." The agent had no way to tell them apart, so it kept guessing the same fix-shape (URL change) and being right maybe 30% of the time.&lt;/p&gt;

&lt;p&gt;I tried a few stopgaps — adding richer error messages, parsing the stack trace heuristically — and they all got me to maybe 50% first-fix correctness. Not enough. Below the bar where you can leave the agent alone and trust the loop to converge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unlock isn't better stderr — it's structure
&lt;/h2&gt;

&lt;p&gt;The model doesn't need a more eloquent error message. It needs &lt;em&gt;data&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If the test runner returned, on failure, a JSON shape like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failure_category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assertion_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TARN-A-STATUS-MISMATCH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"actual"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hints"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Status 404 often means the URL or HTTP method is wrong."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Check the endpoint exists and that the path matches your route registration."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the agent's branching logic becomes obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;failure_category == "connection_error"&lt;/code&gt; → server isn't reachable. Don't touch the test, check &lt;code&gt;base_url&lt;/code&gt;, kill-and-restart the dev server.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;failure_category == "timeout"&lt;/code&gt; → either bump the timeout or look at server perf. Don't change assertions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;failure_category == "assertion_failed"&lt;/code&gt; AND &lt;code&gt;TARN-A-STATUS-MISMATCH&lt;/code&gt; → look at the response body and the URL. Probably the endpoint or method is wrong.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;failure_category == "assertion_failed"&lt;/code&gt; AND &lt;code&gt;TARN-A-BODY-SHAPE&lt;/code&gt; → response shape changed. Update the JSONPath, don't touch the URL.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;failure_category == "capture_error"&lt;/code&gt; → previous step assertion passed but &lt;code&gt;$.id&lt;/code&gt; couldn't be extracted. The shape of &lt;em&gt;that&lt;/em&gt; response drifted.
This isn't magic. It's just data instead of prose. The agent can branch on a six-state enum trivially. It cannot reliably branch on a sentence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I built that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tarn — what it actually is
&lt;/h2&gt;

&lt;p&gt;Tarn is a CLI-first API testing tool I wrote in Rust. The whole bet is that contract: every failure comes back with a stable category, a stable error code, and a list of remediation hints.&lt;/p&gt;

&lt;p&gt;Tests are &lt;code&gt;.tarn.yaml&lt;/code&gt; files. The minimal one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Health check&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GET /health&lt;/span&gt;
    &lt;span class="na"&gt;request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GET&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;env.base_url&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}/health"&lt;/span&gt;
    &lt;span class="na"&gt;assert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;YAML on purpose. Models already know YAML — there is no DSL to teach. There is no test framework to bootstrap. An LLM writes a &lt;code&gt;.tarn.yaml&lt;/code&gt; file, you run &lt;code&gt;tarn run&lt;/code&gt;, it goes.&lt;/p&gt;

&lt;p&gt;A more realistic test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;User CRUD&lt;/span&gt;
&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;base_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000/api/v1"&lt;/span&gt;

&lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;create_and_verify&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create user&lt;/span&gt;
        &lt;span class="na"&gt;request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POST&lt;/span&gt;
          &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;env.base_url&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}/users"&lt;/span&gt;
          &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jane"&lt;/span&gt;
            &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jane.{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$random_hex(6)&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}@example.com"&lt;/span&gt;
        &lt;span class="na"&gt;capture&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;user_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$.id"&lt;/span&gt;
        &lt;span class="na"&gt;assert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;201&lt;/span&gt;
          &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$.id"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;not_empty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;true&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify user&lt;/span&gt;
        &lt;span class="na"&gt;request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GET&lt;/span&gt;
          &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;env.base_url&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}/users/{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;capture.user_id&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
        &lt;span class="na"&gt;assert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
          &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$.id"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;capture.user_id&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;{{ $random_hex(6) }}&lt;/code&gt; is a built-in faker so each run gets a unique email. &lt;code&gt;capture&lt;/code&gt; plucks &lt;code&gt;$.id&lt;/code&gt; from the create response and types it (string stays string, number stays number — important for downstream JSONPath assertions). The second step interpolates &lt;code&gt;{{ capture.user_id }}&lt;/code&gt; into both the URL and the assertion.&lt;/p&gt;

&lt;p&gt;Default human output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tarn run tests/users.tarn.yaml
&lt;span class="go"&gt; ● User CRUD / create_and_verify
   ✓ Create user (123ms)
   ✓ Verify user (45ms)
 Results: 1 test passed (180ms)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For agents and CI, ask for JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tarn run tests/users.tarn.yaml &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--json-mode&lt;/span&gt; compact
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get a complete machine-readable run report with &lt;code&gt;failure_category&lt;/code&gt;, &lt;code&gt;error_code&lt;/code&gt;, request, response, captures, durations, all of it. Successful steps are summarized; failed steps include the full request and response so the agent has every byte it needs to diagnose the issue without re-running anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agent loop in practice
&lt;/h2&gt;

&lt;p&gt;Here is what a realistic loop looks like when I'm pairing with Claude Code on a new endpoint.&lt;/p&gt;

&lt;p&gt;Me: "I just added &lt;code&gt;POST /users/:id/avatar&lt;/code&gt; for multipart avatar uploads. Write a test for it."&lt;/p&gt;

&lt;p&gt;Claude Code writes &lt;code&gt;tests/avatar.tarn.yaml&lt;/code&gt; with a multipart upload step. Runs &lt;code&gt;tarn run tests/avatar.tarn.yaml --format json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Output (failure):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Upload avatar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST avatar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"failure_category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assertion_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"error_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TARN-A-STATUS-MISMATCH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:3000/api/v1/users/abc-123/avatar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"multipart"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"avatar.png"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4321&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"missing field 'avatar'"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hints"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"Server returned 400 with body containing 'missing field'. Check that the multipart field name matches the server expectation."&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude reads &lt;code&gt;failure_category: "assertion_failed"&lt;/code&gt;, sees the hint about a missing field, looks at the response body — &lt;code&gt;missing field 'avatar'&lt;/code&gt; — and the request — &lt;code&gt;name: "file"&lt;/code&gt;. Patches the YAML to use &lt;code&gt;name: "avatar"&lt;/code&gt;. Re-runs. Green.&lt;/p&gt;

&lt;p&gt;Total round trip: maybe 30 seconds. No human in the middle. The interesting part is that the agent didn't have to &lt;em&gt;guess&lt;/em&gt; — it had a &lt;code&gt;failure_category&lt;/code&gt; to branch on, a hint to read first, and the request/response to confirm.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP — making it tool-native
&lt;/h2&gt;

&lt;p&gt;The next thing I built was &lt;code&gt;tarn-mcp&lt;/code&gt;, a server that implements the Model Context Protocol. Instead of Claude Code shelling out to &lt;code&gt;tarn run --format json&lt;/code&gt; and parsing stdout, it can call typed MCP tools directly.&lt;/p&gt;

&lt;p&gt;The tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tarn_run&lt;/code&gt; — execute a test or directory, structured JSON return&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tarn_validate&lt;/code&gt; — syntax check before running&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tarn_fix_plan&lt;/code&gt; — consume a failure report, emit structured fix suggestions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tarn_inspect&lt;/code&gt; — drill into a specific failure (&lt;code&gt;file::test::step&lt;/code&gt;) without parsing the full report&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tarn_rerun_failed&lt;/code&gt; — replay only failing &lt;code&gt;(file, test)&lt;/code&gt; pairs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tarn_diff&lt;/code&gt; — compare two run reports, bucket failures into &lt;code&gt;new&lt;/code&gt; / &lt;code&gt;fixed&lt;/code&gt; / &lt;code&gt;persistent&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;a few more
Configure it in your &lt;code&gt;.mcp.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tarn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tarn-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Claude Code, opencode, Cursor, Windsurf — anything that speaks MCP — can call these as tools. Faster, less brittle, and the agent doesn't waste tokens re-parsing the same stdout format on every call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;A few things from real use I didn't expect when I started:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. YAML mattered more than the structured failures.&lt;/strong&gt; I expected the structured-JSON-failure thing to be the headline win. It is — but the bigger jump in agent first-shot correctness came from switching the test format from a Jest-style DSL to plain YAML. From maybe 60% correct first tests to maybe 90%. Models generate cleaner YAML than they generate any test-framework DSL, full stop. That was bigger than I gave it credit for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Failure cascades are the real problem, not single failures.&lt;/strong&gt; If step 3 fails because step 2 couldn't &lt;code&gt;capture: $.id&lt;/code&gt;, then steps 4, 5, 6 all show as failed for unrelated reasons (they were trying to use a &lt;code&gt;user_id&lt;/code&gt; that doesn't exist). The naive agent tries to fix step 6 first — and step 6 looks fine. Confusion compounds. Tarn collapses these cascades into a single root-cause entry — &lt;code&gt;cascades: 5&lt;/code&gt; rather than five individual failures. That single change made loops noticeably more efficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The &lt;code&gt;tarn_fix_plan&lt;/code&gt; tool is the most uncertain piece.&lt;/strong&gt; I built it as an MCP tool that consumes a failure report and emits structured fix &lt;em&gt;suggestions&lt;/em&gt;. But I'm honestly not sure that's the right level of abstraction. Maybe the model should just see the raw failure report and plan its own fix, and &lt;code&gt;tarn_fix_plan&lt;/code&gt; is over-engineering. I haven't decided yet. If you've built similar agent tools, I'd love to hear which side of this you've landed on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it deliberately doesn't do
&lt;/h2&gt;

&lt;p&gt;I want to be transparent about scope:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No XPath / HTML assertions.&lt;/strong&gt; Hurl is better for HTML scraping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No full Hurl-style filter DSL.&lt;/strong&gt; Hurl wins on filter depth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No OpenAPI-first test generation.&lt;/strong&gt; People keep asking; I'm not yet convinced this is the right fit for the agent loop, where the model generates tests from informal specs anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No GUI.&lt;/strong&gt; Bruno has an excellent one. If you want a GUI, use Bruno; Tarn is for CI and agent loops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No record-replay.&lt;/strong&gt; Trace-based testing tools exist for that.
Tarn's bet is specifically the write-run-fix slice that an AI coding agent drives. If you're hand-writing tests as a human, Hurl or Bruno will probably make you happier.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you're driving an agent loop where API tests are part of the picture, Tarn might fit. The install is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/NazarKalytiuk/tarn/main/install.sh | sh
tarn init
tarn run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single static binary, musl-linked so it runs on any Linux from Alpine to RHEL, plus macOS (Intel + Apple Silicon) and Windows. MIT-licensed. The &lt;code&gt;install.sh&lt;/code&gt; also lays down &lt;code&gt;tarn-mcp&lt;/code&gt; and &lt;code&gt;tarn-lsp&lt;/code&gt; (a Language Server for in-editor diagnostics on &lt;code&gt;.tarn.yaml&lt;/code&gt; files) when those are available in the release archive.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/NazarKalytiuk/tarn" rel="noopener noreferrer"&gt;https://github.com/NazarKalytiuk/tarn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docs: &lt;a href="https://nazarkalytiuk.github.io/tarn/" rel="noopener noreferrer"&gt;https://nazarkalytiuk.github.io/tarn/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MCP setup: &lt;a href="https://nazarkalytiuk.github.io/tarn/mcp.html" rel="noopener noreferrer"&gt;https://nazarkalytiuk.github.io/tarn/mcp.html&lt;/a&gt;
I'm specifically interested in feedback on three things:&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;The JSON failure schema (&lt;code&gt;schemas/v1/report.json&lt;/code&gt;) — does the failure-category taxonomy feel complete, too coarse, or too fine?&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;tarn_fix_plan&lt;/code&gt; (the MCP fix-suggestion tool) is the right abstraction, or whether it should just emit raw failures and let the model plan the fix itself.&lt;/li&gt;
&lt;li&gt;What's missing for &lt;em&gt;your&lt;/em&gt; specific agent loop — what would make you switch from your current setup, if you have one?
If you build something with it, drop a note in the GitHub issues or come find me. I'm reachable.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
