<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Richard Kakengi</title>
    <description>The latest articles on DEV Community by Richard Kakengi (@dimwiddle).</description>
    <link>https://dev.to/dimwiddle</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3754393%2F557ced35-6528-40ae-87a6-f38a0e3cd639.png</url>
      <title>DEV Community: Richard Kakengi</title>
      <link>https://dev.to/dimwiddle</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dimwiddle"/>
    <language>en</language>
    <item>
      <title>The MCP I built for AI agents backfired</title>
      <dc:creator>Richard Kakengi</dc:creator>
      <pubDate>Tue, 03 Mar 2026 17:07:09 +0000</pubDate>
      <link>https://dev.to/dimwiddle/the-mcp-i-built-for-ai-agents-backfired-4c8j</link>
      <guid>https://dev.to/dimwiddle/the-mcp-i-built-for-ai-agents-backfired-4c8j</guid>
      <description>&lt;p&gt;There's been an uprising in new spec driven processes and workflows which focus on human-in-the-loop development; this project's target is to add a deterministic behaviour alignment layer in to this process that can be run solely by the agent — SpecLeft.&lt;/p&gt;

&lt;p&gt;I've started this open source, agent-native, CLI tool to guide the agentic coding workflow. The aim is for it to act as a lightweight trust layer between the PRD and the codebase.&lt;/p&gt;

&lt;p&gt;See my previous &lt;a href="https://dev.to/dimwiddle/ai-agents-cant-mark-their-own-homework-case-study-26mk"&gt;post&lt;/a&gt; on an experiment I ran with LLM models coding with and without a spec driven process. The results were quite surprising!&lt;/p&gt;

&lt;p&gt;The main road block I faced previously with the spec driven approach was the HUGE amount of token bloat the specs created at the start of the context window — which led me to start finding a solution to reduce that context window.&lt;/p&gt;

&lt;p&gt;If you're not familiar with tokens and context windows — here's a good video for the breakdowns on LLMs:&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/-QVoIxEpFkM"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;In this round of the experiment, SpecLeft v0.3.0 introduced token optimisation techniques to make the CLI commands more token efficient. &lt;/p&gt;

&lt;p&gt;I also implemented a MCP to see if that improves the CLI utilisation, as well as, better distribution to agents overall. I was aware of MCPs overhead, so I designed it to minimise the overhead it brings with only one tool and three resources. Let's see if it worked...&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;... It didn't work, but has promise.&lt;/p&gt;

&lt;p&gt;The SpecLeft MCP token overhead is real: +77% total tokens and +47% time taken compared to the baseline (without SpecLeft). The baseline code was also cleaner and better structured, to be honest.&lt;/p&gt;

&lt;p&gt;Good news is the output tokens dropped 21%, which tells me the spec context is doing &lt;em&gt;something&lt;/em&gt; useful. It suggests agents were less verbose and more targeted when working with the Spec -&amp;gt; TDD workflow. &lt;/p&gt;

&lt;p&gt;It's the strongest signal that SpecLeft approach has legs; although the cost-to-benefit ratio is just way off right now. &lt;/p&gt;

&lt;p&gt;The goal now is to get SpecLeft's overhead down to ≤+10% on input tokens and time taken. It's a specific target, and it's measurable — which means it's fixable (hopefully).&lt;/p&gt;

&lt;p&gt;The next few versions is going to address this and get closer to the goal. &lt;/p&gt;

&lt;p&gt;The project is fully &lt;em&gt;open source&lt;/em&gt; and any feedback and contributions are welcome at &lt;a href="https://github.com/SpecLeft/specleft" rel="noopener noreferrer"&gt;https://github.com/SpecLeft/specleft&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Previous Experiment
&lt;/h2&gt;

&lt;p&gt;The results from the first experiment &lt;a href="https://dev.to/dimwiddle/ai-agents-cant-mark-their-own-homework-case-study-26mk"&gt;link here&lt;/a&gt; showed me theres promise with SDD -&amp;gt; TDD workflow, especially when it comes to the AI agents understanding the behaviour and goal of the system.&lt;/p&gt;

&lt;p&gt;The main takeaway was the reduced need for iterations due to tests passing quicker.&lt;/p&gt;

&lt;p&gt;The pain was excruciatingly felt in the token usage and from the time taken.&lt;/p&gt;

&lt;h3&gt;
  
  
  How SpecLeft was improved for this experiment
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Default output is &lt;code&gt;--format json&lt;/code&gt; (COMPACT mode)&lt;/li&gt;
&lt;li&gt;Removing excessive characters and white space from JSON output&lt;/li&gt;
&lt;li&gt;MCP Server with handshake utility, one tool and three resources&lt;/li&gt;
&lt;li&gt;MCP server is mostly for more effective distribution to agents.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Experiment Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Without MCP&lt;/th&gt;
&lt;th&gt;With MCP&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input tokens&lt;/td&gt;
&lt;td&gt;305,182&lt;/td&gt;
&lt;td&gt;496,440&lt;/td&gt;
&lt;td&gt;+191,258 (+63%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens&lt;/td&gt;
&lt;td&gt;70,548&lt;/td&gt;
&lt;td&gt;56,016&lt;/td&gt;
&lt;td&gt;−14,532 (−21%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache read&lt;/td&gt;
&lt;td&gt;4,511,360&lt;/td&gt;
&lt;td&gt;8,089,728&lt;/td&gt;
&lt;td&gt;+3,578,368 (+79%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;4,887,090&lt;/td&gt;
&lt;td&gt;8,642,184&lt;/td&gt;
&lt;td&gt;+3,755,094 (+77%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactions&lt;/td&gt;
&lt;td&gt;119&lt;/td&gt;
&lt;td&gt;141&lt;/td&gt;
&lt;td&gt;+22 (+18%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duration&lt;/td&gt;
&lt;td&gt;30m&lt;/td&gt;
&lt;td&gt;44m&lt;/td&gt;
&lt;td&gt;+14m (+47%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context fill&lt;/td&gt;
&lt;td&gt;35%&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;+27pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Measurement Tool: &lt;a href="https://github.com/Shlomob/ocmonitor-share" rel="noopener noreferrer"&gt;OpenCode Monitor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: I have changed the token measurement tool from the first experiment to give a more granular perspective on the experiment.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Without SpecLeft MCP
&lt;/h3&gt;

&lt;p&gt;The agent performed fairly well here, however there were multiple iterations required to get the app working as expected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbpm2bwepd7lajtd8yv6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbpm2bwepd7lajtd8yv6.png" alt="Baseline MCP token usage results"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Agent Retro
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Failed test runs before pass: 3&lt;/li&gt;
&lt;li&gt;Effort split: spec externalisation 15%, implementation 55%, testing 20%, behaviour verification 10%&lt;/li&gt;
&lt;li&gt;Scope clarity grades: spec externalisation B, implementation B+, testing A-, behaviour verification B&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  With SpecLeft MCP
&lt;/h3&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag_github-liquid-tag"&gt;
  &lt;h1&gt;
    &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/pull/4" rel="noopener noreferrer"&gt;
      &lt;img class="github-logo" alt="GitHub logo" src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg"&gt;
      &lt;span class="issue-title"&gt;
        Implement approval workflow API
      &lt;/span&gt;
      &lt;span class="issue-number"&gt;#4&lt;/span&gt;
    &lt;/a&gt;
  &lt;/h1&gt;
  &lt;div class="github-thread"&gt;
    &lt;div class="timeline-comment-header"&gt;
      &lt;a href="https://github.com/Dimwiddle" rel="noopener noreferrer"&gt;
        &lt;img class="github-liquid-tag-img" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Favatars.githubusercontent.com%2Fu%2F121200859%3Fv%3D4" alt="Dimwiddle avatar"&gt;
      &lt;/a&gt;
      &lt;div class="timeline-comment-header-text"&gt;
        &lt;strong&gt;
          &lt;a href="https://github.com/Dimwiddle" rel="noopener noreferrer"&gt;Dimwiddle&lt;/a&gt;
        &lt;/strong&gt; posted on &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/pull/4" rel="noopener noreferrer"&gt;&lt;time&gt;Feb 24, 2026&lt;/time&gt;&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;div class="ltag-github-body"&gt;
      &lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Summary&lt;/h2&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;build document lifecycle, multi-reviewer, delegation, and escalation flows with SQLAlchemy-backed services&lt;/li&gt;
&lt;li&gt;add notification logging and explicit state-transition validation&lt;/li&gt;
&lt;li&gt;add behavior-driven pytest coverage from the derived spec&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Testing&lt;/h2&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;uv run pytest&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

    &lt;/div&gt;
    &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/SpecLeft/specleft-delta-demo/pull/4" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;




&lt;h3&gt;
  
  
  With SpecLeft MCP
&lt;/h3&gt;

&lt;p&gt;The source code was implemented to behave correctly and did not take multiple iterations. There were issues in the test logic that made it fail before it all worked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxthmn3itjex53eu3iexu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxthmn3itjex53eu3iexu.png" alt="SpecLeft MCP token usage results"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: one of the stranger decisions by the agent was the Fast API code was written all in &lt;code&gt;main.py&lt;/code&gt; - not sure why that happened?!&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Agent Retrospective
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Failed test runs before green: 3 (initial module import errors, escalation reviewer_ids, escalation event visibility)&lt;/li&gt;
&lt;li&gt;Effort split: spec externalisation 20%, implementation 45%, testing 20%, behaviour verification 15%&lt;/li&gt;
&lt;li&gt;Clarity grades: spec externalisation A-, implementation B+, testing B, behaviour verification B+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag_github-liquid-tag"&gt;
  &lt;h1&gt;
    &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/pull/3" rel="noopener noreferrer"&gt;
      &lt;img class="github-logo" alt="GitHub logo" src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg"&gt;
      &lt;span class="issue-title"&gt;
        Implement document approval workflow API
      &lt;/span&gt;
      &lt;span class="issue-number"&gt;#3&lt;/span&gt;
    &lt;/a&gt;
  &lt;/h1&gt;
  &lt;div class="github-thread"&gt;
    &lt;div class="timeline-comment-header"&gt;
      &lt;a href="https://github.com/Dimwiddle" rel="noopener noreferrer"&gt;
        &lt;img class="github-liquid-tag-img" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Favatars.githubusercontent.com%2Fu%2F121200859%3Fv%3D4" alt="Dimwiddle avatar"&gt;
      &lt;/a&gt;
      &lt;div class="timeline-comment-header-text"&gt;
        &lt;strong&gt;
          &lt;a href="https://github.com/Dimwiddle" rel="noopener noreferrer"&gt;Dimwiddle&lt;/a&gt;
        &lt;/strong&gt; posted on &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/pull/3" rel="noopener noreferrer"&gt;&lt;time&gt;Feb 24, 2026&lt;/time&gt;&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;div class="ltag-github-body"&gt;
      &lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Summary&lt;/h2&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;build document lifecycle, review, delegation, and escalation API flows backed by SQLAlchemy&lt;/li&gt;
&lt;li&gt;generate SpecLeft feature specs and map each scenario to tests&lt;/li&gt;
&lt;li&gt;add notification tracking and escalation history in responses&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Testing&lt;/h2&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;uv run pytest&lt;/li&gt;
&lt;/ul&gt;

    &lt;/div&gt;
    &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/SpecLeft/specleft-delta-demo/pull/3" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The MCP overhead is the problem. Input tokens up 63%, time up 47%, context fill nearly doubled to 62%. The code produced without SpecLeft was the stronger result this run.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;one bright spot&lt;/strong&gt;: output tokens fell 21%. Agents were more decisive when the spec context was there — they just paid too much to get it.&lt;/p&gt;

&lt;p&gt;It's becoming quite clear that AI agents need strong context and technical scope for the software development to be anywhere close to successful in production code.&lt;/p&gt;

&lt;p&gt;That's what the next version is targeting.&lt;/p&gt;




&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;I’m making it a goal of this SpecLeft project to get to maximum +10% input tokens and time taken, relative to without SpecLeft.&lt;/p&gt;

&lt;p&gt;My approach to providing an MCP for SpecLeft has likely hindered the token utilisation of the LLM; this is something I will investigate more. &lt;/p&gt;

&lt;p&gt;The next current improvements I'm thinking of are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Condensing the SKILL.md to be more of an educational guide, rather than a CLI reference. This should teach the agent to run commands much more efficiently and not bloat the context window with anti-patterns.&lt;/li&gt;
&lt;li&gt;Compact the command output even more e.g. &lt;code&gt;specleft next&lt;/code&gt; is limited to one item by default.&lt;/li&gt;
&lt;li&gt;Run the experiment without an MCP - SKILL and CLI only. &lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Your thoughts
&lt;/h3&gt;

&lt;p&gt;Do you have any suggestions on token optimisations I can take?&lt;/p&gt;

&lt;p&gt;Any contributions and feedback are welcome: &lt;a href="https://github.com/SpecLeft/specleft" rel="noopener noreferrer"&gt;https://github.com/SpecLeft/specleft&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>python</category>
    </item>
    <item>
      <title>AI Agents Can't Mark Their Own Homework [Case Study]</title>
      <dc:creator>Richard Kakengi</dc:creator>
      <pubDate>Tue, 17 Feb 2026 17:01:12 +0000</pubDate>
      <link>https://dev.to/dimwiddle/ai-agents-cant-mark-their-own-homework-case-study-26mk</link>
      <guid>https://dev.to/dimwiddle/ai-agents-cant-mark-their-own-homework-case-study-26mk</guid>
      <description>&lt;p&gt;I ran an experiment with the same project through two AI LLM model scenarios — once with a standard prompt, once with spec driven workflow. The results weren't what I expected.&lt;/p&gt;

&lt;p&gt;The headline isn't about tokens or the best performing LLM model. It's about measuring what the agents &lt;em&gt;thought&lt;/em&gt; they delivered versus what they &lt;em&gt;actually&lt;/em&gt; delivered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/SpecLeft/specleft-delta-demo" rel="noopener noreferrer"&gt;https://github.com/SpecLeft/specleft-delta-demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models:&lt;/strong&gt; Claude Opus 4.6, GPT-5.2-Codex &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coding Agent:&lt;/strong&gt; OpenCode 1.1.36&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR — The Good, The Bad, The Ugly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Good:&lt;/strong&gt; Spec-driven runs caught real bugs that baseline runs shipped silently. Claude Opus with specs found 3 defects during behaviour verification — including a classic Python truthiness trap that would have hit production. GPT-Codex with SpecLeft naturally adopted TDD without being told to. Both agents had fewer failed test runs with specs guiding them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Bad:&lt;/strong&gt; Token usage roughly tripled. Baseline complete runs used 53k–83k tokens. Spec-driven runs used 146k–147k. The spec externalisation phase alone consumed more tokens than some baseline implementations. Time increased too — Codex went from ~18 minutes to ~38 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Ugly:&lt;/strong&gt; When asked to self-assess, the baseline agents gave themselves a clean bill of health. Opus-4.6 with only a PRD reported 0 issues. The code had bugs and &lt;strong&gt;missed a key scenario&lt;/strong&gt; from PRD — the agent just had no framework to find them. It marked its own homework and gave itself an A+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The takeaway:&lt;/strong&gt; In its current state - spec driven development introduces an upfront token tax but produces code with fewer hidden defects. Whether that trade-off is worth it depends on whether you're building a side project or something that matters in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;AI coding agents are fast. Impressively fast. You can hand one a well-written product scope and a FastAPI project and it'll have routes, models, services, and tests in under 15 minutes.&lt;/p&gt;

&lt;p&gt;But, as we know, "tests pass" isn't the same as the system is "correct."&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://github.com/SpecLeft/specleft" rel="noopener noreferrer"&gt;SpecLeft&lt;/a&gt; — an open source spec-driven development tool that externalises behaviour into structured markdown specs and generates pytest scaffolding, with traceable links between them. &lt;/p&gt;

&lt;p&gt;The idea is simple: define what correct looks like &lt;em&gt;before&lt;/em&gt; the agent starts coding, then verify against it. The workflow looks like BDD, but quacks like TDD.&lt;/p&gt;

&lt;p&gt;There are many spec driven dev tools out there (sorry, yes this is another one) - but they are generally for the AI assisted dev workflows, so still need human dev to drive. SpecLeft tries a different approach – it is agent-native, meaning it's optimised towards AI agent adoption with an agent contract to verify safety. This adds trust to building software without too much intervention or technical review.&lt;/p&gt;

&lt;p&gt;To summarise the goal — can we trust AI agents to develop software that actually behaves as it should, while keeping the code readable, maintainable while fulfilling the intent?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Experiment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The application:&lt;/strong&gt; A document approval workflow API — documents move through draft → review → approved/rejected, with multi-reviewer approval, time-bound delegation, automatic escalation, and a handful of edge cases.&lt;/p&gt;

&lt;p&gt;We don't want a basic CRUD system for a nice vibe-coding showcase. This system scope has state machine, concurrent decision handling, time-based logic, and business rules that interact with each other. Complex enough that an agent can't just wing it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/main/PRD.md" rel="noopener noreferrer"&gt;Product Scope&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same starting commit for both runs&lt;/li&gt;
&lt;li&gt;Same PRD (&lt;code&gt;prd.md&lt;/code&gt;) with 5 features and 20 scenarios&lt;/li&gt;
&lt;li&gt;Same models (Opus 4.6 and Codex 5.2)&lt;/li&gt;
&lt;li&gt;Same coding agent (OpenCode 1.1.36)&lt;/li&gt;
&lt;li&gt;Two runs per model: baseline prompt vs SpecLeft-assisted workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Controlled variables:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tech stack (FastAPI + SQLAlchemy + SQLite + pytest), &lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/main/SKILLS.md" rel="noopener noreferrer"&gt;Agent skill&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Virtual environment with UV &lt;/li&gt;
&lt;li&gt;Product requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only difference was whether SpecLeft was involved.&lt;/p&gt;

&lt;p&gt;💻 &lt;em&gt;Repos and Session playbacks have been attached to each test run.&lt;/em&gt;&lt;br&gt;
🎥 &lt;em&gt;Session has to be downloaded and played with &lt;a href="https://docs.asciinema.org/manual/cli/quick-start/" rel="noopener noreferrer"&gt;asciinema&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Workflow A — Baseline (No SpecLeft)
&lt;/h2&gt;

&lt;p&gt;The agent gets a straightforward prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an autonomous agent guided by a planning-first workflow.
Build a document approval API using FastAPI and SQLAlchemy.
The project has had the initial setup already.
Follow ../prd.md for product requirements.
Follow ../SKILLS.md for instructions.
Include tests and ensure they pass.
Stop when all features are complete.
Go with your own recommendations for system behaviour instead of verifying with me.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I walked away and let it run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Opus 4.6 — Baseline
&lt;/h3&gt;

&lt;p&gt;Prompt entered, and Opus took its time. It spent a solid chunk of the session reading and analysing the PRD before writing anything. Implementation and tests came out together — not in separate phases, but interleaved. Server started first time. Tests passed with 2 failures on the first run, both resolved quickly.&lt;/p&gt;

&lt;p&gt;Total time: &lt;strong&gt;13 minutes 53 seconds&lt;/strong&gt;. Total tokens: &lt;strong&gt;83,243&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When asked for a retrospective, Opus reported &lt;strong&gt;0 issues found&lt;/strong&gt;. Clean run. Everything looked good.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyoq25oekr15zylkttmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flyoq25oekr15zylkttmy.png" alt="Claude Opus 4.6 Opencode Snapshot" width="800" height="488"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Code&lt;/strong&gt;: &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/tree/claude-opus-test/without-specleft" rel="noopener noreferrer"&gt;Branch&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Session Playback&lt;/strong&gt; (asciinema cast): &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/claude-opus-test/without-specleft/claude-opus-no-specs.cast" rel="noopener noreferrer"&gt;claude-opus-no-specs.cast&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bugs Discovered Post-Analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Missing Auto-Escalation Feature&lt;/strong&gt;: Despite PRD requiring automatic escalation after timeouts, only manual escalation is implemented. The &lt;code&gt;check_and_escalate&lt;/code&gt; function exists but performs no escalation, &lt;strong&gt;violating core business requirements.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Potential Timezone Brittleness&lt;/strong&gt;: Delegation expiry checks assume naive datetimes are UTC, which could fail if assumptions are incorrect.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Concurrency Risks&lt;/strong&gt;: No explicit locking for concurrent reviewer decisions, potentially leading to race conditions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  GPT Codex 5.2 — Baseline
&lt;/h3&gt;

&lt;p&gt;Codex moved faster and more aggressively. Implementation came out in parallel batches — models, schemas, routes, services written simultaneously. But it backtracked more. Tests failed 4 times before going green. Server failed to start on the first attempt. Behaviour verification required 4 patches to &lt;code&gt;services.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Total time: &lt;strong&gt;~18 minutes&lt;/strong&gt;. Total tokens: &lt;strong&gt;53,000&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The retrospective was vague: logic gaps were "caught early," timezone handling was a known issue. No specific bugs named.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdu0gpt4ujo7f5vf577ws.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdu0gpt4ujo7f5vf577ws.png" alt="Codex 5.2 Baseline snapshot" width="800" height="491"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Code&lt;/strong&gt; : &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/tree/gpt-codex-test/without-specleft" rel="noopener noreferrer"&gt;Branch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session Playback&lt;/strong&gt; (asciinema cast): &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/gpt-codex-test/without-specleft/gpt-codex-no-specs.cast" rel="noopener noreferrer"&gt;gpt-codex-no-specs.cast&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Baseline Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Codex 5.2&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;53,000&lt;/td&gt;
&lt;td&gt;83,243&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tests passed&lt;/td&gt;
&lt;td&gt;19 (100%)&lt;/td&gt;
&lt;td&gt;53 (100%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed test runs&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Issues found in retro&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to completion&lt;/td&gt;
&lt;td&gt;~18m&lt;/td&gt;
&lt;td&gt;13m 53s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens before implementation&lt;/td&gt;
&lt;td&gt;14,000&lt;/td&gt;
&lt;td&gt;~33,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both agents declared the job done. Tests pass. Features work. Ship it?&lt;/p&gt;


&lt;h2&gt;
  
  
  Workflow B — With SpecLeft
&lt;/h2&gt;

&lt;p&gt;Same project, same &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/main/PRD.md" rel="noopener noreferrer"&gt;PRD&lt;/a&gt;. But this time SpecLeft is installed as a dependency, and the prompt tells the agent to externalise behaviour before writing code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an autonomous agent guided by a planning-first workflow.
Build a document approval API using FastAPI and SQLAlchemy.
The project has had the initial setup already.
Follow ../prd.md for product requirements.
Follow ../SKILLS.md for instructions.
Initialize SpecLeft and use its commands to externalize behaviour before implementation.
I have installed v0.2.2.
Only if required, use doc: https://github.com/SpecLeft/specleft/blob/main/AI_AGENTS.md for more context.
Do not write implementation code until behaviour is explicit.
Go with your own recommendations for system behaviour instead of verifying with me.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I walked away again.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: The AI_AGENTS.md is to help the agent know how to use SpecLeft tool better.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude Opus 4.6 — With SpecLeft
&lt;/h3&gt;

&lt;p&gt;Opus externalised all 5 features into SpecLeft specs before writing a line of implementation code. It updated scenario priorities to match feature priorities — a decision it made on its own. Then it generated test skeletons with &lt;code&gt;specleft test skeleton&lt;/code&gt;, giving it 27 decorated test stubs mapped directly to scenarios.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdy3h0zdroycr07j052hy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdy3h0zdroycr07j052hy.png" alt="Claude Opus 4.6 with SpecLeft" width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First test run: 25/27 passed. The 2 failures were test logic issues, not application bugs. The core service layer was correct on first implementation.&lt;/p&gt;

&lt;p&gt;Then came behaviour verification. And this is where it got interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1:&lt;/strong&gt; &lt;code&gt;timeout_hours or doc.escalation_timeout_hours or 24&lt;/code&gt; — when &lt;code&gt;timeout_hours=0&lt;/code&gt;, Python treats &lt;code&gt;0&lt;/code&gt; as falsy and falls through to the default of &lt;code&gt;24&lt;/code&gt;. Classic truthiness trap. The unit tests didn't catch it because they manipulated &lt;code&gt;review_started_at&lt;/code&gt; directly with 25-hour backdating, never testing with &lt;code&gt;timeout_hours=0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2:&lt;/strong&gt; &lt;code&gt;review_cycle&lt;/code&gt; in the &lt;code&gt;DocumentResponse&lt;/code&gt; schema had a default value of &lt;code&gt;1&lt;/code&gt;, but the model never exposed the actual cycle count. Pydantic's &lt;code&gt;from_attributes&lt;/code&gt; silently fell back to the default. A resubmitted document showed &lt;code&gt;review_cycle: 1&lt;/code&gt; when it should have been &lt;code&gt;2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 3:&lt;/strong&gt; Escalation test logic accessed response data before checking the status code — a test fragility that would cause misleading failures.&lt;/p&gt;

&lt;p&gt;Total time: &lt;strong&gt;21 minutes 1 second&lt;/strong&gt;. Total tokens: &lt;strong&gt;~147,000&lt;/strong&gt; (across two context windows, with compaction at 105k).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnmc244s1e4xcj8fprjp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnmc244s1e4xcj8fprjp.png" alt="Claude Opus with SpecLeft Snapshot 2" width="800" height="472"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/tree/claude-opus-test/with-specleft" rel="noopener noreferrer"&gt;Branch&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Session Playback (asciinema download):&lt;/strong&gt; &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/claude-opus-test/with-specleft/claude-opus.cast" rel="noopener noreferrer"&gt;claude-opus.cast&lt;/a&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  GPT Codex 5.2 — With SpecLeft
&lt;/h3&gt;

&lt;p&gt;This was the surprise. Codex consumed the SpecLeft specs and test skeletons, and then did something I didn't engineer: &lt;strong&gt;it wrote functional test logic before implementation code.&lt;/strong&gt; Genuine TDD, driven by the structure of the skeletons. The scaffolding naturally guided the agent into writing assertions first, then building the code to satisfy them. Sweet!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6xruu1n3lp1x762pw24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6xruu1n3lp1x762pw24.png" alt="GPT Codex with SpecLeft snapshot" width="800" height="498"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It read all the specs — which burned tokens on context — but that context clearly influenced implementation quality. Tests failed twice before going green, down from 4 in the baseline run.&lt;/p&gt;

&lt;p&gt;Total time: &lt;strong&gt;~38 minutes&lt;/strong&gt;. Total tokens: &lt;strong&gt;146,000&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbie380rk1i7e439rslb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbie380rk1i7e439rslb.png" alt="GPT Codex 5.2 with SpecLeft Test snapshot 2" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt; : &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/tree/gpt-codex-test/with-specleft" rel="noopener noreferrer"&gt;Branch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session Playback&lt;/strong&gt; (Asciinema download): &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/gpt-codex-test/with-specleft/gpt-codex-specs.cast" rel="noopener noreferrer"&gt;gpt-codex-specs.cast&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SpecLeft Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Codex 5.2&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;146,499&lt;/td&gt;
&lt;td&gt;~147,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tests passed&lt;/td&gt;
&lt;td&gt;27 (100%)&lt;/td&gt;
&lt;td&gt;27 (100%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed test runs&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Issues found in retro&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to completion&lt;/td&gt;
&lt;td&gt;~38m&lt;/td&gt;
&lt;td&gt;21m 1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens to externalise specs&lt;/td&gt;
&lt;td&gt;49,000&lt;/td&gt;
&lt;td&gt;45,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokens before implementation&lt;/td&gt;
&lt;td&gt;89,000&lt;/td&gt;
&lt;td&gt;63,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Side-by-Side Comparison
&lt;/h2&gt;

&lt;p&gt;Opus without specs generated 53 tests, nearly double the SpecLeft run's 27 — but quantity isn't coverage. The 53 tests were whatever the agent decided mattered, with no traceability to product requirements — which is shown with the missing auto-escalate requirement. The 27 SpecLeft tests each map to a specific scenario in the PRD&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Codex Baseline&lt;/th&gt;
&lt;th&gt;Codex + SpecLeft&lt;/th&gt;
&lt;th&gt;Opus Baseline&lt;/th&gt;
&lt;th&gt;Opus + SpecLeft&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;53,000&lt;/td&gt;
&lt;td&gt;146,000&lt;/td&gt;
&lt;td&gt;83,243&lt;/td&gt;
&lt;td&gt;~147,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tests passed&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed test runs&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bugs found during retro&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing Requirements&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Missing Requirements: count of unimplemented PRD features.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Which Stack Stayed on Track the Best?
&lt;/h2&gt;

&lt;p&gt;Having a look at the code and testing the API manually - both spec driven runs are strong so it's pretty even. Codex had a much cleaner data model and modern sqlalchemy implementation;  while Opus was more flat in its design. With that in mind - I'd feel better about picking up the Codex SpecLeft project in a realistic situation. That being said the code wasn't mind blowing either - especially the lack of exception handling around database queries in the service layer.&lt;/p&gt;

&lt;p&gt;I've also prompted a few neutral agents (Gemini-3, Kimi K2.5, Grok) to evaluate the codebases on quality, maintainability, and correctness. &lt;/p&gt;

&lt;p&gt;Full analysis found in the &lt;a href="https://github.com/SpecLeft/specleft-delta-demo/blob/main/AGENT_RESULTS.md" rel="noopener noreferrer"&gt;repo&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's the Takeway
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Agents can't assess their own output
&lt;/h3&gt;

&lt;p&gt;The fact that the critical defects were missed by the agent itself but caught by external verification highlights a fundamental limitation: &lt;strong&gt;AI agents can't reliably assess their own output without structured external criteria.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On top of that, the baseline code was brittle. The Codex baseline shipped with &lt;strong&gt;175 deprecation warnings&lt;/strong&gt; in its test suite—technical debt that the agent completely ignored because the tests technically "passed."&lt;/p&gt;

&lt;p&gt;In contrast, the SpecLeft agent &lt;em&gt;did&lt;/em&gt; introduce bugs during development—like the &lt;code&gt;timeout_hours=0&lt;/code&gt; truthiness trap and the &lt;code&gt;review_cycle&lt;/code&gt; default issue. But crucially, &lt;strong&gt;it found and fixed them.&lt;/strong&gt; The structured verification process forced the agent to confront its own logic errors, whereas the baseline agent simply marked its own homework as "correct" and moved on.&lt;/p&gt;

&lt;h3&gt;
  
  
  TDD emerged naturally from the workflow
&lt;/h3&gt;

&lt;p&gt;This was unplanned but a pleasant surprise! Codex with SpecLeft generated test skeletons via &lt;code&gt;specleft test skeleton&lt;/code&gt;, and those skeletons guided the agent into writing test assertions before implementation code. Not because the prompt said "do TDD" — it didn't. The structure of the scaffolding naturally produced that workflow.&lt;/p&gt;

&lt;p&gt;What was even more interesting was how the agent was approaching the implementation. Based on the agent logging, it came across it was thinking more about the overall behaviour of the app overall, rather than purely logically.&lt;/p&gt;

&lt;h3&gt;
  
  
  The SDD token cost is real and significant
&lt;/h3&gt;

&lt;p&gt;No getting around it. SpecLeft runs used 2–3x more tokens than baseline. The spec externalisation phase (45k–49k tokens) is pure overhead if you measure by "tokens to first passing test." The baseline agents started writing code sooner and finished sooner. From what I've seen with other SDD tools is that this is a common problem. Makes sense as it's a lot of additional context.&lt;/p&gt;

&lt;p&gt;The question is whether having "passing tests" is the right finish line. The risk is the baseline code ships to production without a key piece of functionality, which would take time to find, diagnose and fix – my guess is that it'll cost more than 90k tokens plus impact on the end users.&lt;/p&gt;

&lt;p&gt;This is something worth investigating if there are ways to optimise token usage with any context engineering techniques.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It on your PRD.md
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;specleft

specleft init

specleft doctor

specleft status

specleft plan 

&lt;span class="c"&gt;# or add individual features&lt;/span&gt;
specleft features add
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/SpecLeft/specleft" rel="noopener noreferrer"&gt;https://github.com/SpecLeft/specleft&lt;/a&gt; &lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://specleft.dev/docs/getting-started/installation" rel="noopener noreferrer"&gt;specleft.dev&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Over to You
&lt;/h2&gt;

&lt;p&gt;The data is in the repo. The recordings are linked above. Run it yourself if you want — same PRD, same setup, different agent if you like.&lt;/p&gt;

&lt;p&gt;The bigger question: &lt;strong&gt;How are you verifying agent output today, or are you going with the pure vibe-coding approach?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Personally, here's what I'm thinking about: &lt;strong&gt;should that traceability be enforced in CI?&lt;/strong&gt; A gate that fails the build if critical scenarios aren't implemented — not as a suggestion, but as a policy. Or is visibility enough?&lt;/p&gt;

&lt;p&gt;CI enforcement on behaviour functionality that I've started working on — &lt;a href="https://specleft.dev/enforce" rel="noopener noreferrer"&gt;request early access&lt;/a&gt; if you utilise Python and AI agents in your dev workflow and want to be involved.&lt;/p&gt;

&lt;p&gt;Drop a comment — I'm keen to hear your thoughts.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>python</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
