DEV Community: Richard Kakengi

The MCP I built for AI agents backfired

Richard Kakengi — Tue, 03 Mar 2026 17:07:09 +0000

There's been an uprising in new spec driven processes and workflows which focus on human-in-the-loop development; this project's target is to add a deterministic behaviour alignment layer in to this process that can be run solely by the agent — SpecLeft.

I've started this open source, agent-native, CLI tool to guide the agentic coding workflow. The aim is for it to act as a lightweight trust layer between the PRD and the codebase.

See my previous post on an experiment I ran with LLM models coding with and without a spec driven process. The results were quite surprising!

The main road block I faced previously with the spec driven approach was the HUGE amount of token bloat the specs created at the start of the context window — which led me to start finding a solution to reduce that context window.

If you're not familiar with tokens and context windows — here's a good video for the breakdowns on LLMs:

In this round of the experiment, SpecLeft v0.3.0 introduced token optimisation techniques to make the CLI commands more token efficient.

I also implemented a MCP to see if that improves the CLI utilisation, as well as, better distribution to agents overall. I was aware of MCPs overhead, so I designed it to minimise the overhead it brings with only one tool and three resources. Let's see if it worked...

TL;DR

... It didn't work, but has promise.

The SpecLeft MCP token overhead is real: +77% total tokens and +47% time taken compared to the baseline (without SpecLeft). The baseline code was also cleaner and better structured, to be honest.

Good news is the output tokens dropped 21%, which tells me the spec context is doing something useful. It suggests agents were less verbose and more targeted when working with the Spec -> TDD workflow.

It's the strongest signal that SpecLeft approach has legs; although the cost-to-benefit ratio is just way off right now.

The goal now is to get SpecLeft's overhead down to ≤+10% on input tokens and time taken. It's a specific target, and it's measurable — which means it's fixable (hopefully).

The next few versions is going to address this and get closer to the goal.

The project is fully open source and any feedback and contributions are welcome at https://github.com/SpecLeft/specleft

The Previous Experiment

The results from the first experiment link here showed me theres promise with SDD -> TDD workflow, especially when it comes to the AI agents understanding the behaviour and goal of the system.

The main takeaway was the reduced need for iterations due to tests passing quicker.

The pain was excruciatingly felt in the token usage and from the time taken.

How SpecLeft was improved for this experiment

Default output is --format json (COMPACT mode)
Removing excessive characters and white space from JSON output
MCP Server with handshake utility, one tool and three resources
MCP server is mostly for more effective distribution to agents.

The Experiment Results

Metric	Without MCP	With MCP	Delta
Input tokens	305,182	496,440	+191,258 (+63%)
Output tokens	70,548	56,016	−14,532 (−21%)
Cache read	4,511,360	8,089,728	+3,578,368 (+79%)
Total tokens	4,887,090	8,642,184	+3,755,094 (+77%)
Interactions	119	141	+22 (+18%)
Duration	30m	44m	+14m (+47%)
Context fill	35%	62%	+27pp

Measurement Tool: OpenCode Monitor

Note: I have changed the token measurement tool from the first experiment to give a more granular perspective on the experiment.

Without SpecLeft MCP

The agent performed fairly well here, however there were multiple iterations required to get the app working as expected.

Agent Retro

Failed test runs before pass: 3
Effort split: spec externalisation 15%, implementation 55%, testing 20%, behaviour verification 10%
Scope clarity grades: spec externalisation B, implementation B+, testing A-, behaviour verification B

With SpecLeft MCP

Implement approval workflow API #4

Dimwiddle posted on Feb 24, 2026

Summary

build document lifecycle, multi-reviewer, delegation, and escalation flows with SQLAlchemy-backed services
add notification logging and explicit state-transition validation
add behavior-driven pytest coverage from the derived spec

Testing

uv run pytest

View on GitHub

With SpecLeft MCP

The source code was implemented to behave correctly and did not take multiple iterations. There were issues in the test logic that made it fail before it all worked.

Note: one of the stranger decisions by the agent was the Fast API code was written all in main.py - not sure why that happened?!

Agent Retrospective

Failed test runs before green: 3 (initial module import errors, escalation reviewer_ids, escalation event visibility)
Effort split: spec externalisation 20%, implementation 45%, testing 20%, behaviour verification 15%
Clarity grades: spec externalisation A-, implementation B+, testing B, behaviour verification B+

Implement document approval workflow API #3

Dimwiddle posted on Feb 24, 2026

Summary

build document lifecycle, review, delegation, and escalation API flows backed by SQLAlchemy
generate SpecLeft feature specs and map each scenario to tests
add notification tracking and escalation history in responses

Testing

uv run pytest

View on GitHub

Summary

The MCP overhead is the problem. Input tokens up 63%, time up 47%, context fill nearly doubled to 62%. The code produced without SpecLeft was the stronger result this run.

The one bright spot: output tokens fell 21%. Agents were more decisive when the spec context was there — they just paid too much to get it.

It's becoming quite clear that AI agents need strong context and technical scope for the software development to be anywhere close to successful in production code.

That's what the next version is targeting.

The takeaway

I’m making it a goal of this SpecLeft project to get to maximum +10% input tokens and time taken, relative to without SpecLeft.

My approach to providing an MCP for SpecLeft has likely hindered the token utilisation of the LLM; this is something I will investigate more.

The next current improvements I'm thinking of are:

Condensing the SKILL.md to be more of an educational guide, rather than a CLI reference. This should teach the agent to run commands much more efficiently and not bloat the context window with anti-patterns.
Compact the command output even more e.g. specleft next is limited to one item by default.
Run the experiment without an MCP - SKILL and CLI only.

Your thoughts

Do you have any suggestions on token optimisations I can take?

Any contributions and feedback are welcome: https://github.com/SpecLeft/specleft.

AI Agents Can't Mark Their Own Homework [Case Study]

Richard Kakengi — Tue, 17 Feb 2026 17:01:12 +0000

I ran an experiment with the same project through two AI LLM model scenarios — once with a standard prompt, once with spec driven workflow. The results weren't what I expected.

The headline isn't about tokens or the best performing LLM model. It's about measuring what the agents thought they delivered versus what they actually delivered.

Repo: https://github.com/SpecLeft/specleft-delta-demo

Models: Claude Opus 4.6, GPT-5.2-Codex

Coding Agent: OpenCode 1.1.36

TL;DR — The Good, The Bad, The Ugly

The Good: Spec-driven runs caught real bugs that baseline runs shipped silently. Claude Opus with specs found 3 defects during behaviour verification — including a classic Python truthiness trap that would have hit production. GPT-Codex with SpecLeft naturally adopted TDD without being told to. Both agents had fewer failed test runs with specs guiding them.

The Bad: Token usage roughly tripled. Baseline complete runs used 53k–83k tokens. Spec-driven runs used 146k–147k. The spec externalisation phase alone consumed more tokens than some baseline implementations. Time increased too — Codex went from ~18 minutes to ~38 minutes.

The Ugly: When asked to self-assess, the baseline agents gave themselves a clean bill of health. Opus-4.6 with only a PRD reported 0 issues. The code had bugs and missed a key scenario from PRD — the agent just had no framework to find them. It marked its own homework and gave itself an A+.

The takeaway: In its current state - spec driven development introduces an upfront token tax but produces code with fewer hidden defects. Whether that trade-off is worth it depends on whether you're building a side project or something that matters in production.

The Problem

AI coding agents are fast. Impressively fast. You can hand one a well-written product scope and a FastAPI project and it'll have routes, models, services, and tests in under 15 minutes.

But, as we know, "tests pass" isn't the same as the system is "correct."

I've been building SpecLeft — an open source spec-driven development tool that externalises behaviour into structured markdown specs and generates pytest scaffolding, with traceable links between them.

The idea is simple: define what correct looks like before the agent starts coding, then verify against it. The workflow looks like BDD, but quacks like TDD.

There are many spec driven dev tools out there (sorry, yes this is another one) - but they are generally for the AI assisted dev workflows, so still need human dev to drive. SpecLeft tries a different approach – it is agent-native, meaning it's optimised towards AI agent adoption with an agent contract to verify safety. This adds trust to building software without too much intervention or technical review.

To summarise the goal — can we trust AI agents to develop software that actually behaves as it should, while keeping the code readable, maintainable while fulfilling the intent?

The Experiment

The application: A document approval workflow API — documents move through draft → review → approved/rejected, with multi-reviewer approval, time-bound delegation, automatic escalation, and a handful of edge cases.

We don't want a basic CRUD system for a nice vibe-coding showcase. This system scope has state machine, concurrent decision handling, time-based logic, and business rules that interact with each other. Complex enough that an agent can't just wing it.

Product Scope

The setup:

Same starting commit for both runs
Same PRD (prd.md) with 5 features and 20 scenarios
Same models (Opus 4.6 and Codex 5.2)
Same coding agent (OpenCode 1.1.36)
Two runs per model: baseline prompt vs SpecLeft-assisted workflow

Controlled variables:

Tech stack (FastAPI + SQLAlchemy + SQLite + pytest),
Agent skill
Virtual environment with UV
Product requirements

The only difference was whether SpecLeft was involved.

💻 Repos and Session playbacks have been attached to each test run.
🎥 Session has to be downloaded and played with asciinema

Workflow A — Baseline (No SpecLeft)

The agent gets a straightforward prompt:

You are an autonomous agent guided by a planning-first workflow.
Build a document approval API using FastAPI and SQLAlchemy.
The project has had the initial setup already.
Follow ../prd.md for product requirements.
Follow ../SKILLS.md for instructions.
Include tests and ensure they pass.
Stop when all features are complete.
Go with your own recommendations for system behaviour instead of verifying with me.

Then I walked away and let it run.

Claude Opus 4.6 — Baseline

Prompt entered, and Opus took its time. It spent a solid chunk of the session reading and analysing the PRD before writing anything. Implementation and tests came out together — not in separate phases, but interleaved. Server started first time. Tests passed with 2 failures on the first run, both resolved quickly.

Total time: 13 minutes 53 seconds. Total tokens: 83,243.

When asked for a retrospective, Opus reported 0 issues found. Clean run. Everything looked good.

Code: Branch
Session Playback (asciinema cast): claude-opus-no-specs.cast

Bugs Discovered Post-Analysis:

Missing Auto-Escalation Feature: Despite PRD requiring automatic escalation after timeouts, only manual escalation is implemented. The check_and_escalate function exists but performs no escalation, violating core business requirements.
Potential Timezone Brittleness: Delegation expiry checks assume naive datetimes are UTC, which could fail if assumptions are incorrect.
Concurrency Risks: No explicit locking for concurrent reviewer decisions, potentially leading to race conditions.

GPT Codex 5.2 — Baseline

Codex moved faster and more aggressively. Implementation came out in parallel batches — models, schemas, routes, services written simultaneously. But it backtracked more. Tests failed 4 times before going green. Server failed to start on the first attempt. Behaviour verification required 4 patches to services.py.

Total time: ~18 minutes. Total tokens: 53,000.

The retrospective was vague: logic gaps were "caught early," timezone handling was a known issue. No specific bugs named.

Code : Branch

Session Playback (asciinema cast): gpt-codex-no-specs.cast

Baseline Results

Metric	Codex 5.2	Opus 4.6
Total tokens	53,000	83,243
Total tests passed	19 (100%)	53 (100%)
Failed test runs	4	2
Issues found in retro	0	0
Time to completion	~18m	13m 53s
Tokens before implementation	14,000	~33,000

Both agents declared the job done. Tests pass. Features work. Ship it?

Workflow B — With SpecLeft

Same project, same PRD. But this time SpecLeft is installed as a dependency, and the prompt tells the agent to externalise behaviour before writing code:

You are an autonomous agent guided by a planning-first workflow.
Build a document approval API using FastAPI and SQLAlchemy.
The project has had the initial setup already.
Follow ../prd.md for product requirements.
Follow ../SKILLS.md for instructions.
Initialize SpecLeft and use its commands to externalize behaviour before implementation.
I have installed v0.2.2.
Only if required, use doc: https://github.com/SpecLeft/specleft/blob/main/AI_AGENTS.md for more context.
Do not write implementation code until behaviour is explicit.
Go with your own recommendations for system behaviour instead of verifying with me.

Then I walked away again.

Note: The AI_AGENTS.md is to help the agent know how to use SpecLeft tool better.

Claude Opus 4.6 — With SpecLeft

Opus externalised all 5 features into SpecLeft specs before writing a line of implementation code. It updated scenario priorities to match feature priorities — a decision it made on its own. Then it generated test skeletons with specleft test skeleton, giving it 27 decorated test stubs mapped directly to scenarios.

First test run: 25/27 passed. The 2 failures were test logic issues, not application bugs. The core service layer was correct on first implementation.

Then came behaviour verification. And this is where it got interesting.

Bug 1: timeout_hours or doc.escalation_timeout_hours or 24 — when timeout_hours=0, Python treats 0 as falsy and falls through to the default of 24. Classic truthiness trap. The unit tests didn't catch it because they manipulated review_started_at directly with 25-hour backdating, never testing with timeout_hours=0.

Bug 2: review_cycle in the DocumentResponse schema had a default value of 1, but the model never exposed the actual cycle count. Pydantic's from_attributes silently fell back to the default. A resubmitted document showed review_cycle: 1 when it should have been 2.

Bug 3: Escalation test logic accessed response data before checking the status code — a test fragility that would cause misleading failures.

Total time: 21 minutes 1 second. Total tokens: ~147,000 (across two context windows, with compaction at 105k).

Code: Branch
Session Playback (asciinema download): claude-opus.cast

GPT Codex 5.2 — With SpecLeft

This was the surprise. Codex consumed the SpecLeft specs and test skeletons, and then did something I didn't engineer: it wrote functional test logic before implementation code. Genuine TDD, driven by the structure of the skeletons. The scaffolding naturally guided the agent into writing assertions first, then building the code to satisfy them. Sweet!

It read all the specs — which burned tokens on context — but that context clearly influenced implementation quality. Tests failed twice before going green, down from 4 in the baseline run.

Total time: ~38 minutes. Total tokens: 146,000.

Code : Branch

Session Playback (Asciinema download): gpt-codex-specs.cast

SpecLeft Results

Metric	Codex 5.2	Opus 4.6
Total tokens	146,499	~147,000
Total tests passed	27 (100%)	27 (100%)
Failed test runs	2	1
Issues found in retro	0	3
Time to completion	~38m	21m 1s
Tokens to externalise specs	49,000	45,000
Tokens before implementation	89,000	63,000

Side-by-Side Comparison

Opus without specs generated 53 tests, nearly double the SpecLeft run's 27 — but quantity isn't coverage. The 53 tests were whatever the agent decided mattered, with no traceability to product requirements — which is shown with the missing auto-escalate requirement. The 27 SpecLeft tests each map to a specific scenario in the PRD

Metric	Codex Baseline	Codex + SpecLeft	Opus Baseline	Opus + SpecLeft
Total tokens	53,000	146,000	83,243	~147,000
Total tests passed	19	27	53	27
Failed test runs	4	2	2	1
Bugs found during retro	0	0	0	3
Missing Requirements	0	0	1	0

Missing Requirements: count of unimplemented PRD features.

Which Stack Stayed on Track the Best?

Having a look at the code and testing the API manually - both spec driven runs are strong so it's pretty even. Codex had a much cleaner data model and modern sqlalchemy implementation; while Opus was more flat in its design. With that in mind - I'd feel better about picking up the Codex SpecLeft project in a realistic situation. That being said the code wasn't mind blowing either - especially the lack of exception handling around database queries in the service layer.

I've also prompted a few neutral agents (Gemini-3, Kimi K2.5, Grok) to evaluate the codebases on quality, maintainability, and correctness.

Full analysis found in the repo

What's the Takeway

Agents can't assess their own output

The fact that the critical defects were missed by the agent itself but caught by external verification highlights a fundamental limitation: AI agents can't reliably assess their own output without structured external criteria.

On top of that, the baseline code was brittle. The Codex baseline shipped with 175 deprecation warnings in its test suite—technical debt that the agent completely ignored because the tests technically "passed."

In contrast, the SpecLeft agent did introduce bugs during development—like the timeout_hours=0 truthiness trap and the review_cycle default issue. But crucially, it found and fixed them. The structured verification process forced the agent to confront its own logic errors, whereas the baseline agent simply marked its own homework as "correct" and moved on.

TDD emerged naturally from the workflow

This was unplanned but a pleasant surprise! Codex with SpecLeft generated test skeletons via specleft test skeleton, and those skeletons guided the agent into writing test assertions before implementation code. Not because the prompt said "do TDD" — it didn't. The structure of the scaffolding naturally produced that workflow.

What was even more interesting was how the agent was approaching the implementation. Based on the agent logging, it came across it was thinking more about the overall behaviour of the app overall, rather than purely logically.

The SDD token cost is real and significant

No getting around it. SpecLeft runs used 2–3x more tokens than baseline. The spec externalisation phase (45k–49k tokens) is pure overhead if you measure by "tokens to first passing test." The baseline agents started writing code sooner and finished sooner. From what I've seen with other SDD tools is that this is a common problem. Makes sense as it's a lot of additional context.

The question is whether having "passing tests" is the right finish line. The risk is the baseline code ships to production without a key piece of functionality, which would take time to find, diagnose and fix – my guess is that it'll cost more than 90k tokens plus impact on the end users.

This is something worth investigating if there are ways to optimise token usage with any context engineering techniques.

Try It on your PRD.md

pip install specleft

specleft init

specleft doctor

specleft status

specleft plan 

# or add individual features
specleft features add

Repo: https://github.com/SpecLeft/specleft
Docs: specleft.dev

Over to You

The data is in the repo. The recordings are linked above. Run it yourself if you want — same PRD, same setup, different agent if you like.

The bigger question: How are you verifying agent output today, or are you going with the pure vibe-coding approach?

Personally, here's what I'm thinking about: should that traceability be enforced in CI? A gate that fails the build if critical scenarios aren't implemented — not as a suggestion, but as a policy. Or is visibility enough?

CI enforcement on behaviour functionality that I've started working on — request early access if you utilise Python and AI agents in your dev workflow and want to be involved.

Drop a comment — I'm keen to hear your thoughts.