There's been an uprising in new spec driven processes and workflows which focus on human-in-the-loop development; this project's target is to add a deterministic behaviour alignment layer in to this process that can be run solely by the agent — SpecLeft.
I've started this open source, agent-native, CLI tool to guide the agentic coding workflow. The aim is for it to act as a lightweight trust layer between the PRD and the codebase.
See my previous post on an experiment I ran with LLM models coding with and without a spec driven process. The results were quite surprising!
The main road block I faced previously with the spec driven approach was the HUGE amount of token bloat the specs created at the start of the context window — which led me to start finding a solution to reduce that context window.
If you're not familiar with tokens and context windows — here's a good video for the breakdowns on LLMs:
In this round of the experiment, SpecLeft v0.3.0 introduced token optimisation techniques to make the CLI commands more token efficient.
I also implemented a MCP to see if that improves the CLI utilisation, as well as, better distribution to agents overall. I was aware of MCPs overhead, so I designed it to minimise the overhead it brings with only one tool and three resources. Let's see if it worked...
TL;DR
... It didn't work, but has promise.
The SpecLeft MCP token overhead is real: +77% total tokens and +47% time taken compared to the baseline (without SpecLeft). The baseline code was also cleaner and better structured, to be honest.
Good news is the output tokens dropped 21%, which tells me the spec context is doing something useful. It suggests agents were less verbose and more targeted when working with the Spec -> TDD workflow.
It's the strongest signal that SpecLeft approach has legs; although the cost-to-benefit ratio is just way off right now.
The goal now is to get SpecLeft's overhead down to ≤+10% on input tokens and time taken. It's a specific target, and it's measurable — which means it's fixable (hopefully).
The next few versions is going to address this and get closer to the goal.
The project is fully open source and any feedback and contributions are welcome at https://github.com/SpecLeft/specleft
The Previous Experiment
The results from the first experiment link here showed me theres promise with SDD -> TDD workflow, especially when it comes to the AI agents understanding the behaviour and goal of the system.
The main takeaway was the reduced need for iterations due to tests passing quicker.
The pain was excruciatingly felt in the token usage and from the time taken.
How SpecLeft was improved for this experiment
- Default output is
--format json(COMPACT mode) - Removing excessive characters and white space from JSON output
- MCP Server with handshake utility, one tool and three resources
- MCP server is mostly for more effective distribution to agents.
The Experiment Results
| Metric | Without MCP | With MCP | Delta |
|---|---|---|---|
| Input tokens | 305,182 | 496,440 | +191,258 (+63%) |
| Output tokens | 70,548 | 56,016 | −14,532 (−21%) |
| Cache read | 4,511,360 | 8,089,728 | +3,578,368 (+79%) |
| Total tokens | 4,887,090 | 8,642,184 | +3,755,094 (+77%) |
| Interactions | 119 | 141 | +22 (+18%) |
| Duration | 30m | 44m | +14m (+47%) |
| Context fill | 35% | 62% | +27pp |
Measurement Tool: OpenCode Monitor
Note: I have changed the token measurement tool from the first experiment to give a more granular perspective on the experiment.
Without SpecLeft MCP
The agent performed fairly well here, however there were multiple iterations required to get the app working as expected.
Agent Retro
- Failed test runs before pass: 3
- Effort split: spec externalisation 15%, implementation 55%, testing 20%, behaviour verification 10%
- Scope clarity grades: spec externalisation B, implementation B+, testing A-, behaviour verification B
With SpecLeft MCP
Implement approval workflow API
#4
- build document lifecycle, multi-reviewer, delegation, and escalation flows with SQLAlchemy-backed services
- add notification logging and explicit state-transition validation
- add behavior-driven pytest coverage from the derived spec
uv run pytest
With SpecLeft MCP
The source code was implemented to behave correctly and did not take multiple iterations. There were issues in the test logic that made it fail before it all worked.
Note: one of the stranger decisions by the agent was the Fast API code was written all in main.py - not sure why that happened?!
Agent Retrospective
- Failed test runs before green: 3 (initial module import errors, escalation reviewer_ids, escalation event visibility)
- Effort split: spec externalisation 20%, implementation 45%, testing 20%, behaviour verification 15%
- Clarity grades: spec externalisation A-, implementation B+, testing B, behaviour verification B+
Implement document approval workflow API
#3
- build document lifecycle, review, delegation, and escalation API flows backed by SQLAlchemy
- generate SpecLeft feature specs and map each scenario to tests
- add notification tracking and escalation history in responses
- uv run pytest
Summary
The MCP overhead is the problem. Input tokens up 63%, time up 47%, context fill nearly doubled to 62%. The code produced without SpecLeft was the stronger result this run.
The one bright spot: output tokens fell 21%. Agents were more decisive when the spec context was there — they just paid too much to get it.
It's becoming quite clear that AI agents need strong context and technical scope for the software development to be anywhere close to successful in production code.
That's what the next version is targeting.
The takeaway
I’m making it a goal of this SpecLeft project to get to maximum +10% input tokens and time taken, relative to without SpecLeft.
My approach to providing an MCP for SpecLeft has likely hindered the token utilisation of the LLM; this is something I will investigate more.
The next current improvements I'm thinking of are:
- Condensing the SKILL.md to be more of an educational guide, rather than a CLI reference. This should teach the agent to run commands much more efficiently and not bloat the context window with anti-patterns.
- Compact the command output even more e.g.
specleft nextis limited to one item by default. - Run the experiment without an MCP - SKILL and CLI only.
Your thoughts
Do you have any suggestions on token optimisations I can take?
Any contributions and feedback are welcome: https://github.com/SpecLeft/specleft.


Top comments (2)
The 63% input token increase is interesting because it's basically the cost of giving the agent a manual. I hit something similar — verbose schemas made agents more precise but slower and more expensive. The output token decrease (agents being more decisive) suggests the spec is working, just at too high a cost. Have you tried sending only the relevant spec sections based on the current task rather than the full doc?
Yeah that's been a tricky one - I did try something like that relevant spec segments, but didn't work as intended. That was a takeaway from the first experiment; so I wanted to make sure the agent only receives the minimum that the model needs to implement the next feature/scenario from specs.
I made the improvement in specleft to have a setting for 'compact mode' in the CLI output of the commands — however the agent went against that anyway and decided to read the verbose output instead.
This has made me think that educating the agent on the CLI tool is a critical piece to this.
Something I will explore with skills in the next round.