16 frameworks. One Blind Spot

Andrey Kucherenko — Wed, 06 May 2026 13:12:36 +0000

Every AI agent framework on the market today has the same fatal flaw.

They will take your half-baked, 3 a.m., "wouldn't it be cool if..." idea and dutifully - expensively - turn it into a beautifully specified, well-architected, properly tested pile of garbage.

Spec-driven? Sure.
Multi-agent? Of course.
Self-reviewing? Absolutely.
Adversarial? They'll claim it.

But ask any of them a single question:

"Did you challenge the idea before you built it?"

Silence.

This is Part 1 of a two-part series. Part 1 maps the landscape and identifies the structural gap. Part 2 shows what filling that gap actually looks like - at the level of agents, prompts, and termination conditions.

The garbage-in industrial complex

Here's the workflow every AI framework optimizes for:

User has an idea.
Framework generates a spec.
Framework writes the code.
Framework tests the code.
User gets a working implementation of a bad idea.

Sixteen frameworks. The space exploded after GitHub released Spec-Kit in September 2025 - every framework I'll show you below either predates that by months or appeared in its wake. All of them are sophisticated machinery for steps 2 through 4:

Planning agents.
Reviewer agents.
Test-writing agents.
Constitution enforcers.
Traceability auditors.
Agents that argue with each other about implementation details.

Nobody is arguing about whether the thing should exist.

Concrete example: Spec-Kit, one of the most-installed SDD tools, will dutifully run specify init and create a .specify/ directory with a nine-article constitution, a spec template, a plan template, and a tasks template - for an idea that should never have been written down.

The constitution will then block your merge if the implementation violates Article III's testing standards. It will not block the merge if the feature itself was a bad idea. That's not in scope.

The dirty secret of the AI agent space is that most ideas users bring to these systems are bad. Not evil-bad. Just unexamined. Half-thought. Solving the wrong problem. Built on assumptions that fall apart under thirty seconds of pressure.

And the frameworks happily burn tokens turning that into an artifact.

"But we have reviewers"

No, you don't. Not really.

I want to be specific here, because "frameworks have reviewers" is technically true and substantively misleading. Let me walk through what current reviewer agents actually do - and what they don't.

Take Superpowers, one of the most-installed Claude Code plugins right now. After implementation, two reviewer subagents spawn in fresh contexts: a spec compliance reviewer and a code quality reviewer. Each returns a PASS or FAIL verdict. If either fails, the task isn't marked complete.

This is genuinely good engineering - it eliminates the most common AI coding failure mode, which is the agent declaring victory on code that doesn't actually work. But notice the framing. The spec compliance reviewer asks:

Does this code match the spec?

The code quality reviewer asks:

Is this code well-written?

Neither one asks whether the spec made sense in the first place.

Take MUSUBI, the most formally rigorous open-source SDD tool I've seen. It ships with a @traceability-auditor skill that blocks merges if requirement-to-test coverage drops below 100%, and a @constitution-enforcer skill that checks every file against nine constitutional articles before merging.

The traceability matrix maps every requirement to a design section, a code file, and a test file. Beautiful enforcement. But the requirements themselves - the things being traced - were written before the auditor ever ran. They were never on trial. If the requirement was wrong, the matrix just makes sure the wrong thing is fully implemented and fully tested.

Take CSDD, the academic methodology from Marri. It embeds security constraints directly into the project's constitution — SEC-001 (CWE-89) - SQL Injection Prevention, with MUST-level enforcement, semgrep integration, and rejection of any code that violates the rule. The reported result is a 73% reduction in security defects. Genuinely impressive - for security defects [1]. The constitution still doesn't ask whether the feature should exist. It just makes sure that if you build the feature, you don't build it with SQL injection in it.

Or look at Intent, the commercial L2 framework with a $252M war chest behind it. Its Verifier agent atomically updates the spec when the implementing agent discovers an API change or a missing requirement.

This is genuinely innovative - bidirectional Living Spec is a real advance. But the Verifier verifies the spec-to-code consistency, not the idea-to-need consistency. The spec is still gospel.

I could keep going. Every reviewer in every framework reviews the implementation against the spec.

They check whether the code matches what was planned.
They catch bugs.
They suggest refactors.
They block bad merges.

None of them ask:

Should this have been planned in the first place?

The spec is treated as gospel the moment it's written. Every downstream agent assumes the upstream decision was correct. The whole pipeline is a confidence-laundering machine — your vague idea goes in one end, a fully-specced deliverable comes out the other, and at no point did anyone say:

Wait, this is dumb.

That's not a reviewer problem. That's a design problem. Reviewers can only check what they're pointed at, and they're all pointed downstream.

Sixteen frameworks. One empty column.

I went through the most popular spec-driven frameworks I could find - all of them open-source and commercial, production and research. Sixteen of them.

For each one I cataloged:

what quality mechanisms it actually ships;
what gates it enforces;
what it blocks;
what it merely suggests.

Here's the result.

The frameworks group into three rigor levels. The taxonomy comes from Piskala's 2026 paper [2], and I find it useful precisely because it cuts across popularity - the most-starred tool isn't necessarily the most rigorous, and the most rigorous tool isn't necessarily the most useful. Three different things.

L1 - Spec-First.

You write a spec before code. The spec lives in the feature branch. After merge, it becomes historical documentation. The promise is flexibility; the cost is low.

Most frameworks live here: Spec-Kit, Superpowers, BMAD-METHOD, GSD, cc-sdd, Don Cheli SDD, Agent OS v3, Shotgun, SpecSwarm, OWASP Skill, WordPress SDD.

Eleven of the sixteen. This is where almost all real-world adoption happens, because the overhead is bearable and the discipline is real.

L2 - Spec-Anchored.

The spec lives the full project lifecycle. Every change requires an addendum. Every addendum propagates to design, code, and tests. The promise is auditability and traceability; the cost is process overhead.

Three frameworks here: MUSUBI (orthodox traceability with EARS format), Intent (modern bidirectional Living Spec), OpenSpec (brownfield-first with delta specs).

This level pays off when an auditor is going to ask hard questions and you need to answer them with receipts.

L3 - Spec-as-Source.

The spec is the only edited artifact. Code is generated and carries a literal // GENERATED FROM SPEC - DO NOT EDIT comment.

Two examples: Tessl (commercial, currently in beta) and CSDD (academic). The promise is zero spec drift by construction.

The cost is that this approach is still aspirational outside narrow contexts - generated-code workflows, security-critical microservices, certain OSS libraries.

Now look across all three levels at the guardrails each framework offers. Every kind you can imagine is present somewhere.

Spec gates that block implementation without an approved design - the HARD-GATE in Superpowers' brainstorming skill, the prerequisite chain in Spec-Kit's CLI, the Contract signature in MUSUBI.
TDD enforcement that refuses production code without a failing test first - Superpowers' iron law, MUSUBI's blocking Article III, Don Cheli SDD's TypeScript orchestrator gate at 85% coverage.
Constitutions that block merges when code violates project principles - Spec-Kit's nine articles, MUSUBI's nine articles, CSDD's CWE-mapped security clauses, WordPress SDD's domain-specific constitution.
Traceability matrices linking requirement to design to code to test - MUSUBI's 100% coverage requirement, CSDD's Compliance Traceability Matrix, Tessl's Spec Registry, Intent's bidirectional sync.
Code review subagents with fresh contexts and PASS/FAIL verdicts - Superpowers' two reviewers, MUSUBI's traceability-auditor, Intent's Verifier.

Five columns of guardrails. All of them point in the same direction. Every single one is asking a variant of the same question:

Did the implementation follow the rules?

The last two columns of the matrix - Adversarial debate and Nash equilibrium - are empty for fifteen of the sixteen frameworks.

Nobody, except one, checks the spec against attack before it gets written. That is the hole.

The phase that's missing

There's a phase that has to happen before specification, and it's the most important one. Call it whatever you want. I call it The Grilling.

Grilling is what happens when you take an idea to a room full of people who don't care about your feelings and they tear it apart for an hour.

You leave that room with one of three outcomes:

The idea is dead.

Good. You just saved a week.
The idea is different now.

Better. The thing you build is the thing that should exist.
The idea survives intact.

Rare. But now it's load-bearing, not vibes.

Every good engineering org runs some version of this.

RFCs with adversarial review.
Design docs with explicit Alternatives Considered and Drawbacks sections.
Pre-mortems.
Red teams.

The whole point is to find out the idea is bad before you've built it, not after.

AI frameworks skipped this entirely. They went straight from "user said something" to "let's spec it."

What's missing isn't another reviewer

It would be tempting to look at this gap and say: just add another reviewer agent, one that reviews the spec instead of the code. Problem solved.

It wouldn't work, and the reasons it wouldn't work are exactly what makes the missing phase interesting.

A reviewer agent operates on a single artifact and checks it against rules. That works for code-against-spec or spec-against-constitution.

It doesn't work for idea-against-reality, because there's no static rulebook to check against. The check has to be dynamic - you have to actually attack the idea from multiple angles, see what breaks, see what holds, see where the assumptions are fragile. You can't write that as a checklist.

It also has to run with grounding. An attacker that doesn't know:

what's already in the codebase;
what's already been tried;
what failed before;
what constraints apply.

That attacker just hallucinates objections that don't apply to the real system.

The Devil's Advocate role only has weight when it's standing on verified ground truth.

And it has to terminate on a substantive condition, not a procedural one. "We've talked enough" is not the same as "The idea has been adequately tested."

Frameworks that have multi-agent debate often stop after a fixed number of rounds, or when a synthesizer declares the discussion mature. That's exhaustion, not equilibrium.

So what would a framework with this phase actually look like?

Once you take the question seriously, the constraints multiply quickly.

It can't be a single prompt.
It can't happen before reconnaissance of the actual codebase.
It needs structural termination.
It has to live inside a larger pipeline with downstream gates, otherwise the verdict gets quietly ignored.

I spent a lot of time trying to add this on top of existing frameworks. None of them quite fit.

The L1 tools are too downstream — by the time their gates fire, the idea is already a spec, and the spec is already being implemented.
The L2 tools add traceability but still treat the spec as input, not as the thing being challenged.
The L3 tools generate code from spec - but that just makes the spec even more sacred, not less.

The mismatch was structural. The missing phase had to sit upstream of where every existing framework starts.

So I built Gangsta Agents - a new open-source agentic framework

where adversarial scrutiny is a phase, not a prompt;
where it sits before the spec, not after it;
and where the verdict it produces flows through a chain of gates that the team can't quietly skip.

The framework has six phases:
Reconnaissance → Grilling → Sit-Down → Resource Development → The Hit → Laundering.

The mafia metaphor is intentional - every phase has a role, every artifact has a name, every transition has a signature from the Don (user). It's a young project (first stable release in April 2026), and the rest of the pipeline is worth its own treatment.

But the phase that's genuinely new — the phase that doesn't exist anywhere else - is Phase 2.

Coming in Part 2

Part 2 is the deep dive: how Grilling actually runs as a phase, what the three subagents do (Proposer, Devil's Advocate, Synthesizer), why the user sits at the table as the Don, what Nash Equilibrium termination looks like in practice, and when not to use Grilling at all. With diagrams, decision rules, and a concrete session example.

If the chart in this article changed how you see the SDD landscape, the next one is where the alternative gets specified.

github.com/kucherenko/gangsta
gangsta.page

References

[1] Marri, S. R. Constitutional Spec-Driven Development: Enforcing Security by Construction in AI-Assisted Code Generation. arXiv:2602.02584, January 2026. https://arxiv.org/abs/2602.02584. The 73% figure is from a single banking microservices case study (n=1) — replication needed, but the methodology is solid.

[2] Piskala, D. B. Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants. arXiv:2602.00180, January 2026. https://arxiv.org/abs/2602.00180. Source of the L1/L2/L3 rigor taxonomy used in the matrix.