GoodRelax

Posted on Mar 25

How I Prevented Quality Collapse in AI-Driven Dev with AI Agents and SDD

#ai #softwareengineering #testing #opensource

AI-driven development is fast (sometimes). But can you answer "is this code really OK?"
When bugs show up and you tell the AI to fix them, does the debugging ever actually end?
Are you quietly hand-fixing AI-generated code line by line?

...Yeah, that was me until last year (lol)

Hoping to save others from the same trap, here's a breakdown of SDD (Spec-Driven Development) and A-SDLC (Agentic SDLC) — the backbone of AIDD (AI-Driven Development).

Terminology

AIDD : AI-Driven Development. A development style where AI leads dev tasks and humans supervise and evaluate.

SDD : Spec-Driven Development. An approach where the spec drives design, implementation, and testing.

A-SDLC : Agentic SDLC. A development style where AI agents autonomously execute the entire Software Development Life Cycle.

Let's get straight to the point.

SDD lives or dies by its spec template and process — and the process can be automated by AI

Here's a bad flow vs. a good flow:

Before: humans test, tell AI to fix, test again... infinite loop.

After: a separate AI agent inspects quality before moving to the next phase.

I wrote "inspection" here, not "review." This is closer to inspection

— a formal examination with checklists, severity classifications, and pass/fail criteria defined (IEEE 1028).

That's what AI should be doing. There are check items, and the gate won't open unless Critical/High findings are zero.

Even in human team development, quality suffers when one person does everything. The same applies to AI-driven development.
In A-SDLC, the key is to clearly separate phases and roles across multiple AI agents.

A sample of AI Agent Team

Quality Gates = Inspection Criteria

Specifically, the inspection agent examines from 6 perspectives:

Perspective	What it checks	Example
R1	Spec quality	Requirements have IDs? No ambiguous wording? Edge cases covered?
R2	Design principles	Any SRP, OCP, DIP, DRY, KISS, YAGNI violations?
R3	Code quality	Error handling, null safety, defensive programming
R4	State transitions	Deadlocks, race conditions, missing state transitions
R5	Performance	Algorithm efficiency, memory, network
R6	Consistency	Spec ↔ design ↔ implementation ↔ tests — all connected?

If even one Critical/High issue remains, the next phase is blocked. That's the quality gate.

Using the OSS framework gr-sw-maker which implements this approach, I'll walk through a real example: building an earthquake map site in just 45 minutes (10 min hands-on).

① Have AI write the spec — incrementally, with inspections in between

The "S" in SDD. This is the foundation. If your AI-driven dev or SDD isn't producing results, this is where it's weak.

The spec "template" is everything

For human specs, IEEE 830 (SRS) is well-known, but it was designed for humans to read and write.
The chapter structure is too heavy, granularity varies... Hand this to an AI and it becomes a hallucination factory.

So I created ANMS (AI-Native Minimal Spec), a spec template designed for AI.
Clean Architecture's Stable Dependencies Principle is applied to the spec itself — upper layers (purpose, requirements) are stable, lower layers (design, tests) are easy to change.
...This is a unique design choice in gr-sw-maker.

Even with this spec, having AI write it all at once will fail.

First, have it write purpose and requirements (Ch1-2) and pass inspection.
Then proceed to design and test strategy (Ch3-6).

If the foundation is shaky and you proceed to design, you'll redo everything later. ...Sound familiar?

Natural language makes AI guess wrong

If you write:

Change marker color based on earthquake size

What's "size"? What are the thresholds? What colors? AI interprets freely, producing something different every time.

EARS notation eliminates ambiguity

The same requirement in EARS (Easy Approach to Requirements Syntax):

FR-05: The system SHALL display each earthquake marker with a color determined by magnitude: green for below 3.0, yellow for 3.0–5.0, orange for 5.0–7.0, red for 7.0 and above.

Subject, verb, conditions, thresholds — all there. No room for AI misinterpretation. (Actual spec Ch2)

EARS examples (most requirements can be written with these patterns)

Pattern	Syntax	Example (actual)
Ubiquitous	The system SHALL...	Display map with Leaflet.js
Event	When [event], the system SHALL...	Marker click → show popup
State	While [state], the system SHALL...	During data fetch → show loading
Unwanted	If [condition], then the system SHALL...	API failure → show error message
Constraint	The system SHALL NOT...	Reject future dates

After writing the spec, have it inspected

The completed spec is examined by a dedicated agent. Here's an actual finding:

F-002 (Medium): FR-05 mentions "color gradient" but thresholds are undefined. Cannot be tested.
Fix: Specify green for below 3.0, yellow for 3.0–5.0...

This kind of ambiguity is eliminated through AI-to-AI inspection before moving to the next phase.
A human would take ages, but AI finishes this inspection in seconds. That's the power of A-SDLC.
(Actual example)

② Have AI design — expressing Clean Architecture with colors

Once the spec is solid, the design agent creates the architecture.

A unique feature of ANMS is color-coding components by layer. This makes dependency direction violations instantly visible.

Critical: At the architecture design stage, separate means from ends and organize dependencies into a clean architecture. This makes the system resilient to bug fixes and spec changes.

The Domain layer (orange) is pure functions with zero external dependencies. Dependency direction flows UI → Adapter → Domain, never reversed.
If colors are mixed up, the design is wrong. The inspection agent checks this too.
(Actual design Ch3)

③ Derive test cases from the spec

This is the best part.

gr-sw-maker has test specifications written in Gherkin notation, so test code can be derived directly from the spec.

Scenario: SC-005 Marker color by magnitude (traces: FR-05)
  Given earthquake data has been successfully fetched
  Then markers for magnitude below 3.0 are displayed in green
  And markers for magnitude 3.0–5.0 are displayed in yellow
  And markers for magnitude 5.0–7.0 are displayed in orange
  And markers for magnitude 7.0 and above are displayed in red

Each line of Gherkin becomes a test case. (Actual spec Ch4)

Auto-generated test code

import { magnitudeToColor } from "../../src/domain/magnitude-scale.js";

describe("magnitudeToColor", () => {
  it("returns green (#00CC00) for magnitude below 3.0", () => {
    expect(magnitudeToColor(2.9)).toBe("#00CC00");
  });
  it("returns yellow (#CCCC00) for magnitude 3.0 to 4.9", () => {
    expect(magnitudeToColor(4.0)).toBe("#CCCC00");
  });
  it("returns orange (#FF8800) for magnitude 5.0 to 6.9", () => {
    expect(magnitudeToColor(6.0)).toBe("#FF8800");
  });
  it("returns red (#CC0000) for magnitude 7.0 and above", () => {
    expect(magnitudeToColor(7.5)).toBe("#CC0000");
  });
});

Similar to TDD (Test-Driven Development), gr-sw-maker creates test cases (acceptance criteria) before implementation.
The goal — what tests to pass — is defined before coding begins, minimizing rework.

What's also important here: the implementing agent and the testing agent are separate.
The problem of testing your own code is structurally eliminated. (Actual test code)

Statement coverage: 98.93% (domain + adapter layers, target was 80%). Since tests are derived from Gherkin scenarios, coverage structurally matches what's specified.

Implementation inspection results

Inspection results for the earthquake-map implementation:

Item	Result
Verdict	PASS
Critical	0
High	0
Medium	3
Low	3

Critical/High is 0, so it passes. Medium/Low are recorded and carried forward. (Actual inspection record)

④ The full run — results

So what actually happened? I built an earthquake map app from a 5-line memo.
Here's the summary.

Initial input (this is all):

Earthquake Map
I want to see where and when earthquakes happened in the world.
Runs in the browser only. No server.
I want to zoom into any area.
I want to specify a time span.

Two spec changes mid-development:

Add an Update button
Make it work on local PC

Output:

Item	Value
Functional requirements	20
Gherkin scenarios	21
Unit tests	32 (100% pass)
Code coverage	98.93%
Vulnerabilities	0
Total time	45 min

I didn't write a single line of code. Didn't even look at it (lol)

👉 Live app
📝 Introduction article (Zenn)
📄 Spec (with component & sequence diagrams)
💬 Full AI chat transcript
📊 Timeline & metrics

Sounds easy so far, but...

What this article covered is just part of the picture. Under the hood, multiple AI sub-agents are orchestrated in complex ways.

The full picture of gr-sw-maker

You don't need to read everything at these links, but at least look at the diagrams.

Component	Role
ANMS (spec template)	The quality pillar
Agent configuration (20+)	Agents and their coordination flow
Agent implementations	Prompts for each agent
Process rules	Discipline for agents
Review standards	Inspection criteria
Document management rules	Artifact types, naming, versioning

Binding AI agents with strict rules might seem harsh. But just like human projects, it's what prevents death marches and leads to happier development.

⑤ Have AI reflect — AI improving its own process

gr-sw-maker doesn't stop here.
After development, AI runs a retrospective.
...You do that in human projects too, right? (lol)

Improvement proposals from the earthquake-map retrospective:

ID	Problem	Improvement
IP-01	ES modules didn't work with file://, causing a full redo	Add compatibility check in early phases for "no server" projects
IP-02	Ambiguity between auto-fetch vs. button-triggered data loading	Add trigger method question to interview template
IP-03	Leaflet's async rendering caused screenshot verification timeouts	Document screenshot limitations for async UIs
IP-04	No procedure for closing inspection findings	Make finding closure an explicit orchestrator step

All of these were identified by AI itself. It even produced a gap analysis of the gr-sw-maker framework (4 SDD items + 6 A-SDLC items).
(The retrospective is recorded at the end of the chat transcript)

These findings feed back into the next project, and the framework itself improves over time.

Challenges ahead

The retrospective surfaced these honestly:

Cost: More agents means more token consumption. It's a speed-vs-quality tradeoff.
→ I accept this.
Process gaps: Like IP-04, the procedure for closing inspection findings is still weak.
→ Run more diverse projects, retrospect, and close the gaps.
...Humans hate this kind of grunt work, but AI doesn't complain.
Multi-language: Spec templates and prompts are written in both Japanese and English, so AI translation enables other languages.
I actually developed gr-sw-maker in Japanese and translated to English. The earthquake map was developed in English.
→ Next I'd like to try German or Chinese (whichever has better translation accuracy).
Beyond Claude Code: Designed to avoid vendor lock-in so it works with other AI tools, though porting is needed (~15%). Naturally, AI does the porting.
See the porting guide for details.
→ Maybe Cursor next?
Scaling up: Applying to complex, large-scale multi-service systems. A single spec won't suffice — multiple specs (ANPS) or graph specs (ANGS) will be needed.
→ This is where individual effort hits its limits. Both money and time.
Human judgment remains: Acceptance testing and final decisions are human. It's nearly fully automated, but not unmanned.
→ Critical: You need enough skill to review AI-inspected artifacts and make the AI say "you're absolutely right."

Summary

Quality collapses because you have AI write code immediately.
Separate agents for spec → SW design & test design → implementation → testing.
Set up quality gates between phases with inspection every time.

It's exactly what human development does. Just have AI do it faithfully.

Well, getting that to actually work is surprisingly tricky (lol)
When it is, try gr-sw-maker as a starting point.
I hope it makes your development easier and your products better.

Try it now

npm init gr-sw-maker

GitHub: gr-sw-maker — OSS MIT License, Japanese & English
→ GitHub (Japanese README)
GitHub: Examples
→ Example 1: Earthquake Map Site and dev records
→ Example 2: Mahjong Dice App and dev records

Standards and concepts referenced in this article

IEEE 830 (SRS) — Standard format for software requirements specifications
IEEE 1028 — Standard process for software reviews and inspections
RFC 2119 — Keywords for requirement levels (SHALL / SHOULD / MAY)
EARS (Easy Approach to Requirements Syntax) — Notation for writing requirements in structured natural language
Gherkin (Cucumber) — Scenario description language using Given/When/Then
Clean Architecture (Robert C. Martin) — Architecture pattern controlling dependency direction
ANMS (AI-Native Minimal Spec) — Spec template for AI-driven development
ANGS (AI-Native Graph Spec) — Graph spec for large-scale multi-service systems

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more