DEV Community

Cover image for How I Validate Quality When AI Agents Write My Code
Teemu Piirainen
Teemu Piirainen

Posted on

How I Validate Quality When AI Agents Write My Code

Someone asked me the best question after I posted about managing AI agents like a dev team:

And how do you validate quality?

Fair point. If AI is writing the code, who's making sure it actually works?

My solution: a system of enforced gates that makes shipping bad code harder than shipping good code. Here's how I built that system.


The Mental Model: Quality Is a Pipeline, Not a Checkpoint

Often we think of quality as something you check at the end. Run the tests. Do a code review. Ship it.

But we have already learned this lesson with SDLC / SSDLC:

security and quality must be embedded in every phase, not bolted on at the end.

The same principle applies when AI writes the code. The difference is that you can't rely on AI agent developer discipline to follow the process. Your AI framework must enforce it through gates that agents cannot bypass.

AI agents can produce plausible-looking code that passes superficial inspection but drifts from requirements, violates architecture patterns, or introduces subtle bugs. I first tried the obvious approach: detailed instructions telling the coding agent to handle testing, architecture patterns, and edge cases all at once. It never worked reliably. The breakthrough came when I loosened the constraints. Let the LLM write its best code freely, then build independent validation gates with separate agents that catch what the first one missed.

My workflow has eight quality gates. Code must pass through all of them before it reaches production.

If issues surface at Gate 5, 6, or 7, the fix flows back through Gate 3 → 4 before proceeding. In my experience, most issues are caught at Gate 4.


Gate 1: Requirements Definition (~70% of My Time)

This is the most counterintuitive part. In an AI-native workflow, I spend roughly 70% of my time defining requirements, not writing code. My role has shifted from how to build it to what to build and why. The code is the agent's job. Getting the requirements right is mine.

Why does this matter for quality? Because agents are extremely literal. Give them vague instructions and they'll build something that technically matches what you said but misses what you meant. The quality of AI output is directly proportional to the clarity of input.

How It Works

I use a requirements-analyst agent that:

  1. Reads the issue from our project management tool (Linear)
  2. Researches business requirements documentation to map functional and non-functional requirements
  3. Searches for industry patterns and best practices
  4. Asks me clarifying questions until requirements are unambiguous
  5. Decomposes epics into right-sized stories with clear acceptance criteria

Every issue gets a structured format:

## What
[Problem to solve]

## Why
[Business value]

## Context
[Constraints, dependencies, scope]

## Acceptance Criteria
- [ ] Criterion 1 (specific, testable)
- [ ] Criterion 2
- [ ] Criterion 3
Enter fullscreen mode Exit fullscreen mode

The key insight: acceptance criteria are the contract between me and the agents. If a criterion is vague, the agent will interpret it however it wants. If it's specific and testable, the agent has a clear target, and so does the validator that checks the work later.

But requirements alone aren't enough. I also maintain architecture documentation: files that describe the project's patterns, conventions, data models, and design system. When a code-architect agent later designs the implementation, it reads these docs and follows established patterns rather than inventing its own. The requirements define what, the architecture docs constrain how.

What This Prevents

  • Scope creep (agents build exactly what's specified, nothing more)
  • Spec drift (each sub-task traces back to business requirements)
  • Wasted iterations (ambiguities are resolved before any code is written)

Gate 2: Architecture Design

Before any code is written, a code-architect agent takes the requirements from Gate 1 and the architecture documentation I maintain, then designs the implementation. For example, my project maintains docs like these:

docs/code-documentation/
├── architecture-backend.md
├── architecture-frontend.md
├── business-requirements.md
├── gcp-setup.md
├── design-system.md
├── testing-guidelines.md
└── ...
Enter fullscreen mode Exit fullscreen mode

I typically maintain 10-20 such documents per project. These are living documents that evolve with the codebase. They serve as context for every agent, ensuring each one understands the project's patterns, conventions, and constraints before making any decisions.

The architect agent reads relevant docs before designing anything, so it follows established patterns instead of inventing its own. Its process:

  • Reads project architecture docs to understand established patterns and conventions
  • Analyzes the existing codebase for relevant precedents
  • Researches best practices for the specific technology stack
  • Designs the feature architecture with specific file paths, component responsibilities, and data flow
  • Breaks the work into ordered implementation phases
  • Creates sub-issues for each phase with its own acceptance criteria

I review the blueprint if it proposes changes to the general architecture. Personally, I want to understand and own the high-level design, but that's a preference, not a requirement of the system.

The blueprint specifies:

  • Every file to create or modify
  • Component responsibilities and boundaries
  • Data flow from entry points through transformations to outputs
  • Build sequence that defines which phases must complete before others can start

Each sub-issue carries its own acceptance criteria, which means the validator at Gate 4 has specific targets to check against. The quality chain is: requirements → architecture → implementation → validation, and each gate feeds the next.

What This Prevents

  • Architecture drift (agents follow established patterns, not their own ideas)
  • Integration failures (data flow is designed upfront, not discovered during integration)
  • Over-engineering (scope is bounded by the blueprint)

Gate 3: Implementation with Built-in Validation

The developer agents (separate for each domain, like backend, frontend etc.) don't just write code and hand it off. They have mandatory validation steps built into their process. Why separate agents? Each one has a focused prompt, isolated context window, and role-specific evaluation criteria. A backend agent doesn't need to know about React patterns, and vice versa.

Incremental Testing

After modifying or creating each file, the agent runs only the related test file, not the full suite. This is deliberate: running all tests after every file change slows the agent dramatically, especially as the project grows and integration tests get heavier. By scoping to the affected test file, feedback cycles stay at seconds instead of minutes. The agent must fix failures before moving to the next file. This works well when test boundaries are clear (one service = one test file), and catches issues at the smallest possible scope. The full test suite runs later at Gate 4.

Pre-Completion Validation

Before reporting back, every developer agent must run and pass three checks:

  1. Type-check: zero errors
  2. Lint: zero errors
  3. Test suite: all tests pass + coverage >= 90% for new/modified files (as a minimum guardrail, not a quality metric, since high coverage alone doesn't prove tests are meaningful)

These checks use custom validation scripts that produce compact, structured output: a 5-line summary instead of hundreds of lines of test runner noise. This matters because verbose tool output slows AI agents down significantly. When agents can parse results in seconds instead of scrolling through walls of text, the feedback loop stays tight.

What This Prevents

  • Cascading failures (small scope means bugs are isolated to one subtask)
  • Test regressions (existing tests must still pass before moving on)
  • Untested code (90% coverage enforced per file)

Gate 4: Code Validator Agent

After each developer agent completes, a dedicated code-validator agent runs independently. This is the quality gate that blocks commits.

The validator:

  1. Reads the issue and acceptance criteria
  2. Inspects recent changes and existing tests
  3. Runs the full test suite for affected packages
  4. Generates and reviews coverage reports
  5. Performs a code review focusing on correctness, edge cases, security, and project conventions
  6. Decides: PASS or FAIL

This review focuses on the current subtask in isolation. The broader feature-level review happens at Gate 5.

Confidence Scoring

The validator rates each potential issue on a 0-100 confidence scale:

Score Meaning
0 False positive, not a real issue
25 Might be real, might be false positive
50 Real issue, but minor
75 Verified real issue, will impact functionality
100 Confirmed critical issue

Only issues with confidence >= 75 are reported. The scoring uses structured prompts that require the agent to provide evidence for each finding. No evidence, no report. It's a pragmatic filtering mechanism that dramatically reduces noise and false positives.

The Hard Rule

Commits are blocked until the validator returns PASS. If it returns FAIL, the developer agent is re-invoked to fix the issues, and the validator runs again. The workflow enforces this automatically, so there's no way to skip it.

Developer Agent
  ↓
Validator (FAIL)
  ↓
Developer Agent (fix)
  ↓
Validator (PASS)
  ↓
Commit allowed
Enter fullscreen mode Exit fullscreen mode

What This Prevents

  • Convention violations (code that works but doesn't follow project patterns)
  • Coverage regressions (no commit without meeting the threshold)
  • Blind spots from the writing agent (independent review catches what the author missed)

Gate 5: Multi-Agent Code Review

While Gate 4 validates each subtask in isolation, Gate 5 reviews the entire feature across all commits before creating a pull request. A code review skill runs multiple specialized agents in parallel:

Parallel Review Agents

Four agents run simultaneously, each with a different focus:

  1. Architecture Compliance: Audit changes against architecture documentation, flag violations with exact rule citations
  2. Bug Detection: Scan the diff for logic errors, null handling issues, and edge cases
  3. Security Review: Check for vulnerabilities, injection risks, and unsafe patterns in changed code
  4. E2E Test: Run an end-to-end test that exercises the new feature from the user's perspective

Validation Round

Each flagged issue goes through a separate validation agent that confirms the issue actually exists. This filters out false positives before any findings are reported.

High-Signal Only

The review explicitly does not flag:

  • Code style concerns (linters handle that)
  • Subjective improvements
  • Pre-existing issues not introduced in this change
  • Pedantic nitpicks
  • Patterns used consistently elsewhere in the codebase

What This Prevents

  • Architectural violations slipping through
  • Security issues in new code
  • Logic bugs that tests don't cover

Gate 6: CI/CD Pipeline

Gates 3-5 all ran on the agent's machine. Gate 6 is the first time code runs in a completely independent environment. When the pull request is marked ready for review, CI runs the full pipeline from scratch.

Detect Changed Packages
        ↓
  Lint & Type Check
        ↓
  ┌─────┼─────┐
  ↓     ↓     ↓
Pkg A Pkg B Pkg C   (tests in parallel)
  ↓     ↓     ↓
  └─────┼─────┘
        ↓
      Build
        ↓
  Static Scanners
        ↓
  Ready for Review
Enter fullscreen mode Exit fullscreen mode

Smart Change Detection

The CI pipeline detects which packages changed and only runs their tests. If shared types change, all dependent packages are retested automatically because cascading dependencies are tracked. This keeps CI fast on small changes while still catching cross-package breakage.

What the Pipeline Runs

  1. Lint & Type Check: Static analysis across all changed packages
  2. Per-package tests: Unit and integration tests run in parallel for each affected package
  3. Build: Full production build of all changed modules
  4. Static Scanners: Run static analysis tools to catch potential security issues before merging

Draft PR Strategy

PRs are always created as drafts first. CI skips draft PRs to save CI minutes. When ready for review, the PR is marked as non-draft, which triggers the full pipeline. This means CI resources are only spent on code that's already passed all local gates (Gate 3 + Gate 4).

What This Prevents

  • Environment-specific failures (clean CI, not the developer's machine)
  • Cross-package breakage (shared type changes tested across all dependents)
  • Build failures in production configuration

Gate 7: Human Review and Merge

This is the only manual approval gate in the entire pipeline. After CI passes, I personally review the changes before merging. This is a critical checkpoint that forces me to consciously take ownership of delivered code. I want to understand what changed at a high level so that I'm able to steer future work and make informed decisions about architecture and design patterns.

The review is intentionally lightweight. By this point, the code has already passed five automated gates. I'm not hunting for bugs or style issues. I'm checking that the feature makes sense, the approach aligns with where the project is heading, and nothing looks fundamentally wrong.

What This Prevents

  • Losing architectural awareness (I stay informed about every change)
  • Autopilot merging (conscious decision to ship, not rubber-stamping)
  • Strategic drift (changes that technically work but move the project in the wrong direction)

Gate 8: Deployment Verification

On merge to main, automated release tooling creates a versioned release, and the deploy pipeline runs:

  1. Validates environment variables before building (catches missing config early)
  2. Builds all changed modules with production configuration
  3. Deploys only changed components: backend, frontend, and infrastructure rules are deployed independently based on what actually changed
  4. Verifies all deployments succeeded: if any component fails, the release is marked as failed with actionable retry instructions

What This Prevents

  • Deploying with missing or misconfigured environment variables
  • Deploying unchanged components unnecessarily
  • Silent partial failures (one component fails but the release looks successful)

The System in Practice

Here's what a typical feature looks like flowing through these gates:

1. Define requirements           [~1 hour]
2. Architecture design           [~10 min]
3. Implementation + tests        [~0,5 - 6 hours in total]
4. Validator after each phase    [~3 min each]
5. Code review before PR         [~5 min]
6. CI pipeline                   [~8 min]
7. I review and merge            [~10 min]
8. Deploy on merge               [~5 min]
Enter fullscreen mode Exit fullscreen mode

Regardless of feature size, the validation overhead stays roughly bounded: about 20 minutes of automated checks. The implementation time scales with complexity, but the quality gates are much less variable. That's the point.

When the Pipeline Catches Something

Here's a real example. A developer agent implemented a new feature that added a field to a shared data model. Unit tests passed. Type-check passed. Coverage was above 90%. The agent reported success.

Then the validator ran. It detected that while the new field existed in the TypeScript interface and the backend service, the Firestore converter (responsible for translating between the database and the application) was never updated. Data would be written to the database but silently lost on read. The validator returned FAIL, the developer agent was re-invoked with the specific finding, and it fixed the converter in under a minute.

Without Gate 4, that bug would have shipped. Unit tests didn't catch it because they mocked the database layer. The type system didn't catch it because the converter used a spread operator that silently dropped unknown fields. Only an independent agent reviewing the full change against project conventions found the gap.

That failure became a permanent memory entry. Now every agent touching shared data models gets warned: "Converter updates require synchronized changes in 4+ locations."


Long-Term Memory: Quality That Improves Itself

Without persistent memory, every session starts from zero. Agents repeat the same mistakes, validators catch the same failures, and I re-explain the same constraints. The quality gates work, but they don't get better.

Long-term memory closes this gap. It forms a feedback loop with the gates: the validator catches a failure, that failure becomes a memory entry, and in the next session the developer agent gets warned before it writes a single line of code. The agent avoids the mistake. The validator confirms. Gates catch problems once. Memory prevents them forever.

This compounds. Early in a project, agents make more mistakes and the validator catches them frequently. After 10+ runs, agents start each session already knowing dozens of project-specific traps. Validator failures become rarer. The system gets faster because it spends less time fixing and re-running.

Here are a few real pitfalls that the pipeline caught and encoded:

  • Zod/TypeScript sync: Adding interface fields requires updating Zod schemas AND all consumers
  • Test mock indices: New LLM-calling nodes shift ALL mock call indices in integration tests
  • Config wiring: Adding a parameter signature without reading config is a silent no-op
  • Converter updates: New conversation fields require synchronized updates in 4+ locations

These aren't hypothetical. Each one caused a real failure, was caught by the validation pipeline, and became permanent institutional knowledge. Every project accumulates its own version of this list.

The result is a quality system that develops itself. Every feature it validates teaches it how to validate the next one better. No human intervention needed for this loop to run. This is fundamentally different from static tooling that works the same way on day one and day three hundred.


What I've Learned

1. Front-load requirements, not reviews

The biggest quality lever isn't better testing. It's clearer requirements. When I spend an hour defining exactly what a feature should do with specific acceptance criteria, the agents produce correct code on the first try far more often than when I rush through requirements and rely on review to catch problems.

2. Separate writing from validation

Don't ask the same agent to write code and verify it. That's like having students grade their own exams. The coding agent's job is to write the best code it can. The validator is a separate agent with a separate prompt, separate context, and explicit permission to fail the work. It has no incentive to pass. This separation is what makes the gates trustworthy.

3. One subtask at a time

The natural instinct is to implement a full feature and test at the end. That's where quality breaks down. Instead, break the work into small subtasks, implement one, validate it, commit it, then move to the next. Each commit is a known-good checkpoint. When something fails, the blast radius is one subtask, not an entire feature. This pattern is counterintuitive but it's the most practical change another developer could adopt immediately.

4. Enforce the process in the framework, not in prompts

You can't tell an AI agent to "be careful" and expect consistent results. The quality comes from a workflow that runs validation automatically after every subtask, not from instructions asking agents to remember to test. Bake the gates into the framework so they execute by default. When skipping a gate is harder than following it, quality becomes a property of the system rather than a hope.

5. This is an engineering problem, not an AI problem

The question isn't:

Can AI write good code?

It can.

The question is

Does your system prevent bad code from shipping?"

That requires overlapping automated gates, independent validation agents, long-term memory, and a workflow that enforces all of it. No single technique is enough. The system is the product.


Tools: Claude Code, Codex, specialized AI agents per role, skills, long-term memory for persistent learnings, git worktrees, Linear for issue tracking, GitHub Actions for CI/CD.

Top comments (0)