DEV Community: Travis Martin

Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Travis Martin — Tue, 23 Jun 2026 16:35:59 +0000

Series context: This is a follow-up to How I Automate Parts of My SDLC with AI Agents. Earlier posts covered the pipeline overview, the Validate phase, agent state management, and rate-limit resilience. This one is about the part that holds the whole thing together: how I know whether a change to my harness actually made it better.

The Question I Couldn't Answer

I changed a prompt. The next run looked better. Did the prompt help, or did I get lucky?

For a long time my honest answer was "I think so?" I'd tweak a slash command, run my pipeline on a feature, watch it succeed, and ship the change. That's vibes-based prompt engineering, and almost everyone building agents does it — because the alternative feels impossible.

Here's the trap. A coding agent is non-deterministic. Run the same task twice and you get two different trajectories. So a single good run tells you almost nothing: you can't separate "my change helped" from "the dice came up nice this time." And open-ended feature work has no single right answer, so you can't just write a unit test for "did the agent build the feature well."

That's two problems stacked on top of each other: non-determinism and no ground-truth. If you don't solve both, every harness change is a coin flip you can't see.

This post is how I solved it — by stealing the discipline from ML evaluation and pointing it at my own harness.

Where Eval Sits in the Pipeline

Quick orientation for anyone new to the series. My ADW harness runs a feature through seven phases:

research → plan → build → validate → test → review → document

Eval isn't one of those phases. Eval is meta — it wraps the whole harness and asks a different question:

                  ┌─────────────────────────────────────┐
   harness  ─────►│  run the full pipeline on N tasks,  │
   change         │  M times each, score every run      │────► verdict:
                  │  (code graders + LLM judge)         │      better / worse / noise
                  └─────────────────────────────────────┘

The phases build software. The eval framework measures whether my changes to how the phases work are improvements or regressions. It's testing the tester.

The Core Idea: A Frozen Benchmark You A/B Against

The whole thing rests on one move borrowed from ML: fix everything except the variable you're testing.

A few terms I'll use (this harness has its own vocabulary, so here are the one-liners):

Target — a sample codebase a task runs against, vendored: copied and frozen into the repo so scores stay comparable over months. If the target drifts, your scores aren't measuring your harness anymore.
Task — one benchmark item: a prompt plus an oracle.
Oracle / acceptance checks — the deterministic definition of "correct": shell commands that must exit 0. No oracle, no task.
Variant — a labeled config under test: a planner, a flag, a branch. This is the thing you're A/B-ing.
Trial — one run of one task. Because agents are non-deterministic, every task runs N times (default 3).

I run two difficulty tiers:

Tier 1 — a tiny hermetic notes CLI with pytest. Cheap regression gate, ~5 min/trial.
Tier 2 — a frozen ~69K-LOC Express/TypeScript backend with Postgres + Redis. Realistic headroom, ~20 min/trial (longer when the build loop has to retry). This is where the interesting failures live.

Thirteen tasks across the two tiers — features, bugs, and chores. Each trial copies the target to a clean temp dir, runs the full SDLC against it, and grades the result. Same tasks, same targets, every time. Now a change is measurable.

How I Grade a Run (and Why I Use Two Judges)

Each trial gets scored two completely different ways, on purpose.

Code graders — deterministic. These are the things a machine can check without an opinion:

test pass-ratio (did the target's own suite go green?)
behavioral acceptance oracle (the task-specific "what correct looks like")
phases completed, test-retry count
cost — the agent-under-test's token spend, pulled from the API-recorded usage
wall-clock time
diff size (did it change three files or thirty?)

LLM judge — probabilistic. A separate, fixed model scores what determinism can't: spec quality (0–5) and implementation fidelity (0–5). One combined call, memoized so re-runs don't re-pay, and its cost is itemized separately from the agent under test — you never want your judge's spend polluting the number you're optimizing.

Why both? Because they catch different failures. The execution signal (tests) tells you the code runs; the judge tells you the code is good. A change can make tests pass while quietly trashing code quality, or write beautiful code that fails a behavioral check. The two signals are complementary, not redundant — lean on only one and you'll optimize a blind spot.

The Metric That Actually Matters: pass@k vs pass^k

This is the part I wish someone had told me earlier.

If you average your trials, you get a mean. Means lie about agents. A task that succeeds 2 out of 3 times and a task that succeeds 3 out of 3 times can show the same "67% vs 100%" gap that looks like noise — when actually one is reliable and one is a coin flip.

So I report two reliability numbers instead:

pass@k — did at least one of k trials pass? This is the capability ceiling: can the agent do this task at all?
pass^k — did all k trials pass? This is consistency: can it do it every time?

A concrete example from my own suite — Task 08, a CRUD feature on the tier-2 backend, run three times:

Task 08 (seasonal-tips CRUD, tier-2 backend):
   trials       → fail, pass, pass
   mean success = 0.67   ← "mostly fine." Looks shippable.
   pass@3       = 1.00   ← it CAN do this task.
   pass^3       = 0.00   ← but not every time. Not reliable.

A mean would've called Task 08 "mostly fine" at 0.67. But pass^3 is all-or-nothing — every trial passes or it's a zero — and here it's a zero: the capability is there, the reliability isn't. That's a completely different engineering problem, and you can't fix what your metric hides.

The rule I follow: a difference smaller than the spread across trials is noise, not signal. If variant A scores 0.71 and variant B scores 0.68 but the trial-to-trial spread is ±0.15, you've discovered nothing. Run more trials or make the change bigger.

Reading a Scorecard

Here's the format my compare.py spits out when I A/B two planners on the same tasks — the kind of comparison I use to settle a question like is OpenSpec actually a better planner than my original two-agent approach, or do I just like it?

The numbers below are illustrative — they show the shape of the answer compare.py gives, not a published result. The point is what each row tells you, not these exact values.

A/B: travis (baseline) vs openspec        tasks=13  trials=3
======================================================================
                       travis            openspec          Δ
----------------------------------------------------------------------
pass@3  (capability)   0.85              0.92             +0.07
pass^3  (reliability)  0.54              0.77             +0.23  ◄ the real win
mean spec quality      3.9 / 5           4.3 / 5          +0.4
mean impl fidelity     3.7 / 5           4.1 / 5          +0.4
avg SUT cost / task    $0.71             $0.63            -$0.08
avg diff size          214 LOC           158 LOC          -56
avg wall time          8m12s             7m41s            -31s
----------------------------------------------------------------------
verdict: openspec wins on reliability and cost; capability gap
         within trial spread (±0.06) — treat as tied there.

The row to read for isn't capability — if both planners can eventually solve most tasks, pass@3 lands in the same place and tells you little. The row that decides it is pass^3: a jump there means a planner didn't make the agent smarter, it made it consistent. That's the kind of conclusion you simply cannot reach by eyeballing a couple of runs — and it's exactly what I built the scorecard to surface before I promote a planner to the default slot.

That's the whole payoff. A scorecard like this turns "I think it's better" into "it's +X on pass^3 at lower cost, and the capability gap is within noise" — or it tells you there's no difference worth shipping, which is just as useful.

Keeping the Benchmark Honest: Saturation and the Capture Loop

A benchmark has a failure mode of its own: saturation. When your tasks get easy enough that every variant aces them, the suite stops discriminating — good change and bad change score the same, and you're back to flying blind.

Two things keep mine sharp.

1. Hard-mode tasks with no greppable answer. My nastiest task is a real, unplanted latent defect I found in the tier-2 backend: a Sequelize findAndCountAll combined with a hasMany include that inflates count by the number of JOINed rows. There's no magic keyword to grep for — the agent has to actually understand ORM semantics to find it. And there are two instances of the bug while the prompt only reports one, so the task also grades whether the agent sweeps for siblings or fixes the one and leaves. That single task discriminates harnesses that "look right" from harnesses that investigate.

2. The capture loop. When a real run produces a buggy result on a vendored target, I mint it into a new permanent task with /capture_eval_task. The acceptance oracle is harvested from my fix, and before I trust the task I run a saturation check — a quick A/B that only earns the task a spot if the current harness doesn't already ace it. In other words: every real failure becomes a permanent guard, and the benchmark gets harder exactly as fast as the agent gets better. The suite can't go stale because my own mistakes keep feeding it.

This is the closed loop that turns "a pile of tasks" into a system that stays useful.

Bonus: Finding the Cheapest Thing That Still Works

Once you can score variants, a fun question opens up: what's the cheapest model-and-effort setting that still passes a task?

I run a model×effort sweep — each cell of the grid is just another eval variant — and report prints a Pareto view plus the cheapest cell that still gets a full pass. That's how I reason about which model each phase gets: today the harness runs Sonnet for build and the execution-heavy phases (validate, test, review, document) and reserves Opus for planning, where spec quality moves pass^k the most. The sweep is what turns "can I drop a tier here?" from a vibe into a lookup — it shows exactly where going cheaper stops being free.

Skeptic's Corner

"This is a lot of infrastructure for a side project." It is. But it's also the single highest-leverage thing I built. Every other improvement to the harness — every prompt tweak, every new planner, every phase change — is now measurable instead of hopeful. The eval framework pays for itself the first time it stops you from shipping a regression you were sure was an improvement.

"Thirteen tasks isn't statistically significant." Correct, and I don't pretend otherwise — that's exactly why I report spread and refuse to call anything inside the noise band a win. The goal isn't a p-value, it's to stop fooling myself. Thirteen real tasks scored honestly beats one impressive demo every time.

"The LLM judge is graded by an LLM — isn't that circular?" That's why the judge is fixed, memoized, cost-isolated, and paired with deterministic graders. The execution signal is the ground truth; the judge only scores the things tests can't see. If they disagree, that disagreement is itself a signal worth reading.

The Bigger Picture

Coding agents are easy to demo and hard to trust. The thing that moved my harness from "neat demo" to "system I iterate on with confidence" wasn't a smarter prompt or a bigger model — it was deciding to measure. Frozen targets, real tasks, two kinds of graders, and reliability numbers that don't let a flaky 2-of-3 hide behind a friendly average.

Most "AI agent" projects skip this part because it's unglamorous. That's exactly why it's worth doing. The discipline is the moat.

What's Next

The next post in the series gets back into the pipeline itself: the Review phase — how the review agent compares what was built against the original spec, categorizes issues by severity, and how I auto-patch blockers. After that, a deeper look at the OpenSpec planner integration — and how I'm using the scorecard above to decide whether it earns the default slot.

If you're building your own harness, the one thing I'd steal first isn't any single phase — it's the eval loop. Build the thing that tells you whether you're improving, and everything else gets easier.

The Validate Phase: How I Catch AI Code Issues Before They Reach My Tests

Travis Martin — Sun, 08 Mar 2026 12:52:12 +0000

Series context: This is a deep-dive follow-up to How I Automate Parts of My SDLC with AI Agents. If you haven't read that post, the short version: I built an agentic dev workflow (ADW) that automates my full development cycle: Plan → Build → Validate → Test → Review → Document. This post focuses on the Validate phase.

Why Validation Is the Most Underrated Phase

The elephant in the room: AI-generated code is fast but imperfect
Linters and static analysis exist for human written code why would AI-written code get a free pass?
Without a validate phase, imperfections land directly in your test agent (or worse, in review)
The validate phase is the quality gate that makes the rest of the pipeline trustworthy
Quick recap of where it sits in the pipeline:

Plan → Build → [Validate ×3] → [Test ×3] → Review → Document
                    ↑
              You are here

What Validation Is NOT (Scope Clarity)

Not running unit tests that is the Test Agent's job (separate agent, separate concerns)
Not running the application
No external service calls or DB connections
Purely static analysis we only analyze the code itself, nothing needs to execute

This separation is intentional. Each agent does one thing well (SRP). Keeping validation static means it is fast enough to retry 3 times without killing your pipeline's momentum and not burning a hole in your wallet.

The Tool Stack

JavaScript / TypeScript (the original)

ESLint with custom architectural rules
Custom rules enforce things like: no direct fetch in components, no model imports in routes
One command, JSON output, done
I create custom claude commands that encapsulate the exact flow each agent needs, so a simplified version of the validation agent's command looks like this:
For an actual example see this validate file:

cd backend && npm run validate:architecture:json
cd frontend && npm run validate:architecture:json

Java / Spring Boot (the new addition)

The same philosophy, different tools. Here is the parallel:

Concern	JS Tool	Java Tool
Architecture	ESLint custom rules	ArchUnit
Code style	ESLint	Checkstyle
Code smells	ESLint plugins	PMD
Bug patterns	—	SpotBugs
Fast fail gate	implicit	mvn compile

Execution order matters fastest checks first:

# 1. Fast fail stop here if this breaks, no point running anything else
mvn compile -q

# 2. Style + formatting
mvn checkstyle:check

# 3. Code smells and complexity
mvn pmd:check

# 4. Bytecode-level bug patterns
mvn spotbugs:check

# 5. Architecture rules only isolated by JUnit tag
mvn test -Dgroups=architecture -Dsurefire.failIfNoSpecifiedTests=false

Why compile first? Most Java static analysis tools require compiled bytecode. A compile failure is also the cheapest signal no point running ArchUnit on code that does not compile.

Architecture Rules with ArchUnit

Brief intro: ArchUnit lets you write your architecture decisions as executable tests
These are not regular unit tests they validate structure, not logic
Tag them separately so the Validation Agent and Test Agent have zero overlap

@Tag("architecture")
@AnalyzeClasses(packages = "com.yourapp")
class ArchitectureRules {

    // Controllers must not call repositories directly
    @ArchTest
    static final ArchRule no_direct_repo_in_controllers =
        noClasses()
            .that().resideInAPackage("..controller..")
            .should().dependOnClassesThat()
            .resideInAPackage("..repository..");

    // Services must not import Spring MVC annotations
    @ArchTest
    static final ArchRule services_must_not_use_mvc =
        noClasses()
            .that().resideInAPackage("..service..")
            .should().dependOnClassesThat()
            .resideInAPackage("org.springframework.web.bind.annotation..");

    // Naming conventions enforced
    @ArchTest
    static final ArchRule controllers_named_correctly =
        classes()
            .that().resideInAPackage("..controller..")
            .should().haveSimpleNameEndingWith("Controller");

    // @Transactional only allowed on service layer
    @ArchTest
    static final ArchRule transactional_only_on_services =
        noClasses()
            .that().resideOutsideOfPackage("..service..")
            .should().beAnnotatedWith(Transactional.class);
}

The Test Agent excludes this tag so there is zero overlap between the two agents:

# Validation Agent runs ONLY architecture tests
mvn test -Dgroups=architecture -Dsurefire.failIfNoSpecifiedTests=false

# Test Agent runs everything EXCEPT architecture tests
mvn test -DexcludedGroups=architecture

The Standardized Violation Schema The Secret Sauce

I find each tool has its own output format it can be noisy and inconsistent across tools
The agent cannot reliably reason about what to fix if the input format varies per tool
Solution: normalize everything into one consistent JSON schema before feeding it into the fix loop
- We can use AI to help with this normalization step write a prompt that takes raw tool output and maps it to the schema this can be done by giving it a few examples of the input and output format that are required.
This is the same schema used in the JS version the contract does not change, only the tools that populate it do
- Tools change but the schema is stable and consistent across languages this is the key to making the rest of the pipeline tool-agnostic.

[
  {
    "rule": "ArchUnit/no-direct-repo-in-controllers",
    "file": "src/main/java/com/app/controller/UserController.java",
    "line": 34,
    "column": null,
    "severity": "error",
    "message": "Controllers should not import repositories directly. Use a service.",
    "fix_suggestion": "Replace UserRepository injection with UserService. Controllers should only depend on the service layer."
  },
  {
    "rule": "checkstyle/MethodLength",
    "file": "src/main/java/com/app/service/OrderService.java",
    "line": 87,
    "column": 1,
    "severity": "warning",
    "message": "Method length is 72 lines (max 50).",
    "fix_suggestion": "Extract the validation logic into a private helper method to reduce method length."
  }
]

Two severity levels, two behaviors:

severity: "error" → fails validation, triggers the auto-fix retry loop
severity: "warning" → logged and visible but does not fail the phase

A note on fix_suggestion for Java: ESLint can generate suggestions natively. Java tools cannot. Instead, maintain a small rule registry a lookup map of rule name → suggestion string that the normalizer uses when building the output. Upfront effort, but it pays off every retry cycle.

The Auto-Fix Retry Loop

When violations are found the agent does not stop it feeds the structured violations back to the LLM to fix, then re-validates
Hard cap at 3 retries before escalating to human intervention
Two failure modes to guard against:

Build Output
     ↓
Run Validation Tools
     ↓
Normalize all output → JSON violations array
     ↓
violations.length > 0?
  YES → Feed violations to fix agent → Re-validate (max 3 attempts)
  NO  → Phase complete ✅
     ↓
Still failing after 3 retries? → Halt, surface to human 🛑

Violation diffing between retries track the count before and after each fix attempt. If the count is not going down, the agent is stuck. Escalate instead of looping.

Regression detection if new violations appear that were not present before a fix attempt, the fix introduced a regression. Treat this as a separate signal and re-run from the last clean state rather than continuing forward.

What this looks like in your pipeline output:

Phase 3: Validation
======================================================================
SUCCESS
   Critical: 0, Warnings: 2, Attempts: 2

   Found 4 violations on first pass
   Auto-fixed: controller importing repository directly, method too long,
               missing @Override annotation, unused import
   Re-validated: CLEAN

What About SonarQube?

You might already have SonarQube running in CI does it belong here too?
Short answer: no, not inside the validation agent
mvn sonar:sonar is slow and expensive a bad fit for a loop that may run 3 times
This agent runs before a push, so there is no Sonar result to even poll yet
SonarQube's natural home is post-push in CI, as a final safety net before merge
Instead, run SonarLint in connected mode locally in your IDE same quality profile as your server, zero pipeline cost

The principle: order checks by cost. Cheap and fast first, expensive later (or delegate to CI entirely). This is why compile runs before Checkstyle, and Checkstyle before ArchUnit.

Before and After: What Your Test Agent Receives

Without a validation agent the test agent receives raw AI output that may include:

A controller calling a repository directly (arch violation)
A method that is 90 lines long (PMD)
Unused imports (Checkstyle)
A missing null check (SpotBugs)
The test agent now has to fight bad structure AND broken tests simultaneously

With a validation agent the test agent receives code that has already passed:

Compile check
Style and formatting rules
Smell and complexity thresholds
Bug pattern analysis
Your architectural boundaries

The test agent works with clean, well-structured code every time. That is why your test agent rarely needs all 3 of its own retries.

Key Takeaway

The validate phase is not about distrusting AI. It is about applying the same rigor to AI generated code that you apply to any code the same linters, the same architectural rules, the same standards your team already agreed on. The difference is it runs automatically, fixes itself, and only escalates to you when it genuinely cannot resolve the issue.

The code that reaches your Test Agent has already been through a compile check, style validation, smell detection, bug pattern analysis, and your architecture rules. You are not reviewing raw AI output. You are reviewing code that has already been through the gauntlet.

What's Next

I plan on writing a post around the Review Agent next why its needed and how can we apply a but more automation if there are any issues found during that phase.
The Review Agent did the AI actually build what the spec asked for?

How I Automate Parts of My Software Development Lifecycle with AI Agents

Travis Martin — Wed, 21 Jan 2026 14:56:06 +0000

Every developer knows the drill: You get a feature request. You create a branch. You write a plan (maybe). You implement. You write tests. You review. You document. Rinse and repeat. What if an AI could handle the tedious parts while you focus on the interesting problems? That's exactly what I built, I call it AI Developer Workflows (ADW). In this post, I'll show you how I automated the complete software development lifecycle using AI agents, and how you can do the same, I have templates for typescript, golang, and java, however it can easily be adjusted to other languages you just need to update the prompt commands.

My day to day development workflow

Most of my day wasn't spent solving interesting problems. It was spent on ceremony. Can I take my workflow and get AI to automate parts or all of it? Here is my day before my AI workflow:

Read through all the Jira tasks and find the one I like the most, assign to myself. Hoping the details exist and it's NOT just a one liner that the PM created.
Planning: Read a lot of code and decide HOW my feature will fit into the code base.
Implementation Plan: Once I have all relevant files or identified the areas of change, I create a document to keep track of all the changes that are needed.
Implementation: Start coding!
Oh yeah tests: I then write my tests, I know I probably should follow TDD or some framework.
Review: Now that everything is done lets compare the feature with the actual jira ticket did we build the correct thing? Did We miss anything? hopefully not and also pray for NO scope creep.
Documentation? Lol

The Solution: AI Developer Workflow (ADW)

ADW is a framework that orchestrates AI agents through a complete SDLC pipeline. The idea is to have one agent perfect one task extremely well (SRP), vs trying to get a single agent to perform multiple tasks.

┌───────────────────────────────────────────────────────────────────────┐
│                         ADW Pipeline                                  │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  Prompt ──► Plan ──► Build ──► Validate ──► Test ──► Review ──► Doc   │
│               │         │          │          │         │        │    │
│               ▼         ▼          ▼          ▼         ▼        ▼    │
│            Spec      Code      Quality     Fixes    Issues    Docs    │
│            File     Changes    Enforced   Applied   Fixed   Created   │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

1. Plan Phase:

For larger code bases this can be broken into two parts: Research and Planning, the research agent does a deep analysis on what are the relevant files before passing this to the planning agent just to control a bit more of the context window. Explained in a Presentation: "I shipped code I don't understand and I bet you have too" by Jake Nations.

Next you provide a prompt like "Add user authentication with JWT tokens." The planning agent:

Researches your codebase
Identifies relevant files and patterns (Important)
Creates a detailed implementation spec
Outputs a structured plan file

2. Build Phase

The builder agent reads the spec and:

Implements the feature following your codebase patterns
Creates necessary files and modifications
Follows existing conventions automatically

3. Validate Phase (The "AI Writes Bad Code" Killer)

This is the phase that addresses the elephant in the room: "But AI-generated code is garbage!"

The validation agent:

Runs linters, static analysis, and architectural rules
Catches anti-patterns, code smells, and style violations
Automatically fixes violations and retries
Enforces YOUR coding standards, not generic ones

This isn't just go fmt. It's running tools like golangci-lint with your custom ruleset, checking for security issues, verifying architectural boundaries, and ensuring the AI generated code follows the same standards as us humans. For non-golang'ers there are tools like ArchUnit (java) and ArchUnitTS (typescript) to Enforce architecture rules.

If the AI writes code that violates your standards? The validation agent fixes it automatically, then re-validates. Up to N retries until it's clean.

4. Test Phase

This agent I would say is one of the more important agents, if you get this one done correctly you shouldnt hit any regressions (mostly). What I like to do here is setup both unit tests and integration tests, the ensure that the slash command know how to execute them both. This way if we break anything this agent will find the issues and correctly fix them. NOTE: we also need to explain HOW can AI troubleshoot issues, we need to have good logging (any production app should have great logging) and again tell AI how to search these logs if it does encounter issues. Do this well and you will save A LOT of tokens and time. In my apps I always add centralized logging and explain to AI how to search these logs effectively.

The test agent:

Runs your test suite
If tests fail, analyzes the failures
Attempts to fix issues automatically
Retries up to N times (configurable)

5. Review Phase

The review agent:

Compares implementation against the original spec
Identifies gaps, bugs, or missing requirements
Categorizes issues by severity (blocker, tech debt, skippable)
Creates a review report (We have an agent to resolve these issues: travis_patch.py if there are blockers)

6. Document Phase

The documentation agent:

Generates user-facing documentation
Updates relevant README sections
Creates API documentation if applicable
We update our conditional_docs.md, this allows us to conditionally load documentation when we are using /feature, essentially allowing AI to dynamically load documents if they are required in the new feature.

What This Looks Like in Practice

Here's the magic. One command:

~ uv run travis_sdlc.py "Add rate limiting to the API endpoints"

======================================================================
  Travis SDLC Workflow
  ADW ID: a1b2c3d4
======================================================================

Phase 1: Planning
======================================================================
✅ SUCCESS
   File: specs/feature-a1b2c3d4-api-rate-limiting.md

Phase 2: Implementation
======================================================================
✅ SUCCESS

Phase 3: Validation
======================================================================
✅ SUCCESS
   Critical: 0, Warnings: 2, Attempts: 2

   ↳ Found 3 violations on first pass
   ↳ Auto-fixed: unused variable, missing error check, import order
   ↳ Re-validated: CLEAN

Phase 4: Testing
======================================================================
✅ SUCCESS
   Passed: 47, Failed: 0, Attempts: 1

Phase 5: Review
======================================================================
✅ SUCCESS
   Issues: 0

Phase 6: Documentation
======================================================================
✅ SUCCESS
   Path: app_docs/feature-a1b2c3d4-api-rate-limiting.md

======================================================================
  ✅ WORKFLOW COMPLETED SUCCESSFULLY
======================================================================

Notice Phase 3: The validation found 3 violations, automatically fixed them, and re-validated. Acknowledging that AI does NOT always write the best code, we need to put in checks into our agents that will enforce coding standards.

From a single prompt to a fully implemented, tested, reviewed, and documented feature.

Skeptic's Corner: Addressing the Hard Questions

I been working with AI for over a few years now, one of the most common push back is quality and the quantity of code outputted.

AI-generated code is unmaintainable garbage.

You are correct! Unfiltered AI output often has issues: unused variables, missing error handling, duplicate methods or trying to re-build a class we already have, …etc. Here is the thing, human developers write code with issues too. This is why we have linters, code reviews, and CI pipelines. My approach of the ADW applies the same rigor to AI generated code through the Validate Phase.

┌─────────────────────────────────────────────────────────────────┐
│                    Validation Phase Loop                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Build Output ──► Validate ──► Violations? ──►  Auto-Fix ──┐   │
│                        ▲                                    │   │
│                        └────────────── Re-validate ◄────────┘   │
│                                                                 │
│   Max 3 retries, then human intervention required               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The validation phase runs golangci-lint, eslint, security scanners, and your custom architectural rules. If violations are found, the agent fixes them automatically and revalidates. The AI doesn't write perfect code. But the system catches and corrects mistakes before they reach you.

AI just creates tech debt that I'll have to clean up later.

This is a valid concern. AI can take shortcuts, copy-paste patterns inappropriately, or ignore edge cases. The ADW addresses this with the Review Phase: I have below a recent log that found a few issues in my application. This phase can be customized to FAIL if a condition is met I currently have it set to only fail on blockers. I get the agent to create a detail plan on how to fix these issues the file is saved here: specs/review_issues/review-4cb749dc.md If i wanted ai to fix these i just run the command uv run .awd/travis/travis_patch.py 4cb749dc (the ID number of this job and it will pick everything up on its own)

======================================================================
  Phase 5: Review
======================================================================

ADW Logger initialized - ID: 4cb749dc
Travis Review starting - ADW ID: 4cb749dc
Reviewing implementation against spec: specs/feature-a1b2c3d4-nuclei-vulnerability-scanning.md

Review Summary:
  Status: PASSED
  Tests: PASSED
  Build: PASSED
  Summary: The Nuclei vulnerability scanning feature has been successfully implemented with all core functionality working as specified. The implementation includes proper CLI commands (vuln scan, vuln list), database persistence with migration, Nuclei tool integration, and comprehensive test coverage. All tests pass and the build succeeds. Minor issues exist around missing config validation, incomplete findings integration, and lack of repository tests, but none are blocking the release of this feature.

  Issues Found: 7

  Issue #1:
    Severity: tech_debt
    File: internal/cli/vuln_scan.go:541
    Description: The --custom flag uses BuildCustomTemplateArgs but the implementation uses -templates flag which differs from Nuclei's actual -t flag for custom templates. This may cause issues when users try to specify custom template paths.
    Resolution: Update BuildCustomTemplateArgs in pkg/recon/vulnscan/nuclei.go to use -t flag instead of -templates flag for custom template paths, matching Nuclei's actual CLI interface.

  Issue #2:
    Severity: skippable
    File: configs/default.yaml:181
    Description: The config includes 'severity' and 'exclude_templates' fields, but these are not validated or used anywhere in the codebase. The CLI always requires explicit --severity or --templates flags.
    Resolution: Either implement support for reading default severity and exclude_templates from config file in the scan command, or remove these unused fields from the config schema to avoid user confusion.

  Issue #3:
    Severity: tech_debt
    File: specs/feature-a1b2c3d4-nuclei-vulnerability-scanning.md:209-212
    Description: The spec mentions 'Auto-create findings from critical/high severity results' and 'Link vuln_scans to findings table' as acceptance criteria, but this integration is not implemented. The code only stores in vuln_scans table without creating finding entries.
    Resolution: Add integration with the findings table to auto-create findings for critical/high severity vulnerabilities. This can be done by calling the findings repository after saving vuln scans with high severity.

  Issue #4:
    Severity: tech_debt
    File: internal/repository/vuln_scan.go
    Description: No tests exist for the VulnScanRepository despite the repository having complex JSON serialization logic for references and extracted_results. This creates risk of bugs in database operations.
    Resolution: Add unit tests for VulnScanRepository covering Create, GetByID, GetBySeverity, GetByHost, GetByCVE, Exists, and JSON serialization/deserialization of array fields.

  Issue #5:
    Severity: skippable
    File: .gitignore:9
    Description: The bbrecon binary was removed from .gitignore, which will cause the compiled binary to be tracked by git. This is generally not desired for build artifacts.
    Resolution: Add 'bbrecon' back to .gitignore to prevent the binary from being committed to the repository.

  Issue #6:
    Severity: tech_debt
    File: pkg/recon/vulnscan/nuclei.go:1762-1768
    Description: BuildTemplateArgs uses -tags flag for template categories, but this may not match Nuclei's expected behavior. Nuclei template categories like 'cves' typically use the -t flag with path like '-t cves/' not '-tags cves'.
    Resolution: Verify the correct Nuclei flag for template categories and update BuildTemplateArgs to use -t flag with category paths (e.g., '-t cves/') instead of -tags. Add integration test with actual Nuclei to verify.

  Issue #7:
    Severity: skippable
    File: internal/cli/vuln_scan.go:702
    Description: The getTargetsFromDB function only builds HTTPS URLs but some services may only be accessible via HTTP. This could miss vulnerabilities on HTTP-only services.
    Resolution: Update getTargetsFromDB to check the subdomain's HTTPScheme or URL field if available, or build both HTTP and HTTPS URLs based on the actual probed protocol from the web recon phase.
Review issues written to specs/review_issues/review-4cb749dc.md

Review issues file created: specs/review_issues/review-4cb749dc.md
Review phase completed successfully

Phase 5: Review: ✅ SUCCESS

The review agent specifically looks for:

Spec compliance: Did we actually build what was planned?
Tech debt indicators: Shortcuts, TODOs, incomplete error handling
Missing edge cases: What happens when X fails?

Issues are categorized by severity. Blockers stop the workflow. Tech debt / skippable is documented but doesn't block. You decide what bar to set.

AI doesn't understand MY codebase. It'll write code that doesn't fit.

This is why the Planning Phase exists, before writing any code, the planning agent:

Reads your README.md and DESIGN.md
Searches for similar coding patterns in your codebase.
Identifies the files it needs to modify.
Creates a spec that follows YOUR conventions, does NOT make up patterns.

AI analyzes your existing patterns before generating any code. The commands are built to extend what already exists, and only create new components when nothing relevant is found.

I have not tested this but if the Codebase is large you can easily break this phase into two parts: 1. Research agent, and 2. Planning agent, planner utilizes the research agents results. The Research Agent main job is to find What is relevant to this feature within this large codebase and pass it to the planner. This way we are NOT wasting the planners context trying to find everything needed to 'X' feature.

You still have to review everything anyway. What's the point?

Yes, we should always review everything however, here is the difference between the two:

Without ADW:

Review raw AI output
Find the 12 linting errors
Notice missing error handling
Realize it didn't follow your patterns
Send it back, wait, review again

With ADW:

Validation already caught the linting errors
Tests already verified basic functionality
Review already flagged tech debt
I'm reviewing polished code, not first drafts

The code that reaches me has already passed:

Linter (validation phase)
Static analysis (validation phase)
Unit tests (test phase)
Spec compliance (review phase)

My review is the final check, not the first line of defense.

The Bottom Line

ADW doesn't trust the AI output. It verifies, validates, tests, and reviews automatically. AI is fast but imperfect, the system is designed to catch imperfections programmatically before they reach you. You still need to review the code, however, it has already passed multiple quality gates not raw AI output. This ADW is a good first start to your feature, it doesn't always one shot the feature but it will get you 80–90% there.

The Architecture

Now that we address some of the Skepticisms let's look at the architecture: ADW consists of three layers and only one layer is language-specific.

┌─────────────────────────────────────────────────────────────────┐
│                    ADW Architecture                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Layer 3: Orchestrator (Python)     LANGUAGE-AGNOSTIC   │    │
│  │  travis_sdlc.py - chains phases, manages state          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                            │                                    │
│                            ▼                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Layer 2: Slash Commands (Markdown)  LANGUAGE-SPECIFIC  │    │
│  │  /test, /validate, /review - customize per language     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                            │                                    │
│                            ▼                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Layer 1: Agent Module (Python)     LANGUAGE-AGNOSTIC   │    │
│  │  Claude Code execution, retry logic, state management   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Layer 1: Core Agent Module (Language-Agnostic)

The foundation that handles:

Claude Code CLI execution
Retry logic for transient failures
Output parsing (JSONL → JSON)
State management

This never changes regardless of what language your project uses.

Layer 2: Slash Commands (Language-Specific)

This is where the magic happens. Slash commands are markdown templates that define HOW each phase executes for YOUR language:

.claude/commands/
├── test.md        # "Run go test ./..." or "npm test" or "mvn test"
├── validate.md    # "Run golangci-lint" or "eslint" or "checkstyle"
├── feature.md     # Plan format with language-specific patterns
├── implement.md   # Implementation instructions
├── review.md      # Review criteria
└── document.md    # Documentation format

To support a new language, you only customize these files. For example, here's how /test differs by language, to ensure these work in claude code just open the claude terminal and type /test if your test run correctly this is how the agent will execute this command.

## Test Execution
- Command: `go test ./... -v -race -coverprofile=coverage.out`
- test_name: "go_test"

**TypeScript:**
## Test Execution
- Command: `npm test -- --coverage --watchAll=false`
- test_name: "jest_test"

**Java:**
## Test Execution
- Command: `mvn test -B`
- test_name: "maven_test"

Same orchestrator. Same workflow. Different language-specific commands

Layer 3: Orchestrator (Language-Agnostic)

The travis_sdlc.py script that:

Chains phases together
Manages state between phases
Handles failures and retries
Provides observability and logging

This is pure Python and doesn't know or care what language your project uses. It just calls the slash commands and processes the results. NOTE: Each of these files can be run independently, they are meant to be isolated function calls to the claude sdk. Learned this from IndyDevDan.

Why This Matters

This architecture means:

One orchestrator to maintain The Python ADW code works for any language
Easy to add new languages Just write new slash commands
Shareable workflow logic Test/retry/review logic is universal
Customizable per project Each repo can have its own command variations

Want to use ADW on a Rust project? Write /test.md with cargo test, /validate.md with clippy, and you're done. The orchestrator handles the rest.

Getting Started

Prerequisites:

Claude Code CLI installed and authenticated
Python 3.11+ with uv package manager (install with brew)
Your codebase with a README.md and basic structure

Add a .env file with the following:

CLAUDE_CODE_PATH=claude # This is the path to claude code default should be this

After getting the prereqs you need to EDIT the slash commands to match your repo, big brain move is get claude code to do it maybe?

Please REVIEW the slash commands located in @.claude/commands/* 
there are some language specific files like: 
- @.claude/commands/test.md 
- @.claude/commands/validate.md
...

We need to update them to MATCH our system in this repo ensure all commands 
are correctly matching our system.

Example Commands

# Simple feature
uv run travis_sdlc.py "Add a health check endpoint"

# Bug fix
uv run travis_sdlc.py "Fix the memory leak in the cache module" --plan-type bug

# Chore/refactor
uv run travis_sdlc.py "Refactor the logging to use structured output" --plan-type chore

# Use a more powerful model for complex tasks
uv run travis_sdlc.py "Implement OAuth2" --model opus

# Skip optional phases
uv run travis_sdlc.py "Quick fix" --skip-review --skip-document

# Increase test retry attempts
uv run travis_sdlc.py "Tricky feature" --max-test-retries 5

What's Next

This post covered the "what" and "why" of ADW. In the next posts, I plan on explaining deeper into the following phases.

The Planning Phase How to write effective prompts and customize plan templates
Validation Enforcing code quality with linters, auto-fixes, and custom rules
Test & Review Handling failures, auto-fixes, and quality gates
Customizing Slash Commands Adapting ADW for Go, Java, TypeScript, Rust, or any language

The ADW framework is available on GitHub: [https://github.com/travism26/claude_code_agent_templates] I'd love to hear how you're using it. Drop a comment or reach out on Twitter [@travism26].

Shoutouts

IndyDevDan I took his course and a lot of the ideas I learned and expanded upon are from his course.