<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Travis Martin</title>
    <description>The latest articles on DEV Community by Travis Martin (@rickjms).</description>
    <link>https://dev.to/rickjms</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2188150%2F43f9028f-2d43-4168-98c1-63d5c2b3471d.jpg</url>
      <title>DEV Community: Travis Martin</title>
      <link>https://dev.to/rickjms</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rickjms"/>
    <language>en</language>
    <item>
      <title>Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes</title>
      <dc:creator>Travis Martin</dc:creator>
      <pubDate>Tue, 23 Jun 2026 16:35:59 +0000</pubDate>
      <link>https://dev.to/rickjms/eval-driven-agent-development-how-i-stopped-tuning-prompts-on-vibes-1189</link>
      <guid>https://dev.to/rickjms/eval-driven-agent-development-how-i-stopped-tuning-prompts-on-vibes-1189</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series context:&lt;/strong&gt; This is a follow-up to &lt;a href="https://dev.to/rickjms/how-i-automate-parts-of-my-software-development-lifecycle-with-ai-agents-43h7"&gt;How I Automate Parts of My SDLC with AI Agents&lt;/a&gt;. Earlier posts covered the pipeline overview, the Validate phase, agent state management, and rate-limit resilience. This one is about the part that holds the whole thing together: how I know whether a change to my harness actually made it &lt;em&gt;better&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Question I Couldn't Answer
&lt;/h2&gt;

&lt;p&gt;I changed a prompt. The next run looked better. Did the prompt help, or did I get lucky?&lt;/p&gt;

&lt;p&gt;For a long time my honest answer was "I think so?" I'd tweak a slash command, run my pipeline on a feature, watch it succeed, and ship the change. That's vibes-based prompt engineering, and almost everyone building agents does it — because the alternative feels impossible.&lt;/p&gt;

&lt;p&gt;Here's the trap. A coding agent is non-deterministic. Run the same task twice and you get two different trajectories. So a single good run tells you almost nothing: you can't separate "my change helped" from "the dice came up nice this time." And open-ended feature work has no single right answer, so you can't just write a unit test for "did the agent build the feature well."&lt;/p&gt;

&lt;p&gt;That's two problems stacked on top of each other: &lt;strong&gt;non-determinism&lt;/strong&gt; and &lt;strong&gt;no ground-truth&lt;/strong&gt;. If you don't solve both, every harness change is a coin flip you can't see.&lt;/p&gt;

&lt;p&gt;This post is how I solved it — by stealing the discipline from ML evaluation and pointing it at my own harness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Eval Sits in the Pipeline
&lt;/h2&gt;

&lt;p&gt;Quick orientation for anyone new to the series. My ADW harness runs a feature through seven phases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;research → plan → build → validate → test → review → document
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Eval isn't one of those phases. Eval is &lt;em&gt;meta&lt;/em&gt; — it wraps the whole harness and asks a different question:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  ┌─────────────────────────────────────┐
   harness  ─────►│  run the full pipeline on N tasks,  │
   change         │  M times each, score every run      │────► verdict:
                  │  (code graders + LLM judge)         │      better / worse / noise
                  └─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The phases build software. The eval framework measures whether &lt;em&gt;my changes to how the phases work&lt;/em&gt; are improvements or regressions. It's testing the tester.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: A Frozen Benchmark You A/B Against
&lt;/h2&gt;

&lt;p&gt;The whole thing rests on one move borrowed from ML: &lt;strong&gt;fix everything except the variable you're testing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A few terms I'll use (this harness has its own vocabulary, so here are the one-liners):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Target&lt;/strong&gt; — a sample codebase a task runs against, &lt;em&gt;vendored&lt;/em&gt;: copied and frozen into the repo so scores stay comparable over months. If the target drifts, your scores aren't measuring your harness anymore.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task&lt;/strong&gt; — one benchmark item: a prompt plus an &lt;strong&gt;oracle&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oracle / acceptance checks&lt;/strong&gt; — the deterministic definition of "correct": shell commands that must exit &lt;code&gt;0&lt;/code&gt;. No oracle, no task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variant&lt;/strong&gt; — a labeled config under test: a planner, a flag, a branch. This is the thing you're A/B-ing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trial&lt;/strong&gt; — one run of one task. Because agents are non-deterministic, every task runs N times (default 3).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I run two difficulty tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tier 1&lt;/strong&gt; — a tiny hermetic notes CLI with pytest. Cheap regression gate, ~5 min/trial.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 2&lt;/strong&gt; — a frozen &lt;strong&gt;~69K-LOC Express/TypeScript backend&lt;/strong&gt; with Postgres + Redis. Realistic headroom, ~20 min/trial (longer when the build loop has to retry). This is where the interesting failures live.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thirteen tasks across the two tiers — features, bugs, and chores. Each trial copies the target to a clean temp dir, runs the full SDLC against it, and grades the result. Same tasks, same targets, every time. Now a change is measurable.&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Grade a Run (and Why I Use Two Judges)
&lt;/h2&gt;

&lt;p&gt;Each trial gets scored two completely different ways, on purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code graders — deterministic.&lt;/strong&gt; These are the things a machine can check without an opinion:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;test pass-ratio (did the target's own suite go green?)&lt;/li&gt;
&lt;li&gt;behavioral acceptance oracle (the task-specific "what correct looks like")&lt;/li&gt;
&lt;li&gt;phases completed, test-retry count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cost&lt;/strong&gt; — the agent-under-test's token spend, pulled from the API-recorded usage&lt;/li&gt;
&lt;li&gt;wall-clock time&lt;/li&gt;
&lt;li&gt;diff size (did it change three files or thirty?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LLM judge — probabilistic.&lt;/strong&gt; A separate, fixed model scores what determinism can't: spec quality (0–5) and implementation fidelity (0–5). One combined call, &lt;strong&gt;memoized&lt;/strong&gt; so re-runs don't re-pay, and its cost is itemized &lt;em&gt;separately&lt;/em&gt; from the agent under test — you never want your judge's spend polluting the number you're optimizing.&lt;/p&gt;

&lt;p&gt;Why both? Because they catch different failures. The execution signal (tests) tells you the code &lt;em&gt;runs&lt;/em&gt;; the judge tells you the code is &lt;em&gt;good&lt;/em&gt;. A change can make tests pass while quietly trashing code quality, or write beautiful code that fails a behavioral check. The two signals are complementary, not redundant — lean on only one and you'll optimize a blind spot.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Metric That Actually Matters: pass@k vs pass^k
&lt;/h2&gt;

&lt;p&gt;This is the part I wish someone had told me earlier.&lt;/p&gt;

&lt;p&gt;If you average your trials, you get a mean. &lt;strong&gt;Means lie about agents.&lt;/strong&gt; A task that succeeds 2 out of 3 times and a task that succeeds 3 out of 3 times can show the same "67% vs 100%" gap that looks like noise — when actually one is reliable and one is a coin flip.&lt;/p&gt;

&lt;p&gt;So I report two reliability numbers instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pass@k&lt;/strong&gt; — did &lt;em&gt;at least one&lt;/em&gt; of k trials pass? This is the &lt;strong&gt;capability ceiling&lt;/strong&gt;: can the agent do this task &lt;em&gt;at all&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pass^k&lt;/strong&gt; — did &lt;em&gt;all&lt;/em&gt; k trials pass? This is &lt;strong&gt;consistency&lt;/strong&gt;: can it do it &lt;em&gt;every time&lt;/em&gt;?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A concrete example from my own suite — Task 08, a CRUD feature on the tier-2 backend, run three times:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task 08 (seasonal-tips CRUD, tier-2 backend):
   trials       → fail, pass, pass
   mean success = 0.67   ← "mostly fine." Looks shippable.
   pass@3       = 1.00   ← it CAN do this task.
   pass^3       = 0.00   ← but not every time. Not reliable.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A mean would've called Task 08 "mostly fine" at 0.67. But pass^3 is all-or-nothing — every trial passes or it's a zero — and here it's a zero: the capability is there, the &lt;em&gt;reliability&lt;/em&gt; isn't. That's a completely different engineering problem, and you can't fix what your metric hides.&lt;/p&gt;

&lt;p&gt;The rule I follow: &lt;strong&gt;a difference smaller than the spread across trials is noise, not signal.&lt;/strong&gt; If variant A scores 0.71 and variant B scores 0.68 but the trial-to-trial spread is ±0.15, you've discovered nothing. Run more trials or make the change bigger.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reading a Scorecard
&lt;/h2&gt;

&lt;p&gt;Here's the format my &lt;code&gt;compare.py&lt;/code&gt; spits out when I A/B two planners on the same tasks — the kind of comparison I use to settle a question like &lt;em&gt;is OpenSpec actually a better planner than my original two-agent approach, or do I just like it?&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The numbers below are illustrative — they show the &lt;strong&gt;shape&lt;/strong&gt; of the answer &lt;code&gt;compare.py&lt;/code&gt; gives, not a published result. The point is what each row tells you, not these exact values.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A/B: travis (baseline) vs openspec        tasks=13  trials=3
======================================================================
                       travis            openspec          Δ
----------------------------------------------------------------------
pass@3  (capability)   0.85              0.92             +0.07
pass^3  (reliability)  0.54              0.77             +0.23  ◄ the real win
mean spec quality      3.9 / 5           4.3 / 5          +0.4
mean impl fidelity     3.7 / 5           4.1 / 5          +0.4
avg SUT cost / task    $0.71             $0.63            -$0.08
avg diff size          214 LOC           158 LOC          -56
avg wall time          8m12s             7m41s            -31s
----------------------------------------------------------------------
verdict: openspec wins on reliability and cost; capability gap
         within trial spread (±0.06) — treat as tied there.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The row to read for isn't capability — if both planners can &lt;em&gt;eventually&lt;/em&gt; solve most tasks, pass@3 lands in the same place and tells you little. The row that decides it is &lt;strong&gt;pass^3&lt;/strong&gt;: a jump there means a planner didn't make the agent smarter, it made it &lt;em&gt;consistent&lt;/em&gt;. That's the kind of conclusion you simply cannot reach by eyeballing a couple of runs — and it's exactly what I built the scorecard to surface before I promote a planner to the default slot.&lt;/p&gt;

&lt;p&gt;That's the whole payoff. A scorecard like this turns "I think it's better" into "it's +X on pass^3 at lower cost, and the capability gap is within noise" — or it tells you there's no difference worth shipping, which is just as useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Keeping the Benchmark Honest: Saturation and the Capture Loop
&lt;/h2&gt;

&lt;p&gt;A benchmark has a failure mode of its own: &lt;strong&gt;saturation.&lt;/strong&gt; When your tasks get easy enough that every variant aces them, the suite stops discriminating — good change and bad change score the same, and you're back to flying blind.&lt;/p&gt;

&lt;p&gt;Two things keep mine sharp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Hard-mode tasks with no greppable answer.&lt;/strong&gt; My nastiest task is a &lt;em&gt;real, unplanted&lt;/em&gt; latent defect I found in the tier-2 backend: a Sequelize &lt;code&gt;findAndCountAll&lt;/code&gt; combined with a &lt;code&gt;hasMany&lt;/code&gt; include that inflates &lt;code&gt;count&lt;/code&gt; by the number of JOINed rows. There's no magic keyword to grep for — the agent has to actually understand ORM semantics to find it. And there are &lt;em&gt;two&lt;/em&gt; instances of the bug while the prompt only reports one, so the task also grades whether the agent sweeps for siblings or fixes the one and leaves. That single task discriminates harnesses that "look right" from harnesses that investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The capture loop.&lt;/strong&gt; When a real run produces a buggy result on a vendored target, I mint it into a new permanent task with &lt;code&gt;/capture_eval_task&lt;/code&gt;. The acceptance oracle is harvested from &lt;em&gt;my&lt;/em&gt; fix, and before I trust the task I run a &lt;strong&gt;saturation check&lt;/strong&gt; — a quick A/B that only earns the task a spot if the current harness &lt;em&gt;doesn't already ace it&lt;/em&gt;. In other words: every real failure becomes a permanent guard, and the benchmark gets harder exactly as fast as the agent gets better. The suite can't go stale because my own mistakes keep feeding it.&lt;/p&gt;

&lt;p&gt;This is the closed loop that turns "a pile of tasks" into a system that stays useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus: Finding the Cheapest Thing That Still Works
&lt;/h2&gt;

&lt;p&gt;Once you can score variants, a fun question opens up: what's the &lt;em&gt;cheapest&lt;/em&gt; model-and-effort setting that still passes a task?&lt;/p&gt;

&lt;p&gt;I run a model×effort sweep — each cell of the grid is just another eval variant — and &lt;code&gt;report&lt;/code&gt; prints a Pareto view plus the cheapest cell that still gets a full pass. That's how I reason about which model each phase gets: today the harness runs Sonnet for build and the execution-heavy phases (validate, test, review, document) and reserves Opus for planning, where spec quality moves pass^k the most. The sweep is what turns "can I drop a tier here?" from a vibe into a lookup — it shows exactly where going cheaper stops being free.&lt;/p&gt;




&lt;h2&gt;
  
  
  Skeptic's Corner
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"This is a lot of infrastructure for a side project."&lt;/strong&gt; It is. But it's also the single highest-leverage thing I built. Every other improvement to the harness — every prompt tweak, every new planner, every phase change — is now measurable instead of hopeful. The eval framework pays for itself the first time it stops you from shipping a regression you were &lt;em&gt;sure&lt;/em&gt; was an improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Thirteen tasks isn't statistically significant."&lt;/strong&gt; Correct, and I don't pretend otherwise — that's exactly why I report spread and refuse to call anything inside the noise band a win. The goal isn't a p-value, it's to stop fooling myself. Thirteen real tasks scored honestly beats one impressive demo every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"The LLM judge is graded by an LLM — isn't that circular?"&lt;/strong&gt; That's why the judge is fixed, memoized, cost-isolated, and &lt;em&gt;paired with deterministic graders&lt;/em&gt;. The execution signal is the ground truth; the judge only scores the things tests can't see. If they disagree, that disagreement is itself a signal worth reading.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Coding agents are easy to demo and hard to trust. The thing that moved my harness from "neat demo" to "system I iterate on with confidence" wasn't a smarter prompt or a bigger model — it was deciding to &lt;strong&gt;measure&lt;/strong&gt;. Frozen targets, real tasks, two kinds of graders, and reliability numbers that don't let a flaky 2-of-3 hide behind a friendly average.&lt;/p&gt;

&lt;p&gt;Most "AI agent" projects skip this part because it's unglamorous. That's exactly why it's worth doing. The discipline is the moat.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The next post in the series gets back into the pipeline itself: the &lt;strong&gt;Review phase&lt;/strong&gt; — how the review agent compares what was built against the original spec, categorizes issues by severity, and how I auto-patch blockers. After that, a deeper look at the &lt;strong&gt;OpenSpec planner integration&lt;/strong&gt; — and how I'm using the scorecard above to decide whether it earns the default slot.&lt;/p&gt;

&lt;p&gt;If you're building your own harness, the one thing I'd steal first isn't any single phase — it's the eval loop. Build the thing that tells you whether you're improving, and everything else gets easier.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwaredevelopment</category>
      <category>testing</category>
    </item>
    <item>
      <title>The Validate Phase: How I Catch AI Code Issues Before They Reach My Tests</title>
      <dc:creator>Travis Martin</dc:creator>
      <pubDate>Sun, 08 Mar 2026 12:52:12 +0000</pubDate>
      <link>https://dev.to/rickjms/the-validate-phase-how-i-catch-ai-code-issues-before-they-reach-my-tests-31c8</link>
      <guid>https://dev.to/rickjms/the-validate-phase-how-i-catch-ai-code-issues-before-they-reach-my-tests-31c8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series context:&lt;/strong&gt; This is a deep-dive follow-up to &lt;a href="https://dev.to/rickjms/how-i-automate-parts-of-my-software-development-lifecycle-with-ai-agents-43h7"&gt;How I Automate Parts of My SDLC with AI Agents&lt;/a&gt;. If you haven't read that post, the short version: I built an agentic dev workflow (ADW) that automates my full development cycle: Plan → Build → &lt;strong&gt;Validate&lt;/strong&gt; → Test → Review → Document. This post focuses on the Validate phase.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why Validation Is the Most Underrated Phase
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The elephant in the room: AI-generated code is fast but imperfect&lt;/li&gt;
&lt;li&gt;Linters and static analysis exist for human written code why would AI-written code get a free pass?&lt;/li&gt;
&lt;li&gt;Without a validate phase, imperfections land directly in your test agent (or worse, in review)&lt;/li&gt;
&lt;li&gt;The validate phase is the quality gate that makes the rest of the pipeline trustworthy&lt;/li&gt;
&lt;li&gt;Quick recap of where it sits in the pipeline:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Plan → Build → [Validate ×3] → [Test ×3] → Review → Document
                    ↑
              You are here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Validation Is NOT (Scope Clarity)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Not running unit tests that is the Test Agent's job (separate agent, separate concerns)&lt;/li&gt;
&lt;li&gt;Not running the application&lt;/li&gt;
&lt;li&gt;No external service calls or DB connections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purely static analysis&lt;/strong&gt; we only analyze the code itself, nothing needs to execute&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;This separation is intentional. Each agent does one thing well (SRP). Keeping validation static means it is fast enough to retry 3 times without killing your pipeline's momentum and not burning a hole in your wallet.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Tool Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  JavaScript / TypeScript (the original)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;ESLint with custom architectural rules&lt;/li&gt;
&lt;li&gt;Custom rules enforce things like: no direct fetch in components, no model imports in routes&lt;/li&gt;
&lt;li&gt;One command, JSON output, done&lt;/li&gt;
&lt;li&gt;I create custom claude commands that encapsulate the exact flow each agent needs, so a simplified version of the validation agent's command looks like this:&lt;/li&gt;
&lt;li&gt;For an actual example &lt;a href="https://github.com/travism26/claude_code_agent_templates/blob/main/example_projects/role_matcher_typescript_webapp/.claude/commands/validate.md" rel="noopener noreferrer"&gt;see this validate file:&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;backend &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run validate:architecture:json
&lt;span class="nb"&gt;cd &lt;/span&gt;frontend &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run validate:architecture:json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Java / Spring Boot (the new addition)
&lt;/h3&gt;

&lt;p&gt;The same philosophy, different tools. Here is the parallel:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;JS Tool&lt;/th&gt;
&lt;th&gt;Java Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;ESLint custom rules&lt;/td&gt;
&lt;td&gt;ArchUnit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code style&lt;/td&gt;
&lt;td&gt;ESLint&lt;/td&gt;
&lt;td&gt;Checkstyle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code smells&lt;/td&gt;
&lt;td&gt;ESLint plugins&lt;/td&gt;
&lt;td&gt;PMD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bug patterns&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;SpotBugs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast fail gate&lt;/td&gt;
&lt;td&gt;implicit&lt;/td&gt;
&lt;td&gt;mvn compile&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Execution order matters fastest checks first:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Fast fail stop here if this breaks, no point running anything else&lt;/span&gt;
mvn compile &lt;span class="nt"&gt;-q&lt;/span&gt;

&lt;span class="c"&gt;# 2. Style + formatting&lt;/span&gt;
mvn checkstyle:check

&lt;span class="c"&gt;# 3. Code smells and complexity&lt;/span&gt;
mvn pmd:check

&lt;span class="c"&gt;# 4. Bytecode-level bug patterns&lt;/span&gt;
mvn spotbugs:check

&lt;span class="c"&gt;# 5. Architecture rules only isolated by JUnit tag&lt;/span&gt;
mvn &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-Dgroups&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;architecture &lt;span class="nt"&gt;-Dsurefire&lt;/span&gt;.failIfNoSpecifiedTests&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why compile first?&lt;/strong&gt; Most Java static analysis tools require compiled bytecode. A compile failure is also the cheapest signal no point running ArchUnit on code that does not compile.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture Rules with ArchUnit
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Brief intro: ArchUnit lets you write your architecture decisions as executable tests&lt;/li&gt;
&lt;li&gt;These are not regular unit tests they validate structure, not logic&lt;/li&gt;
&lt;li&gt;Tag them separately so the Validation Agent and Test Agent have zero overlap
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Tag&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"architecture"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@AnalyzeClasses&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"com.yourapp"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ArchitectureRules&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// Controllers must not call repositories directly&lt;/span&gt;
    &lt;span class="nd"&gt;@ArchTest&lt;/span&gt;
    &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ArchRule&lt;/span&gt; &lt;span class="n"&gt;no_direct_repo_in_controllers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="n"&gt;noClasses&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;that&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;resideInAPackage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"..controller.."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;should&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;dependOnClassesThat&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;resideInAPackage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"..repository.."&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Services must not import Spring MVC annotations&lt;/span&gt;
    &lt;span class="nd"&gt;@ArchTest&lt;/span&gt;
    &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ArchRule&lt;/span&gt; &lt;span class="n"&gt;services_must_not_use_mvc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="n"&gt;noClasses&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;that&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;resideInAPackage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"..service.."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;should&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;dependOnClassesThat&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;resideInAPackage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.springframework.web.bind.annotation.."&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Naming conventions enforced&lt;/span&gt;
    &lt;span class="nd"&gt;@ArchTest&lt;/span&gt;
    &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ArchRule&lt;/span&gt; &lt;span class="n"&gt;controllers_named_correctly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;that&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;resideInAPackage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"..controller.."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;should&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;haveSimpleNameEndingWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Controller"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// @Transactional only allowed on service layer&lt;/span&gt;
    &lt;span class="nd"&gt;@ArchTest&lt;/span&gt;
    &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ArchRule&lt;/span&gt; &lt;span class="n"&gt;transactional_only_on_services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="n"&gt;noClasses&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;that&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;resideOutsideOfPackage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"..service.."&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;should&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;beAnnotatedWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Transactional&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Test Agent excludes this tag so there is zero overlap between the two agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Validation Agent runs ONLY architecture tests&lt;/span&gt;
mvn &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-Dgroups&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;architecture &lt;span class="nt"&gt;-Dsurefire&lt;/span&gt;.failIfNoSpecifiedTests&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;

&lt;span class="c"&gt;# Test Agent runs everything EXCEPT architecture tests&lt;/span&gt;
mvn &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-DexcludedGroups&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;architecture
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Standardized Violation Schema The Secret Sauce
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I find each tool has its own output format it can be noisy and inconsistent across tools&lt;/li&gt;
&lt;li&gt;The agent cannot reliably reason about what to fix if the input format varies per tool&lt;/li&gt;
&lt;li&gt;Solution: normalize everything into one consistent JSON schema before feeding it into the fix loop

&lt;ul&gt;
&lt;li&gt;We can use AI to help with this normalization step write a prompt that takes raw tool output and maps it to the schema this can be done by giving it a few examples of the input and output format that are required.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;This is the same schema used in the JS version the contract does not change, only the tools that populate it do

&lt;ul&gt;
&lt;li&gt;Tools change but the schema is stable and consistent across languages this is the key to making the rest of the pipeline tool-agnostic.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ArchUnit/no-direct-repo-in-controllers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"src/main/java/com/app/controller/UserController.java"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Controllers should not import repositories directly. Use a service."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fix_suggestion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Replace UserRepository injection with UserService. Controllers should only depend on the service layer."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"checkstyle/MethodLength"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"src/main/java/com/app/service/OrderService.java"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Method length is 72 lines (max 50)."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fix_suggestion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Extract the validation logic into a private helper method to reduce method length."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Two severity levels, two behaviors:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;severity: "error"&lt;/code&gt; → fails validation, triggers the auto-fix retry loop&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;severity: "warning"&lt;/code&gt; → logged and visible but does not fail the phase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A note on fix_suggestion for Java:&lt;/strong&gt; ESLint can generate suggestions natively. Java tools cannot. Instead, maintain a small rule registry a lookup map of rule name → suggestion string that the normalizer uses when building the output. Upfront effort, but it pays off every retry cycle.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Auto-Fix Retry Loop
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;When violations are found the agent does not stop it feeds the structured violations back to the LLM to fix, then re-validates&lt;/li&gt;
&lt;li&gt;Hard cap at 3 retries before escalating to human intervention&lt;/li&gt;
&lt;li&gt;Two failure modes to guard against:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Build Output
     ↓
Run Validation Tools
     ↓
Normalize all output → JSON violations array
     ↓
violations.length &amp;gt; 0?
  YES → Feed violations to fix agent → Re-validate (max 3 attempts)
  NO  → Phase complete ✅
     ↓
Still failing after 3 retries? → Halt, surface to human 🛑
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Violation diffing between retries&lt;/strong&gt; track the count before and after each fix attempt. If the count is not going down, the agent is stuck. Escalate instead of looping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression detection&lt;/strong&gt; if new violations appear that were not present before a fix attempt, the fix introduced a regression. Treat this as a separate signal and re-run from the last clean state rather than continuing forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this looks like in your pipeline output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase 3: Validation
======================================================================
SUCCESS
   Critical: 0, Warnings: 2, Attempts: 2

   Found 4 violations on first pass
   Auto-fixed: controller importing repository directly, method too long,
               missing @Override annotation, unused import
   Re-validated: CLEAN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What About SonarQube?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You might already have SonarQube running in CI does it belong here too?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Short answer: no, not inside the validation agent&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mvn sonar:sonar&lt;/code&gt; is slow and expensive a bad fit for a loop that may run 3 times&lt;/li&gt;
&lt;li&gt;This agent runs &lt;strong&gt;before a push&lt;/strong&gt;, so there is no Sonar result to even poll yet&lt;/li&gt;
&lt;li&gt;SonarQube's natural home is post-push in CI, as a final safety net before merge&lt;/li&gt;
&lt;li&gt;Instead, run &lt;strong&gt;SonarLint in connected mode&lt;/strong&gt; locally in your IDE same quality profile as your server, zero pipeline cost&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The principle: order checks by cost. Cheap and fast first, expensive later (or delegate to CI entirely). This is why compile runs before Checkstyle, and Checkstyle before ArchUnit.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Before and After: What Your Test Agent Receives
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without a validation agent&lt;/strong&gt; the test agent receives raw AI output that may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A controller calling a repository directly (arch violation)&lt;/li&gt;
&lt;li&gt;A method that is 90 lines long (PMD)&lt;/li&gt;
&lt;li&gt;Unused imports (Checkstyle)&lt;/li&gt;
&lt;li&gt;A missing null check (SpotBugs)&lt;/li&gt;
&lt;li&gt;The test agent now has to fight bad structure AND broken tests simultaneously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With a validation agent&lt;/strong&gt; the test agent receives code that has already passed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compile check&lt;/li&gt;
&lt;li&gt;Style and formatting rules&lt;/li&gt;
&lt;li&gt;Smell and complexity thresholds&lt;/li&gt;
&lt;li&gt;Bug pattern analysis&lt;/li&gt;
&lt;li&gt;Your architectural boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test agent works with clean, well-structured code every time. That is why your test agent rarely needs all 3 of its own retries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaway
&lt;/h2&gt;

&lt;p&gt;The validate phase is not about distrusting AI. It is about applying the same rigor to AI generated code that you apply to any code the same linters, the same architectural rules, the same standards your team already agreed on. The difference is it runs automatically, fixes itself, and only escalates to you when it genuinely cannot resolve the issue.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The code that reaches your Test Agent has already been through a compile check, style validation, smell detection, bug pattern analysis, and your architecture rules. You are not reviewing raw AI output. You are reviewing code that has already been through the gauntlet.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I plan on writing a post around the Review Agent next why its needed and how can we apply a but more automation if there are any issues found during that phase.&lt;/li&gt;
&lt;li&gt;The Review Agent did the AI actually build what the spec asked for?&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>codequality</category>
    </item>
    <item>
      <title>How I Automate Parts of My Software Development Lifecycle with AI Agents</title>
      <dc:creator>Travis Martin</dc:creator>
      <pubDate>Wed, 21 Jan 2026 14:56:06 +0000</pubDate>
      <link>https://dev.to/rickjms/how-i-automate-parts-of-my-software-development-lifecycle-with-ai-agents-43h7</link>
      <guid>https://dev.to/rickjms/how-i-automate-parts-of-my-software-development-lifecycle-with-ai-agents-43h7</guid>
      <description>&lt;p&gt;Every developer knows the drill: You get a feature request. You create a branch. You write a plan (maybe). You implement. You write tests. You review. You document. Rinse and repeat. What if an AI could handle the tedious parts while you focus on the interesting problems? That's exactly what I built, I call it AI Developer Workflows (ADW). In this post, I'll show you how I automated the complete software development lifecycle using AI agents, and how you can do the same, I have templates for typescript, golang, and java, however it can easily be adjusted to other languages you just need to update the prompt commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  My day to day development workflow
&lt;/h2&gt;

&lt;p&gt;Most of my day wasn't spent solving interesting problems. It was spent on ceremony. Can I take my workflow and get AI to automate parts or all of it? Here is my day before my AI workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read through all the Jira tasks and find the one I like the most, assign to myself. Hoping the details exist and it's NOT just a one liner that the PM created.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Planning&lt;/strong&gt;: Read a lot of code and decide HOW my feature will fit into the code base.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation Plan&lt;/strong&gt;: Once I have all relevant files or identified the areas of change, I create a document to keep track of all the changes that are needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation&lt;/strong&gt;: Start coding!&lt;/li&gt;
&lt;li&gt;Oh yeah tests: I then write my tests, I know I probably should follow TDD or some framework.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review&lt;/strong&gt;: Now that everything is done lets compare the feature with the actual jira ticket did we build the correct thing? Did We miss anything? hopefully not and also pray for NO scope creep.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt;? Lol&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Solution: AI Developer Workflow&amp;nbsp;(ADW)
&lt;/h2&gt;

&lt;p&gt;ADW is a framework that orchestrates AI agents through a complete SDLC pipeline. The idea is to have one agent perfect one task extremely well (SRP), vs trying to get a single agent to perform multiple tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;┌───────────────────────────────────────────────────────────────────────┐
│                         ADW Pipeline                                  │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  Prompt ──► Plan ──► Build ──► Validate ──► Test ──► Review ──► Doc   │
│               │         │          │          │         │        │    │
│               ▼         ▼          ▼          ▼         ▼        ▼    │
│            Spec      Code      Quality     Fixes    Issues    Docs    │
│            File     Changes    Enforced   Applied   Fixed   Created   │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1. Plan&amp;nbsp;Phase:
&lt;/h3&gt;

&lt;p&gt;For larger code bases this can be broken into two parts: &lt;strong&gt;Research&lt;/strong&gt; and &lt;strong&gt;Planning&lt;/strong&gt;, the &lt;strong&gt;research agent&lt;/strong&gt; does a deep analysis on what are the relevant files before passing this to the &lt;strong&gt;planning agent&lt;/strong&gt; just to control a bit more of the context window. Explained in a &lt;a href="https://youtu.be/eIoohUmYpGI?si=7JUODxAs0FqnVqmq&amp;amp;t=665" rel="noopener noreferrer"&gt;Presentation&lt;/a&gt;: "I shipped code I don't understand and I bet you have too" by Jake Nations.&lt;/p&gt;

&lt;p&gt;Next you provide a prompt like "&lt;strong&gt;Add user authentication with JWT tokens.&lt;/strong&gt;" The planning agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Researches your codebase&lt;/li&gt;
&lt;li&gt;Identifies relevant files and patterns (Important)&lt;/li&gt;
&lt;li&gt;Creates a detailed implementation spec&lt;/li&gt;
&lt;li&gt;Outputs a structured plan file&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Build&amp;nbsp;Phase
&lt;/h3&gt;

&lt;p&gt;The builder agent reads the spec and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implements the feature following your codebase patterns&lt;/li&gt;
&lt;li&gt;Creates necessary files and modifications&lt;/li&gt;
&lt;li&gt;Follows existing conventions automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Validate Phase (The "AI Writes Bad Code"&amp;nbsp;Killer)
&lt;/h3&gt;

&lt;p&gt;This is the phase that addresses the elephant in the room: "&lt;strong&gt;But AI-generated code is garbage!&lt;/strong&gt;"&lt;/p&gt;

&lt;p&gt;The validation agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs linters, static analysis, and architectural rules&lt;/li&gt;
&lt;li&gt;Catches anti-patterns, code smells, and style violations&lt;/li&gt;
&lt;li&gt;Automatically fixes violations and retries&lt;/li&gt;
&lt;li&gt;Enforces YOUR coding standards, not generic ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't just &lt;code&gt;go fmt&lt;/code&gt;. It's running tools like &lt;code&gt;golangci-lint&lt;/code&gt; with your custom ruleset, checking for security issues, verifying architectural boundaries, and ensuring the AI generated code follows the same standards as us humans. For non-golang'ers there are tools like &lt;strong&gt;ArchUnit&lt;/strong&gt; (java) and &lt;strong&gt;ArchUnitTS&lt;/strong&gt; (typescript) to Enforce architecture rules.&lt;/p&gt;

&lt;p&gt;If the AI writes code that violates your standards? The validation agent fixes it automatically, then re-validates. Up to N retries until it's clean.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Test&amp;nbsp;Phase
&lt;/h3&gt;

&lt;p&gt;This agent I would say is one of the more important agents, if you get this one done correctly you shouldnt hit any regressions (mostly). What I like to do here is setup both unit tests and integration tests, the ensure that the slash command know how to execute them both. This way if we break anything this agent will find the issues and correctly fix them. NOTE: we also need to explain HOW can AI troubleshoot issues, we need to have good logging (any production app should have great logging) and again tell AI how to search these logs if it does encounter issues. Do this well and you will save A LOT of tokens and time. In my apps I always add centralized logging and explain to AI how to search these logs effectively.&lt;/p&gt;

&lt;p&gt;The test agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs your test suite&lt;/li&gt;
&lt;li&gt;If tests fail, analyzes the failures&lt;/li&gt;
&lt;li&gt;Attempts to fix issues automatically&lt;/li&gt;
&lt;li&gt;Retries up to N times (configurable)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Review&amp;nbsp;Phase
&lt;/h3&gt;

&lt;p&gt;The review agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compares implementation against the original spec&lt;/li&gt;
&lt;li&gt;Identifies gaps, bugs, or missing requirements&lt;/li&gt;
&lt;li&gt;Categorizes issues by severity (blocker, tech debt, skippable)&lt;/li&gt;
&lt;li&gt;Creates a review report (We have an agent to resolve these issues: travis_patch.py if there are blockers)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Document&amp;nbsp;Phase
&lt;/h3&gt;

&lt;p&gt;The documentation agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generates user-facing documentation&lt;/li&gt;
&lt;li&gt;Updates relevant README sections&lt;/li&gt;
&lt;li&gt;Creates API documentation if applicable&lt;/li&gt;
&lt;li&gt;We update our conditional_docs.md, this allows us to conditionally load documentation when we are using &lt;code&gt;/feature&lt;/code&gt;, essentially allowing AI to dynamically load documents if they are required in the new feature.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What This Looks Like in&amp;nbsp;Practice
&lt;/h3&gt;

&lt;p&gt;Here's the magic. One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;~ uv run travis_sdlc.py "Add rate limiting to the API endpoints"

======================================================================
  Travis SDLC Workflow
&lt;span class="gu"&gt;  ADW ID: a1b2c3d4
======================================================================
&lt;/span&gt;
&lt;span class="gu"&gt;Phase 1: Planning
======================================================================
&lt;/span&gt;✅ SUCCESS
   File: specs/feature-a1b2c3d4-api-rate-limiting.md

&lt;span class="gu"&gt;Phase 2: Implementation
======================================================================
&lt;/span&gt;✅ SUCCESS

&lt;span class="gu"&gt;Phase 3: Validation
======================================================================
&lt;/span&gt;✅ SUCCESS
   Critical: 0, Warnings: 2, Attempts: 2

   ↳ Found 3 violations on first pass
   ↳ Auto-fixed: unused variable, missing error check, import order
   ↳ Re-validated: CLEAN

&lt;span class="gu"&gt;Phase 4: Testing
======================================================================
&lt;/span&gt;✅ SUCCESS
   Passed: 47, Failed: 0, Attempts: 1

&lt;span class="gu"&gt;Phase 5: Review
======================================================================
&lt;/span&gt;✅ SUCCESS
   Issues: 0

&lt;span class="gu"&gt;Phase 6: Documentation
======================================================================
&lt;/span&gt;✅ SUCCESS
   Path: app_docs/feature-a1b2c3d4-api-rate-limiting.md

======================================================================
&lt;span class="gu"&gt;  ✅ WORKFLOW COMPLETED SUCCESSFULLY
======================================================================
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice &lt;strong&gt;Phase 3&lt;/strong&gt;: The validation found 3 violations, &lt;strong&gt;automatically fixed&lt;/strong&gt; them, and re-validated. Acknowledging that AI does NOT always write the best code, we need to put in checks into our agents that will enforce coding standards.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;From a single prompt to a fully implemented, tested, reviewed, and documented feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skeptic's Corner: Addressing the Hard Questions
&lt;/h2&gt;

&lt;p&gt;I been working with AI for over a few years now, one of the most common push back is quality and the quantity of code outputted.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI-generated code is unmaintainable garbage.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;You are correct!&lt;/strong&gt; Unfiltered AI output often has issues: unused variables, missing error handling, duplicate methods or trying to re-build a class we already have,&amp;nbsp;…etc. Here is the thing, human developers write code with issues too. This is why we have linters, code reviews, and CI pipelines. My approach of the ADW applies the same rigor to AI generated code through the &lt;strong&gt;Validate Phase&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    Validation Phase Loop                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Build Output ──► Validate ──► Violations? ──►  Auto-Fix ──┐   │
│                        ▲                                    │   │
│                        └────────────── Re-validate ◄────────┘   │
│                                                                 │
│   Max 3 retries, then human intervention required               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The validation phase runs &lt;code&gt;golangci-lint&lt;/code&gt;, &lt;code&gt;eslint&lt;/code&gt;, security scanners, and &lt;strong&gt;your custom architectural rules&lt;/strong&gt;. If violations are found, the agent fixes them automatically and revalidates. The AI doesn't write perfect code. But the &lt;strong&gt;system catches and corrects mistakes before they reach you.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  AI just creates tech debt that I'll have to clean up&amp;nbsp;later.
&lt;/h3&gt;

&lt;p&gt;This is a valid concern. AI can take shortcuts, copy-paste patterns inappropriately, or ignore edge cases. The ADW addresses this with the &lt;strong&gt;Review Phase:&lt;/strong&gt; I have below a recent log that found a few issues in my application. This phase can be customized to FAIL if a condition is met I currently have it set to only fail on blockers. I get the agent to create a detail plan on how to fix these issues the file is saved here: specs/review_issues/review-4cb749dc.md If i wanted ai to fix these i just run the command &lt;code&gt;uv run&amp;nbsp;.awd/travis/travis_patch.py 4cb749dc&lt;/code&gt; (the ID number of this job and it will pick everything up on its own)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;======================================================================
&lt;span class="gu"&gt;  Phase 5: Review
======================================================================
&lt;/span&gt;
ADW Logger initialized - ID: 4cb749dc
Travis Review starting - ADW ID: 4cb749dc
Reviewing implementation against spec: specs/feature-a1b2c3d4-nuclei-vulnerability-scanning.md

Review Summary:
  Status: PASSED
  Tests: PASSED
  Build: PASSED
  Summary: The Nuclei vulnerability scanning feature has been successfully implemented with all core functionality working as specified. The implementation includes proper CLI commands (vuln scan, vuln list), database persistence with migration, Nuclei tool integration, and comprehensive test coverage. All tests pass and the build succeeds. Minor issues exist around missing config validation, incomplete findings integration, and lack of repository tests, but none are blocking the release of this feature.

  Issues Found: 7

  Issue #1:
    Severity: tech_debt
    File: internal/cli/vuln_scan.go:541
    Description: The --custom flag uses BuildCustomTemplateArgs but the implementation uses -templates flag which differs from Nuclei's actual -t flag for custom templates. This may cause issues when users try to specify custom template paths.
    Resolution: Update BuildCustomTemplateArgs in pkg/recon/vulnscan/nuclei.go to use -t flag instead of -templates flag for custom template paths, matching Nuclei's actual CLI interface.

  Issue #2:
    Severity: skippable
    File: configs/default.yaml:181
    Description: The config includes 'severity' and 'exclude_templates' fields, but these are not validated or used anywhere in the codebase. The CLI always requires explicit --severity or --templates flags.
    Resolution: Either implement support for reading default severity and exclude_templates from config file in the scan command, or remove these unused fields from the config schema to avoid user confusion.

  Issue #3:
    Severity: tech_debt
    File: specs/feature-a1b2c3d4-nuclei-vulnerability-scanning.md:209-212
    Description: The spec mentions 'Auto-create findings from critical/high severity results' and 'Link vuln_scans to findings table' as acceptance criteria, but this integration is not implemented. The code only stores in vuln_scans table without creating finding entries.
    Resolution: Add integration with the findings table to auto-create findings for critical/high severity vulnerabilities. This can be done by calling the findings repository after saving vuln scans with high severity.

  Issue #4:
    Severity: tech_debt
    File: internal/repository/vuln_scan.go
    Description: No tests exist for the VulnScanRepository despite the repository having complex JSON serialization logic for references and extracted_results. This creates risk of bugs in database operations.
    Resolution: Add unit tests for VulnScanRepository covering Create, GetByID, GetBySeverity, GetByHost, GetByCVE, Exists, and JSON serialization/deserialization of array fields.

  Issue #5:
    Severity: skippable
    File: .gitignore:9
    Description: The bbrecon binary was removed from .gitignore, which will cause the compiled binary to be tracked by git. This is generally not desired for build artifacts.
    Resolution: Add 'bbrecon' back to .gitignore to prevent the binary from being committed to the repository.

  Issue #6:
    Severity: tech_debt
    File: pkg/recon/vulnscan/nuclei.go:1762-1768
    Description: BuildTemplateArgs uses -tags flag for template categories, but this may not match Nuclei's expected behavior. Nuclei template categories like 'cves' typically use the -t flag with path like '-t cves/' not '-tags cves'.
    Resolution: Verify the correct Nuclei flag for template categories and update BuildTemplateArgs to use -t flag with category paths (e.g., '-t cves/') instead of -tags. Add integration test with actual Nuclei to verify.

  Issue #7:
    Severity: skippable
    File: internal/cli/vuln_scan.go:702
    Description: The getTargetsFromDB function only builds HTTPS URLs but some services may only be accessible via HTTP. This could miss vulnerabilities on HTTP-only services.
    Resolution: Update getTargetsFromDB to check the subdomain's HTTPScheme or URL field if available, or build both HTTP and HTTPS URLs based on the actual probed protocol from the web recon phase.
Review issues written to specs/review_issues/review-4cb749dc.md

Review issues file created: specs/review_issues/review-4cb749dc.md
Review phase completed successfully

Phase 5: Review: ✅ SUCCESS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The review agent specifically looks for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spec compliance: Did we actually build what was planned?&lt;/li&gt;
&lt;li&gt;Tech debt indicators: Shortcuts, TODOs, incomplete error handling&lt;/li&gt;
&lt;li&gt;Missing edge cases: What happens when X fails?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Issues are categorized by severity. &lt;strong&gt;Blockers&lt;/strong&gt; stop the workflow. &lt;strong&gt;Tech debt / skippable&lt;/strong&gt; is documented but doesn't block. &lt;strong&gt;You decide&lt;/strong&gt; what bar to set.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI doesn't understand MY codebase. It'll write code that doesn't&amp;nbsp;fit.
&lt;/h3&gt;

&lt;p&gt;This is why the &lt;strong&gt;Planning Phase&lt;/strong&gt; exists, before writing any code, the planning agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads your &lt;code&gt;README.md&lt;/code&gt; and &lt;code&gt;DESIGN.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Searches for similar coding patterns in your codebase.&lt;/li&gt;
&lt;li&gt;Identifies the files it needs to modify.&lt;/li&gt;
&lt;li&gt;Creates a spec that follows YOUR conventions, does NOT make up patterns.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;AI analyzes your existing patterns before generating any code. The commands are built to extend what already exists, and only create new components when nothing relevant is found.&lt;/p&gt;

&lt;p&gt;I have not tested this but if the &lt;strong&gt;Codebase is large&lt;/strong&gt; you can easily break this phase into two parts: 1. &lt;strong&gt;Research agent&lt;/strong&gt;, and 2. &lt;strong&gt;Planning agent&lt;/strong&gt;, planner utilizes the research agents results. The &lt;strong&gt;Research Agent&lt;/strong&gt; main job is to find What is relevant to this feature within this large codebase and pass it to the planner. This way we are NOT wasting the planners context trying to find everything needed to 'X' feature.&lt;/p&gt;

&lt;h3&gt;
  
  
  You still have to review everything anyway. What's the&amp;nbsp;point?
&lt;/h3&gt;

&lt;p&gt;Yes, we should always review everything however, here is the difference between the two:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without ADW:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review raw AI output&lt;/li&gt;
&lt;li&gt;Find the 12 linting errors&lt;/li&gt;
&lt;li&gt;Notice missing error handling&lt;/li&gt;
&lt;li&gt;Realize it didn't follow your patterns&lt;/li&gt;
&lt;li&gt;Send it back, wait, review again&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With ADW:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validation already caught the linting errors&lt;/li&gt;
&lt;li&gt;Tests already verified basic functionality&lt;/li&gt;
&lt;li&gt;Review already flagged tech debt&lt;/li&gt;
&lt;li&gt;I'm reviewing &lt;strong&gt;polished&lt;/strong&gt; code, not first drafts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code that reaches me has already &lt;strong&gt;passed&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linter (validation phase)&lt;/li&gt;
&lt;li&gt;Static analysis (validation phase)&lt;/li&gt;
&lt;li&gt;Unit tests (test phase)&lt;/li&gt;
&lt;li&gt;Spec compliance (review phase)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My review is the final check, not the first line of defense.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom&amp;nbsp;Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ADW doesn't trust the AI&lt;/strong&gt; output. It verifies, validates, tests, and reviews automatically. AI is fast but imperfect, the system is designed to &lt;strong&gt;catch imperfections programmatically&lt;/strong&gt; before they reach you. You still need to review the code, however, it has already passed multiple quality gates not raw AI output. This ADW is a good first start to your feature, it doesn't always one shot the feature but it will get you 80–90% there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Now that we address some of the Skepticisms let's look at the architecture: ADW consists of three layers and only one layer is language-specific.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    ADW Architecture                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Layer 3: Orchestrator (Python)     LANGUAGE-AGNOSTIC   │    │
│  │  travis_sdlc.py - chains phases, manages state          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                            │                                    │
│                            ▼                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Layer 2: Slash Commands (Markdown)  LANGUAGE-SPECIFIC  │    │
│  │  /test, /validate, /review - customize per language     │    │
│  └─────────────────────────────────────────────────────────┘    │
│                            │                                    │
│                            ▼                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Layer 1: Agent Module (Python)     LANGUAGE-AGNOSTIC   │    │
│  │  Claude Code execution, retry logic, state management   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 1: Core Agent Module (Language-Agnostic)
&lt;/h3&gt;

&lt;p&gt;The foundation that handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code CLI execution&lt;/li&gt;
&lt;li&gt;Retry logic for transient failures&lt;/li&gt;
&lt;li&gt;Output parsing (JSONL → JSON)&lt;/li&gt;
&lt;li&gt;State management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This never changes regardless of what language your project uses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Slash Commands (Language-Specific)
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. Slash commands are markdown templates that define &lt;strong&gt;HOW&lt;/strong&gt; each phase executes for YOUR language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;.claude/commands/
├── test.md        # "Run go test ./..." or "npm test" or "mvn test"
├── validate.md    # "Run golangci-lint" or "eslint" or "checkstyle"
├── feature.md     # Plan format with language-specific patterns
├── implement.md   # Implementation instructions
├── review.md      # Review criteria
└── document.md    # Documentation format
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To support a new language, you only customize these files. For example, here's how &lt;code&gt;/test&lt;/code&gt; differs by language, to ensure these work in claude code just open the claude terminal and type &lt;code&gt;/test&lt;/code&gt; if your test run correctly this is how the agent will execute this command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Test Execution&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Command: &lt;span class="sb"&gt;`go test ./... -v -race -coverprofile=coverage.out`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; test_name: "go_test"

&lt;span class="gs"&gt;**TypeScript:**&lt;/span&gt;
&lt;span class="gu"&gt;## Test Execution&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Command: &lt;span class="sb"&gt;`npm test -- --coverage --watchAll=false`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; test_name: "jest_test"

&lt;span class="gs"&gt;**Java:**&lt;/span&gt;
&lt;span class="gu"&gt;## Test Execution&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Command: &lt;span class="sb"&gt;`mvn test -B`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; test_name: "maven_test"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same orchestrator. Same workflow. Different language-specific commands&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Orchestrator (Language-Agnostic)
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;travis_sdlc.py&lt;/code&gt; script that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chains phases together&lt;/li&gt;
&lt;li&gt;Manages state between phases&lt;/li&gt;
&lt;li&gt;Handles failures and retries&lt;/li&gt;
&lt;li&gt;Provides observability and logging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is pure Python and doesn't know or care what language your project uses. It just calls the slash commands and processes the results. NOTE: Each of these files can be run independently, they are meant to be isolated function calls to the claude sdk. Learned this from IndyDevDan.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This&amp;nbsp;Matters
&lt;/h3&gt;

&lt;p&gt;This architecture means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;One orchestrator to maintain  The Python ADW code works for any language&lt;/li&gt;
&lt;li&gt;Easy to add new languages  Just write new slash commands&lt;/li&gt;
&lt;li&gt;Shareable workflow logic  Test/retry/review logic is universal&lt;/li&gt;
&lt;li&gt;Customizable per project  Each repo can have its own command variations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Want to use ADW on a Rust project? Write &lt;code&gt;/test.md&lt;/code&gt; with &lt;code&gt;cargo test&lt;/code&gt;, &lt;code&gt;/validate.md&lt;/code&gt; with &lt;code&gt;clippy&lt;/code&gt;, and you're done. The orchestrator handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Prerequisites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Code CLI installed and authenticated&lt;/li&gt;
&lt;li&gt;Python 3.11+ with &lt;code&gt;uv&lt;/code&gt; package manager (install with brew)&lt;/li&gt;
&lt;li&gt;Your codebase with a README.md and basic structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add a&amp;nbsp;.env file with the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CLAUDE_CODE_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;claude &lt;span class="c"&gt;# This is the path to claude code default should be this&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After getting the prereqs you need to EDIT the slash commands to match your repo, big brain move is get claude code to do it maybe?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Please REVIEW the slash commands located in @.claude/commands/&lt;span class="err"&gt;*&lt;/span&gt; 
there are some language specific files like: 
&lt;span class="p"&gt;-&lt;/span&gt; @.claude/commands/test.md 
&lt;span class="p"&gt;-&lt;/span&gt; @.claude/commands/validate.md
...

We need to update them to MATCH our system in this repo ensure all commands 
are correctly matching our system.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Example Commands
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Simple feature&lt;/span&gt;
uv run travis_sdlc.py "Add a health check endpoint"

&lt;span class="gh"&gt;# Bug fix&lt;/span&gt;
uv run travis_sdlc.py "Fix the memory leak in the cache module" --plan-type bug

&lt;span class="gh"&gt;# Chore/refactor&lt;/span&gt;
uv run travis_sdlc.py "Refactor the logging to use structured output" --plan-type chore

&lt;span class="gh"&gt;# Use a more powerful model for complex tasks&lt;/span&gt;
uv run travis_sdlc.py "Implement OAuth2" --model opus

&lt;span class="gh"&gt;# Skip optional phases&lt;/span&gt;
uv run travis_sdlc.py "Quick fix" --skip-review --skip-document

&lt;span class="gh"&gt;# Increase test retry attempts&lt;/span&gt;
uv run travis_sdlc.py "Tricky feature" --max-test-retries 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This post covered the "what" and "why" of ADW. In the next posts, I plan on explaining deeper into the following phases.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Planning Phase How to write effective prompts and customize plan templates&lt;/li&gt;
&lt;li&gt;Validation Enforcing code quality with linters, auto-fixes, and custom rules&lt;/li&gt;
&lt;li&gt;Test &amp;amp; Review Handling failures, auto-fixes, and quality gates&lt;/li&gt;
&lt;li&gt;Customizing Slash Commands Adapting ADW for Go, Java, TypeScript, Rust, or any language&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The ADW framework is available on GitHub: [&lt;a href="https://github.com/travism26/claude_code_agent_templates" rel="noopener noreferrer"&gt;https://github.com/travism26/claude_code_agent_templates&lt;/a&gt;] I'd love to hear how you're using it. Drop a comment or reach out on Twitter [@travism26].&lt;/p&gt;

&lt;h2&gt;
  
  
  Shoutouts
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/@indydevdan" rel="noopener noreferrer"&gt;IndyDevDan&lt;/a&gt; I took his course and a lot of the ideas I learned and expanded upon are from his course.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>softwaredevelopment</category>
    </item>
  </channel>
</rss>
