Tzvi Gregory Kaidanov

Posted on Jun 8

Zero-Stall AI: Building a Self-Managing TDD Pipeline with Autonomous Agents

#agents #ai #automation #testing

published: false
description: "How to design an AI-driven TDD loop that never gets stuck — GitHub Issues as memory, Playwright for tests, Vercel for staging, and Telegram for one-tap human approval."
tags: aiagents, tdd, devops, llmops

cover_image: https://images.unsplash.com/photo-1555949963-ff9fe0c870eb?w=1000

tl;dr — Point an AI agent at a GitHub Issue, have it write a failing E2E test, implement the fix, commit with full provenance metadata, deploy to staging, and ping you on Telegram. One tap to approve. Ship.

The Problem with AI-Assisted Development Today

Most teams using AI coding assistants hit the same wall:

Agents stall waiting for a human to respond in the IDE
Context windows expire mid-task, losing all progress
No audit trail — you don't know which model wrote what, how long it took, or how many tokens it cost
Tests are an afterthought — AI writes code first, tests sometimes never
Staging review requires a laptop — killing async workflows

This post describes a systematic architecture that solves all five.

Core Idea: GitHub Issues as the AI's Working Memory

The foundation is simple: a GitHub Issue is the single source of truth for every unit of work.

Each issue contains:

A reference to the relevant PRD section
Acceptance criteria written as plain-language assertions
The last iteration snapshot (what was done, what failed, what's next)
Links to test artefacts (video, trace, HTML report)
Token/time metadata from every AI session

When an AI agent starts a task, it reads the issue. When it ends — whether it finished or ran out of tokens — it writes back to the issue. The next agent (or the same one in a new session) picks up exactly where things left off.

Why Issues and not a file? Issues survive branch switches, are visible to all team members, support comments and labels, and integrate natively with CI/CD triggers.

The 6-Phase TDD Loop

┌──────────────────────────────────────────┐
│         📋  GITHUB ISSUE                 │
│  PRD ref · Acceptance Criteria · State   │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│  🔴  RED PHASE                           │
│  Write Playwright spec from criteria     │
│  Test MUST fail before code is written   │
└──────────────────┬───────────────────────┘
                   │ FAIL confirmed ✓
                   ▼
┌──────────────────────────────────────────┐
│  🛠️  GREEN PHASE                         │
│  AI implements minimal fix               │  ◄──── loops here on CHANGE
│  Guardian reviews: types · no duplication│
└──────────────────┬───────────────────────┘
                   │ re-run test
                   ▼
┌──────────────────────────────────────────┐
│  ✅  PASS                                │
│  Video · Trace · HTML report saved       │
│  Artefacts posted to Issue comment       │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│  📦  GITOPS COMMIT                       │
│  branch: tdd/issue-slug                  │
│  platform · model · tokens · duration    │
│  rollback tag created                    │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│  🚀  STAGING DEPLOY                      │
│  Auto-deploy on tdd/* push               │
│  Preview URL → Issue + Telegram          │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│  👤  HUMAN REVIEW  (on your phone)       │
│  Artefacts + checklist via Telegram      │
│                                          │
│  APPROVE ──► Merge to main               │
│  CHANGE  ──► Back to GREEN PHASE         │
│  ESCALATE──► human-blocked · agent exits │
└──────────────────────────────────────────┘

The one invariant rule: A test must fail before any code is written. This forces acceptance criteria to be precise, ensures the test exercises the right behaviour, and gives a clear signal when the implementation is complete.

Phase Breakdown

Phase 1 — Read Issue Context

An issue-knowledge-manager agent reads the current issue, extracts the PRD reference, parses acceptance criteria, and loads the last iteration snapshot. This costs ~500 tokens and takes under 10 seconds. Every subsequent agent in the loop starts from this shared context.

Phase 2 — RED (Write Failing Test)

A qa-test-engineer agent writes an E2E spec from the acceptance criteria. Before handing off, it runs the test suite and confirms the new test fails. A test that passes immediately means the criterion was already satisfied — or the test is wrong. Either way, stop and investigate.

Phase 3 — GREEN (Implement)

A frontend-dev or backend-dev agent implements the minimal code change. A guardian agent then reviews the diff: no new any types, no duplicate logic, patterns consistent with the codebase. Only after approval does the loop return to Phase 2 for re-run.

Phase 4 — Commit with Provenance

Once the test passes, a release-automation agent commits with structured metadata:

[vscode/claude-sonnet-4] fix: table sort order matches canvas view

Issue: #32
Platform: VSCode Extension
Model: claude-sonnet-4
Tokens used: ~11,200
Duration: 22 min
Tests: 14/14 passing
Staging: https://your-app-pr-42.vercel.app

A rollback tag is created before the commit: test-pass/32/2026-03-31. Any other AI environment can roll back to this exact state with one command.

Phase 5 — Deploy to Staging

A devops-engineer agent ensures every tdd/* branch triggers an automatic staging deploy. The preview URL is posted to the GitHub Issue and sent via the messaging gateway.

Phase 6 — Human Review on Your Phone

The orchestrator sends a notification containing:

A link to the Playwright video recording
The HTML test report
The staging preview URL
A checklist of acceptance criteria with pass/fail status

You reply with one word: APPROVE, CHANGE, or ESCALATE. The agent handles the rest.

Safety: The Zero-Stall Guarantee

The biggest practical failure mode for AI agents is getting stuck. Here is the full safety net:

Trigger                        →   Agent Response
─────────────────────────────────────────────────────────────────────
Token count > 80k (soft)       →   Save snapshot to Issue · continue
Token count > 95k (hard)       →   Save snapshot · Telegram alert · EXIT
No tool response for 30 min    →   Save snapshot · Telegram "stalled on #N" · EXIT
Same action repeated 3×        →   Break loop · log to Issue · Telegram · EXIT
API key exhausted               →   Rotate to fallback key · log rotation · continue

The exit contract: An agent that exits cleanly always writes a snapshot to the issue first — what was completed, what was in progress, the exact file and line being worked on. Any agent that reads this snapshot can continue from that exact point, in any environment.

Agent Delegation Map

📋 GITHUB ISSUE  (Source of Truth)
         │
         ▼
┌─────────────────────────────────────┐  ORCHESTRATION LAYER
│  tdd-orchestrator                   │  Drives the loop · enforces token budget
│  issue-knowledge-manager            │  Reads + writes issue state
└─────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐  IMPLEMENTATION LAYER
│  qa-test-engineer                   │  Playwright specs · artefact collection
│  frontend-dev / backend-dev         │  Minimal fix · strict types · no any
│  guardian                           │  Code review gate · no duplication
└─────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐  SHIPPING LAYER
│  release-automation                 │  Commit · metadata · rollback tag · PR
│  devops-engineer                    │  Staging deploy · preview URL
└─────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐  HUMAN LOOP
│  Messaging Gateway (Telegram/Slack) │  Artefacts + checklist to your phone
│  You                                │  APPROVE · CHANGE · ESCALATE
└─────────────────────────────────────┘
         │
         └──────── decision flows back to orchestrator

Each layer is independently replaceable. Swap Playwright for Cypress. Swap Telegram for Slack. The orchestration contract stays the same.

The Commit as a Time Capsule

Every commit in this system is self-describing. Someone (or another AI) reading the git log six months from now can reconstruct exactly:

Field	What it tells you
Commit message	What changed
Issue reference	Why it changed (links to acceptance criteria)
`Tests: 14/14`	How it was validated
`Model: claude-sonnet-4`	What wrote it
`Tokens: ~11,200`	What it cost
Staging URL	Where to see it live

This is not overhead. It is the foundation of trustworthy AI-assisted development.

Iteration Snapshot Format

Every time an agent writes back to an issue, it uses this structured template:

## Iteration Snapshot — 2026-03-31 14:22

**Status:** PASS
**Agent:** qa-test-engineer + frontend-dev
**Platform:** VSCode Extension | **Model:** claude-sonnet-4
**Tokens:** ~13,400 | **Duration:** 24 min

### Completed this iteration
- Wrote Playwright spec for acceptance criterion 2 (table sort order)
- Confirmed RED: test failed on `expect(rows[0]).toBe('SKU-001')`
- Implemented sort fix in `TableView.tsx:214`
- Confirmed GREEN: 14/14 tests passing
- Committed: `abc1234` · Tagged: `test-pass/32/2026-03-31`

### Artefacts
- [Test Video](./playwright-report/videos/sort-order.webm)
- [HTML Report](./playwright-report/index.html)
- [Staging Preview](https://your-app-pr-42.vercel.app)

### Next step
Acceptance criterion 3 — clicking a table row should select the node on canvas.
Start at: `TableView.tsx` + `useCanvasSelection` hook.

What This Unlocks

Before	After
Agent stalls waiting for IDE response	Times out, saves state, exits, pings you
Context window resets kill progress	Issue snapshot = resumable from any environment
No idea what the AI changed or why	Every commit is a fully documented time capsule
Tests written after the fact (if at all)	Tests define done — no test, no merge
Staging review requires a laptop	One-tap approve from your phone
Token exhaustion = lost work	Snapshot at 80k, graceful exit at 95k
AI writes the same pattern twice	Guardian agent blocks duplication before commit

Getting Started Checklist

[ ] Define acceptance criteria in GitHub Issues (not just task descriptions)
[ ] Set up E2E testing (Playwright) with video: 'on' and trace: 'on'
[ ] Configure branch-based staging deploys (Vercel, Netlify, or equivalent)
[ ] Set up a messaging gateway for human-in-the-loop notifications (Telegram bot is easiest)
[ ] Write agent definition files for each role (orchestrator, qa, dev, release, devops)
[ ] Establish the commit metadata convention — enforce it from day one
[ ] Set token budget thresholds — 80k soft, 95k hard is a solid baseline
[ ] Create an issue snapshot template so all agents write consistent state

Conclusion

The goal is not to remove humans from software development. It is to remove humans from the parts that do not require human judgment — running tests, writing boilerplate, deploying previews, rotating API keys — and to surface the parts that do, cleanly, on the device you actually have in your hand.

A GitHub Issue with a clear acceptance criterion, an E2E test that fails first, a commit that documents its own provenance, and a one-tap decision from your phone — that is a workflow a team can trust, audit, and scale.

The AI does not need to be perfect. It needs to be accountable.

Tags: #aiagents #tdd #devops #llmops #playwright #vercel #gitops

DEV Community