published: false
description: "How to design an AI-driven TDD loop that never gets stuck — GitHub Issues as memory, Playwright for tests, Vercel for staging, and Telegram for one-tap human approval."
tags: aiagents, tdd, devops, llmops
cover_image: https://images.unsplash.com/photo-1555949963-ff9fe0c870eb?w=1000
tl;dr — Point an AI agent at a GitHub Issue, have it write a failing E2E test, implement the fix, commit with full provenance metadata, deploy to staging, and ping you on Telegram. One tap to approve. Ship.
The Problem with AI-Assisted Development Today
Most teams using AI coding assistants hit the same wall:
- Agents stall waiting for a human to respond in the IDE
- Context windows expire mid-task, losing all progress
- No audit trail — you don't know which model wrote what, how long it took, or how many tokens it cost
- Tests are an afterthought — AI writes code first, tests sometimes never
- Staging review requires a laptop — killing async workflows
This post describes a systematic architecture that solves all five.
Core Idea: GitHub Issues as the AI's Working Memory
The foundation is simple: a GitHub Issue is the single source of truth for every unit of work.
Each issue contains:
- A reference to the relevant PRD section
- Acceptance criteria written as plain-language assertions
- The last iteration snapshot (what was done, what failed, what's next)
- Links to test artefacts (video, trace, HTML report)
- Token/time metadata from every AI session
When an AI agent starts a task, it reads the issue. When it ends — whether it finished or ran out of tokens — it writes back to the issue. The next agent (or the same one in a new session) picks up exactly where things left off.
Why Issues and not a file? Issues survive branch switches, are visible to all team members, support comments and labels, and integrate natively with CI/CD triggers.
The 6-Phase TDD Loop
┌──────────────────────────────────────────┐
│ 📋 GITHUB ISSUE │
│ PRD ref · Acceptance Criteria · State │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 🔴 RED PHASE │
│ Write Playwright spec from criteria │
│ Test MUST fail before code is written │
└──────────────────┬───────────────────────┘
│ FAIL confirmed ✓
▼
┌──────────────────────────────────────────┐
│ 🛠️ GREEN PHASE │
│ AI implements minimal fix │ ◄──── loops here on CHANGE
│ Guardian reviews: types · no duplication│
└──────────────────┬───────────────────────┘
│ re-run test
▼
┌──────────────────────────────────────────┐
│ ✅ PASS │
│ Video · Trace · HTML report saved │
│ Artefacts posted to Issue comment │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 📦 GITOPS COMMIT │
│ branch: tdd/issue-slug │
│ platform · model · tokens · duration │
│ rollback tag created │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 🚀 STAGING DEPLOY │
│ Auto-deploy on tdd/* push │
│ Preview URL → Issue + Telegram │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 👤 HUMAN REVIEW (on your phone) │
│ Artefacts + checklist via Telegram │
│ │
│ APPROVE ──► Merge to main │
│ CHANGE ──► Back to GREEN PHASE │
│ ESCALATE──► human-blocked · agent exits │
└──────────────────────────────────────────┘
The one invariant rule: A test must fail before any code is written. This forces acceptance criteria to be precise, ensures the test exercises the right behaviour, and gives a clear signal when the implementation is complete.
Phase Breakdown
Phase 1 — Read Issue Context
An issue-knowledge-manager agent reads the current issue, extracts the PRD reference, parses acceptance criteria, and loads the last iteration snapshot. This costs ~500 tokens and takes under 10 seconds. Every subsequent agent in the loop starts from this shared context.
Phase 2 — RED (Write Failing Test)
A qa-test-engineer agent writes an E2E spec from the acceptance criteria. Before handing off, it runs the test suite and confirms the new test fails. A test that passes immediately means the criterion was already satisfied — or the test is wrong. Either way, stop and investigate.
Phase 3 — GREEN (Implement)
A frontend-dev or backend-dev agent implements the minimal code change. A guardian agent then reviews the diff: no new any types, no duplicate logic, patterns consistent with the codebase. Only after approval does the loop return to Phase 2 for re-run.
Phase 4 — Commit with Provenance
Once the test passes, a release-automation agent commits with structured metadata:
[vscode/claude-sonnet-4] fix: table sort order matches canvas view
Issue: #32
Platform: VSCode Extension
Model: claude-sonnet-4
Tokens used: ~11,200
Duration: 22 min
Tests: 14/14 passing
Staging: https://your-app-pr-42.vercel.app
A rollback tag is created before the commit: test-pass/32/2026-03-31. Any other AI environment can roll back to this exact state with one command.
Phase 5 — Deploy to Staging
A devops-engineer agent ensures every tdd/* branch triggers an automatic staging deploy. The preview URL is posted to the GitHub Issue and sent via the messaging gateway.
Phase 6 — Human Review on Your Phone
The orchestrator sends a notification containing:
- A link to the Playwright video recording
- The HTML test report
- The staging preview URL
- A checklist of acceptance criteria with pass/fail status
You reply with one word: APPROVE, CHANGE, or ESCALATE. The agent handles the rest.
Safety: The Zero-Stall Guarantee
The biggest practical failure mode for AI agents is getting stuck. Here is the full safety net:
Trigger → Agent Response
─────────────────────────────────────────────────────────────────────
Token count > 80k (soft) → Save snapshot to Issue · continue
Token count > 95k (hard) → Save snapshot · Telegram alert · EXIT
No tool response for 30 min → Save snapshot · Telegram "stalled on #N" · EXIT
Same action repeated 3× → Break loop · log to Issue · Telegram · EXIT
API key exhausted → Rotate to fallback key · log rotation · continue
The exit contract: An agent that exits cleanly always writes a snapshot to the issue first — what was completed, what was in progress, the exact file and line being worked on. Any agent that reads this snapshot can continue from that exact point, in any environment.
Agent Delegation Map
📋 GITHUB ISSUE (Source of Truth)
│
▼
┌─────────────────────────────────────┐ ORCHESTRATION LAYER
│ tdd-orchestrator │ Drives the loop · enforces token budget
│ issue-knowledge-manager │ Reads + writes issue state
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐ IMPLEMENTATION LAYER
│ qa-test-engineer │ Playwright specs · artefact collection
│ frontend-dev / backend-dev │ Minimal fix · strict types · no any
│ guardian │ Code review gate · no duplication
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐ SHIPPING LAYER
│ release-automation │ Commit · metadata · rollback tag · PR
│ devops-engineer │ Staging deploy · preview URL
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐ HUMAN LOOP
│ Messaging Gateway (Telegram/Slack) │ Artefacts + checklist to your phone
│ You │ APPROVE · CHANGE · ESCALATE
└─────────────────────────────────────┘
│
└──────── decision flows back to orchestrator
Each layer is independently replaceable. Swap Playwright for Cypress. Swap Telegram for Slack. The orchestration contract stays the same.
The Commit as a Time Capsule
Every commit in this system is self-describing. Someone (or another AI) reading the git log six months from now can reconstruct exactly:
| Field | What it tells you |
|---|---|
| Commit message | What changed |
| Issue reference | Why it changed (links to acceptance criteria) |
Tests: 14/14 |
How it was validated |
Model: claude-sonnet-4 |
What wrote it |
Tokens: ~11,200 |
What it cost |
| Staging URL | Where to see it live |
This is not overhead. It is the foundation of trustworthy AI-assisted development.
Iteration Snapshot Format
Every time an agent writes back to an issue, it uses this structured template:
## Iteration Snapshot — 2026-03-31 14:22
**Status:** PASS
**Agent:** qa-test-engineer + frontend-dev
**Platform:** VSCode Extension | **Model:** claude-sonnet-4
**Tokens:** ~13,400 | **Duration:** 24 min
### Completed this iteration
- Wrote Playwright spec for acceptance criterion 2 (table sort order)
- Confirmed RED: test failed on `expect(rows[0]).toBe('SKU-001')`
- Implemented sort fix in `TableView.tsx:214`
- Confirmed GREEN: 14/14 tests passing
- Committed: `abc1234` · Tagged: `test-pass/32/2026-03-31`
### Artefacts
- [Test Video](./playwright-report/videos/sort-order.webm)
- [HTML Report](./playwright-report/index.html)
- [Staging Preview](https://your-app-pr-42.vercel.app)
### Next step
Acceptance criterion 3 — clicking a table row should select the node on canvas.
Start at: `TableView.tsx` + `useCanvasSelection` hook.
What This Unlocks
| Before | After |
|---|---|
| Agent stalls waiting for IDE response | Times out, saves state, exits, pings you |
| Context window resets kill progress | Issue snapshot = resumable from any environment |
| No idea what the AI changed or why | Every commit is a fully documented time capsule |
| Tests written after the fact (if at all) | Tests define done — no test, no merge |
| Staging review requires a laptop | One-tap approve from your phone |
| Token exhaustion = lost work | Snapshot at 80k, graceful exit at 95k |
| AI writes the same pattern twice | Guardian agent blocks duplication before commit |
Getting Started Checklist
- [ ] Define acceptance criteria in GitHub Issues (not just task descriptions)
- [ ] Set up E2E testing (Playwright) with
video: 'on'andtrace: 'on' - [ ] Configure branch-based staging deploys (Vercel, Netlify, or equivalent)
- [ ] Set up a messaging gateway for human-in-the-loop notifications (Telegram bot is easiest)
- [ ] Write agent definition files for each role (orchestrator, qa, dev, release, devops)
- [ ] Establish the commit metadata convention — enforce it from day one
- [ ] Set token budget thresholds — 80k soft, 95k hard is a solid baseline
- [ ] Create an issue snapshot template so all agents write consistent state
Conclusion
The goal is not to remove humans from software development. It is to remove humans from the parts that do not require human judgment — running tests, writing boilerplate, deploying previews, rotating API keys — and to surface the parts that do, cleanly, on the device you actually have in your hand.
A GitHub Issue with a clear acceptance criterion, an E2E test that fails first, a commit that documents its own provenance, and a one-tap decision from your phone — that is a workflow a team can trust, audit, and scale.
The AI does not need to be perfect. It needs to be accountable.
Tags: #aiagents #tdd #devops #llmops #playwright #vercel #gitops
Top comments (0)