DEV Community: Creatman

Playwright MCP burns 1.5M tokens. CLI does it in 27k. So I built the skill that splits the phases.

Creatman — Sun, 03 May 2026 14:16:12 +0000

I wanted to test my web app. That's it. A Next.js portfolio and a SaaS chat — run some accessibility checks, catch console errors, verify nothing's broken on mobile. The kind of thing you do before pushing to production.

I opened Claude Code, connected Playwright MCP, typed "test the app" and watched it burn through tokens like there was no tomorrow. Then /compact fired at 18% text context. Then I discovered the invisible image budget. Then I spent three days building the tool I wished existed.

This is the story of how a routine testing session turned into webtest-orch — an open-source Claude Code skill that does e2e testing without bankrupting your token budget or hitting invisible context limits.

The problem: MCP is great at exploring, terrible at replaying

In November 2025, Pramod Dutta published an analysis that went around the AI-testing corner of the internet: Playwright MCP burns ~114k tokens per single test. The Özal benchmark on Microsoft's own issue tracker shows e-commerce verify workflows hitting ~1.5M tokens on MCP. The Playwright CLI? Still ~27k.

That's a 50–60× asymmetry. The cause is architectural: MCP keeps the LLM in the browser loop for every action — navigate, click, wait, screenshot, reason, repeat. Great for discovering a UI you've never seen. Catastrophically expensive for replaying the same flow a second time.

Microsoft's own README has since been updated to recommend CLI + Skills over MCP for coding agents. The official Test Agents documentation now ships a Planner / Generator / Healer triplet as the supported architecture — not "agent stays inside MCP for the whole session."

The fix isn't "use less Playwright MCP." It's split exploration from replay.

The second problem: the invisible image budget

While debugging my token usage, I found something worse. Claude Code has a second context limit that nobody documents — an inline-image budget of roughly 50–100 blocks per session. No counter. No warning. No --show-image-budget flag.

Every Playwright:browser_take_screenshot returns one image block. Fifty screenshots in, you've used 0.4% of your text budget and 100% of your image budget. Then /compact fires while your text context is 80% empty. The agent loses everything that wasn't disk-backed.

I tried three fixes:

"Just take fewer screenshots" — discipline drifts by turn 30 in any real exploratory session.

CLAUDE.md rule "never take screenshots" — soft rules survive about 30 turns under pressure. The agent reaches for a screenshot when stuck on a modal and rationalises the rule violation in the same turn.

context: fork in skill frontmatter — the documented official field, which silently failed to register the skill on my Claude Code 2.1.x Windows build. 90 minutes of debugging before I gave up and put the protection in the skill body instead.

What actually works: Task subagents have isolated image budgets. Whatever a subagent reads doesn't count against the parent. I verified empirically — dispatched a subagent to read 6 PNGs, return 6 text descriptions, parent counter stayed at zero.

Three days later: webtest-orch

By day three I had a working skill built around one architectural invariant: the parent chat never receives an image, ever. Four patterns enforce it:

Pattern A (90% of work): ARIA-tree exploration via Playwright:browser_snapshot — returns the page as text, not images. Same locator information as a screenshot, except the agent can grep and diff it. Zero image-budget cost.

Pattern B (3–5 per run): When vision is genuinely needed (pixel-diff fired, layout check on a zero-baseline page), a Task subagent reads ONE image and returns ONE text line. The subagent burns its own budget; the parent stays clean.

Pattern C: Playwright's built-in toHaveScreenshot() returns diff% as JSON when run through npx playwright test. Text the whole way down. Vision tokens only burn when the diff genuinely fires — and even then it's Pattern B.

Pattern D: Screenshots on disk (failure artifacts, MCP cache) cost zero unless explicitly Read. Don't conflate "file exists" with "file costs context."

The skill flow: first run → Playwright MCP walks the app via ARIA snapshots (in a subagent), generates *.spec.ts. Every run after → npx playwright test directly — deterministic, ~zero ongoing LLM cost. Bug fingerprinting with SHA-256 composite keys, run-diff that classifies bugs as new / regression / persisting / fixed.

The issues[] collector pattern

Every generated spec collects all soft checks into one array:

test('home page baseline', async ({ page }) => {
  const consoleErrors: string[] = [];
  const failedRequests: string[] = [];
  const issues: string[] = [];

  page.on('pageerror', (e) => consoleErrors.push(`pageerror: ${e.message}`));
  page.on('console', (m) => m.type() === 'error' && consoleErrors.push(`console: ${m.text()}`));
  page.on('response', (r) => r.status() >= 400 && failedRequests.push(`${r.status()} ${r.url()}`));

  await page.goto('/');

  const a11y = await new AxeBuilder({ page })
    .withTags(['wcag2a','wcag2aa','wcag21aa','wcag22aa']).analyze();
  a11y.violations.forEach((v) =>
    issues.push(`a11y[${v.impact}] ${v.id}: ${v.help} (${v.nodes.length}x nodes)`));

  // overflow, heading hierarchy, touch targets, html-lang — all push into issues[]

  expect(issues, `${issues.length} issues found:\n  - ${issues.join('\n  - ')}`).toEqual([]);
  expect(consoleErrors).toEqual([]);
  expect(failedRequests).toEqual([]);
});

When a test fails, you get every problem in one message. Post-run script parses the output and emits one bug record per issue with a stable fingerprint for cross-run diffing.

What gets tested out of the box

Every generated spec enforces: console error listeners (attached before page.goto(), with a noise-filter for GTM/Stripe/Sentry/Next.js/Supabase/ResizeObserver), axe-core WCAG audit (wcag2a through wcag22aa), heading hierarchy (no h1 → h3 jumps), touch-target sizing (WCAG 2.5.8 AA = 24×24 CSS px), horizontal overflow detection, and html lang presence. Visual regression uses Playwright's built-in toHaveScreenshot() — zero external dependencies.

Severity is auto-assigned from axe impact fields and error class, with three override mechanisms: [severity:S0] inline in the collector, in the test name, or // @severity: S0 as a comment before test(). Tracker mappings for Linear, GitHub, and Jira ship out of the box.

The honest competitive picture

Octomind posted a farewell letter on April 30, 2026. The paid names still standing — QA Wolf ($60–250k/year per Bug0's analysis), Mabl, BrowserStack AI — sell real value: cloud parallelism, human triage, SOC 2, SLAs.

webtest-orch does not compete with any of those at scale. No managed cloud, no human review layer, no compliance certification. For solo devs and small teams already on Claude Code with no QA budget, it's a credible $0/mo option. For a 50-person team running cross-browser nightly regression — it isn't, and pretending otherwise would be dishonest.

The honest peer group is the free/OSS tier: Microsoft's native playwright init-agents --loop=claude (Planner/Generator/Healer triplet) and Magnitude (Apache/MIT, vision-AI framework). webtest-orch's deltas: out-of-the-box axe-core + console + network audit, bug fingerprinting with run-diff, and tracker mappings. None of those are in the free alternatives.

What I deliberately did not build

No self-healing. The QA community has been pushing back on self-healing — the failure mode is well-documented: healer picks a visually-similar-but-wrong element, test goes green, bug ships. webtest-orch prefers red over false-green.

No vendor cloud. Tests stay in your repo. Reports on your filesystem. If the npm package disappears tomorrow, your suite still runs.

No "AI writes all your tests" pitch. This is a complement, not a replacement, for engineering judgment. It's especially good at the boring 80%: a11y, console, network, responsive, regression diffs.

Two real validation runs

Both apps are public on GitHub — not synthetic benchmarks.

CreatmanCEO/portfolio (creatman.site) — static Next.js portfolio, mobile viewport. 4 real bugs found, 0 false positives: axe-core color-contrast failure on 8 elements (S1), two touch-targets under 24×24 px (S2), heading-jump h1→h3 on /projects (S2). Image budget burned in parent: zero.

CreatmanCEO/lingua-companion — voice-first AI language-learning SaaS in private beta (Next.js 16 + FastAPI + Supabase + WebSocket). 11 specs across login, chat, translation, TTS, settings, phrase library, scenario mode, stats, logout. 10/10 green after 4 iterations, ~12 min wall-clock. The dogfood round produced 6 fixes for v0.2.0.

Quick start (3 minutes)

npx webtest-orch@beta install

# If MCPs are missing:
claude mcp add --scope user playwright npx @playwright/mcp@latest
claude mcp add --scope user chrome-devtools npx chrome-devtools-mcp@latest

Drop a .env.test in your project with TEST_BASE_URL, restart Claude Code, say "test the app." The skill auto-detects authed vs public, scaffolds Playwright + axe-core, runs the exploratory pass, writes reports/<run-id>/index.html.

Status

0.3.1-beta — 113 tests, full CI on Linux/macOS/Windows. MIT license. Looking for early users on Linux/macOS — there's a 5-minute OS-compatibility report template in the issues.

Repo: github.com/CreatmanCEO/webtest-orch. License MIT.

The companion piece on the other invisible token drain in agent loops — hierarchical project context — is here: The context problem nobody talks about.

— Nick (Creatman). Full-stack dev, working with Claude Code daily on 15+ web apps. Open to remote opportunities — creatmanick@gmail.com

The context problem nobody talks about: why AI coding agents waste 80% of tokens on files they already read yesterday

Creatman — Fri, 17 Apr 2026 20:34:21 +0000

Every AI coding agent — Claude Code, Cursor, Codex, Gemini CLI — starts every session completely blind. It doesn't know your projects. It doesn't know your servers. It doesn't remember that you spent three hours yesterday debugging the payment system.

So it greps. It reads file after file. It SSHs into your server to check what's running. It asks you "which project?" for the hundredth time. By the time it's oriented, you've burned half your context window on reconnaissance.

I manage 15 projects across 4 VPS servers. This was costing me hours of context per day. So I built a fix.

The pattern: hierarchical context

The idea is dead simple. Instead of the agent searching bottom-up (grep everything → read files → build understanding), give it a top-down map:

Level 0: Project Map     — knows ALL your projects       (~2KB, always loaded)
Level 1: Project Detail  — architecture of one project   (~5KB, on demand)  
Level 2: Source Files     — actual code                   (only when needed)

That's it. Three files instead of fifty. The agent reads the map, knows where to look, and goes straight to the answer.

What this looks like in practice

Without hierarchy

You: "What payment methods does Project A support?"

Agent:

Greps C:\Users\ for anything payment-related (3 tool calls)
Finds 6 candidate files, reads them all (6 tool calls)
Realizes it's the wrong project, searches more (4 tool calls)
SSHs into your server to read the production config (2 tool calls)
Finally answers — 15+ tool calls, 80K+ tokens, 8 minutes

With hierarchy

You: "What payment methods does Project A support?"

Agent:

Reads Level 0 — sees Project A is at ~/projects/a/ (already loaded, 0 calls)
Reads ~/projects/a/CLAUDE.md — sees "Payments: Stars + CryptoCloud" (1 call)
Answers immediately — 1 tool call, ~15K tokens, 10 seconds

Same question. Same agent. Same model. The only difference is a 2KB file that says "here's where everything is."

Setting it up (10 minutes)

Step 1: Create your project map

Add this to your agent's global instruction file (~/.claude/CLAUDE.md for Claude Code, .cursorrules for Cursor, AGENTS.md for Codex):

## Project Map

| Project | Local path | Server |
|---------|-----------|--------|
| **Auth Service** | `~/projects/auth/` | prod-1:/root/auth/ |
| **Landing Page** | `~/projects/landing/` | Cloudflare Pages |
| **Mobile App** | `~/projects/mobile/` | — |
| **Admin Panel** | `~/projects/admin/` | prod-1 (Docker) |

### Servers
| Name | IP | Key |
|------|-----|-----|
| prod-1 | x.x.x.x | ~/.ssh/prod |
| staging | y.y.y.y | ~/.ssh/staging |

### Rule
Read project CLAUDE.md before reading source files.

This is your Level 0. It's ~2KB. The agent loads it automatically at the start of every session.

Step 2: Add CLAUDE.md to each project

In each project root, create a context file:

# Auth Service — CLAUDE.md

## Status: LIVE
API for user authentication. Handles OAuth, JWT, rate limiting.

## Tech Stack
Python 3.12, FastAPI, PostgreSQL, Redis

## Key Files
- main.py — entry point, route registration
- auth/jwt.py — token generation and validation  
- auth/oauth.py — Google/GitHub OAuth providers
- models/user.py — SQLAlchemy user model

## Deployment
- Server: prod-1 (x.x.x.x)
- Service: auth-service.service
- Logs: journalctl -u auth-service -f

This is Level 1. ~3-5KB per project. The agent reads it when you mention the project and immediately knows the architecture.

Step 3 (optional): Add Graphify for code navigation

Graphify turns your codebase into a knowledge graph. Run it once per project:

pip install graphifyy
cd ~/projects/auth

In your AI agent:

/graphify .
graphify claude install

Now the agent has Level 1.5 — a structural map of your code. Before grepping, it consults the graph and knows exactly which file to read.

Step 4 (optional): Connect Claude Desktop via MCP

If you use Claude Desktop, add Graphify as an MCP server:

{
  "mcpServers": {
    "graphify": {
      "command": "python",
      "args": ["-m", "graphify.serve", "/path/to/graphify-out/graph.json"]
    }
  }
}

Desktop automatically calls query_graph when you ask about your projects. No prompting needed — it just works.

Real test results

I ran the same questions with and without the hierarchy. Same model (Haiku — the cheapest), same machine, same projects.

"What is the architecture of Project A?"

	Blind agent	With hierarchy
Tool calls	12	1
Behavior	Grep → read 4 files → build answer	Read CLAUDE.md → answer
Accuracy	Correct	Correct

"Which of my projects use library X?"

	Blind agent	With hierarchy
Tool calls	44	2
Behavior	Scan entire disk	Targeted grep in known paths
Accuracy	Missed 1 of 3 projects	Found all 3

"Where is Project B deployed? Service name? Logs?"

	Blind agent	With hierarchy
Tool calls	9	0
Behavior	Read configs + SSH into server	Answered from context
SSH needed	Yes	No

The blind agent in T2 actually missed a project that the hierarchy-equipped agent found. More context didn't just save tokens — it produced better answers.

Why this works

AI coding agents are fundamentally search engines. When you ask a question, they search for the answer. The quality of the answer depends on the quality of the search.

Without context, the agent searches blind: grep everything, read everything, hope to find the right files. With a hierarchy, the search is directed: check the map, go to the right project, read the right file.

This isn't a new idea. It's how humans navigate codebases — you don't grep -r your company's entire monorepo every time someone asks about a service. You know which repo, which module, which file. The hierarchy gives the agent the same knowledge.

What this is NOT

Not a framework. It's a pattern — three markdown files.
Not a token compression tool. The savings come from not reading files, not from compressing them.
Not a replacement for Graphify. Graphify handles code-level navigation. This handles project-level navigation. They complement each other.
Not magic. If your project doesn't have a CLAUDE.md, the agent still greps. You have to write the context files.

The full setup

Templates, scripts, and multi-platform guides:

github.com/CreatmanCEO/ai-context-hierarchy

Includes:

Level 0 and Level 1 templates
Conversation indexing scripts (Claude Code sessions + Desktop export → searchable markdown)
VPS sync command template
Platform-specific setup for Claude Code, Cursor, Codex, Gemini CLI

Bonus: conversation indexing

Your past conversations with the AI contain architectural decisions, debugging sessions, deployment notes. But the agent can't search them.

The repo includes parsers that convert Claude Code session logs (.jsonl) and Claude Desktop exported chats into markdown files with YAML frontmatter:

---
title: "Fixed payment webhook"
date: 2026-04-14
project: auth-service
topics: ["webhook", "cryptocloud", "cloudflare"]
files_touched: ["payments.py", "webhook.py"]
---

Index these with Graphify and the agent can find "what did we decide about the payment flow last week" without you re-explaining it.

Start here

Write a project map (5 minutes)
Add CLAUDE.md to your main project (5 minutes)
Ask the agent about your project in a new session
Watch it answer without grepping

That's the whole thing. No dependencies, no installation, no configuration. Just three markdown files that turn your blind agent into one that knows where to look.

Built with Graphify for code-level navigation. Source and templates: ai-context-hierarchy.

A Government Messenger Detects VPNs and Reports Server IPs. Here's How I Protected My Infrastructure.

Creatman — Wed, 18 Mar 2026 08:26:55 +0000

Credit First

Everything in this article about how Max's HOST_REACHABILITY module works is based on the research of others. I want to acknowledge them clearly:

"runetfreedom" — published the original deep analysis on Habr (March 5, 2026, 485K+ views, 1055+ upvotes) documenting the HOST_REACHABILITY module through JADX decompilation
RKS Global — independently verified the findings
RigOlit — published a detailed security analysis on GitHub (August 2025), later deleted it under pressure. The original repo was preserved in Web Archive
Denis Yagodin — conducted RAG analysis of decompiled code (Medium), discovered the fsb class and intentionally weakened TLS cipher suites
Scamshot Telegram channel — confirmed clipboard access and installed app list exfiltration through dynamic analysis
Novaya Gazeta Europe — verified RigOlit's analysis with technical experts before it was removed

These researchers took real personal risk to expose this. A security researcher in Russia who publishes critical analysis of a government-backed app faces genuine consequences — and one of them already had to delete their work.

My contribution is narrow but practical: I read their research, understood the threat to my VPN infrastructure, and built a working server-side protection system in one day.

What is Max and Why Should VPN Operators Care?

Max (formerly VK Messenger, ru.oneme.app) is developed by VK — Russia's largest social media company, controlled by Gazprom structures. Since September 2025, it's mandatory pre-installed on every smartphone sold in Russia. 100 million registered users (Of course, this is according to VK's own data, which cannot be verified in any way).

VK is run by Vladimir Kirienko — son of the First Deputy Chief of Staff of the Presidential Administration. Max has no end-to-end encryption and is legally required to integrate with SORM (FSB wiretapping infrastructure).

I run VPN infrastructure (VLESS+Reality with DPI bypass) for approximately 165 clients in Russia. When the HOST_REACHABILITY research was published, I realized my servers were directly at risk of being identified and blocked.

How HOST_REACHABILITY Works (Based on Published Research)

The following is a summary of findings published by "runetfreedom" on Habr and verified by RKS Global. I'm condensing it here for VPN operators who need to understand the threat model.

Step 1: VPN Detection

The module uses Android's NetworkCapabilities.TRANSPORT_VPN API to detect active VPN connections.

Step 2: IP Identification via 6 Services

If VPN is detected, the module queries six IP detection services simultaneously — three Russian, three foreign:

Service	Location
`ipv4-internet.yandex.net`	Russia
`ipv6-internet.yandex.net`	Russia
`ip.mail.ru`	Russia (VK-owned)
`ifconfig.me`	International
`api.ipify.org`	Cloudflare
`checkip.amazonaws.com`	AWS

The mix is deliberate: if a user has split-tunneling, Russian and foreign services return different IPs, revealing the VPN server address. For personal VPN servers where inbound and outbound IPs match — the server is directly exposed and tied to the user's identity.

Step 3: Censorship Bypass Test

The module tests reachability of blocked resources (Telegram, WhatsApp) using ICMP ping and TCP:443. If blocked resources are reachable — the user is bypassing censorship.

Step 4: Data Package

Everything is sent to api.oneme.ru via a proprietary binary protocol (10-byte header + MessagePack + LZ4), mixed with regular messenger traffic: VPN flag, server IP, host reachability results, carrier, userId, sessionId.

Step 5: Blocking Pipeline

Data flows: Max → api.oneme.ru → VK → Roskomnadzor → TSPU → IP blocked.

Critical Detail

The module is controlled by a server-side flag returned per-user during authentication — enabling targeted activation for specific accounts.

For full technical details — decompiled class names, opcode structure, ASN lists — read the original research on Habr.

What I Built: Two-Layer Server-Side Protection

After reading the research, I needed to protect my infrastructure without requiring any changes on the client side. Here's what I implemented in one day.

Layer 1: Domain Blocking via Xray Routing

Traffic to Max domains is dropped at the Xray level before leaving the VPN server. The HOST_REACHABILITY module can collect data on the device — but it can't send it home.

Domains blocked:

Domain	Reason	False positive risk
`domain:oneme.ru`	Core Max API — HOST_REACHABILITY payload channel	Zero
`domain:max.ru`	Max web platform	Zero
`full:ip.mail.ru`	VK-owned IP detection service used by the module	Minimal

Domains NOT blocked (to avoid breaking other apps):

vk.com, mail.ru — used by other VK ecosystem apps
ifconfig.me, api.ipify.org, yandex.net — legitimate services used by thousands of apps

Xray configuration:

{
  "outbounds": [
    { "protocol": "freedom", "tag": "direct" },
    { 
      "protocol": "blackhole", 
      "settings": { "response": { "type": "http" } }, 
      "tag": "block-max" 
    }
  ],
  "routing": {
    "domainStrategy": "AsIs",
    "rules": [
      {
        "type": "field",
        "domain": [
          "domain:oneme.ru",
          "full:ip.mail.ru",
          "domain:max.ru"
        ],
        "outboundTag": "block-max"
      }
    ]
  }
}

Critical: Sniffing must be enabled on every inbound — without it, Xray can't see the domain in TLS ClientHello:

"sniffing": {
  "enabled": true,
  "destOverride": ["http", "tls"]
}

Layer 2: Real-Time Monitoring + Telegram Alerts

A Python daemon watches Xray access logs and sends instant Telegram alerts when a client's device tries to contact Max domains:

⚠️ MAX DETECTED
Client: [CLIENT_NAME]
Domain: api.oneme.ru:443
Server: xray-direct
Time: 2026-03-17 12:00:00
Traffic to Max blocked at Xray level.

Deduplication: one alert per client per hour. Runs as a systemd service.

Deployment

Applied to two Xray services:

VLESS+Reality (direct connection)
VLESS+WebSocket via Cloudflare

All ~165 clients protected automatically. Zero client-side changes. Hiddify, v2rayNG, Streisand continue working normally.

Limitations

DoH / Hardcoded IPs: If Max switches to DNS-over-HTTPS or hardcoded IPs, domain-based blocking fails. Fallback plan: iptables DROP for VK IP ranges (AS47541, AS47764).
Split-tunneling: If a user only routes foreign traffic through VPN, Max reaches api.oneme.ru directly. Server-side protection can't help. Recommendation: full tunnel.
Module updates: When Max changes domains or protocols, the blocklist needs updating. Monitoring helps detect new patterns.
The only guarantee: Don't install Max on a device that connects to VPN.

The Bigger Picture

After the Habr publication went viral, VK released an update removing the Telegram/WhatsApp checks — but the module code was not removed from the APK. It's dormant, waiting.

The researchers who exposed this took real risk. One deleted their work under pressure. The least we can do as infrastructure operators is take their findings seriously and build actual protections.

If you operate VPN infrastructure with users in Russia — the Xray configuration above takes 10 minutes to deploy and protects all your clients immediately.

References

Original research by "runetfreedom" — Habr, March 5, 2026
RKS Global — independent verification
RigOlit — GitHub analysis (August 2025, deleted, preserved in Web Archive)
Denis Yagodin — RAG analysis of decompiled code (Medium)
Scamshot — dynamic analysis (Telegram channel)
Novaya Gazeta Europe — investigative reporting
Human Rights Watch, Reporters Without Borders, Access Now — policy analysis

I'm Creatman — I build solutions for problems I encounter. When researchers expose a threat, I build the defense.

GitHub: github.com/CreatmanCEO | Portfolio: creatman.site | LinkedIn: linkedin.com/in/creatman

I Stopped Claude Code From Breaking My Projects. Here's the Exact Setup

Creatman — Thu, 05 Mar 2026 16:02:00 +0000

A practical guide to anti-regression AI coding with Claude Code, subagents, hooks, and Google Antigravity — from someone who was ready to quit.

Six months ago, my workflow with Claude Code looked like this: build a working prototype in an hour, spend the next three hours watching Claude systematically destroy it while "improving" things. Every developer using AI coding agents knows this pattern. You ask for a small change, and suddenly your auth module is rewritten, three tests are deleted, and there's a new dependency you never asked for.

I'm a full stack developer running my own company (CREATMAN), and I code solo most of the time. I can't afford to babysit an AI agent on every keystroke. I needed a setup where Claude Code would help me ship faster without regressing what already works.

After weeks of research, community deep-dives, and a lot of trial and error, I found a combination that actually works. This article is everything I learned — the tools, the config files, and the exact workflow I use daily.

Why Claude Code Breaks Things (It's Not What You Think)

The root cause isn't that Claude is "dumb." It's context window exhaustion.

Claude Code operates in a ~200K token window. About 80% of that gets consumed by file reads and tool results — not your conversation. By the time you're at 90% utilization, Claude literally cannot hold your project's architecture in its working memory anymore. It forgets patterns from earlier in the session, contradicts its own decisions, and starts generating code that conflicts with what it wrote 20 minutes ago.

The community calls this "context drift" and it's the #1 source of regressions. The fix isn't "write better prompts" — it's architectural.

The Stack: Antigravity + Claude Code + Subagents

Here's what I landed on after testing Cursor, VS Code, Warp, Zed, and several other setups.

Google Antigravity as the IDE

Antigravity is Google's free, agent-first IDE (a VS Code fork). I chose it for three reasons:

Free. I'm already paying $100/month for Claude Max — I'm not adding Cursor on top.
VS Code compatible. My extensions, keybindings, and theme transferred over in one click.
Built-in browser agent. Antigravity's Gemini 3 Pro agent can autonomously open a browser, navigate your app, and test UI. No Playwright setup needed for basic visual checks.

I run Claude Code in Antigravity's terminal with my Max subscription. Gemini 3 Pro runs in the Agent Manager panel. Two AI brains, one IDE, zero extra cost.

Claude Code with Anti-Regression Config

The terminal Claude Code is where real development happens. But instead of running it raw, I set up three layers of protection.

Layer 1: CLAUDE.md — The File That Changes Everything

If you take one thing from this article, make it this. CLAUDE.md is a special file that Claude reads at the start of every session. It survives compaction. It's your project's constitution.

Here's my actual template:

# Project Name

## Architecture
- **Frontend**: React + TypeScript
- **Backend**: Python 3.11, FastAPI
- **Database**: PostgreSQL
- **Deploy**: Docker on VPS

## Key Commands
- `make dev` — Start development servers
- `make test` — Run full test suite
- `make lint` — Lint everything

## CRITICAL RULES
- NEVER delete or rewrite working tests without explicit request
- NEVER delete files without confirmation
- ALWAYS run tests after any code change
- ALWAYS do git checkpoint before large refactors
- One task at a time. Do NOT make multiple changes simultaneously
- If unsure — ASK, don't guess

## Working Style
- Plan FIRST, code SECOND. Never start coding without confirmed plan
- Small diffs. One file → tests → next file
- After every change: run tests and show results
- Use subagents for codebase research

The CRITICAL RULES section is where the magic happens. Claude follows these instructions with remarkable consistency — because they're injected before every conversation, not buried 50 messages deep where they'd get compacted away.

Key insight from SFEIR Institute's research: 60% of Claude Code support tickets come from the "ghost context" anti-pattern — working without a CLAUDE.md. A simple CLAUDE.md resolves the issue in 90% of cases.

Layer 2: Subagents — Isolated AI Specialists

Subagents are Claude Code's most underrated feature. Each runs in its own context window and returns only a summary to your main session. This means:

Research doesn't pollute your implementation context
A reviewer can check code without knowing (or forgetting) what the planner decided
You can run them in parallel

I use three subagents. Create them as markdown files in .claude/agents/:

.claude/agents/planner.md — Researches the codebase and writes implementation plans. Never writes code. Saves plans to ./plans/ so other agents (and future sessions) can reference them.

.claude/agents/tester.md — Writes tests following existing patterns, runs the full test suite (not just new tests), and reports regressions.

.claude/agents/code-reviewer.md — Reviews changes for regressions, security issues, and pattern violations. Outputs severity-rated findings with file:line references.

The key is adding a reminder to your CLAUDE.md:

## Agents
- Use `planner` agent before any complex task
- Use `tester` agent after code changes
- Use `code-reviewer` agent before commits

Without this reminder, Claude tends to do everything in the main context — which is exactly what causes context bloat and regressions.

Layer 3: Hooks — Automated Safety Nets

Hooks fire automatically on Claude Code lifecycle events. My most valuable hook blocks commits when tests fail:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash(git commit*)",
        "hooks": [
          {
            "type": "command",
            "command": "python -m pytest tests/ -x --timeout=60 || (echo {\"block\": true, \"message\": \"Tests failing.\"} 1>&2 && exit 2)",
            "timeout": 120
          }
        ]
      }
    ]
  }
}

This creates a hard gate: Claude literally cannot commit code that breaks tests. No exceptions, no "I'll fix it later." The feedback loop forces it to fix regressions before moving forward.

The Daily Workflow

Here's how an actual development session looks:

1. Start the session

Open Antigravity. Terminal. claude. Claude reads CLAUDE.md automatically.

2. Before any complex task — plan first

> Use the planner agent to research our codebase and create
> an implementation plan for [feature]. Do NOT write any code.

Review the plan. Question assumptions. Adjust. Only then:

> Implement Step 1 from the plan. Run tests after.

3. Monitor context religiously

Check /cost periodically
At 60-70% context → /compact "Preserve: modified files list, test results, current plan step"
Switching topics → /clear

4. Use checkpoints as undo

Before risky changes: git add -A && git commit -m "checkpoint: before auth refactor"

If Claude breaks something: Esc + Esc → restore code only. Or git reset --hard HEAD.

5. Review before commit

> Use code-reviewer agent to check all changes
> Use tester agent to run the full test suite

6. Parallel sessions for complex work

Split terminal in Antigravity:

Terminal 1: Claude Code — main implementation
Terminal 2: Claude Code — tests in parallel
Terminal 3: dev server

Three sessions = effectively 600K tokens of context. CLAUDE.md serves as shared memory between them.

The Antigravity Bonus: Two Brains

While Claude handles the heavy coding in the terminal, I use Antigravity's Gemini 3 Pro for:

UI testing via built-in browser — "Test the login flow on localhost:3000 and screenshot any errors"
Second opinion — paste Claude's plan into Gemini's agent panel and ask for critique
Documentation — Gemini is solid at generating docs from code

This dual-model approach gives you the coding power of Claude and the planning/context strengths of Gemini without paying for two subscriptions.

MCP Servers Worth Installing

For browser automation beyond what Antigravity's built-in browser offers:

claude mcp add playwright -- npx -y @executeautomation/playwright-mcp-server

Now Claude can open browsers, take screenshots, click elements, and verify UI — all through natural language.

For GitHub integration:

claude mcp add github -- npx -y @modelcontextprotocol/server-github

Verify everything connected with /mcp in Claude Code.

What Actually Changed

Before this setup, I'd lose 2-3 hours per day to regression whack-a-mole. Claude would "fix" one thing and break two others. I'd context-switch between debugging Claude's mistakes and actually building features.

After implementing CLAUDE.md + subagents + hooks + checkpoints:

Regressions dropped dramatically. The hook that blocks commits on failing tests alone is worth the entire setup time.
Sessions are productive longer. Subagents keep the main context clean. I can work for 2+ hours before needing to compact.
Recovery is instant. Checkpoints + git mean I'm never more than 10 seconds away from a working state.
I actually trust the output. The planner → implement → review → test pipeline catches issues before they compound.

Quick Start (15 Minutes)

If you want to try this today, grab the ready-to-use configs from the GitHub repo or set it up manually:

Create CLAUDE.md in your project root with your architecture, commands, and CRITICAL RULES
Create .claude/agents/tester.md — even just one subagent for testing makes a huge difference
Add the commit-blocking hook to .claude/settings.json
Start every complex task with "Create a plan first. Do NOT write code."

That's it. These four changes will transform your Claude Code experience more than any model upgrade or IDE switch.

Resources

claude-code-antiregression-setup — All configs from this article: CLAUDE.md template, subagents, hooks, rules, workflow docs
awesome-claude-code — Curated list of skills, hooks, agents, commands
claude-code-workflows — Production-ready workflow plugins (17 agents, 10 commands)
claude-code-templates — CLI to install 100+ ready-made agents and hooks
Claude Code Official Docs: Memory — How CLAUDE.md and auto-memory work
Claude Code Official Docs: Best Practices — Anthropic's own recommendations

I'm Nick, a full stack developer and digital architect at CREATMAN. I build backend systems, AI tools, and VPN infrastructure. The complete anti-regression setup from this article — CLAUDE.md template, all three subagents, hooks, rules, and workflow docs — is available as a ready-to-use repo: claude-code-antiregression-setup. Clone it, fill in the template, and start shipping without fear. I'm open to remote opportunities — find me on GitHub.