As AI agents write more of our production code, I started asking a simple question: who's testing the code the AI just wrote?
Not unit tests. We still have those. Not Playwright E2E suites. Those too. I'm talking about a new layer that sits alongside everything else. One that the AI itself writes as it builds features.
Here's what we built.
The problem
AI agents already write E2E tests. Playwright, Cypress. The agent generates the code, it runs, it passes. Job done, right?
Not quite. Those tests are deterministic. They assert on exact selectors, exact text, exact DOM structure. The agent that wrote page.locator('#sidebar-nav > ul > li:nth-child(3)') has baked in an assumption about the HTML that will break the moment another agent (or a human) touches that component. The test didn't get worse. The UI moved on and the test couldn't keep up.
This is the real problem: deterministic tests written by AI are still brittle tests. The AI just writes them faster. It doesn't make them less fragile.
What we actually needed was a test that behaves the way a human tester does. Look at the screen, find the login button (wherever it is), click it, check what happens. Not "find element with ID btn-submit" but "find the button that says Sign In."
The solution: markdown test files
Each regression test is a markdown file. Plain English. Structured steps. No code.
Here's a real test from our suite:
# Test 001: Login as SuperAdmin
| Field | Value |
|--------------|--------------------------|
| **ID** | AI-REG-001 |
| **Priority** | P0 (Critical) |
| **Area** | Authentication |
| **Requires** | testData/LoginCreds.json |
## Steps
### Step 1: Navigate to the app
- **Action**: Open browser and navigate to the app URL
- **Expected**: Since the user is not authenticated, the app redirects to /login
- **Verify**: URL ends with /login
### Step 2: Verify login page elements
- **Action**: Take a snapshot of the login page
- **Expected**: The login form is visible with an email input, password input, and Sign In button
- **Verify**: All three elements are present
### Step 3: Enter SuperAdmin email
- **Action**: Fill the email input with the email from test data
- **Expected**: The email appears in the input field
### Step 4: Enter SuperAdmin password
- **Action**: Fill the password input with the password from test data
- **Expected**: The password field shows masked characters
### Step 5: Submit the login form
- **Action**: Click the Sign In button
- **Expected**: Page redirects to the dashboard
- **Verify**: URL is now / (no longer /login)
### Step 6: Verify SuperAdmin nav items
- **Action**: Inspect the sidebar navigation
- **Expected**: SuperAdmin-only items are visible: Manage DB, People Management, AI Models, AI Logs, AI Dashboard
- **Verify**: At least 3 of the 5 SuperAdmin menu items are present
### Step 7: Verify user role badge
- **Action**: Click on the user avatar to open the dropdown
- **Expected**: Dropdown shows full name, email, and role badge
- **Verify**: Role badge text contains "SUPERADMIN"
No selectors. No XPath. No page.locator('#email-input'). Just descriptions of what a human would do and see.
How it runs: agent-browser
The execution engine is agent-browser, an open-source CLI from Vercel built for AI agents to automate browsers.
An AI agent reads the markdown file, then translates each step into agent-browser commands:
# Step 1: Navigate
agent-browser open "https://myapp.azurestaticapps.net/"
agent-browser wait --load networkidle
# Step 2: Discover what's on the page (accessibility snapshot)
agent-browser snapshot -i
# Returns: @e1 heading "Sign In", @e2 textbox "Email", @e3 textbox "Password", @e4 button "Sign In"
# Step 3: Fill email
agent-browser fill @e2 "admin@example.com"
# Step 4: Fill password
agent-browser fill @e3 "secretpassword"
# Step 5: Click sign in
agent-browser click @e4
agent-browser wait --load networkidle
# Step 6: Check what's on the dashboard
agent-browser snapshot -i
# Returns the full accessibility tree, agent checks for nav items
# Step 7: Open user dropdown (Radix UI popover)
agent-browser click @e15
agent-browser snapshot -s "[data-radix-popper-content-wrapper]"
# Agent checks for "SUPERADMIN" in the popover content
The important bit: snapshot -i returns an accessibility tree with reference IDs (@e1, @e2). The agent finds elements by their accessible name and role, not by CSS selectors. If someone renames a CSS class or reorders the DOM, the test still passes. It's testing what the user sees, not how the HTML is structured.
Writing tests as features get built
Here's where it gets interesting. We set up our AI coding workflow so that when the agent builds a new feature, it also writes a regression test for that feature. Same session. Same context.
The /create-ai-test command walks the agent through:
- Read the React source for the UI component being tested
- Identify the user journey
- Write the structured markdown test
- Validate against our test principles
So the agent doesn't just ship code. It writes the test coverage in the same session. Each feature comes with its own regression test, automatically.
Running tests in parallel
Tests are independent by design, so we run them in parallel:
/run-ai-tests P0
This launches one AI agent per test, each with its own isolated browser session:
# Test 001 runs in its own session
agent-browser --session test-001 open "https://..."
agent-browser --session test-001 snapshot -i
# Test 002 runs simultaneously in a separate session
agent-browser --session test-002 open "https://..."
agent-browser --session test-002 snapshot -i
Results get recorded back into each test file and into a central test log:
## Last Run
| Field | Value |
|----------------|--------------------|
| **Timestamp** | 2026-03-08 23:35 |
| **Result** | PASS |
| **Steps** | 8/8 passed |
### Step Results
| Step | Description | Result | Notes |
|------|----------------------------|--------|-----------------------------------------|
| 1 | Navigate to the app | PASS | Redirected to /login as expected |
| 2 | Verify login page elements | PASS | Email, Password, Sign in button present |
| 3 | Enter SuperAdmin email | PASS | Filled via fill @e2 |
| 5 | Submit the login form | PASS | Redirected to / after networkidle wait |
| 7 | Verify user role badge | PASS | Radix popover shows "SUPERADMIN" badge |
Where this sits in the test stack
This doesn't replace anything. It's an additional layer.
Unit tests still test logic in isolation. AI writes these too. They're deterministic, and that's fine because they're testing pure functions, not UI.
E2E tests (Playwright/Cypress): AI writes these as well. They're precise, fast, and good at catching exact regressions. But they're tightly coupled to the DOM. Every refactor risks breaking them.
AI regression tests (markdown): this new layer. Tests that behave like a human. Find the button by what it looks like, not what it's called in the code. They survive refactors and component library swaps because they test the experience, not the implementation.
The question each layer answers:
| Layer | Question |
|---|---|
| Unit tests | Does the logic work? |
| E2E tests | Does the code work exactly as written? |
| AI regression tests | Does the app work the way a user expects? |
The elephant in the room (what QA will correctly point out)
If you've been in testing for a while, your alarm bells are probably ringing. Yes, there are pros and cons. Let's address the valid criticisms:
1. "You haven't solved brittleness, you just reinvented Cucumber/BDD."
Writing tests in plain English isn't new. But in old BDD workflows, developers still had to write the glue code to explicitly map those English steps to CSS selectors. Here, there is no glue code. The agent translates the English intent directly against the live accessibility tree at runtime.
2. "These tests are going to be slow and expensive."
Yes, they are. Playwright runs a 100-step test in 3 seconds. An AI interpreting the DOM and making decisions takes minutes and costs API tokens. That's why these don't replace E2E suites. They don't run on every single PR commit. They run on nightly builds or release branches to validate the human-experience layer and during build time as feedback loop to the builder agents.
3. "Determinism is a feature! I want the test to fail if the DOM structure changes."
Absolutely. If a developer accidentally strips out ARIA tags or blocks elements, your standard E2E suite will (and should) catch it instantly. But when the app is functionally healthy yet visually or structurally reorganised, AI tests survive the refactor where deterministic tests shatter.
4. "AI hallucinates. That means flaky tests."
LLMs aren't entirely deterministic. Sometimes they might struggle if multiple elements have identical accessible names. But here's the twist: if an AI utilising the accessibility tree gets confused by your UI, a human using an assistive device probably will too. It forces you to build genuinely accessible, semantic interfaces.
5. "Natural language is ambiguous. How do you know it's testing what you think it's testing?"
Fair point, and this is why the tests use a structured format (Action / Expected / Verify) rather than freeform prose. "Verify: URL ends with /login" is not ambiguous. "Verify: at least 3 of the 5 SuperAdmin menu items are present" is not ambiguous. The structure constrains the language enough that the agent knows exactly what to check. Freeform "test the login page" would be a problem. We don't do that.
6. "The real problem is bad test architecture, not bad tools."
Agreed. If your existing tests use copy-pasted CSS selectors from the browser inspector, AI won't fix that. We still use proper test IDs, roles, and aria attributes in our E2E suite. The markdown tests are a separate layer for a separate question: does the app behave the way a user expects after a refactor?
Why this matters now
AI agents are writing more production code every month. The more autonomous the coding becomes, the more you need quality loops that are written at build time, not bolted on after the fact. Tests that a PM can review without knowing JavaScript. Tests that check behavior, not implementation. Tests where the file itself documents what the feature should do.
The agents that ship reliable software won't just be the ones that write good code. They'll be the ones that write their own quality checks as they go.
Try it yourself
- Install agent-browser:
brew install agent-browser - Create a markdown test file describing a user journey in your app
- Have your AI coding agent execute it step by step
- Record the results back into the file
Start with login. It's the simplest journey and immediately proves the concept.
I'm a developer who can't stop tinkering with how AI agents build software. Early adopter of GenAI tooling, currently obsessed with the feedback loops between AI-written code and AI-driven quality. I have recently started sharing about my thoughts about what I'm learning.
Fav Quote Today: The early adopters always look reckless. They also always have a head start.
Top comments (1)
The markdown test files solve the exact brittleness problem we've been fighting — AI-written selectors that break on the next refactor. Using plain English steps as the spec and letting the agent figure out selectors at runtime is a much stronger contract.