Agent-Driven E2E Testing with Cypress: A Practical Guide to Harness Engineering with Cursor Subagents

Darpan Shah — Tue, 07 Apr 2026 22:13:28 +0000

Teams have done end-to-end testing deliberately for years: exploring the app, writing tests from what they see, fixing failures in focused sessions. That's skilled work, not guesswork.

The hard part is usually organizational. Knowledge sits in people's heads or scattered across chat histories and tickets. What you see on a live screen is tough to describe clearly to whoever writes the automated test. Each new flow forces everyone to reload the same context from scratch.

Agent-driven development doesn't replace that judgment. It packages skilled work into narrow roles (explore, implement, execute, repair) with clear inputs and outputs. Quality builds over time instead of starting from zero every sprint.

This approach mirrors harness engineering: the system around the agents that makes them reliable, not just capable.

What Is a Harness, and Why Does It Matter?

The term "harness" has emerged as shorthand for everything in an AI agent system except the model itself. Put simply: Agent = Model + Harness. "The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before." According to Anthropic's engineering research, imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. Without structure, agents drift, repeat work, or declare victory too early.

Their solution? A two-fold approach: an initializer agent that sets up the environment on the first run, and a coding agent that makes incremental progress in every session, while leaving clear artifacts for the next session.

When you talking about a coding agent. Martin Fowler's team breaks harness engineering into key components:

Context engineering: Provides us with the means to make guides and sensors available to the agent.
Architectural constraints: Rules that mechanically enforce quality (not just suggestions).
Feedback loops: The human's job is to steer the agent by iterating on the harness. Whenever an issue happens multiple times, the feedforward and feedback controls should be improved to make the issue less probable or even prevent it.

Here's the counterintuitive insight: increasing trust and reliability in AI-generated code requires constraining the solution space rather than expanding it. Narrow roles, explicit handoffs, and clear boundaries make agents more productive, not less.

How This Applies to E2E Testing with Cypress

This article describes four agents specialized for E2E testing using Cypress and how they form a closed loop:

cypress-browser-explorer: Maps UI flows with live browser tooling
cypress-builder: Implements specs per team conventions
cypress-runner: Executes tests consistently
cypress-debugger: Classifies failures and applies fixes

Each agent produces a structured artifact (exploration report, spec file, run summary, debug notes) that becomes the input for the next agent. This is the harness in action: each step creates a plan that keeps the next agent on track.

In Cursor, each of these agents maps directly to a custom subagent -- a markdown file in .cursor/agents/ with a name, description, and focused prompt. The explorer subagent leverages Cursor's built-in browser tool to navigate your app, take snapshots, read the live DOM, and capture network activity without leaving the IDE. That means the exploration report isn't hand-written -- it's generated from real page state.

It seems reasonable that specialized agents like a testing agent, a quality assurance agent, or a code cleanup agent could do an even better job at sub-tasks across the software development lifecycle. That's exactly what this workflow does for E2E automation when Cypress is your tool.

Evidence from the real UI flows into code. Code gets verified by a standard test run. Failures get handled with clear escalation rules instead of improvisation.

The Feedback Loop

The loop in one sentence: Explore → build → run; on failure, debug and re-run; if the UI changed, explore again and rebuild.

This closed loop is where the efficiency gains come from:

Less rework: Selectors and URLs come from live exploration, not memory
Faster green builds: Runner standardizes execution; debugger applies evidence-based fixes
Clear escalation: Stale DOM leads to re-explore; flaky patterns get documented
Single-test discipline: Fix one failure, re-run, then move on

The Four Agents at a Glance

Agent	Role	Primary Inputs	Primary Outputs	Must Not
cypress-browser-explorer	Map scoped UI flows using Cursor's browser tool	URL, steps, ticket scope	Exploration report with selectors, network map	Wander outside scope; invent selectors without proof
cypress-builder	Implement specs per team rules	Exploration report	Spec and support code; handoff to runner	Skip exploration for unfamiliar pages
cypress-runner	Execute tests consistently	Spec path, tags/env	Pass/fail summary with failure context	Fix failing tests (send to debugger)
cypress-debugger	Classify failures, apply fixes	Failure output, artifacts	Code changes; handoff to runner or explorer	Invent selectors when DOM has changed

Important: These agents are blueprints, not universal standards. Your stack, auth flow, and naming conventions will differ. Expect to:

Edit agent instructions to reference your scripts and config
Pair agents with project rules (lint, selector policy, test ID format)
Add or trim steps where your org needs tighter guardrails

The value is the shape of the workflow and clean handoffs, not a one-size-fits-all prompt.

Handoff Templates: Structured Artifacts That Bridge Context

The key insight here was finding a way for agents to quickly understand the state of work when starting with a fresh context window. Structured handoffs are what prevent "context amnesia" between agents.

Explorer → Builder

## Handoff to cypress-builder

Prompt: "Create cypress/e2e/[feature].cy.js using this exploration report:
- Scope source: [quote from ticket/steps]
- URL map: [ordered list]
- Selector inventory: [element, purpose, selector, stability]
- Network map: [method, pattern, suggested alias]"

Builder → Runner

## Handoff to cypress-runner

Prompt: "Run <spec path> to verify the new/updated spec."

Runner → Debugger (on failure)

## Handoff to cypress-debugger

Prompt: "Triage these E2E test failures (Cypress):

**Failing specs:** cypress/e2e/<spec>.cy.js

**Failures:**
1. [TEST-ID] <describe> > <it>
   Error: <message>
   Screenshot: cypress/screenshots/<path>

**Notes:** <auth errors, timeouts, etc.>"

Debugger → Runner (after fix)

## Handoff to cypress-runner

Prompt: "Re-run <spec path> to verify the fix for [TEST-ID]."

Debugger → Explorer (stale DOM)

## Handoff to cypress-browser-explorer

Prompt: "Re-explore <URL/flow> because selectors are stale for <spec>. Return updated report to builder."

Explorer Report Checklist

When using the explorer agent, require a report that includes:

Scope source: Ticket, pasted steps, or URL/feature
Flow summary: Scoped path, completion or blocked state
URL map: Ordered URLs visited
Selector inventory: Element, purpose, selector, stability rating
Network map: Method, pattern, suggested intercept alias
Test strategy: E2E vs shift-left rationale per scenario
Notes: Gaps, fragile selectors, missing test hooks

Steering the Harness: How to Keep Agents Aligned

Rather than personally inspecting what the agents produce, we can make them better at producing it. The collection of specifications, quality checks, and workflow guidance that control different levels of loops inside the how loop is the agent's harness. The emerging practice of building and maintaining these harnesses, Harness Engineering, is how humans work on the loop.

This is working "on the loop" rather than just "in the loop." You're not micromanaging every output. You're improving the harness so agents naturally produce better results.

Practical steps:

Scope every request: URL, role, numbered steps, or ticket excerpt. The explorer especially needs to know what path to follow.
Encode standards in the repo: Lint rules, skills files, and agent instructions should match. Otherwise the model follows whatever file it read most recently.
Use explicit handoffs: Paste the structured blocks so the next agent gets data, not a summary.
Review diffs like any PR: Generated specs need scrutiny, especially auth, network mocks, and assertions on money or permissions.
Keep secrets out of chat: Credentials belong in .env or your secret manager.
Turn fixes into constraints: When an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again. Add a lint rule, update the instructions, or create a check.

Review Gates: Keeping Humans on the Loop

Agents execute evaluations automatically, but human oversight remains important for initial calibration and quality validation. Keep humans at judgment points:

After build: Review spec structure, selector quality, and assertion coverage before treating a run as final.
After green: Quick coverage and risk check before merge.
After repeated debug failure: If the same failure persists after three fix attempts, escalate to a person.

The agents handle the repetitive cycle. Engineers keep the judgment calls.

Team-Owned Content

The harness above doesn't define these items. Your team documents them in skills, rules, or extended agent files:

Authentication flows, secrets file layout, and commands patterns
Exact run commands (local vs Docker), CI script names
Tag/grep filters, base URLs per environment
Selector policy beyond "prefer stable hooks" (data-*, roles, aria)
Test ID formats, coverage scripts, Lint/Cypress config conventions

Why This Approach Works

The principles from Anthropic and Martin Fowler's research explain why the four-agent pattern is effective:

Constraints as multipliers: Paradoxically, constraining the solution space makes agents more productive, not less. When an agent can generate anything, it wastes tokens exploring dead ends.
Structured artifacts bridge context: Structured progress files and feature lists let a new agent quickly understand the state of work, analogous to a shift handoff between engineers who've never met.
Feedback loops catch issues early: Run follows build. Debug follows failure. Re-explore only when needed. This order cuts rework.
Clean escalation prevents endless retries: If the DOM is wrong, hand to explorer. If three fixes fail, hand to a human. No guessing.
The harness evolves: Coding agents make it much cheaper to build more custom controls and more custom static analysis. Agents can help write structural tests, generate draft rules from observed patterns, scaffold custom linters, or create how-to guides from codebase archaeology.

Implementing This in Cursor with Subagents and Browser

The four-agent workflow maps to four Cursor subagents: one markdown file per role under .cursor/agents/, each with YAML frontmatter (name, description, model and any optional fields you need) plus a focused instructions and prompt body. How you create them is always the same—only the name, description, and instructions change to match explorer, builder, runner, or debugger.

Below is one example (the browser explorer). The other three files use the identical shape; plug in the responsibilities from the agent table and handoff templates earlier in this article instead of pasting four full prompts here.

---
name: cypress-browser-explorer
model: inherit
description: Explores the application UI using browser tools to discover selectors, network calls, and page flows for Cypress test development. Use when exploring a new feature, finding selectors, mapping user flows, building new tests, or when the user says to explore a page. ALWAYS launch the browser - never assume selectors without navigating and snapshotting.
---

You are a browser exploration specialist for E2E tests using Cypress.

When invoked:

1. **Authenticate** if the target page requires login (see above)
2. **Navigate** to the target URL or flow entry point
3. **Take a snapshot** to capture the page structure
4. **Follow the exploration checklist** below for every flow

## Exploration Checklist

### Page URLs
- Record the entry page URL
- Navigate through each step of the flow, recording intermediate URLs
- Record the confirmation/success page URL

### Selectors (capture in priority order)
Priority order:
1. `[data-cy]`, `[data-test]`, `[data-testid]` -- purpose-built for testing
2. Any other `[data-*]` attribute -- stable, not styling-dependent
3. Any `[test-*]` attribute (e.g. `test-auto`, `test-id`) -- also for testing
4. `[role="..."]`, `[aria-label="..."]`, `[aria-labelledby]` -- semantic/accessible
5. `label[for="..."]` + associated input -- form elements
6. Stable visible text via `cy.contains()` -- only when text itself is the assertion
7. Tag + attribute combos (e.g. `input[name="email"]`) -- last resort

**Never use**: CSS classes, generated IDs, tag names alone, XPath, positional selectors

### Network Calls
- Monitor network requests during the flow using browser tools
- For each significant API call, record:
  - HTTP method and URL pattern
  - Suggested intercept alias (e.g., `get:cart-items`, `post:place-order`)
  - Whether the response contains data needed for assertions
- Pay attention to: auth calls, data fetching, form submissions, redirects

## Authentication

When the target page requires login (e.g. `/dashboard`, `/account`, any page that
redirects to `/login`), authenticate **before** exploring. Never ask the user
for credentials -- resolve them from project files.

### Credential Resolution (priority order)

1. **`.env`** file in the project root -- parse `KEY=VALUE` lines.
2. **`cypress.env.json`** in the project root -- parse JSON object.

## Handoff to cypress-builder

Prompt: "Create cypress/e2e/[feature].cy.js using this exploration report:
- Scope source: [quote from ticket/steps]
- URL map: [ordered list]
- Selector inventory: [element, purpose, selector, stability]
- Network map: [method, pattern, suggested alias]
- Draft spec: [snippet if applicable]
"

## Output Format

Return a structured report:
1. **Scope source:** Ticket, pasted steps, or URL/feature 
2. **Flow summary**: Scoped path, completion or blocked state
3. **URL map:** Ordered URLs visited
4. **Selector inventory:** Element, purpose, selector, stability rating
5. **Network map:** Method, pattern, suggested intercept alias
6. **Test strategy:** E2E vs shift-left rationale per scenario
7. **Notes:** Gaps, fragile selectors, missing test hooks

Save as .cursor/agents/cypress-browser-explorer.md. Add cypress-builder.md, cypress-runner.md, and cypress-debugger.md the same way, then invoke with /cypress-browser-explorer (and so on) or let the parent agent delegate from each file’s description.

Cursor's browser tool powers the explorer

The explorer subagent is where Cursor's built-in browser tool becomes essential. Rather than asking an engineer to describe what's on screen, the agent:

Navigates directly to URLs and follows multi-step flows
Takes snapshots of live DOM state, capturing element structure, attributes, and text
Reads selectors from the actual page -- data-testid, ARIA roles, form labels -- instead of guessing
Captures network activity to identify API calls that need cy.intercept() aliases

This means the exploration report is evidence-based from the start. Selectors come from the real DOM, not from memory or a other sources that may be out of date. When the debugger detects stale selectors and hands back to the explorer, the browser tool re-navigates and captures the current state -- closing the feedback loop with live data.

Why subagents fit this workflow

Cursor subagents provide three properties that align with the harness model:

Context isolation: Each subagent gets its own context window. The explorer's noisy DOM snapshots and network logs don't pollute the builder's context. The debugger's stack traces don't crowd the runner. This is the same isolation principle the harness pattern demands.
Parallel execution: Multiple subagents run simultaneously, cutting wall-clock time on multi-spec work.
Structured handoffs: A subagent returns a final message to the parent agent. That message is the handoff artifact -- the exploration report, the run summary, the debug notes. The templates in this article become the return format each subagent follows.

The Orchestration Pattern

The parent agent acts as an orchestrator, coordinating the four subagents in sequence:

Invoke /cypress-browser-explorer with URL and steps -- get exploration report
Pass the report to /cypress-builder -- get spec files
Hand spec paths to /cypress-runner -- get pass/fail summary
On failure, send details to /cypress-debugger -- get fixes, then back to step 3

Each handoff uses the structured templates from earlier in this article. The parent agent doesn't need deep knowledge of Cypress APIs—it routes data between specialists. This is the same orchestrator pattern Cursor's documentation recommends for complex workflows.

If you use Cypress MCP, you can also point /cypress-debugger at MCP tools to fetch failures from Cypress Cloud. The debugger triages, patches the spec or support code, then uses the Debugger → Runner handoff to re-run and stays in that loop until failures are addressed. That keeps run, fail, fetch, fix, re-run inside one workflow.

Closing

Treating exploration, implementation, execution, and repair as separate agent roles mirrors how strong teams already work. The harness makes this pattern repeatable and easy to hand off inside the IDE.

The largest efficiency win is the closed loop: run follows build, debug follows failure, re-explore only when the page structure actually changed.

The most effective harnesses don't just constrain the agent. They create an environment where the agent naturally produces better output with less correction needed. This is a critical insight. The best harnesses aren't restrictive. They're enabling.

Since shipping these specialized Cypress agents, I have hardly written tests by hand. The agents produce specs; I review them, merge when they are right, and when something drifts or misfires I adjust the agent definitions, skills, or prompts so the next run is better. The work shifts from typing cy.* to curating the harness -- continuous improvement on the automation itself, not just on individual tests.

The loop is sequential, but each step stays small: one subagent, one job, less noise in context than doing it all in a single chat.

Agent-driven development pays off when agents are blueprints you maintain. With Cursor subagents, those blueprints live in your repo as markdown files -- versioned, reviewable, and shared across the team. The browser tool gives the explorer agent direct access to your running app, so the entire loop from live UI to green test stays inside the IDE. Tighten instructions as your app and pipeline evolve. Keep guidance in the loop so automation stays trustworthy, not just clever.

References

Anthropic: Effective Harnesses for Long-Running Agents
Martin Fowler: Harness engineering for coding agent users
Cursor: Subagents and Browser Tool

DEV Community: Darpan Shah