Vladimir Mikhalev for Cypress

Posted on Feb 26

Cypress in the Age of AI Agents: Orchestration, Trust, and the Tests That Run Themselves

#cypress #ai #docker #devops

Last year, I wrote about Docker and Cypress for this blog. It covered containers, layer caching, and parallel runners. Good stuff. Useful stuff.

But I'm not writing that article again.

Here's why.

I could write a perfect container config in my sleep. So could Claude. So could GPT. So could any intern with a prompt. Syntax has become a commodity. The Dockerfile isn't the hard part anymore.

The hard part?

Orchestration and trust when AI agents run the tests.

Let me explain.

The Shift Nobody Talks About

In 2025, Cypress shipped cy.prompt(). Write tests in plain English. The AI figures out the selectors. It even self-heals when your UI changes.

That's powerful. And that's dangerous.

Not because the tool is bad. It's genuinely impressive. But because it changes who is making decisions in your pipeline. And most teams haven't thought about that.

Before cy.prompt(), the chain of trust was simple:

A human wrote the test
A human reviewed it
CI ran it
If it failed, a human fixed it

Every link in that chain had a name attached.

Now?

An AI writes the test
An AI picks the selectors
An AI heals the test when it breaks
The human sees green checkmarks
Everybody ships

Until something goes wrong. And nobody knows why.

Autonomy vs. Augmentation: The Framework That Matters

The industry keeps confusing two very different things.

Autonomy means the agent acts for you. You find out later what happened.
Think: self-driving car. You're the passenger. The AI makes every turn.

Augmentation means the agent helps you decide. You still make the call.
Think: GPS navigation. It suggests the route. You drive.

Most AI testing tools sell autonomy:

"Never write a test again!"
"Self-healing pipelines!"
"Zero maintenance!"

That sounds great in a demo.

It falls apart in production.

Google's testing team found that 1.5% of all test runs were flaky (2016 study). Nearly 16% of tests showed some flakiness over time. Microsoft reported 49,000 flaky tests across 100+ product teams (2022). These numbers haven't gotten better. Now imagine those tests were written by AI.

You don't have a testing problem.

You have a trust problem.

What Actually Happens When AI Writes Your Cypress Tests

I've watched AI code assistants generate test suites. Here's the pattern I see every time:

Day one: Beautiful. High coverage numbers. Clean syntax. The PR merges fast. Everyone celebrates.

Week two: A UI change breaks three tests. The self-healing kicks in. Tests pass again. Nobody checks what changed.

Month two: The self-healed selectors are now targeting the wrong elements. The tests pass. But they're testing the wrong things. Your coverage number says 90%. Your real coverage is closer to 40%.

Quarter end: A production bug ships. The test suite was green. The post-mortem reveals the AI "healed" a critical login test. It now clicks a decorative button instead of the submit button. Both are blue. Both say "Continue."

The AI didn't fail.

The architecture failed.

Nobody designed a system where AI decisions get verified.

The Architecture Cypress Teams Actually Need

Here's the playbook I'd build for any team using Cypress with AI in 2026.

Layer 1: AI Generates, Humans Gate

Use cy.prompt() (or any AI tool) to draft tests. That's the accelerator.

But treat AI-generated tests like pull requests from a junior developer.

// cy.prompt() generates the test
cy.prompt([
  'Visit the login page',
  'Type admin@company.com into the email field',
  'Type the password into the password field',
  'Click the sign in button',
  'Verify the dashboard loads'
])

Then eject that code. Review the selectors. Commit the explicit version.

// The reviewed, committed version
cy.visit('/login')
cy.get('[data-cy=email]').type('admin@company.com')
cy.get('[data-cy=password]').type(Cypress.env('TEST_PASSWORD'))
cy.get('[data-cy=submit-login]').click()
cy.url().should('include', '/dashboard')
cy.get('[data-cy=welcome-banner]').should('be.visible')

The AI got you there faster. A human verified the result.

That's augmentation.

Layer 2: The Trust Boundary in CI

Your pipeline needs a clear line:

On one side: things AI can do alone
On the other: things that need human eyes

# GitHub Actions - Trust Architecture
jobs:
  ai-generated-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      - name: Run AI-assisted test suite
        run: |
          docker compose -f docker-compose.cypress.yml up \
            --abort-on-container-exit \
            --exit-code-from cypress

      - name: Validate no self-healed selectors
        run: |
          # Flag any tests that healed since last commit
          # Note: Requires a custom script to parse
          # Cypress Cloud API or stdout logs
          node ./scripts/check-healed-tests.js
          # If selectors changed, block the merge
          # Force a human review

      - name: Screenshot diff on healed tests
        if: failure()
        run: |
          # Capture what the AI "fixed"
          # Attach to PR for human review
          npx cypress run --spec "healed-tests/**" \
            --config screenshotOnRunFailure=true

The key:

Self-healed tests don't auto-merge. They create a review request. A human looks at what changed. Then decides.

Layer 3: The Accountability Layer

Every AI decision in your pipeline needs a log.

Not just "test passed."

But: "test healed selector from .btn-primary to .btn-action on Feb 15."

// cypress.config.js
module.exports = {
  e2e: {
    experimentalPromptCommand: true,
    setupNodeEvents(on, config) {
      on('after:spec', (spec, results) => {
        // Parse Cypress stdout or Cloud API for healing events.
        // Self-healing data appears in the Command Log
        // but isn't yet exposed in results.stats.
        //
        // Option A: Parse terminal output for "Self-Healed" tags
        // Option B: Query Cypress Cloud API for spec run details
        // Option C: Build a custom Cypress plugin that listens
        //           to command events during the run

        const healingEvents = parseHealingFromLogs(spec.name)

        if (healingEvents.length > 0) {
          logToAuditTrail({
            spec: spec.name,
            healed: healingEvents.length,
            timestamp: new Date().toISOString(),
            details: healingEvents
          })
        }
      })
    }
  }
}

When something breaks in production, you can trace it back:

"The AI changed this selector on this date. Nobody reviewed it. That's the gap."

Without this layer, your pipeline is a black box.

Green doesn't mean correct. It means unchallenged.

Layer 4: Docker as the Trust Container

Docker isn't just for consistency anymore.

It's your isolation boundary for AI-generated tests.

# docker-compose.cypress.yml
services:
  cypress-human-authored:
    build:
      context: .
      dockerfile: Dockerfile.cypress
    command: >
      npx cypress run
      --spec "cypress/e2e/human-authored/**"
    volumes:
      - ./results/human:/results

  cypress-ai-generated:
    build:
      context: .
      dockerfile: Dockerfile.cypress
    command: >
      npx cypress run
      --spec "cypress/e2e/ai-generated/**"
    volumes:
      - ./results/ai:/results
    # AI tests run in a separate container
    # Different reporting, different trust level

Separate the results. Report them differently.

Human-authored tests are your source of truth
AI-generated tests are your early warning system

When both agree: high confidence.
When they disagree: investigate.
When only AI tests pass: be suspicious.

The Uncomfortable Question

Here's where I need to be honest.

I've been in tech for 20 years, and spent the last 15 building delivery pipelines. I can debug a failing Docker container at 2 AM with my eyes half closed. I've configured CI/CD systems that run thousands of tests across dozens of services.

And I'm watching AI tools do parts of that job faster than I can.

That's not a threat.

That's a signal.

The value isn't in writing the cy.get() selector anymore.

The value is in designing the system where:

AI-generated selectors get verified
Self-healing gets audited
Trust has a paper trail

The Executor writes the test.

The Architect designs the trust system.

Most teams are building AI-powered testing without building AI-accountable testing. They're adding speed without adding trust.

That's technical debt with a new name.

What I'd Do This Week

If I ran a Cypress team today, here's my Monday morning plan:

Separate your test suites. Human-authored in one folder. AI-generated in another. Track them separately.
Add an audit log for self-healing. Every time cy.prompt() (or any AI tool) changes a selector, log it. Make it visible.
Block auto-merge on healed tests. Self-healed tests go into a review queue. A human approves. Every time.
Run AI tests in a separate Docker container. Different reporting pipeline. Compare results against human-authored tests.
Measure real coverage. Not line coverage. Not selector coverage. "Does this test actually verify the behavior we care about?" AI can inflate coverage numbers without testing anything meaningful.

None of this is anti-AI.

All of this is pro-trust.

The Bottom Line

Cypress + AI is the future. I believe that. cy.prompt() is a genuine leap forward.

The ability to write tests in plain English, the self-healing, the lower barrier to entry — all of it matters.

But the teams that win won't be the ones who automate the most.

They'll be the ones who trust the right things and verify everything else.

The bot that ships the wrong build doesn't get fired. You do.

Design accordingly.

Resources:

Valdemar is a Docker Captain and Cypress Ambassador based in Canada. He builds CI/CD pipelines that don't lie to you. Find him at valdemar.ai.

DEV Community