DEV Community

Cover image for What Actually Breaks Test Automation After the Demo
David Frei
David Frei

Posted on

What Actually Breaks Test Automation After the Demo

Most test automation demos are too clean.

The demo app is stable. The login flow is simple. The selectors are obvious. The data is predictable. CI is not under pressure. Nobody is trying to debug a flaky checkout test five minutes before a release.

Real test automation work is different.

The product changes. The frontend refactors. A locator breaks. A test passes in preview but fails after merge. An AI-generated Playwright test looks good but asserts the wrong thing. A CI job keeps failing only under parallel execution. Someone adds a feature flag. Someone else updates a React component. The browser suite starts taking too long, so people add retries and hope for the best.

That is the world SDETs actually live in.

I went through the current notes on The SDET and grouped them into a practical reading path for teams trying to build test automation that still works after the first exciting week.

Start with the uncomfortable truth: maintenance is the real product

Writing the first test is rarely the hard part.

Maintaining the 300th test is.

That is why I would start with What to Measure in Test Automation Maintenance Before Your Suite Becomes Expensive.

The useful shift is to measure automation like an ongoing system, not a one-time project.

Good maintenance metrics include:

  • flaky test rate
  • selector churn
  • time to debug failures
  • time to update tests after UI changes
  • number of retries
  • number of quarantined tests
  • CI runtime
  • test ownership
  • how often failures are ignored

Those numbers matter more than raw test count.

A suite with 1,000 tests can still be weak if nobody trusts the failures. A suite with 80 tests can be valuable if it covers the right flows and fails for clear reasons.

The worst test suite is not the one that fails.

The worst test suite is the one that fails and everyone says, “It is probably just the tests.”

Playwright is powerful, but it still needs engineering discipline

Playwright is one of the best tools for modern browser automation, but adopting Playwright does not automatically give you a good test suite.

For a strong foundation, read How to Build a Playwright Test Framework from Scratch.

A framework needs more than a folder full of specs. It needs structure around:

  • fixtures
  • test data
  • authentication
  • browser projects
  • reporting
  • artifacts
  • retries
  • CI configuration
  • selector strategy
  • cleanup
  • environment handling

This is where a lot of teams underestimate the work.

A simple Playwright test is easy. A reliable Playwright framework that survives product churn is a different thing.

The article Playwright Test Data Strategies That Keep Your Suite Stable is a good companion because test data is often the hidden reason browser tests fail.

Bad test data creates fake flakiness.

The test fails, but not because the UI is broken. It fails because the user already exists, the cart is not empty, the record was deleted by another test, the backend state is stale, or two parallel workers used the same fixture.

A stable suite needs test data that is:

  • isolated
  • disposable
  • predictable
  • parallel-safe
  • easy to reset
  • close enough to real business behavior

Without that, every test failure becomes a guessing game.

Flaky tests should be diagnosed, not tolerated

Flaky tests are not just annoying. They damage trust.

A flaky test creates a decision every time CI goes red:

  • Is this a real bug?
  • Should we block the merge?
  • Should we rerun?
  • Should we quarantine it?
  • Who owns the fix?

That is why How to Stop Flaky Playwright Tests Before They Reach CI is worth reading early.

The article makes an important point: retries are not a strategy. They are evidence.

A retry can tell you that the failure is probably timing-related, state-related, or environment-related. But if the test needs luck to pass, the release signal is already compromised.

For deeper debugging, read How to Debug Flaky Playwright Tests with Trace Viewer, Logs, and Timing Clues.

Trace Viewer is useful because it turns a vague failure into a sequence of facts:

  • what the browser saw
  • what action happened
  • what the DOM looked like
  • what network calls were made
  • what console errors appeared
  • whether an element existed but was not actionable
  • whether the app was still transitioning

A good trace can show that the problem was not Playwright at all. Maybe the product rendered a button before it was ready. Maybe the API returned late. Maybe an animation blocked the click. Maybe the test asserted too early.

A flaky test is usually not random. It just has not been classified yet.

Learn to classify failures before fixing them

One of the most practical SDET skills is knowing what kind of failure you are looking at.

The article How I Decide Whether a Flaky Test Is a Product Bug, a Test Bug, or a CI Bug is useful because it avoids the lazy answer of “the test is flaky.”

A failing test might point to:

  • a real product bug
  • a brittle locator
  • a bad assertion
  • missing test data setup
  • backend state drift
  • CI resource limits
  • browser differences
  • timing assumptions
  • environment mismatch

Each category has a different fix.

If the product allows double submission, the test should probably expose that. If the selector depends on a generated class name, the test needs to change. If the failure only appears when tests run in parallel, the data isolation needs attention.

This is also where How to Debug Flaky API-Plus-UI Flows When the Browser Is Not the Real Problem becomes important.

Browser tests often get blamed for problems that start below the browser:

  • async backend processing
  • slow API responses
  • stale data
  • eventual consistency
  • test account state
  • feature flags
  • third-party services

A UI failure is sometimes just the visible symptom of backend instability.

CI failures need artifacts, not theories

CI failures are expensive when they lack evidence.

If the only output is “expected button to be visible,” someone has to reconstruct the run from memory and hope.

That is why How to Store Playwright Test Artifacts in CI So Failure Triage Is Actually Fast is one of the most practical notes in the set.

A useful CI failure should include:

  • trace files
  • screenshots
  • video
  • console logs
  • network logs
  • browser version
  • environment details
  • retry information
  • test data identifiers

The goal is to answer one question quickly:

What happened?

Not what might have happened. Not what usually happens. What happened in that run.

For broader pipeline confidence, read What to Test in CI Before You Trust a New Release Pipeline.

A release pipeline is part of the product delivery system. It needs testing too.

You need confidence in:

  • build steps
  • deployment steps
  • environment parity
  • secrets
  • test ordering
  • artifact retention
  • rollback behavior
  • failure reporting
  • branch and merge behavior

CI is not just a machine that runs tests. It is where release decisions get made.

Passing in preview does not mean passing after merge

Preview environments are helpful, but they are not production and they are not always the same as post-merge environments.

That is the point of Why Browser Tests Pass in Preview but Fail After Merge.

A browser test can pass in preview and fail after merge because of:

  • caching differences
  • feature flag state
  • environment variables
  • seeded data
  • auth redirects
  • deployment timing
  • CDN behavior
  • hidden dependencies
  • database migrations
  • different browser config

The fix is not to dismiss the failure as “CI being weird.”

The fix is to compare environments carefully and decide what the test is actually proving in each stage.

Similarly, Why E2E Tests Fail in CI but Pass Locally: A Root Cause Checklist is a good checklist for the classic local-versus-CI problem.

Local runs often have advantages that CI does not:

  • warmer caches
  • faster CPU
  • different viewport
  • different timezone
  • reused login state
  • different secrets
  • less parallel pressure
  • different browser version

If the environment changes, the test effectively changes.

Network interception is useful, but it changes the meaning of the test

Playwright network interception is powerful.

It can stabilize tests, mock APIs, control third-party calls, and make authentication flows easier to test.

But it should be used intentionally.

The guide Playwright Network Interception Tutorial for Testing APIs, Auth, and Third-Party Calls is useful because it treats interception as a tradeoff, not magic.

Mocking an API can make the UI deterministic, but it can also hide integration risk. Intercepting auth calls can speed up setup, but it may skip important login behavior. Stubbing a third-party service can reduce noise, but it means the test is no longer validating the real dependency.

That is not bad. It just needs to be clear.

A test with mocked network behavior should be labeled and scoped differently from a full end-to-end test.

Modern frontends create special automation problems

A lot of flaky browser tests come from modern frontend behavior.

Shadow DOM, iframes, WebSockets, file uploads, AI-generated frontend changes, React state updates, and fast component churn all create failure modes that simple tests do not cover well.

For Shadow DOM and iframe handling, read How to Test Shadow DOM and Iframes in Playwright Without Turning Every Locator Into a Guess.

The important idea is that boundaries should be explicit. A test should know when it is inside a frame, inside a component boundary, or interacting with a nested widget. Otherwise, selectors become a pile of guesses.

For real-time interfaces, read How to Test WebSocket-Driven UI Flows Without Chasing Race Conditions in E2E.

Real-time UI flows are hard because timing is part of the product. The test has to distinguish between:

  • connection state
  • message delivery
  • UI update behavior
  • reconnection behavior
  • stale data
  • multi-user synchronization

A simple click-and-expect pattern may not be enough.

For file inputs, read How to Test File Upload Components in Modern React Apps Without Flaky Selectors.

File upload tests need to cover more than selecting a file. They should validate the user-visible result: upload accepted, validation shown, progress handled, preview displayed, file attached, or error recovered.

AI-generated frontend changes need QA before they hit release

AI coding tools can change frontend code quickly.

That is useful, but it also means the test strategy has to catch changes that look reasonable in code review but break behavior, selectors, layout, or accessibility.

The article How I Test AI-Generated Frontend Changes Before They Break the Release Branch focuses on that exact problem.

AI-generated frontend changes can introduce:

  • markup drift
  • changed labels
  • weaker accessibility
  • broken selectors
  • missing loading states
  • layout regressions
  • altered button behavior
  • different form validation behavior

The point is not that AI code is bad. Human code can do all of this too.

The point is that AI-generated changes can be large, plausible, and fast. That makes regression checks even more important.

AI-generated Playwright tests are drafts, not finished automation

A big theme on The SDET is AI-generated Playwright code.

Start with How to Generate Playwright Tests with ChatGPT and How to Generate Playwright Tests with Claude.

Both are useful because they treat AI as a drafting tool, not a replacement for test design.

AI can help with:

  • turning user flows into test skeletons
  • generating boilerplate
  • suggesting locator strategies
  • writing first-pass assertions
  • converting manual test cases
  • creating examples quickly

But AI does not know your real application constraints unless you give it that context. It does not know which selectors are dynamic, which user accounts are safe, which feature flags are enabled, or which workflows require special setup.

The same idea appears in How to Generate Playwright Tests with GitHub Copilot and How to Generate Playwright Tests with Cursor.

Coding assistants are useful when you constrain them. They are risky when you let them invent architecture.

If you already have manual test cases, read How to Use AI to Convert Manual Test Cases into Playwright Tests.

Manual test cases often contain the real product intent, but they are written for humans. AI can translate them into code only if the input is structured enough:

  • preconditions
  • test data
  • steps
  • expected results
  • cleanup notes
  • business outcome

If the manual case says “verify checkout works,” the AI still has to guess what “works” means.

That is not safe enough for release automation.

Reviewing AI-generated test code is its own skill

Generated tests should be reviewed like production code.

Actually, they should often be reviewed more carefully, because they can look correct while encoding weak assumptions.

Read How to Review AI-Generated Playwright Code, How to Debug AI-Generated Playwright Tests, and AI-Generated Playwright Tests: Complete Example.

A generated Playwright test needs review for:

  • locator quality
  • assertion strength
  • wait strategy
  • fixture design
  • cleanup
  • test data isolation
  • CI behavior
  • readability
  • whether it tests the intended business outcome

The easiest AI mistake to miss is a weak assertion.

The test clicks through the flow and passes, but it only checks that a page loaded or a URL changed. That may not prove the product behavior the team cares about.

A useful test should answer:

What user outcome did this verify?

If that answer is vague, the test is not ready.

Testing AI-powered product features is different

AI is not only helping write tests. AI is also becoming part of the product.

That creates a different testing problem.

The article How to Test AI-Powered Form Validation Without Trusting the Model Too Much is a good example.

AI-powered validation can be useful, but tests should not blindly trust the model output.

Instead, tests should focus on deterministic product behavior:

  • required errors appear
  • unsafe input is handled
  • valid input can proceed
  • fallback behavior works
  • model uncertainty is handled
  • user messaging is clear
  • server-side validation still protects the system

For AI features, exact output may vary. That means tests need contracts, not just string matches.

AI coding limits are a real operational risk

Several notes on The SDET cover a problem that teams do not talk about enough: AI usage limits and reasoning limits can interrupt real automation work.

Read these together:

The pattern is familiar.

AI is helpful for small tasks. Then the task becomes messy. The model needs more context. The framework has many helpers. The failure needs reruns. The CI issue requires logs, traces, and comparison. The AI tool hits a limit before the fix is complete.

That does not mean AI coding assistants are bad.

It means you should not design your release process around the assumption that AI will always be available, always have context, and always finish the debugging session.

Generated code still needs human ownership.

If your team cannot maintain the framework without the assistant, the framework is probably too fragile organizationally.

Endtest appears as the lower-maintenance alternative in several scenarios

A recurring comparison on The SDET is when a managed platform like Endtest makes more sense than owning the whole framework yourself.

These pieces cover that angle:

The useful framing is not “Playwright versus Endtest” or “Selenium versus Endtest” as a religious debate.

The useful question is:

How much framework ownership can this team realistically support?

If the team has strong SDET capacity, a custom Playwright framework can be a good choice.

If the team is small, moving fast, and struggling with browser coverage, test maintenance, and CI triage, a managed platform can be more practical.

The hidden cost of automation is not only writing code. It is maintaining everything around the code:

  • browser infrastructure
  • reports
  • screenshots
  • videos
  • logs
  • selectors
  • test data
  • retries
  • flaky failure triage
  • CI integration
  • onboarding
  • framework conventions

That cost is easy to underestimate.

Screenshot regression can be useful without a giant visual framework

Visual regression is another area where teams can overbuild.

The article How to Use Endtest for Screenshot-Based Regression Checks Without Writing a Heavy Framework focuses on a lighter approach.

Screenshot checks are useful when they are targeted:

  • critical pages
  • checkout
  • dashboards
  • layout-sensitive forms
  • important responsive states
  • design system components
  • pages recently touched by frontend changes

They become painful when teams try to snapshot everything and then ignore the noise.

Visual checks should support release confidence, not create a second review process nobody wants to own.

A practical SDET reading order

If I were using The SDET as a learning path, I would read the notes in this order.

1. Understand maintenance

Start with maintenance metrics and the cost of growing suites.

Read:

2. Build the Playwright foundation

Then focus on framework structure, data, and network control.

Read:

3. Learn CI failure triage

Then move to release pipeline behavior.

Read:

4. Handle modern frontend surfaces

Then cover the tricky UI categories.

Read:

5. Use AI carefully

Finally, use AI as an accelerator, not an autopilot.

Read:

Final thought

The hardest part of test automation is not getting a browser to click a button.

It is keeping the test suite meaningful after the product changes, the team grows, CI gets noisy, browser behavior shifts, and the original framework author is no longer the only person touching the tests.

That is why SDET work is part engineering, part debugging, part product thinking, and part risk management.

A good automated test does not merely pass.

It proves a useful behavior, fails with evidence, and stays maintainable when the application evolves.

That is the standard worth aiming for.

Top comments (0)