David Frei

Posted on Jun 12

What Actually Breaks Test Automation After the Demo

#testing #playwright #qa #automation

Most test automation demos are too clean.

The demo app is stable. The login flow is simple. The selectors are obvious. The data is predictable. CI is not under pressure. Nobody is trying to debug a flaky checkout test five minutes before a release.

Real test automation work is different.

The product changes. The frontend refactors. A locator breaks. A test passes in preview but fails after merge. An AI-generated Playwright test looks good but asserts the wrong thing. A CI job keeps failing only under parallel execution. Someone adds a feature flag. Someone else updates a React component. The browser suite starts taking too long, so people add retries and hope for the best.

That is the world SDETs actually live in.

I went through the current notes on The SDET and grouped them into a practical reading path for teams trying to build test automation that still works after the first exciting week.

Start with the uncomfortable truth: maintenance is the real product

Writing the first test is rarely the hard part.

Maintaining the 300th test is.

That is why I would start with What to Measure in Test Automation Maintenance Before Your Suite Becomes Expensive.

The useful shift is to measure automation like an ongoing system, not a one-time project.

Good maintenance metrics include:

flaky test rate
selector churn
time to debug failures
time to update tests after UI changes
number of retries
number of quarantined tests
CI runtime
test ownership
how often failures are ignored

Those numbers matter more than raw test count.

A suite with 1,000 tests can still be weak if nobody trusts the failures. A suite with 80 tests can be valuable if it covers the right flows and fails for clear reasons.

The worst test suite is not the one that fails.

The worst test suite is the one that fails and everyone says, “It is probably just the tests.”

Playwright is powerful, but it still needs engineering discipline

Playwright is one of the best tools for modern browser automation, but adopting Playwright does not automatically give you a good test suite.

For a strong foundation, read How to Build a Playwright Test Framework from Scratch.

A framework needs more than a folder full of specs. It needs structure around:

fixtures
test data
authentication
browser projects
reporting
artifacts
retries
CI configuration
selector strategy
cleanup
environment handling

This is where a lot of teams underestimate the work.

A simple Playwright test is easy. A reliable Playwright framework that survives product churn is a different thing.

The article Playwright Test Data Strategies That Keep Your Suite Stable is a good companion because test data is often the hidden reason browser tests fail.

Bad test data creates fake flakiness.

The test fails, but not because the UI is broken. It fails because the user already exists, the cart is not empty, the record was deleted by another test, the backend state is stale, or two parallel workers used the same fixture.

A stable suite needs test data that is:

isolated
disposable
predictable
parallel-safe
easy to reset
close enough to real business behavior

Without that, every test failure becomes a guessing game.

Flaky tests should be diagnosed, not tolerated

Flaky tests are not just annoying. They damage trust.

A flaky test creates a decision every time CI goes red:

Is this a real bug?
Should we block the merge?
Should we rerun?
Should we quarantine it?
Who owns the fix?

That is why How to Stop Flaky Playwright Tests Before They Reach CI is worth reading early.

The article makes an important point: retries are not a strategy. They are evidence.

A retry can tell you that the failure is probably timing-related, state-related, or environment-related. But if the test needs luck to pass, the release signal is already compromised.

For deeper debugging, read How to Debug Flaky Playwright Tests with Trace Viewer, Logs, and Timing Clues.

Trace Viewer is useful because it turns a vague failure into a sequence of facts:

what the browser saw
what action happened
what the DOM looked like
what network calls were made
what console errors appeared
whether an element existed but was not actionable
whether the app was still transitioning

A good trace can show that the problem was not Playwright at all. Maybe the product rendered a button before it was ready. Maybe the API returned late. Maybe an animation blocked the click. Maybe the test asserted too early.

A flaky test is usually not random. It just has not been classified yet.

Learn to classify failures before fixing them

One of the most practical SDET skills is knowing what kind of failure you are looking at.

The article How I Decide Whether a Flaky Test Is a Product Bug, a Test Bug, or a CI Bug is useful because it avoids the lazy answer of “the test is flaky.”

A failing test might point to:

a real product bug
a brittle locator
a bad assertion
missing test data setup
backend state drift
CI resource limits
browser differences
timing assumptions
environment mismatch

Each category has a different fix.

If the product allows double submission, the test should probably expose that. If the selector depends on a generated class name, the test needs to change. If the failure only appears when tests run in parallel, the data isolation needs attention.

This is also where How to Debug Flaky API-Plus-UI Flows When the Browser Is Not the Real Problem becomes important.

Browser tests often get blamed for problems that start below the browser:

async backend processing
slow API responses
stale data
eventual consistency
test account state
feature flags
third-party services

A UI failure is sometimes just the visible symptom of backend instability.

CI failures need artifacts, not theories

CI failures are expensive when they lack evidence.

If the only output is “expected button to be visible,” someone has to reconstruct the run from memory and hope.

That is why How to Store Playwright Test Artifacts in CI So Failure Triage Is Actually Fast is one of the most practical notes in the set.

A useful CI failure should include:

trace files
screenshots
video
console logs
network logs
browser version
environment details
retry information
test data identifiers

The goal is to answer one question quickly:

What happened?

Not what might have happened. Not what usually happens. What happened in that run.

For broader pipeline confidence, read What to Test in CI Before You Trust a New Release Pipeline.

A release pipeline is part of the product delivery system. It needs testing too.

You need confidence in:

build steps
deployment steps
environment parity
secrets
test ordering
artifact retention
rollback behavior
failure reporting
branch and merge behavior

CI is not just a machine that runs tests. It is where release decisions get made.

Passing in preview does not mean passing after merge

Preview environments are helpful, but they are not production and they are not always the same as post-merge environments.

That is the point of Why Browser Tests Pass in Preview but Fail After Merge.

A browser test can pass in preview and fail after merge because of:

caching differences
feature flag state
environment variables
seeded data
auth redirects
deployment timing
CDN behavior
hidden dependencies
database migrations
different browser config

The fix is not to dismiss the failure as “CI being weird.”

The fix is to compare environments carefully and decide what the test is actually proving in each stage.

Similarly, Why E2E Tests Fail in CI but Pass Locally: A Root Cause Checklist is a good checklist for the classic local-versus-CI problem.

Local runs often have advantages that CI does not:

warmer caches
faster CPU
different viewport
different timezone
reused login state
different secrets
less parallel pressure
different browser version

If the environment changes, the test effectively changes.

Network interception is useful, but it changes the meaning of the test

Playwright network interception is powerful.

It can stabilize tests, mock APIs, control third-party calls, and make authentication flows easier to test.

But it should be used intentionally.

The guide Playwright Network Interception Tutorial for Testing APIs, Auth, and Third-Party Calls is useful because it treats interception as a tradeoff, not magic.

Mocking an API can make the UI deterministic, but it can also hide integration risk. Intercepting auth calls can speed up setup, but it may skip important login behavior. Stubbing a third-party service can reduce noise, but it means the test is no longer validating the real dependency.

That is not bad. It just needs to be clear.

A test with mocked network behavior should be labeled and scoped differently from a full end-to-end test.

Modern frontends create special automation problems

A lot of flaky browser tests come from modern frontend behavior.

Shadow DOM, iframes, WebSockets, file uploads, AI-generated frontend changes, React state updates, and fast component churn all create failure modes that simple tests do not cover well.

For Shadow DOM and iframe handling, read How to Test Shadow DOM and Iframes in Playwright Without Turning Every Locator Into a Guess.

The important idea is that boundaries should be explicit. A test should know when it is inside a frame, inside a component boundary, or interacting with a nested widget. Otherwise, selectors become a pile of guesses.

For real-time interfaces, read How to Test WebSocket-Driven UI Flows Without Chasing Race Conditions in E2E.

Real-time UI flows are hard because timing is part of the product. The test has to distinguish between:

connection state
message delivery
UI update behavior
reconnection behavior
stale data
multi-user synchronization

A simple click-and-expect pattern may not be enough.

For file inputs, read How to Test File Upload Components in Modern React Apps Without Flaky Selectors.

File upload tests need to cover more than selecting a file. They should validate the user-visible result: upload accepted, validation shown, progress handled, preview displayed, file attached, or error recovered.

AI-generated frontend changes need QA before they hit release

AI coding tools can change frontend code quickly.

That is useful, but it also means the test strategy has to catch changes that look reasonable in code review but break behavior, selectors, layout, or accessibility.

The article How I Test AI-Generated Frontend Changes Before They Break the Release Branch focuses on that exact problem.

AI-generated frontend changes can introduce:

markup drift
changed labels
weaker accessibility
broken selectors
missing loading states
layout regressions
altered button behavior
different form validation behavior

The point is not that AI code is bad. Human code can do all of this too.

The point is that AI-generated changes can be large, plausible, and fast. That makes regression checks even more important.

AI-generated Playwright tests are drafts, not finished automation

A big theme on The SDET is AI-generated Playwright code.

Start with How to Generate Playwright Tests with ChatGPT and How to Generate Playwright Tests with Claude.

Both are useful because they treat AI as a drafting tool, not a replacement for test design.

AI can help with:

turning user flows into test skeletons
generating boilerplate
suggesting locator strategies
writing first-pass assertions
converting manual test cases
creating examples quickly

But AI does not know your real application constraints unless you give it that context. It does not know which selectors are dynamic, which user accounts are safe, which feature flags are enabled, or which workflows require special setup.

The same idea appears in How to Generate Playwright Tests with GitHub Copilot and How to Generate Playwright Tests with Cursor.

Coding assistants are useful when you constrain them. They are risky when you let them invent architecture.

If you already have manual test cases, read How to Use AI to Convert Manual Test Cases into Playwright Tests.

Manual test cases often contain the real product intent, but they are written for humans. AI can translate them into code only if the input is structured enough:

preconditions
test data
steps
expected results
cleanup notes
business outcome

If the manual case says “verify checkout works,” the AI still has to guess what “works” means.

That is not safe enough for release automation.

Reviewing AI-generated test code is its own skill

Generated tests should be reviewed like production code.

Actually, they should often be reviewed more carefully, because they can look correct while encoding weak assumptions.

Read How to Review AI-Generated Playwright Code, How to Debug AI-Generated Playwright Tests, and AI-Generated Playwright Tests: Complete Example.

A generated Playwright test needs review for:

locator quality
assertion strength
wait strategy
fixture design
cleanup
test data isolation
CI behavior
readability
whether it tests the intended business outcome

The easiest AI mistake to miss is a weak assertion.

The test clicks through the flow and passes, but it only checks that a page loaded or a URL changed. That may not prove the product behavior the team cares about.

A useful test should answer:

What user outcome did this verify?

If that answer is vague, the test is not ready.

Testing AI-powered product features is different

AI is not only helping write tests. AI is also becoming part of the product.

That creates a different testing problem.

The article How to Test AI-Powered Form Validation Without Trusting the Model Too Much is a good example.

AI-powered validation can be useful, but tests should not blindly trust the model output.

Instead, tests should focus on deterministic product behavior:

required errors appear
unsafe input is handled
valid input can proceed
fallback behavior works
model uncertainty is handled
user messaging is clear
server-side validation still protects the system

For AI features, exact output may vary. That means tests need contracts, not just string matches.

AI coding limits are a real operational risk

Several notes on The SDET cover a problem that teams do not talk about enough: AI usage limits and reasoning limits can interrupt real automation work.

Read these together:

The pattern is familiar.

AI is helpful for small tasks. Then the task becomes messy. The model needs more context. The framework has many helpers. The failure needs reruns. The CI issue requires logs, traces, and comparison. The AI tool hits a limit before the fix is complete.

That does not mean AI coding assistants are bad.

It means you should not design your release process around the assumption that AI will always be available, always have context, and always finish the debugging session.

Generated code still needs human ownership.

If your team cannot maintain the framework without the assistant, the framework is probably too fragile organizationally.

Endtest appears as the lower-maintenance alternative in several scenarios

A recurring comparison on The SDET is when a managed platform like Endtest makes more sense than owning the whole framework yourself.

These pieces cover that angle:

The useful framing is not “Playwright versus Endtest” or “Selenium versus Endtest” as a religious debate.

The useful question is:

How much framework ownership can this team realistically support?

If the team has strong SDET capacity, a custom Playwright framework can be a good choice.

If the team is small, moving fast, and struggling with browser coverage, test maintenance, and CI triage, a managed platform can be more practical.

The hidden cost of automation is not only writing code. It is maintaining everything around the code:

browser infrastructure
reports
screenshots
videos
logs
selectors
test data
retries
flaky failure triage
CI integration
onboarding
framework conventions

That cost is easy to underestimate.

Screenshot regression can be useful without a giant visual framework

Visual regression is another area where teams can overbuild.

The article How to Use Endtest for Screenshot-Based Regression Checks Without Writing a Heavy Framework focuses on a lighter approach.

Screenshot checks are useful when they are targeted:

critical pages
checkout
dashboards
layout-sensitive forms
important responsive states
design system components
pages recently touched by frontend changes

They become painful when teams try to snapshot everything and then ignore the noise.

Visual checks should support release confidence, not create a second review process nobody wants to own.

A practical SDET reading order

If I were using The SDET as a learning path, I would read the notes in this order.

1. Understand maintenance

Start with maintenance metrics and the cost of growing suites.

Read:

2. Build the Playwright foundation

Then focus on framework structure, data, and network control.

Read:

3. Learn CI failure triage

Then move to release pipeline behavior.

Read:

4. Handle modern frontend surfaces

Then cover the tricky UI categories.

Read:

5. Use AI carefully

Finally, use AI as an accelerator, not an autopilot.

Read:

Final thought

The hardest part of test automation is not getting a browser to click a button.

It is keeping the test suite meaningful after the product changes, the team grows, CI gets noisy, browser behavior shifts, and the original framework author is no longer the only person touching the tests.

That is why SDET work is part engineering, part debugging, part product thinking, and part risk management.

A good automated test does not merely pass.

It proves a useful behavior, fails with evidence, and stays maintainable when the application evolves.

That is the standard worth aiming for.

DEV Community

What Actually Breaks Test Automation After the Demo

Start with the uncomfortable truth: maintenance is the real product

Playwright is powerful, but it still needs engineering discipline

Flaky tests should be diagnosed, not tolerated

Learn to classify failures before fixing them

CI failures need artifacts, not theories

Passing in preview does not mean passing after merge

Network interception is useful, but it changes the meaning of the test

Modern frontends create special automation problems

AI-generated frontend changes need QA before they hit release

AI-generated Playwright tests are drafts, not finished automation

Reviewing AI-generated test code is its own skill

Testing AI-powered product features is different

AI coding limits are a real operational risk

Endtest appears as the lower-maintenance alternative in several scenarios

Screenshot regression can be useful without a giant visual framework

A practical SDET reading order

1. Understand maintenance

2. Build the Playwright foundation

3. Learn CI failure triage

4. Handle modern frontend surfaces

5. Use AI carefully

Final thought

Top comments (0)