DEV Community: Markus Gasser

AI Can Generate a Test Suite. That Does Not Mean You Have One.

Markus Gasser — Wed, 29 Jul 2026 09:08:16 +0000

Generating browser tests with AI has become almost comically easy.

You paste a requirement into Claude, ChatGPT, Copilot, or another coding assistant. A few seconds later, you have a folder full of Playwright tests, fixtures, page objects, and configuration files.

The tests may even pass.

This creates a dangerous moment because the output looks much closer to a finished system than it really is.

A generated test suite is not valuable because it contains many tests. It is valuable when the team can understand it, trust it, maintain it, and use its results to make release decisions.

Those are very different things.

The demo is not the product

Most AI test-generation demos follow the same path:

Give the model a user story.
Ask it to generate tests.
Run the tests.
Show several green checkmarks.

That proves the model can produce executable code. It does not prove that the resulting suite covers the right risks.

Before trusting a generated suite, QA teams should measure much more than the number of tests or the initial pass rate. This guide on what QA teams should measure before trusting a test suite generated by Claude or another coding assistant covers the kinds of signals that matter: useful coverage, false confidence, maintainability, failure clarity, and the suite’s ability to detect real regressions.

A suite with 300 generated tests may be less useful than 30 carefully chosen ones.

The larger suite can create more noise, more duplicated setup, more brittle selectors, and more opportunities for failures that nobody understands.

AI expands whatever process you already have

AI does not automatically fix a weak testing strategy.

It scales it.

When your requirements are vague, the generated tests are vague. When your application has inconsistent test data, AI generates more code around inconsistent data. When your team cannot agree on what should be tested, the model fills in the gaps with assumptions.

This is why hallucinations in test automation are not just a model-quality problem. They are often an input-quality and architecture problem.

The practical guide to reducing AI hallucinations in test automation makes an important distinction: the more context you send and the less structured that context is, the more opportunities the model has to invent details.

The obvious response is to improve the prompt. That helps, but only up to a point.

A better long-term approach is to reduce how much the model must infer.

Instead of repeatedly asking an AI assistant to reconstruct a large Playwright framework, teams can store tests in a structured, human-readable format. AI can help create or modify individual actions without having to regenerate the entire implementation.

That is one of the more useful ideas in this practical guide to AI test automation: use AI where it reduces work, but keep the resulting tests visible and editable.

Logging matters more when the system is autonomous

A human-written browser test generally performs a known sequence of actions.

An AI agent may inspect the page, choose an action, fail, reinterpret the screen, retry, select a different locator, and continue.

That flexibility can make the agent more resilient. It can also make failures much harder to understand.

Suppose an agent tries to click a button three times. On the fourth attempt, it chooses a text link with a similar label and reaches the next page.

Did the test recover intelligently?

Or did it stop testing the intended path?

Without the right evidence, you cannot know.

At a minimum, the run should preserve:

The original goal
The chosen action
The locator or target
The page state before the action
The error returned by the browser
Every retry and the reason for it
The alternative strategy selected
Screenshots and DOM evidence
The final outcome

This article about what to log when an AI test agent retries a browser step and still fails provides a useful starting point.

For more complex systems, ordinary text logs are rarely enough. Instrumenting AI test agents with OpenTelemetry spans, structured logs, and replayable artifacts gives the team a better way to reconstruct the agent’s decisions.

The important word is reconstruct.

When an autonomous test fails, the person investigating should not have to guess what the agent believed it was doing.

Prompt drift becomes test drift

Traditional browser tests usually fail because the application changes.

AI-assisted tests introduce another source of change: the prompt.

A small prompt edit can alter which paths are selected, how assertions are interpreted, and what the system considers a successful result.

The same thing happens when the underlying model changes, even when the prompt does not.

That means teams need versioning for more than test code. They may need to preserve:

Prompt versions
Model or agent configuration
Input datasets
Generated steps
Human approvals
Expected outputs
Screenshots and evidence
The reason a generated change was accepted

The article on prompt replay, human review, and evidence retention explains why these capabilities matter once AI-generated behavior enters a release process.

Similarly, this guide to choosing an AI testing platform when prompt changes make UI signals harder to trust highlights a problem many teams discover late: a passing result is not useful unless you understand what remained consistent between runs.

AI interfaces create new categories of assertions

Testing an AI feature is not the same as testing a deterministic form.

A traditional application might return a fixed validation message. An AI assistant may produce several valid answers with different wording.

This changes the question from:

Did the output exactly match this string?

to:

Did the output satisfy the user’s goal without violating important constraints?

Consider AI-generated copy. The text may be semantically correct while still breaking the interface through unusually long sentences, unsupported characters, missing accessible labels, or poor localization. The article on testing AI-generated copy in web interfaces covers these less obvious failure modes.

Form assistants create similar complications. You need to test not just whether a suggestion appears, but whether the user can reject it, edit it, recover from an invalid suggestion, and reset the state. This practical look at testing AI-powered form assistants goes deeper into those flows.

Floating copilots and AI sidebars introduce another set of problems. They can obscure controls, preserve stale context, or behave differently when reopened. This comparison of Endtest and Playwright for testing AI sidebars and floating command panels illustrates how quickly a seemingly simple chat panel turns into a state-management problem.

RAG testing needs evidence, not confidence

Retrieval-augmented generation applications are especially easy to test badly.

A chatbot provides a polished answer, the answer sounds reasonable, and the test passes.

But the real questions are:

Were the correct documents retrieved?
Did the answer reflect those documents?
Were citations attached to the correct claims?
Did the system ignore outdated or unauthorized sources?
Did a ranking change alter the answer?
Could the result be reproduced?

A practical review of testing RAG chatbots, retrieval drift, and source-citation flows shows why testing only the final text is insufficient.

At a broader level, this market map of browser-testing platforms for AI search and reranking validation provides a useful way to think about the available tooling.

The system needs to record enough evidence to distinguish a model problem from a retrieval problem, a ranking problem, a permissions problem, or a frontend problem.

Otherwise, every failure becomes “the AI gave a bad answer,” which is not actionable.

The affordable option is not always the free framework

Open-source libraries are inexpensive to download.

The system built around them may not be inexpensive at all.

There is test generation, framework design, code review, CI integration, debugging, reporting, test-data management, browser infrastructure, retries, artifact storage, and ongoing maintenance.

AI can reduce some of that work. It can also produce more code than the team can realistically review.

That is why the discussion around affordable AI test automation should focus on total cost rather than licence cost.

A generated Playwright repository may be free in the narrowest possible sense. If several engineers spend every sprint repairing it, explaining it, and rebuilding its infrastructure, it is not free in any business sense.

Start with the decision the test should support

The most useful question is not:

How many tests can AI generate for us?

It is:

What release decision will these tests help us make?

Once that is clear, the rest becomes easier.

You can choose the smallest useful scope, define the evidence required, limit agent autonomy, preserve the generated steps, and decide where human review belongs.

AI can dramatically reduce the work required to create tests.

It cannot decide what your organization should trust.

That remains the team’s job.

The Test Framework Is Not the Product

Markus Gasser — Mon, 27 Jul 2026 21:29:05 +0000

A few years ago, the hardest part of building a browser test framework was getting started.

You had to choose a runner, configure browsers, create page objects, wire up reporting, add retries, manage secrets, connect it to CI, and convince someone else on the team to learn how the whole thing worked.

Today, you can open an AI assistant and ask it to generate most of that before lunch.

That sounds like a dramatic improvement. In some ways, it is.

But it also moves the bottleneck.

The question is no longer, “Can we create a framework?”

The question is, “Can we operate what was created?”

That distinction matters more than it appears.

Generation cost is not ownership cost

A generated framework feels cheap because the first version arrives quickly. The code compiles, a few tests pass, and the pull request looks more complete than anything you could have written in an afternoon.

Then reality starts applying pressure.

The application changes. Authentication behaves differently in staging. A shared helper starts hiding failures. Parallel workers collide over test data. Someone upgrades a dependency and three reporters stop agreeing with one another.

The initial generation was fast. The ownership cost was merely deferred.

This is the central problem described in what actually breaks when Claude generates a large Playwright framework. Large generated systems often fail in the seams: fixtures, abstractions, environment assumptions, test data, and conventions that were never explicitly agreed upon.

The code may be readable line by line while the system remains difficult to reason about as a whole.

That is a dangerous form of complexity because it looks productive.

More code can hide less understanding

Teams sometimes evaluate AI-generated automation by counting output:

number of test files;
number of scenarios;
number of passing checks;
number of prompts completed;
number of lines added.

Those numbers are easy to produce and easy to report.

They are also weak proxies for confidence.

A suite with 500 generated tests can be less useful than a suite with 40 deliberately chosen journeys. The larger suite may validate superficial states repeatedly while missing the handful of transitions that actually put revenue, customer trust, or data integrity at risk.

That is why AI test coverage breaks down when teams optimize for prompt pass rate instead of user journey risk. A prompt can succeed while the resulting test strategy remains badly shaped.

The goal is not to prove that the AI followed instructions.

The goal is to reduce the probability of an expensive surprise.

Those are not the same thing.

The stack keeps expanding quietly

A common setup now looks like this:

Playwright runs the browser.
Claude generates or modifies test code.
GitHub Actions runs the suite.
A reporting service stores results.
A visual tool compares screenshots.
A test data service creates accounts.
Slack receives alerts.
Someone maintains prompts, conventions, and guardrails.

Every component can be reasonable on its own.

The problem is the integration surface.

When a test fails, the team has to determine whether the issue came from the product, generated code, browser timing, fixture state, environment configuration, a model assumption, or the reporting layer.

This is the point where Playwright plus Claude starts feeling like too many moving parts. The burden is not necessarily that either tool is bad. The burden is that your team has effectively become the vendor responsible for assembling, documenting, and supporting the combined system.

That can be a good trade for some companies.

It is not automatically a good trade for yours.

Generated frameworks inherit generated inconsistencies

Ask an AI to add ten tests over several weeks and you may receive several competing ideas about architecture.

One test uses page objects. Another uses fixtures directly. One helper waits for network idle. Another waits for a locator. One file creates data through an API. Another drives the setup through the UI. Naming conventions shift with the wording of the prompt.

Each individual decision can look defensible.

Together, they create entropy.

The same concern applies whether the output is Playwright or Selenium. What to watch for when Claude generates a large Playwright or Selenium framework is not merely syntax quality. It is whether the generated system develops a coherent internal model that humans can consistently extend.

Without a strong architecture owner, AI often accelerates local decisions faster than the team can establish global consistency.

You get more automation and less standardization at the same time.

The real test is the second year

The first month of a new framework is unusually flattering.

The original author remembers everything. The application has not drifted much. Dependencies are current. The test count is manageable. Failures still feel novel enough to investigate.

The second year is where the economics become visible.

Can a new engineer understand why a helper exists?

Can QA modify a business flow without rewriting TypeScript?

Can you identify unused fixtures?

Can you upgrade the runner without a migration project?

Can you distinguish a product defect from a brittle assertion in ten minutes?

A useful evaluation should focus on operational questions like these. This is also why guidance on choosing a browser testing tool for stable runs on fast-changing frontends should be read as an organizational decision, not a feature checklist.

Fast-changing products punish unclear ownership and fragile abstractions.

They reward systems that remain legible under change.

Lightweight is a promise that needs testing

“Lightweight” sounds good because nobody wants another platform rollout.

But lightweight can mean several different things:

fewer features;
less configuration;
a smaller runtime;
a simpler interface;
less vendor involvement;
more work delegated to your own team.

The last meaning is often omitted.

Before adopting a small AI runner, compare what is included and what you will need to build around it. What to compare before adopting a lightweight AI test runner is less about raw capability than about the boundary between the product and your internal engineering work.

A tool can have a tiny installation footprint and a very large organizational footprint.

That is not necessarily wrong. It just needs to be priced honestly.

Buy versus build is now build versus continuously regenerate

The old debate was straightforward:

Should we buy a testing platform or build our own framework?

AI has introduced a third option that feels different but often behaves similarly:

Continuously regenerate and patch an internal framework with AI.

This can reduce the labour required for individual changes. It does not remove the need for architecture, review, debugging, security decisions, test data management, release policies, and maintenance ownership.

AI changes the speed of implementation.

It does not eliminate the consequences of implementation.

The better question is not, “How quickly can we generate this?”

It is:

What permanent responsibility are we creating for the team?

That question is boring, which is usually a sign that it is useful.

A generated test framework can absolutely be the right choice. But the framework is not the product your company sells. It is infrastructure supporting the product.

Treat it accordingly.

Optimize for confidence, comprehensibility, and maintenance cost—not for the excitement of watching a model produce 4,000 lines of code in one sitting.

Your Test Suite Isn't Slow. It's Accumulating Decisions

Markus Gasser — Fri, 24 Jul 2026 22:09:41 +0000

Most browser test suites do not collapse in one dramatic moment.

They get a little slower on Monday.

A little noisier on Tuesday.

Someone adds a retry on Wednesday.

By Friday, the pipeline takes 28 minutes, three tests fail for reasons nobody can reproduce, and the team has quietly learned to merge anyway.

That is how reliability debt works. It rarely looks urgent while you are creating it.

The common explanation is that browser tests are inherently flaky. That explanation is convenient because it makes the problem feel unavoidable. But many of the failures we call “flakiness” are really the accumulated result of dozens of small technical decisions:

an animation that behaves differently in CI;
a feature flag that changes the DOM after the test starts;
a performance threshold that ignores natural variance;
more parallel workers than the environment can actually support;
an assertion that checks implementation details instead of user-visible outcomes.

The test suite is not betraying you. It is reporting the architecture you gave it.

The environment is part of the product

A test that passes locally and fails in CI is often treated as a tooling problem. Sometimes it is. More often, the two environments are not equivalent.

One subtle example is CSS motion. A developer machine may use normal motion preferences while a CI browser reports reduced motion, or vice versa. That can change transition duration, animation timing, element visibility, and even which branch of a component renders.

The result is a test that appears to fail randomly even though it is responding consistently to different inputs. This article on why browser tests fail when CSS motion preferences differ between local and CI environments is a good reminder that browser configuration is test data.

The same principle applies to locale, timezone, colour scheme, viewport size, available fonts, GPU behaviour, network conditions, and feature flags.

Teams often spend hours debugging the final assertion when the real difference was established before the first line of the test ran.

A useful rule is this:

If an environment setting can change the user experience, make it explicit in the test configuration.

Do not depend on whatever default the laptop, container, or hosted runner happens to provide.

Feature flags create multiple applications

A feature flag is not just a Boolean variable. It creates another version of your product.

Five independent flags can theoretically create 32 combinations. Most teams do not test all of them, nor should they. But many teams also fail to define which combinations matter.

That is where browser tests become confusing. The same test name may execute against different UI structures depending on rollout state, account assignment, cached configuration, or timing.

The practical guide on debugging frontend tests that fail after feature flag changes highlights the first thing to verify: what state did the application actually render?

That question sounds obvious. Yet many test reports preserve screenshots and logs without preserving the active flag set.

When a failure is tied to gradual rollout logic, capture at least:

the evaluated flag values;
the user or account segment;
the application version;
the relevant API response;
the DOM or screenshot at the moment the branch appeared.

Without that context, you are not debugging a test. You are reconstructing a missing environment.

Modern UI patterns require different assertions

React Server Actions and optimistic interfaces make applications feel faster by showing the expected result before the server confirms it.

That is good product design. It also creates several states that browser tests can accidentally confuse:

the original state;
the optimistic state;
the confirmed state;
the rollback state after an error.

A test that sees the optimistic update and immediately passes may miss a server failure. A test that waits only for a network request may ignore a rollback bug. A test that asserts every transitional DOM detail becomes brittle whenever the implementation changes.

The article on testing React Server Actions and optimistic UI offers a more durable approach: assert the state transitions that matter to the user, not every intermediate implementation detail.

For a “save” action, that might mean:

the user sees the immediate optimistic update;
the server request succeeds;
the state remains correct after a reload;
an error produces a clear rollback or recovery path.

This is a broader lesson. Browser automation should validate product promises. The closer your assertions are to internal mechanics, the more maintenance you purchase.

Parallelism has a ceiling

When a suite becomes slow, the reflex is to add more workers.

That works—until it does not.

Parallel test execution competes for CPU, memory, browser processes, network bandwidth, database connections, test accounts, and shared environments. Once one of those resources saturates, adding workers can make the suite slower rather than faster.

This framework for understanding why CI test suites get slower when parallelism increases is worth using before paying for larger runners or increasing concurrency again.

The important metric is not the number of workers. It is throughput.

Run a simple experiment:

execute the same representative group with 2 workers;
repeat with 4, 8, and 16;
record total duration, failure rate, CPU, memory, and external-service latency;
stop increasing concurrency when throughput stops improving reliably.

Most teams have an efficient range, not an efficient maximum.

And beware of shared state. Two tests using the same account, inbox, cart, project, or database record are not truly independent. Parallelism merely makes the collision occur sooner.

Dynamic interfaces punish vague synchronization

Tables, filters, and infinite scrolling are common sources of false confidence because the first visible state looks complete before the application is finished.

A filter click can trigger debouncing, multiple requests, a loading placeholder, a DOM replacement, and finally a stable result. Waiting for “the table to be visible” proves almost nothing.

The comparison of Playwright, Cypress, and Selenium for dynamic tables, filters, and infinite scroll shows that tooling matters, but the test model matters more.

Reliable tests usually wait for a business-level condition:

the loading indicator disappears;
the result count changes;
a known row appears;
the final page cursor updates;
the API response corresponding to the action completes.

“Sleep for two seconds” is not synchronization. It is a bet.

Sometimes the bet wins for months. Then the CI environment becomes slightly slower, and the suite suddenly looks haunted.

Performance budgets need tolerance, not wishful thinking

Performance checks are valuable because regressions can be invisible in functional tests. But a strict threshold without an understanding of variance produces alert fatigue.

A page that normally loads between 900 ms and 1.2 seconds should not fail every time it reaches 1.21 seconds. At the same time, using a generous fixed ceiling can hide a gradual decline.

A better model is explained in this guide to building a CI gate for frontend performance budgets without flagging every normal fluctuation.

Useful performance gates often combine:

an absolute maximum;
a percentage regression from a baseline;
several samples instead of one;
separate budgets for different page types;
a warning range before a hard failure.

The goal is not to produce a perfectly stable number. The goal is to detect meaningful degradation early enough to act.

Healthy load tests can still describe an unhealthy product

A load test can show low server response times while real users experience slow pages.

This happens because the test measures the backend request but not the full browser experience: JavaScript execution, hydration, third-party scripts, image decoding, layout shifts, client-side rendering, and long tasks.

The article on why load test results can look healthy while users still experience slow pages makes the distinction clear: infrastructure health and user experience overlap, but they are not identical.

You need both views.

Load tests answer questions such as:

Can the service handle 5,000 concurrent requests?
Where does database latency increase?
When do queues and connection pools saturate?

Browser performance tests answer different questions:

When can the user interact?
Did the main thread freeze?
Did the page shift while loading?
Did a third-party dependency delay the critical path?

A green load test is useful evidence. It is not a certificate of speed.

Measure the cost before adding more coverage

Teams track how many tests they have, but fewer track what those tests cost to own.

That is a mistake because test count is an input. Maintenance burden is the business outcome.

This guide on measuring test suite maintenance cost before it eats your sprint suggests looking beyond execution time.

Track:

engineering hours spent fixing tests;
failures caused by product defects versus test defects;
median time to diagnose a failed run;
repeated failures by component;
tests ignored or retried;
percentage of the suite that has not caught a defect in the last year.

The last metric is uncomfortable, which is why it is useful.

Some tests protect critical workflows and should survive years of product changes. Others were created because coverage looked good in a planning document and now provide very little signal.

Deleting a low-value test is not reducing quality. It can increase quality by making failures credible again.

The operating principle

A reliable test suite is not the one with the most sophisticated framework or the highest test count.

It is the one the team still believes.

That belief comes from boring fundamentals:

explicit environments;
observable feature states;
meaningful synchronization;
realistic concurrency;
outcome-based assertions;
performance thresholds that understand variance;
continuous measurement of maintenance cost.

When a suite becomes slow and noisy, do not begin with retries.

Begin with the decisions.

Your CI Is Not Flaky. Your Failure Triage Is.

Markus Gasser — Thu, 23 Jul 2026 20:08:37 +0000

A red CI build is not a diagnosis.

It is a notification that something happened. That “something” could be a product regression, an unreliable test, a broken test environment, a stale fixture, a browser update, a third-party outage, or a timing issue that only exists under shared infrastructure.

Yet many teams still react to every failure in exactly the same way:

Open the failed job.
Re-run it.
Hope it turns green.
Merge when it does.

That workflow feels fast because it avoids investigation. In reality, it transfers the cost downstream. The same failure returns later, confidence in the suite declines, and people eventually stop treating red builds as meaningful.

The real problem is rarely “too many flaky tests.” It is usually the absence of a dependable failure-triage system.

The three buckets every failure should enter

A useful CI process starts by classifying failures into three broad categories.

1. Product failures

These are the failures you actually want the suite to find:

A button no longer submits a form.
A permission rule exposes the wrong action.
A search result is missing.
A checkout flow breaks after a backend change.
A UI component renders but cannot be interacted with.

The key signal is repeatability. The failure normally appears under the same product state and can be reproduced outside the original CI job.

2. Test failures

These happen when the application is working but the test is not:

A locator depends on brittle DOM structure.
An assertion checks an intermediate state.
A fixed wait is shorter than the real loading time.
A test leaks state into the next test.
A screenshot baseline includes unstable content.

A good starting point is this CI failure triage checklist, which treats product bugs, test noise, and environment drift as separate operational problems instead of one generic “automation failure.”

3. Environment failures

These are often the hardest to identify because they imitate product and test failures:

The test environment was partially deployed.
Seed data was missing.
A shared account was locked.
DNS, storage, email, or a third-party API responded slowly.
The browser version changed underneath the suite.
A worker ran out of memory or disk space.

Environment failures are especially common when browser tests run on shared CI infrastructure. A practical evaluation of Endtest for shared CI browser testing explores the connection between infrastructure stability, coverage, and triage speed.

Stop using retries as your first diagnostic tool

Retries are useful, but only when they generate evidence.

A blind retry answers one question: did the test pass the second time?

It does not tell you why the first run failed.

A better retry captures a comparison set:

First-run screenshot and retry screenshot
Browser console output
Network failures
DOM snapshot or trace
Test data identifiers
Browser and operating system versions
Deployment version
Exact step where timing diverged

This turns a retry from an eraser into an experiment.

When the original and retry runs fail in different places, the problem is likely broader than one selector. When both fail at the same step with the same application state, the odds of a product issue rise. When the retry passes after a much longer load time, the environment or synchronization strategy deserves attention.

Build a CI gate around risk, not perfection

Many teams want a strict rule: one failed test blocks the deployment.

That sounds disciplined. It becomes counterproductive when the suite contains known instability.

The opposite rule—ignore failures and investigate later—is worse.

The useful middle ground is a risk-based gate. This article on building a CI gate for flaky tests without slowing every deployment describes the kind of tradeoff teams need to make.

A practical gate can treat failures differently:

A repeatable failure in a critical checkout or authentication flow blocks immediately.
A first-time failure in a low-risk area triggers one diagnostic retry.
A known flaky test does not silently pass; it creates a tracked reliability event.
Multiple unrelated failures in the same worker flag an environment problem.
A sudden increase in suite-wide duration triggers investigation even if tests pass.

The goal is not to make CI green. The goal is to make CI trustworthy.

Reliability needs a benchmark

Teams often say a suite is “mostly stable,” but that description is too vague to operate.

Start with a few numbers:

First-run pass rate
Retry recovery rate
Failures per 100 executions
Median time to classify a failure
Percentage of failures with enough evidence to diagnose
Failure concentration by test, environment, browser, and application area
Number of quarantined tests and average time spent in quarantine

Modern frontends make these metrics more important. Lazy loading, Suspense boundaries, and route-level code splitting create legitimate intermediate states that tests can easily misread. This guide to benchmarking browser test reliability on lazy-loaded applications is useful because it focuses on repeatable measurement rather than intuition.

A 98% pass rate may sound strong. If you run 2,000 tests per day, it still produces 40 failures. If most of those require manual inspection, the suite is expensive even when its percentage looks impressive.

Flaky-test triage should reduce future work

A flaky-test process fails when it becomes a place to store unresolved failures.

The output of triage should be one of four actions:

Fix the product.
Fix the test.
Fix the environment.
Remove or redesign a test that provides less value than it costs.

The workflow described in this flaky-test triage guide is centered on reducing recurring noise, which is the right objective. Counting flaky tests is not progress. Preventing the same failure pattern from returning is progress.

A useful triage record should include:

Failure category
Evidence
Owner
Temporary containment
Permanent corrective action
Deadline
Similar tests that may share the same weakness

That last field matters. A brittle locator found in one test is often present in twenty others.

AI failures need additional context

AI-assisted workflows add another layer to failure analysis. The UI may be unchanged while a prompt, model response, citation order, or streaming sequence has changed.

For those systems, capture:

Prompt version
Model configuration
Input data
Retrieved context
Response chunks in arrival order
Final assembled response
Screenshots and traces
Human-review decision

This AI test failure triage workflow shows why screenshots alone are insufficient when the behavior depends on prompts and generated output.

The economics are easy to ignore

Teams tend to compare test tools by license price while ignoring the cost of investigation, maintenance, infrastructure, and engineering attention.

That is why the more useful comparison is return on investment, not syntax. The argument is developed in Playwright vs Selenium in 2026: The Real Question Is ROI, Not Syntax.

The same principle applies to CI reliability. A free framework can still support an expensive testing operation. A paid platform can still be poor value. What matters is the total effort required to produce dependable release information.

For another perspective on browser automation strategy, this video discussion is worth adding to your research list.

A red build should create knowledge

The best CI systems do more than approve or reject a commit.

They teach the team:

Which product areas are fragile
Which tests are expensive to maintain
Which environments drift most often
Which failures lack evidence
Which types of regressions escape until production
Where engineering effort will improve confidence fastest

A red build is valuable when it creates a clear decision.

Without classification, evidence, ownership, and follow-through, it is just another notification everyone learns to ignore.

Modern Frontends Don’t Have One “Ready” State

Markus Gasser — Wed, 22 Jul 2026 20:46:41 +0000

A lot of browser tests are still written around a simple mental model:

Open the page.
Wait for it to load.
Interact with the final UI.
Assert that the expected result appears.

That model worked reasonably well when pages arrived as complete documents and JavaScript added a few interactions afterward.

Modern frontends are different.

The page can be visible before it is interactive. A component can render three times before it settles. Text can arrive before the buttons around it. A skeleton can disappear while the real content is still being measured. A route transition can keep the previous screen in the DOM for a few hundred milliseconds. A theme preference can be applied after hydration and briefly produce the wrong colors.

There is no longer one obvious moment when the page is “ready.”

That is why some test suites look healthy in CI while users still report flickering controls, broken keyboard focus, stale content, or clicks that land on elements that are about to disappear.

The intermediate UI is part of the product

Teams often treat loading states as temporary implementation details. Users do not experience them that way.

A user on a slower device may spend several seconds looking at a skeleton screen. Someone opening a server-rendered page may try to click before hydration finishes. A user with reduced-motion preferences may receive a completely different transition path. A returning user may see the wrong theme for half a second before local storage is read.

These are real product states, even if they are short-lived.

A useful starting point is to stop asking only:

Did the final screen appear?

Instead, ask:

What states did the user pass through before the final screen appeared?

That shift immediately changes what you test.

For server-rendered applications, it is worth tracking whether your suite can actually detect client/server divergence rather than merely waiting until the browser repairs it. This guide on measuring whether a frontend test suite catches hydration mismatches provides a useful way to think about coverage.

The same problem appears in streaming interfaces. Content may be inserted in chunks, replaced, or reordered as additional data arrives. An automation agent that assumes the first plausible element is the final element can act too early. The article Why AI Test Agents Break on Streaming UIs, Skeleton States, and Incremental Renders explores why this is especially difficult for AI-driven automation.

“Visible” is not the same as “stable”

A visible element can still be unsafe to interact with.

It may be:

moving because a font has just loaded;
covered by a fading transition layer;
attached to a component that is about to re-render;
a placeholder that will be replaced;
part of the previous route;
visually complete but not yet connected to event handlers.

This is one reason fixed sleeps are so seductive. A two-second pause seems to make the problem disappear.

Until CI gets slower. Or the application gets faster. Or an animation duration changes. Or a third-party request takes 2.3 seconds.

A stronger test waits for a meaningful application condition. That might be the disappearance of a loading marker, the presence of the final record count, the completion of a network request, or a stable element that remains attached across consecutive checks.

Skeleton screens deserve particular attention because they can create both false positives and false negatives. A test may mistake a skeleton row for a real record, or wait for all placeholders to disappear even though infinite scrolling intentionally keeps one visible. This practical guide to testing skeleton screens, loading shimmers, and progressive rendering covers the problem in more detail.

Animation changes the meaning of timing

CSS View Transitions and animated route changes make applications feel smoother, but they also blur the boundary between two screens.

During a transition:

the old screen may still be present;
the new screen may already be present;
both can match the same selector;
focus may move before the animation completes;
screenshots may capture a blended frame;
clicks may hit an element that is technically visible but not usable.

The wrong response is usually to disable all animation in tests. That can be useful for a narrow set of visual checks, but it also means the test environment no longer exercises the behavior users receive.

A better strategy is to separate functional checks from motion-specific checks. Most tests can wait for a final route marker or stable state. A smaller set should deliberately verify transitions, reduced-motion behavior, focus continuity, and interruption handling.

How to Test CSS View Transitions, Route Animations, and Motion-Safe UI Changes offers a good framework for doing that without filling the suite with arbitrary delays.

Persisted preferences create hidden branches

Theme switching looks simple until persistence enters the picture.

A complete test may need to cover:

the default theme for a first-time visitor;
the operating system preference;
a manually selected theme;
persistence after refresh;
persistence in a new tab;
behavior after logout;
synchronization across sessions;
contrast and icon changes;
the brief state before stored preferences are applied.

This is not just a visual regression problem. It is also a browser-state problem.

The comparison in Endtest vs Playwright for testing theme switching, persisted preferences, and dark mode regression risk is useful because it looks beyond the obvious “click the theme toggle” scenario.

Whichever tool you use, the important part is making state explicit. A test should know whether it is starting with clean storage, seeded storage, or a previous session. Otherwise, failures become dependent on execution order.

High-churn interfaces expose maintenance problems quickly

Fast-changing products amplify every weakness in a test suite.

A selector tied to button text breaks when the copy team runs an experiment. A screenshot assertion fails after a spacing adjustment. A locator based on a generated class disappears after a framework upgrade. A test that expects a modal becomes invalid when the flow moves to an inline panel.

This is where maintainability matters more than how quickly the first test was recorded or coded.

A Practical Look at Endtest for Fast-Changing Frontends With Frequent Copy and Selector Drift examines that exact pressure. Two related evaluations—testing dynamic SaaS interfaces with less maintenance and using Endtest for high-churn web apps—are also useful when comparing maintenance approaches.

The broader lesson is tool-independent: design tests around stable product intent rather than incidental markup.

A test should care that the user can submit an invoice, not that the third nested <div> contains a button with exactly the same text forever.

That does not mean avoiding precise assertions. It means being precise about outcomes while being deliberate about which implementation details deserve to become dependencies.

A better model for frontend readiness

For each important workflow, define readiness at three levels:

1. Render readiness

Is the expected structure present?

This catches missing components, server failures, and major rendering problems.

2. Interaction readiness

Can the user actually operate the interface?

This includes event handlers, focus behavior, overlays, enabled controls, and elements that are no longer moving or being replaced.

3. Business readiness

Has the application reached the state that matters?

For example, the order appears in the account, the preference survives refresh, or the newly created record is available from another screen.

Many flaky tests stop at render readiness and immediately perform a business action. The unstable space between those levels is where failures hide.

Test the journey, not just the screenshot at the end

Modern frontend testing is increasingly about transitions between states.

A robust suite observes:

what appears first;
what changes next;
what can be interacted with at each point;
which state is persisted;
which state is temporary;
what happens when rendering is interrupted;
whether accessibility preferences produce a different path.

The final screen still matters. It is just no longer the entire story.

When a team starts treating hydration, streaming, animation, skeletons, and preference restoration as first-class behavior, a surprising number of “random” failures become understandable. More importantly, the suite starts catching the same awkward moments that users notice before those moments become support tickets.

Why Browser Tests Fail Everywhere Except Your Laptop

Markus Gasser — Tue, 21 Jul 2026 09:54:30 +0000

A browser test that fails everywhere is usually easy to diagnose.

A browser test that passes on your laptop, passes when you open DevTools, and then fails in CI is much more dangerous. It encourages the team to blame timing, rerun the job, add another sleep, and move on.

That is how flaky suites become permanent infrastructure.

The problem is rarely that CI is “random.” More often, CI is exposing a difference that your local workflow hides: a different build, a different dependency tree, a different network sequence, a different browser lifecycle, or a different rendering path.

Here is how I approach those failures without immediately reaching for longer timeouts.

Start by proving which environment you are actually testing

Preview deployments often look identical to production while behaving differently underneath.

They may use:

a different API base URL
feature flags tied to branch names
temporary authentication callbacks
incomplete seed data
stricter cookie policies
edge caching that has not warmed up
environment variables injected by a different build pipeline

Before debugging the test, capture the page URL, build identifier, commit SHA, enabled flags, API host, browser version, and viewport. A screenshot is helpful, but it is not enough. Two pages can look the same while loading different JavaScript bundles or talking to different services.

This guide on debugging browser tests that fail only in preview environments is a useful checklist because it treats the preview environment as its own system rather than a smaller copy of production.

A simple rule helps: if the environment can differ, log the difference before the test starts.

Compare the production build, not just the source code

Developers commonly reproduce a failure by running the application locally in development mode. That can be misleading.

Development builds usually preserve readable function names, include detailed error overlays, skip aggressive minification, and generate source maps differently. CI may be exercising a production bundle where stack traces are compressed, chunks are loaded in a different order, and an exception is swallowed by an error boundary.

This is why a test can appear stable while DevTools is open yet fail in a headless CI run. DevTools changes timing, keeps more diagnostics available, and sometimes makes the failure easier for a human to interpret without changing the underlying cause.

A better reproduction workflow is:

Build the exact artifact produced in CI.
Serve that artifact locally.
Use the same environment variables and feature flags.
Run the same browser version in the same mode.
Preserve console errors and unhandled promise rejections.

The article on tests that pass in DevTools but fail in CI because of source maps, minification, and error stacks goes deeper into this class of failure.

Stop treating live UI updates like normal page loads

Server-sent events, streaming responses, and live notification systems create a different synchronization problem.

The page may be loaded, the button may be visible, and the application may still be waiting for a message that arrives later. A test that asserts immediately after an action is not necessarily flaky; it may simply be observing the wrong state transition.

Avoid waiting for arbitrary delays. Instead, wait for evidence that the application reached the state you care about:

a specific notification ID appears
a counter changes from one known value to another
a row receives a final status
a loading marker disappears
a network stream produces a recognizable event

When possible, record the event payload or a correlation ID alongside the UI artifact. That gives you a way to distinguish “the server never sent the update” from “the browser received it but the UI did not render it.”

For more patterns, see testing server-sent events, live notifications, and partial UI updates without flaky assertions.

Hydration failures are state mismatches, not selector problems

Modern React applications can render usable HTML before the client-side application has fully taken control. Suspense boundaries, streaming server rendering, and hydration make the page feel faster, but they also create brief periods where the DOM is present without being stable.

A test may find a button and click it while React is replacing that exact node. The resulting error looks like a detached element, an intercepted click, or a selector that suddenly stopped matching.

The wrong response is usually to create a more complicated selector.

The better response is to identify an application-level readiness signal. Examples include:

a root element gains a hydrated attribute
a skeleton disappears
a client-side event handler becomes active
a streaming region reaches its completed state
the same node remains stable across two observations

This overview of browser testing for React hydration, Suspense, and streaming UI explains why these applications need more than generic “element visible” checks.

Verify the dependency graph used by CI

A lockfile is supposed to make builds reproducible, but teams still end up with differences caused by package manager versions, optional dependencies, platform-specific packages, cached modules, or install commands that do not enforce the lockfile strictly.

When a browser test starts failing after a dependency update, record:

the package manager and version
the lockfile checksum
the runtime version
the resolved version of the suspected package
whether the dependency cache was restored
whether the install modified the lockfile

Do not limit the comparison to direct dependencies. A transitive update in a router, date library, UI component, or polyfill can change browser behavior without appearing in the application code diff.

The checklist in what to log when browser tests fail only in CI after a dependency lockfile change is especially useful when the test failure begins immediately after routine dependency maintenance.

Treat third-party scripts as asynchronous dependencies

Analytics, support widgets, consent managers, fraud tools, payment SDKs, and experimentation platforms can all change the page after your application considers itself ready.

They may inject iframes, move focus, modify the DOM, register global event handlers, delay the main thread, or place overlays above interactive elements. Worse, their behavior can vary by geography, cookie state, account, or time of day.

When one of these scripts is involved, capture more than a screenshot. Log which third-party resources loaded, their response status, load duration, and whether they created new frames or overlays.

This article on what QA teams should log when a browser test fails only after a third-party script loads provides a practical baseline.

Build a failure package, not a failure message

“Element not clickable” is not a diagnosis.

A useful CI failure should preserve enough context for someone to investigate without rerunning the test five times. At minimum, keep:

screenshot and video
browser console output
relevant network failures
application build ID
browser and operating system versions
feature flags
dependency lockfile checksum
current URL and viewport
the last meaningful user action
a DOM snapshot or page source around the failure

The goal is not to collect everything forever. The goal is to make the first failure actionable.

Once you compare environments, production bundles, live update states, hydration readiness, dependency resolution, and third-party behavior, many “random CI failures” stop being random. They become ordinary engineering problems with observable causes.

Modern Frontend Testing Is Mostly About State, Timing, and Geometry

Markus Gasser — Fri, 17 Jul 2026 21:22:01 +0000

Frontend testing used to sound simple: open a page, find an element, click it, and verify the result.

That description still works for basic workflows, but modern interfaces are no longer a single static DOM that changes in obvious ways. Components can render inside Shadow DOM. Modals can be portaled to a different part of the document. Server-rendered HTML can be replaced during hydration. Content can move because of CSS container queries. A page can look finished while several progressive-loading states are still changing underneath it.

The hardest frontend bugs now tend to sit at the intersection of three things:

State: what the application believes is happening.
Timing: when the browser and framework apply changes.
Geometry: where elements appear and whether users can actually interact with them.

A stable test strategy has to observe all three.

Shadow DOM and portals break naive assumptions about element location

Component encapsulation is useful, but it changes how automation finds and interacts with elements.

A control inside an open shadow root is not always reachable through the same selector strategy used for the main document. A portaled modal may appear visually next to a component while being rendered near the end of document.body. Focus can move into the modal even though the DOM hierarchy suggests that it belongs elsewhere.

This guide to testing Shadow DOM and portaled modals without breaking browser automation suites covers the key challenges.

The test should verify behavior, not merely the presence of a node:

Can the user reach the control?
Is the expected element visible above overlays?
Does keyboard focus enter the modal?
Is focus trapped correctly?
Does Escape close it?
Does focus return to the triggering element?
Can a screen reader identify its label and role?

Selectors still matter, but interaction boundaries matter more. A test that locates a button hidden behind another layer is not testing what the user experiences.

Hydration creates a period where the page exists but is not ready

Server-rendered applications can show content before the client-side application becomes interactive. This improves perceived performance, but it also creates a deceptive state for automation.

The test sees a button. The button may even match the correct text. But the event handler is not attached yet, or the framework is about to replace the element during hydration.

That is why hydration bugs often look like flaky clicks.

The practical issues are explored in how to test hydration mismatches, client-side re-renders, and server/client UI drift in modern React apps.

A good test does not assume that visible means interactive. It waits for evidence that hydration finished, such as:

An application-ready marker.
A known client-side state transition.
The absence of hydration warnings in the console.
A control responding without being detached or replaced.
Stable content after the initial client render.

Also test the mismatch itself. Intentionally create cases where the server and client receive different data or locale settings. These conditions reveal whether the application recovers cleanly or silently replaces content in a way that confuses users.

Scroll-linked interfaces need real geometry checks

CSS scroll snap, sticky sections, and scroll-driven effects are difficult to test with simple DOM assertions.

A section can be present and technically visible while being partially covered by a sticky header. A scroll-snap container can settle on the wrong card at a specific viewport width. A mobile browser can calculate the visual viewport differently after its address bar collapses.

The article on testing CSS scroll snap, sticky sections, and scroll-linked UI without missing mobile bugs offers a useful checklist.

These tests should inspect coordinates and viewport relationships:

Is the target section fully inside the usable viewport?
Is it obscured by a sticky element?
Did scrolling stop at the expected snap point?
Can the user still reach the next section?
Does the behavior change after orientation or viewport resizing?
Does reduced-motion mode disable or simplify the effect correctly?

A screenshot is often valuable here because layout bugs can satisfy DOM assertions while remaining obvious to a human.

Progressive loading has more than two states

Many tests model loading as a binary condition:

Loading.
Loaded.

Real interfaces often have many intermediate states:

Initial skeleton.
Partial content.
Deferred images.
Empty result.
Background refresh.
Stale data with a loading indicator.
Error state with partial content.
Pagination or infinite-scroll loading.

This creates a common testing mistake: waiting for the skeleton to disappear and then assuming the page is correct.

A better approach is described in how to test skeleton screens, progressive loading, and empty states without masking real UI regressions.

Test each meaningful state deliberately. For example:

Verify that skeletons resemble the eventual layout closely enough to avoid major shifts.
Confirm that empty states appear only after the request resolves.
Ensure stale content is labeled correctly during refresh.
Check that retry actions work after an error.
Confirm that partially loaded controls cannot submit incomplete data.
Measure whether key content moves unexpectedly when assets arrive.

Avoid hiding these transitions with long waits. The transitions are part of the product.

Responsive CSS makes visual comparison noisy

Modern layouts are influenced by viewport width, content length, font rendering, grid calculations, flex behavior, and container queries. A tiny difference can move several pixels through the page and create a large screenshot diff.

The result is visual regression noise: teams approve many harmless changes and eventually stop paying attention.

The guide to benchmarking visual regression noise on CSS Grid, Flexbox, and container query layouts explains why this deserves measurement rather than guesswork.

A practical benchmark can include:

Several representative viewport sizes.
Short and long localized text.
Slow-loading fonts and images.
Different device pixel ratios.
Content near container-query breakpoints.
Stable and intentionally changed reference screenshots.

Track the false-positive rate. If a visual suite flags hundreds of insignificant differences, its theoretical coverage does not matter. The team will not trust it.

Mask genuinely dynamic regions, but do not mask broad sections simply to make the test green. That turns visual testing into screenshot decoration.

Stability in Cypress still depends on application-aware waits

The tool does not eliminate the need to understand the interface.

Even with automatic retry behavior, a Cypress test can become unreliable when it selects an element that is about to be replaced, uses a forced click to bypass a real usability problem, or waits on a network alias that does not represent the final UI state.

This guide to stable Cypress tests is useful because it focuses on the habits behind stability rather than only syntax.

The most reliable tests tend to:

Select elements by durable intent.
Avoid arbitrary sleeps.
Assert the state that matters to users.
Control test data explicitly.
Observe requests without assuming that one request completes the whole workflow.
Treat detached elements as a rendering signal, not merely a runner inconvenience.
Keep each test focused enough that failures remain understandable.

A forced action should be rare. When automation has to bypass visibility, overlap, or actionability checks, the test may be hiding a real frontend defect.

Build a state matrix before writing the test

For a complex component, list the important states first.

Consider a search panel:

Dimension	Example states
Data	loading, results, empty, error
Rendering	server HTML, hydrating, interactive
Layout	desktop, tablet, mobile
Interaction	mouse, keyboard, touch
Overlay	closed, open, nested dialog
Scroll	top, sticky header active, snapped section
Network	fast, delayed, failed, retried

You do not need to test every mathematical combination. The matrix helps identify the combinations most likely to expose bugs.

For example, “mobile + sticky header + open portaled modal + virtual keyboard” is much more informative than running the same happy path at six arbitrary widths.

Final thought

Modern frontend automation is not primarily a selector problem.

It is a synchronization problem, a state-modeling problem, and increasingly a geometry problem. The browser may contain the correct node while presenting the wrong experience. The page may look complete while hydration is still replacing controls. A visual diff may be large even though the change is harmless—or tiny even though a sticky header blocks the main action.

The strongest tests connect DOM state with real user-visible behavior. They verify not only that an element exists, but that it is stable, reachable, correctly positioned, and meaningful at the moment the user needs it.

Why Browser Tests Fail When State Outlives the Page

Markus Gasser — Fri, 17 Jul 2026 08:18:52 +0000

One of the most misleading assumptions in browser automation is that a new page means a new test state.

It often does not.

The page may be new, but the browser can still remember an old login, restore form values, reuse cached assets, reconnect a service worker, process a delayed background event, or reconstruct the interface from local data before the API has finished responding.

That is why some tests fail even though the selectors are correct and the waits look reasonable.

The failure did not begin when the test clicked the button. It began earlier, in state the test never explicitly controlled.

The page is only one layer of the test environment

A modern browser session can contain state in several places:

Cookies
Local storage
Session storage
IndexedDB
Cache Storage
Service workers
Autofill data
Saved permissions
Browser profiles
Open tabs
Background notifications
Application data restored after refresh

These mechanisms are useful for users. They make applications faster and help people continue where they left off.

They also make automated tests harder to reason about.

Consider a test that creates a draft, refreshes the page, and verifies that the draft is still available.

What exactly is being tested?

The draft could have been restored from the server, local storage, IndexedDB, an in-memory cache, or a combination of all four. The UI might display stale local data first and replace it after the network request finishes.

A test that immediately sees the draft and passes may still miss a rehydration bug a few hundred milliseconds later.

This is why testing browser storage, refresh state, and rehydration deserves more attention than it usually gets. Refresh behavior is not just navigation. It is a synchronization problem between multiple sources of truth.

A reliable test should know:

Which state is expected to survive
Where that state is stored
When the application considers rehydration complete
What happens when local and server data disagree
How the application recovers from incomplete state

Without those answers, the test is often validating whatever happened to load first.

Returning users need different tests than new users

Many test suites create a fresh browser context for every scenario. That is a good default because isolation makes failures easier to understand.

But it also means the suite spends most of its time testing a user who does not exist very often in production: someone with no history.

Returning users have saved sessions, remembered fields, previous searches, browser permissions, and partially completed workflows.

AI agents make this more important.

An agent that operates inside a browser may behave differently when it can see autofilled values, an authenticated session, saved preferences, or content left behind by a previous interaction. A clean profile can tell you whether the agent works in ideal conditions. It cannot tell you how the agent behaves in a realistic browser.

The article on testing AI agents that rely on browser memory, autofill, and saved sessions highlights an increasingly important test dimension: persistence is part of the agent's input.

A useful approach is to create explicit browser-state personas:

New user with no saved data
Returning user with valid state
Returning user with stale state
User with conflicting autofill data
User whose session expired in another tab
User who upgraded from an older application version

The goal is not to make every test use a dirty browser. The goal is to stop pretending that clean-state testing covers every meaningful user journey.

Service workers can break tests after the page has closed

Service workers are especially confusing because they live outside the normal page lifecycle.

They can intercept requests, serve cached assets, receive push messages, run background synchronization, and survive after a tab is closed. A test may therefore inherit behavior from code that was registered during a previous test or deployment.

One common symptom is a test that works until the application changes its cache version.

After invalidation, the browser may briefly combine:

A new HTML document
An old JavaScript bundle
A stale API response
A newly installed service worker waiting to activate

That mixture can produce failures that disappear after manually refreshing the page.

The guide on debugging browser tests that fail only after a service worker cache is invalidated is useful because it treats cache invalidation as a first-class test condition rather than an infrastructure accident.

Teams should deliberately test transitions such as:

First visit with no service worker
Visit with an active older service worker
New service worker installed but waiting
Cache version changed after deployment
Application opened while offline
Application restored after connectivity returns
Push event received while the page is in the background

These scenarios are also relevant when benchmarking browser-test stability on applications that rely on web push, service workers, and background events.

A stability benchmark that only repeats the same clean-session test 100 times can miss the failures users encounter after deployments, reconnects, and background activity.

Preview environments remove the conditions that cause failures

Preview environments are helpful, but they can make browser testing look more reliable than it actually is.

They often have:

Smaller datasets
Fresh caches
Fewer concurrent users
Simplified authentication
Mocked third-party integrations
Different domains
No historical browser state
Different feature-flag configurations
No production service-worker upgrade path

A test can pass in preview because the environment has removed the exact conditions that trigger the production bug.

The article on why preview environments hide browser test failures until production describes a problem many teams discover too late: environment similarity matters more than environment convenience.

This does not mean every pull request needs a perfect clone of production.

It means the team should identify which production characteristics affect behavior and reproduce those intentionally.

For example:

Test against realistic data volume
Exercise real authentication redirects
Use production-like domain boundaries
Test service-worker upgrades
Include slow third-party responses
Run against multiple browser versions
Preserve selected user state between runs

A preview environment should be predictable, but not unrealistically clean.

Console errors are test evidence

A browser test may complete its intended flow while the console records serious problems:

Unhandled promise rejections
Failed resource requests
Hydration mismatches
Content Security Policy violations
Deprecated API warnings
Cross-origin errors
Exceptions inside analytics or payment scripts

If the assertion only checks that the final page is visible, the test may pass.

That does not mean the release is healthy.

The argument in Why Console Errors Belong in Your Release Readiness Scorecard is not that every warning should fail every build. The useful idea is to classify console output and treat unexpected errors as part of the release signal.

A practical policy might be:

Fail immediately on uncaught exceptions
Fail on new errors introduced by the current change
Track known third-party errors separately
Ignore explicitly approved warnings
Attach console output to every failed run
Alert on sudden increases even when tests pass

This makes console data actionable without turning it into noise.

AI products need better failure classification

AI interfaces introduce another source of misleading test failures.

The product may be working, but a model returns a different phrasing. A provider may be temporarily slow. An evaluator may misclassify a valid answer. A monitoring prompt may be unstable. Or the UI may genuinely be broken.

Those are very different incidents, but they often arrive in the same queue.

The guide on separating product bugs from monitoring noise in AI UI release triage makes an important point: teams need a taxonomy for failures before they can automate sensible release decisions.

At minimum, distinguish between:

Product UI failure
Backend integration failure
Model-quality regression
Provider timeout
Evaluator inconsistency
Test-data issue
Environment issue
Monitoring configuration issue

Without this classification, teams either ignore too many failures or block releases for harmless variation.

Stable Playwright tests require stable systems

Playwright provides excellent waiting behavior and browser control, but it cannot make an unpredictable system deterministic by itself.

A test can still be unstable because:

Data is shared between workers
An account is reused by several tests
The application has no reliable completion signal
A third-party service responds inconsistently
Browser state leaks between scenarios
The test checks the UI before rehydration finishes
Retries hide recurring failures
The environment changes during execution

The Ultimate Guide to Stable Playwright Tests covers the framework-level practices, but the larger lesson applies to every automation stack: stability is a property of the whole system.

The most effective fixes are often not clever locators or longer timeouts.

They are things like:

A dedicated test-data API
Unique users for parallel runs
Explicit application readiness signals
Controlled browser profiles
Better environment reset tools
Observable background jobs
Clear ownership of flaky tests

A better way to think about browser-test isolation

Isolation should not mean “delete everything before every test.”

It should mean that the test begins from a known state.

Sometimes that state is empty. Sometimes it intentionally includes persisted data, an older service worker, a saved session, or a partially completed workflow.

The difference is control.

A test suite becomes easier to trust when it can say:

This scenario starts with no state
This one starts with valid returning-user state
This one starts with stale cached data
This one simulates an application upgrade
This one verifies recovery after background activity

That is much stronger than hoping every run begins clean.

Final thought

Many browser tests are called flaky because their failures appear inconsistent.

Often, the system is behaving consistently. The test simply does not know which state it inherited.

Once browser storage, service workers, console output, environment differences, and background activity become visible parts of the test model, those failures become easier to reproduce and explain.

The page is not the whole browser.

And the browser is not the whole system.

The Best Test Automation Tool Is the One Your Team Still Uses a Year Later

Markus Gasser — Wed, 15 Jul 2026 09:41:58 +0000

Most test automation tools look good during a demo.

You record a login flow, add an assertion, run it in Chrome, and get a green result.

Everyone is impressed.

Then the real application gets involved.

There are dynamic elements, delayed API responses, test accounts, verification emails, downloaded files, several deployment environments, and a checkout flow that behaves differently on Safari.

A few months later, the original test suite has grown from 10 tests to 300. Some failures are product bugs. Others are test problems. A few only happen in CI. Nobody is completely sure which is which.

That is when you discover whether you selected a test automation tool or merely a good demo.

Creating tests is rarely the main problem

When teams compare automation tools, they often begin with questions such as:

How quickly can we record a test?
Can AI generate the steps?
Does it support plain-English instructions?
Can a manual tester use it?
Does it integrate with our CI pipeline?

These are reasonable questions, but they mostly describe the beginning of an automation project.

The harder questions appear later:

Who updates the tests after a redesign?
How do we investigate failures?
Can another person understand a test created six months ago?
What happens when the original automation engineer leaves?
Can we test workflows that involve email, APIs, files, or mobile devices?
How much infrastructure do we have to manage?
Does the cost increase every time we run the regression suite?

The first test tells you whether the tool works.

The hundredth test tells you whether the approach works.

Maintenance should be part of the evaluation

A stable automated test is not a test that never changes.

Applications are supposed to change. Buttons move. Components are replaced. Authentication flows evolve. APIs return different data. Product teams redesign entire sections of the interface.

The objective is not to prevent tests from changing. It is to make those changes inexpensive and understandable.

Before selecting a platform, I would test at least four maintenance scenarios.

1. Change a shared workflow

Update the login or checkout process and see how many tests must be edited.

A suite with reusable components should allow one targeted change. A poorly structured suite may require dozens of nearly identical updates.

2. Break a locator intentionally

Rename an element, move it inside another component, or change its attributes.

Then inspect what happens.

Does the platform recover safely? Does it explain what changed? Can you review the decision? Or does it silently perform a different action and still report success?

Self-healing is useful, but only when it preserves trust.

3. Give the test to someone else

Ask a tester or developer who did not create the workflow to explain what it does.

This is especially important for AI-generated tests. Generating a large suite quickly is not very helpful when only the AI can understand or modify it.

4. Investigate a realistic failure

Do not evaluate debugging using a missing button on a sample page.

Create a failure involving test data, an API response, a loading delay, an iframe, or an email verification step. Then see whether the platform provides enough evidence to identify the cause.

Screenshots are useful. Video, browser logs, network information, step-level output, variables, and execution history are even more useful.

No-code does not mean “for non-technical people only”

There is a tendency to frame no-code testing as a simplified option for teams that cannot write Selenium or Playwright tests.

That misses the larger point.

The value of no-code is not merely avoiding syntax. It is avoiding the need to turn every testing requirement into an internal software project.

A code-based framework may require the team to build or integrate:

Test runners
Browser infrastructure
Parallel execution
Reports
Screenshots and video
User permissions
Test data management
CI/CD integrations
Notifications
Email testing
File validation
Mobile device access
Versioning
Failure analysis
Maintenance conventions

Using an open-source library can still be the right choice. A developer-led team may want every test stored as code in the same repository as the application.

But the absence of a license fee does not mean the absence of a cost.

Engineering time is a cost. Infrastructure is a cost. Debugging is a cost. Keeping the framework compatible with changing browsers, dependencies, and application architecture is a cost.

Compare complete workflows, not feature checklists

I recently read this practical comparison of nine no-code test automation tools for 2026.

What I liked about the comparison is that it does not evaluate tools only by how quickly they can record a login test. It also considers maintenance, workflow coverage, infrastructure, debugging, collaboration, predictable usage, and the ability to handle advanced scenarios.

The shortlist includes platforms such as Endtest, mabl, testRigor, Katalon, Testsigma, ACCELQ, Leapwork, BrowserStack Low-Code Automation, and Ghost Inspector.

Those products do not all solve the same problem.

A small team that wants scheduled browser checks has different requirements from an enterprise testing workflows across web, mobile, desktop, APIs, and internal systems.

That is why selecting a tool from a feature matrix is difficult. Almost every platform can place a checkmark next to AI, CI/CD, reporting, and self-healing.

The more useful approach is to reproduce one of your real workflows.

Not the easiest workflow. The annoying one.

Choose something with:

Authentication
Dynamic data
Multiple pages
An iframe or custom component
An email or SMS verification step
A downloaded file
An API call
At least one expected failure

Then ask several people to create, run, modify, and debug it.

You will learn more from that exercise than from ten sales presentations.

AI should reduce work without removing control

AI can make test automation significantly faster.

It can generate initial scenarios, suggest assertions, extract variables, identify alternative locators, summarize failures, and help update tests after application changes.

But “AI-powered” is not a complete testing strategy.

An AI agent can generate a large number of tests that technically execute but do not validate important business risks. It can also hide complexity behind natural-language instructions that become difficult to debug.

The best balance is usually:

AI accelerates repetitive work.
The resulting test remains visible.
Humans can edit the exact actions and assertions.
Changes can be reviewed.
The platform explains failures with evidence.

AI should help the team understand and maintain the suite.

It should not become the only entity capable of interpreting it.

Think about who will own the suite

The most important evaluation question may be surprisingly simple:

Who will maintain these tests a year from now?

The honest answer is rarely “the same person who created the proof of concept.”

People change teams. Contractors leave. Priorities shift. The developer who enthusiastically built the initial framework becomes busy with product work.

A sustainable suite should be understandable by more than one specialist.

That does not mean every product manager needs to edit tests. It means the automation should not become a private system whose logic exists only in one engineer’s head.

Readable steps, reusable components, sensible naming, version history, shared ownership, and useful debugging information matter more than they appear to during the first week.

The best tool is the one that survives contact with reality

There is no universal winner between no-code platforms, low-code tools, Selenium, Playwright, Cypress, and custom frameworks.

The right choice depends on the application, the team, the required control, the available engineering time, and the expected lifetime of the test suite.

But there is one principle that applies almost everywhere:

Do not optimize only for creating the first test.

Optimize for changing the 200th test, understanding the 500th failure, and onboarding the next person who has to maintain the suite.

The best test automation tool is not necessarily the one with the most AI features, the cleanest recorder, or the lowest initial price.

It is the one your team still trusts and uses after the application—and the team itself—has changed.

14 Browser Testing Articles That Changed How I Think About Release Confidence

Markus Gasser — Mon, 13 Jul 2026 20:57:56 +0000

Modern browser testing is no longer just about clicking a button and checking whether the next page loads.

Frontend applications now contain animated route changes, Shadow DOM components, dynamic validation, session refresh logic, AI-generated interfaces, visual transitions, and increasingly complex release pipelines.

At the same time, AI is making it easier to generate both application code and automated tests. That sounds like it should simplify testing, but it also creates a new problem: teams can produce more code and more tests without necessarily improving confidence in their releases.

I recently went through several articles that explore these problems from different angles. Here are some of the ideas that stood out.

Some failures only happen when the browser actually renders the page

A test can pass repeatedly and then fail when a layout shift causes the browser to recalculate the position of an element.

That is particularly frustrating because the selector may be correct and the element may technically exist. The failure happens because the page is moving while the automation is trying to interact with it.

This guide on debugging browser tests that fail when layout shifts trigger a reflow explains why these failures can be difficult to reproduce and what evidence is worth collecting.

CSS animations introduce a similar problem. With view transitions and animated navigation, a page may appear ready even though the browser is still changing its visual state.

The article How to Debug Browser Tests That Fail Only After CSS View Transitions or Animated Route Changes looks specifically at failures that happen during these transitions.

The important lesson is that waiting for an element to exist is not always enough. Sometimes you need to wait for the interface to become stable.

Green CI does not automatically mean a release is safe

Most teams still treat automated tests as a binary signal:

Green means release.
Red means investigate.

That model is useful, but it becomes less reliable when applications and tests are changing quickly.

A passing test suite might still miss a visual regression, an untested AI-generated code path, or a risky frontend change that deserves additional review.

How to Build a Release Signal for Frontend Changes When Green CI Is Not Enough discusses how teams can combine test results with other forms of evidence instead of relying on a single green checkmark.

For AI-generated interfaces, the release decision may need to include screenshots, visual differences, risk indicators, and test evidence. This is explored further in How to Build a Release Gate for AI UI Changes Using Test Evidence, Screenshot Diffs, and Risk Signals.

This does not mean every release needs a complicated scoring system. It means the release signal should reflect the actual risks of the application.

AI-generated pull requests still require human judgment

AI can generate a working frontend feature surprisingly quickly. It can also generate tests for that feature.

But generated tests often focus on the most obvious path. They may not cover interrupted workflows, existing user state, permission differences, session expiration, validation errors, or interactions with older parts of the application.

How to Review AI-Generated Frontend Pull Requests for Test Coverage Gaps Before Merge provides a useful way to think about these gaps during code review.

The goal is not to distrust every line of AI-generated code. It is to avoid confusing generated test volume with meaningful coverage.

Tool comparisons are becoming more situational

The Playwright-versus-everything discussion is often reduced to features, syntax, and execution speed.

In practice, the correct choice depends heavily on what the team is testing and who will maintain the automation.

For example, testing web components introduces questions about Shadow DOM boundaries, slots, reusable components, and selectors that behave differently from those used in traditional pages.

Endtest vs Playwright for Testing Web Components, Shadow DOM, and Slot-Based Layouts compares the two approaches in that context.

Maintenance is another major factor. A code-first framework can offer a great deal of flexibility, but the team must own the framework, integrations, debugging process, reporting, infrastructure, and long-term updates.

Endtest vs Playwright for Teams That Need Less Test Maintenance in Fast-Changing Frontends focuses more directly on that tradeoff.

There is also a useful comparison of mabl vs Playwright for teams choosing between an AI-assisted platform and a code-first browser automation library.

For checkout testing, the requirements can be different again. Redirects, conditional fields, dynamic totals, third-party payment pages, and changing validation messages can make a simple purchase flow surprisingly difficult to automate reliably.

Endtest vs Cypress for Teams Testing Multi-Step Checkout Flows With Dynamic Validation and Redirects examines that narrower use case.

There is no universal winner across all of these comparisons. The better question is usually: which approach creates the least operational friction for this particular team?

Authentication tests are rarely just login tests

A login test that enters an email and password is easy.

Testing the full authentication lifecycle is much harder.

Real applications may silently refresh access tokens, redirect users back to their previous location, require a second verification step, recover from an expired session, or behave differently when authentication partially succeeds.

Endtest Review for Teams Testing Login, Session Refresh, and Multi-Step Recovery Flows looks at the practical requirements behind these scenarios.

Internal tools have similarly complicated workflows. A request may need to move through several users, roles, and approval states before it is complete.

Endtest Review for Teams Testing Multi-Step Approval Flows in Admin and Internal Tools covers the challenges involved in automating those processes.

These are the kinds of tests where setup, test data, user permissions, and cleanup often require more work than the visible browser interactions.

Human approval is still important in AI-powered workflows

Some AI applications do not produce a simple deterministic result.

They generate a recommendation, draft, action, or decision that must be reviewed by a human before the workflow continues.

Testing that process requires more than checking that a button exists. The test may need to validate the generated output, verify the approval state, simulate rejection, and confirm what happens when the AI response is delayed or incomplete.

The Endtest Buyer Guide for Teams Testing AI-Powered Browser Workflows With Human Approval Gates discusses what teams should evaluate when choosing an automation approach for these workflows.

This will probably become a more common testing pattern as AI features are added to existing SaaS products.

What does “automating tests with AI” actually mean?

AI testing is becoming an increasingly broad category.

It can mean:

Generating test code from a prompt.
Creating tests from natural-language instructions.
Repairing selectors after an interface changes.
Identifying visual differences.
Suggesting missing test cases.
Analyzing failures and logs.
Executing browser actions through an agent.

These capabilities solve different problems and have different costs.

What Is the Best Way to Automate Tests with AI? provides a useful overview of the main approaches and the tradeoffs between them.

AI can accelerate test creation, but test creation is only one part of automation. Teams still need reliable execution, understandable results, maintainable workflows, and a process for deciding which failures matter.

Playwright, Selenium, and the arrival of MCP

Playwright MCP adds another interesting dimension by allowing AI agents to interact with browsers through Playwright.

The Playwright MCP Guide covers how this approach works, what it can be used for, and where it fits alongside traditional Playwright tests.

I also recently published a video comparing Playwright and Selenium in the current testing landscape:

The Playwright-versus-Selenium debate is often framed as a battle between an old tool and a new tool. The reality is more nuanced.

Both can automate browsers. The larger differences usually involve architecture, ecosystem compatibility, team experience, maintenance expectations, and how much supporting infrastructure the team is prepared to build.

The recurring theme: test automation is a system

The common thread across all of these topics is that browser automation libraries are only one part of the solution.

A reliable testing process also depends on:

Application stability.
Test data.
Browser infrastructure.
Debugging evidence.
Reporting.
Permissions.
Release policies.
Team adoption.
Long-term maintenance.

AI can help with many of these areas, but it does not eliminate the need to design the overall system carefully.

The teams that get the most value from automation are rarely the teams that generate the largest number of tests. They are the teams that create a dependable feedback loop and keep it useful as the product changes.

The Frontend Testing Problems That Happy-Path E2E Tests Usually Miss

Markus Gasser — Fri, 10 Jul 2026 20:37:50 +0000

A frontend can look stable while becoming much harder to test.

The main user journey may still work. The login button is clickable, the checkout completes, and the dashboard loads. Yet underneath that apparently healthy flow, the application may have accumulated several new sources of risk:

components respond to their containers rather than the viewport;
design tokens change spacing across dozens of screens;
browser autofill creates states that tests never reproduce;
popovers are rendered outside their apparent component hierarchy;
third-party scripts load on their own schedule;
portals, Shadow DOM, and nested frames complicate element access;
a build optimization changes timing or code execution order.

None of these problems is unusual in a modern frontend. The mistake is assuming that a conventional set of happy-path end-to-end tests will reveal them automatically.

Responsive behavior is now component behavior

Responsive testing used to mean running the same page at a few viewport widths. That still matters, but it is no longer enough.

With container queries, two copies of the same component can render differently on the same page. A product card inside a narrow sidebar may use a compact layout while the same card in the main content area displays additional controls. The browser width has not changed; the component’s available space has.

That creates a different testing question. Instead of asking, “Does this page work at 768 pixels?” teams need to ask:

What happens when the component’s parent becomes narrower?
Does the layout switch at the correct container threshold?
Are controls hidden, rearranged, or duplicated?
Does JavaScript resize logic agree with CSS behavior?
Can the component recover after repeated resizing?

A useful starting point is this guide on evaluating a test automation platform for container queries, resize logic, and responsive component states. It frames responsive testing as a state-transition problem rather than a screenshot-at-three-widths exercise.

The distinction matters because many failures occur during the transition. A menu may render correctly when the test starts at a narrow width but fail when the viewport or container shrinks after the page has loaded. Event listeners, cached measurements, animation frames, and debounced resize handlers all become part of the test surface.

Design systems can create wide regressions from tiny changes

A one-line design-token change can affect more screens than a large feature pull request.

Changing a spacing token from 12px to 16px may look harmless in isolation. Across a mature product, however, it can cause:

buttons to wrap inside toolbars;
form labels to shift;
cards to exceed their expected height;
table controls to overlap;
mobile navigation to move below the fold;
text truncation to occur in languages with longer labels.

The difficult part is that each individual component may still be technically valid. Nothing crashes. The DOM contains the right elements. Functional assertions remain green.

This is why testing design systems requires more than checking whether a component exists. The article on testing design tokens, spacing drift, and component variants offers a useful way to think about the problem: validate representative variants, compare states intentionally, and avoid treating every visual difference as equally important.

Visual regression can help, but only when the comparison strategy matches the product. A team with light and dark themes, several brands, and frequently changing tokens can create an enormous screenshot matrix without gaining much confidence. Before scaling that matrix, it is worth deciding what to measure. This piece on visual regression for theme-switching and design-token-heavy interfaces highlights the metrics and review questions that matter before snapshot volume becomes unmanageable.

A good visual suite should reduce uncertainty. It should not merely produce a larger approval queue.

New browser UI primitives introduce unfamiliar failure modes

Modern browser features can simplify application code while complicating test assumptions.

CSS Anchor Positioning and the Popover API are good examples. They remove some of the custom positioning and visibility logic that teams previously maintained themselves. That is a positive change, but tests still need to validate behavior such as:

whether the popover is anchored to the correct element;
whether it stays within the visible viewport;
what happens near scroll-container boundaries;
whether focus moves correctly;
whether Escape closes the right layer;
whether clicking outside dismisses it;
whether overlapping popovers follow the expected stacking order.

A practical checklist for testing CSS Anchor Positioning and popover interactions is useful here because it focuses on interaction, focus, placement, and dismissal—not just visibility.

The biggest trap is asserting only that the popover appeared. A popover can be visible and still be unusable because it is positioned off-screen, detached from its trigger, behind another layer, or outside the expected keyboard sequence.

Browser-managed state deserves its own scenarios

Autofill is a classic example of behavior that users rely on and automated suites often ignore.

A test that types an email address into a blank field does not reproduce the same conditions as a browser restoring a saved value. Autofill can affect:

whether input and change events fire;
whether floating labels move;
whether validation messages clear;
whether controlled components recognize the value;
whether dependent fields update;
whether a submit button becomes enabled.

This is one area where tool comparisons should be tied to the actual workflow rather than broad feature lists. The discussion of Playwright vs Selenium for browser autofill, saved forms, and input prepopulation shows why browser state and event behavior can matter more than basic syntax.

In practice, teams should separate at least three scenarios:

The user manually enters a value.
The application restores a previously saved value.
The browser or password manager supplies a value.

Those states may look identical on screen while exercising different code paths.

Build changes can break tests without changing the feature

One of the most frustrating failures is a test that starts failing after a production-build optimization even though the feature code appears unchanged.

Minification, tree shaking, chunk splitting, lazy loading, module preloading, asset hashing, and bundler upgrades can change timing and execution order. A test may have been depending on an accidental property of the old build:

a component always loaded before an assertion;
an event handler was registered synchronously;
a global variable remained available;
CSS arrived before the first interaction;
a chunk never failed because it was previously bundled with the main application.

When this happens, increasing a timeout may hide the symptom while preserving the underlying race condition. A better process is to compare the development and optimized builds, inspect network and console output, and identify which readiness signal changed. This debugging guide for frontend tests that fail after build optimization changes provides a structured way to investigate those differences.

The key lesson is that “the page loaded” is not a sufficiently precise synchronization condition. Tests should wait for the application state they actually need.

Third-party widgets are separate systems inside your system

Support chat, analytics, consent managers, payment fields, maps, and scheduling tools often arrive as third-party JavaScript. They introduce variability that the application team does not fully control:

scripts may be blocked or delayed;
content may be served from another origin;
iframe structure may change;
regional settings may alter the UI;
the vendor may deploy independently;
test environments may use different keys or modes.

The goal should not be to make every test depend on the live vendor. Nor should teams mock the integration so completely that they never test the real boundary.

A balanced approach usually includes:

fast tests for the application’s own integration code;
a small number of realistic end-to-end checks;
explicit handling for unavailable or delayed widgets;
contract checks for messages, callbacks, and expected state;
clear ownership when the external dependency changes.

This guide on testing third-party JavaScript widgets without creating cross-origin flakiness explains how to preserve confidence without turning the suite into a monitor for someone else’s uptime.

The DOM is no longer always one simple tree

Shadow DOM, framework portals, and nested iframes can all make an element appear visually close to a trigger while being structurally far away.

A modal opened from a component may be rendered under a document-level portal. A custom element may hide internal controls inside an open or closed shadow root. A payment flow may place fields inside several cross-origin frames. Locators that assume a straightforward parent-child path become brittle quickly.

When evaluating tooling, it is worth testing these structures directly rather than relying on a generic “supports modern web apps” claim. The guide to evaluating a QA platform for Shadow DOM, portals, and nested iframes identifies practical scenarios that expose the difference between nominal support and usable support.

The best locator strategy also changes by boundary:

Prefer user-facing roles and labels where the accessibility tree exposes them.
Use stable component contracts rather than generated class names.
Treat iframe switching as an explicit context change.
Avoid long selectors that mirror implementation structure.
Capture enough evidence to show which document or root contained the failure.

A better frontend testing model

Modern frontend reliability is not achieved by adding one more broad end-to-end test. It comes from matching tests to the type of risk.

A useful model is:

Functional checks confirm that the user can complete the workflow.

State-transition checks verify resize behavior, restored values, async loading, and theme changes.

Visual checks detect meaningful layout and token regressions.

Boundary checks validate widgets, frames, portals, and browser-managed behavior.

Build checks compare the behavior of optimized artifacts with local development.

The important part is not maximizing test count. It is making sure the suite can observe the failures that the current frontend architecture is capable of producing.

A green happy path is useful. It is simply not the whole product anymore.

The Browser Testing Bugs That Only Show Up in Real Web Apps

Markus Gasser — Fri, 10 Jul 2026 07:21:37 +0000

Browser testing used to feel simpler.

Open a page. Click a button. Fill a form. Assert that something changed.

That still matters, of course. But modern web apps have become much more dynamic. The hard bugs now often live in the edges: sticky headers that cover elements only on mobile, cookie banners that change the first page view, browser extensions that rewrite the DOM, or cached pages that restore stale state after the user presses Back.

This is why I think teams should be careful when they evaluate browser testing tools. The question is not only: “Can this tool automate Chrome?”

A better question is:

Can it handle the messy behavior of the real app?

Responsive navigation is a good starting point

A lot of teams test the happy-path desktop layout and assume the mobile version is covered because the components are technically the same.

They are not.

Responsive navigation can introduce completely different behavior: hamburger menus, sticky headers, collapsed dropdowns, off-canvas panels, and elements that are technically present in the DOM but not actually usable.

That is why I liked this breakdown on what to look for in a browser testing platform for responsive navigation, sticky headers, and mobile menu breakpoints. Those are exactly the kinds of issues that create “works on my machine” bugs when teams only test one viewport.

Browser extensions can quietly change the page

Another under-tested area is browser extension behavior.

Extensions can inject UI, rewrite forms, add side panels, modify styles, or add network activity. If your product depends on an extension, or if your users commonly run one, your page can behave differently from the clean browser session used in most automated tests.

This article on testing browser extensions that inject UI, rewrite pages, or add side panels without flaky E2E runs covers a useful point: extension testing is not only about checking that the extension loads. It is about proving that the injected behavior does not destabilize the rest of the user flow.

Cookie banners and region logic can break analytics

Cookie consent is another example of a feature that teams often treat as a legal or marketing detail, not a testing concern.

But cookie banners affect:

page load timing
analytics initialization
conversion tracking
locale-specific content
first-session behavior
user segmentation

A user in one region may see a completely different entry path than a user in another region. That makes testing cookie consent, region banners, and locale-specific entry paths worth thinking about as part of the main regression suite, not as a one-off check.

PWAs add another layer of state

Progressive Web Apps introduce another testing challenge: the app is not always “online page loaded from the server.”

You may have service workers, cached assets, offline states, reconnect flows, background updates, and update prompts. A test that only verifies the online happy path can miss the most important reliability problems.

This is why checking PWA offline mode, update prompts, and reconnect flows matters. The bugs here are often not obvious until users are on unstable connections or returning to a stale tab.

Back/forward cache creates surprising bugs

The browser back/forward cache is another area where automated tests can give teams false confidence.

A page may appear to work when loaded fresh, but behave incorrectly when restored from cache. Form state, event listeners, authentication state, timers, and UI flags can all behave differently after a bfcache restore.

This guide on testing browser back/forward cache without missing state restoration bugs is a reminder that browser navigation is not just “go to URL.” Real users move backward, forward, refresh, switch tabs, and return later.

Accessibility-driven workflows need real interaction testing

Keyboard navigation, focus traps, and ARIA-driven modals are another example of behavior that can look fine visually while being broken functionally.

A modal can appear on screen and still fail because focus escapes behind it. A menu can open but be impossible to navigate with the keyboard. A dialog can close visually while screen reader state remains confusing.

That is why comparisons like Playwright vs Selenium for testing keyboard navigation, focus traps, and ARIA-driven modal workflows are useful. The tool choice matters less than whether the team is actually testing these workflows as users experience them.

Third-party scripts are a common source of flakes

Analytics scripts, tag managers, chat widgets, tracking pixels, A/B testing scripts, and personalization tools can all change timing and network activity.

A test might pass in a clean environment and fail in production because a tag manager injected something extra or delayed a request.

This article on why browser tests flake when analytics and tag managers inject extra network activity makes a practical point: flakes are not always caused by bad tests. Sometimes they are caused by uncontrolled runtime behavior.

File uploads, downloads, and storage are not edge cases

Finally, browser storage and file handling deserve more attention.

Many important workflows depend on:

uploading documents
downloading reports
preserving local storage
clearing session storage
validating generated files
checking browser permissions

A QA platform that cannot handle these flows reliably will struggle with real business applications. This article on QA platforms for file uploads, downloads, and browser storage persistence is a good checklist for that category.

The bigger lesson

The best browser testing strategy is not the one with the most tests.

It is the one that covers the behaviors most likely to break for real users.

Modern web apps are full of hidden state: viewport state, browser state, cache state, consent state, extension state, storage state, accessibility state, and third-party script state.

If your automated tests ignore those layers, the suite may look green while users still hit broken flows.

That is the uncomfortable part of browser testing in 2026: the easy clicks are already automated. The value is in testing the messy edges.