Simon Gerber

Posted on Jul 1

The UI Testing Problems That Look Simple Until They Reach CI

#testing #automation #selenium #playwright

A lot of browser testing advice is written around clean examples.

Open a page. Fill in a form. Click a button. Check that a confirmation message appears.

That is useful when you are learning a tool, but it is not where most teams lose time.

The difficult tests are usually attached to interfaces that are constantly changing, partially asynchronous, difficult to select, or dependent on data that behaves differently from one run to the next. Search suggestions reorder themselves. Marketing pages change copy every week. Cookie banners appear only in certain regions. Drag-and-drop components behave differently depending on the browser, viewport, or implementation.

The test may look simple when described in a ticket. The automation behind it often is not.

Here are several areas where modern browser testing becomes more complicated than it first appears, along with some useful resources for exploring each problem in more depth.

Search is not just an input and a result page

Testing search used to mean entering a phrase and checking that the results page contained an expected item.

Modern search interfaces have many more moving parts:

Suggestions appear while the user is typing.
Results change before the form is submitted.
Filters alter the URL, the visible results, or both.
Ranking depends on personalization, inventory, location, or recent activity.
The UI may debounce requests and discard slower responses.
AI-generated suggestions can look plausible while still being irrelevant.

This creates an important distinction between testing whether search works and testing whether search produces good results.

A browser test can easily confirm that a suggestion panel appeared. It can also confirm that five options were displayed. Neither assertion tells you whether the suggestions were relevant, correctly ranked, or based on the current query rather than a stale response.

The article How to Test AI-Powered Search Suggestions Without Masking Relevance Bugs explores this problem directly. It is especially relevant for teams that are tempted to make assertions so flexible that the test keeps passing even when the search experience gets worse.

The same problem appears in conventional search interfaces. Endtest Review for Teams Testing Dynamic Filters, Search Suggestions, and Result Ranking in Web Apps looks at the broader workflow: entering queries, waiting for suggestion states, applying filters, and validating the resulting order.

A useful search test normally needs to separate at least three concerns:

Interaction: Can the user enter a query, select a suggestion, and apply a filter?
State: Does the interface show the correct query, active filters, loading state, and result count?
Quality: Are the returned suggestions and results acceptable for that input?

The first two are classic browser automation problems. The third may require controlled datasets, API-level checks, ranking thresholds, or human review.

Trying to force all three into one fragile end-to-end test usually produces a test that is difficult to understand and even harder to trust.

Drag-and-drop testing exposes the limits of simple automation

Drag-and-drop interfaces are another feature that sounds straightforward until you try to automate them reliably.

The test description may be only one sentence:

Move the card from “In Progress” to “Done.”

But the browser may need to reproduce a sequence of pointer events, maintain the correct coordinates, trigger a drop zone, wait for an animation, and confirm that the change persisted after the UI updated.

The implementation matters too. A board built with native HTML drag events can behave differently from one built with pointer events, touch abstractions, canvas elements, or a frontend framework’s gesture library.

That is why a test that works against one sortable list may fail completely against another.

Endtest Review for Teams Testing Drag-and-Drop Builders, Reorderable Lists, and Gesture-Heavy UI examines this category from the perspective of teams evaluating a managed testing platform.

For teams already working with code-based tooling, Endtest vs Cypress for Testing Drag-and-Drop Boards, Reorderable Lists, and Gesture-Heavy Flows compares the tradeoffs involved.

The most reliable drag-and-drop tests tend to verify more than the visual motion itself. After the gesture, check the durable outcome:

Did the item move to the expected container?
Did its order change?
Was the change saved?
Does the new state remain after a reload?
Was the correct backend request made?
Did another item move unexpectedly?

This matters because an animation can succeed visually while persistence fails. The reverse can also happen: the backend state changes, but the UI does not reflect it correctly.

A good test treats the drag gesture as the action, not as the final proof.

Marketing websites are difficult for a different reason

Product applications are often difficult because of state and interaction complexity.

Marketing websites are difficult because they change constantly.

Headlines are rewritten. Calls to action move. Campaign banners appear and disappear. Pricing sections are rearranged. Experiments swap components. Localization changes the amount of text on the page. A consent platform inserts an overlay that did not exist in the previous run.

This creates a maintenance problem even when the underlying user journey is simple.

Endtest vs Playwright for Testing Marketing Websites With Frequent Copy Changes and Campaign Swaps discusses the difference between testing stable behavior and tying tests too closely to frequently edited content.

The key is to decide which changes should fail the test.

For example, a checkout button disappearing should probably fail. A headline changing from “Start Free” to “Try It Free” may not deserve the same response unless that exact copy is legally or commercially important.

Selectors and assertions should reflect that distinction.

For volatile marketing pages, it is often safer to anchor tests to:

Stable semantic attributes
Form behavior
Navigation destinations
Component presence
Analytics events
Accessibility roles
Business-critical text only

The goal is not to make the tests so loose that they miss defects. It is to avoid turning every approved content edit into a false alarm.

Consent banners and interstitials change the initial state of the page

Cookie banners, ad interstitials, regional notices, newsletter popups, age gates, and privacy dialogs all have one thing in common: they can block the page before the actual test begins.

They are also rarely consistent.

The same visitor may see different overlays based on region, browser storage, previous consent, referrer, campaign parameters, or experimentation rules. Some overlays appear immediately. Others appear after a delay or only after the first scroll.

Endtest Buyer Guide for Teams Testing Ad Interstitials, Cookie Banners, and Consent Overlays focuses on this exact class of UI.

There are two common mistakes here.

The first is dismissing every overlay automatically at the start of every test. That can hide real defects in the overlay itself.

The second is allowing the overlay to appear unpredictably in unrelated tests. That makes otherwise stable tests fail for reasons that have nothing to do with the feature being tested.

A better strategy is to divide the coverage:

Create dedicated tests for the consent or interstitial flow.
Establish a known consent state for unrelated regression tests.
Test both accepted and rejected states when behavior differs.
Include regional configurations where regulations or content change.
Verify that the page remains usable when consent is declined.

The important idea is control. Browser tests become more reliable when the starting state is intentional rather than accidental.

Parallel execution is an infrastructure problem as much as a Selenium problem

When a test suite becomes slow, running tests in parallel sounds like the obvious fix.

It can be, but parallel execution exposes assumptions that were invisible when tests ran one at a time.

Two tests may use the same account. They may edit the same record. They may rely on a shared download directory, test environment, inbox, database row, or browser profile. Once they run simultaneously, they interfere with each other.

How to Run Selenium Tests in Parallel covers the implementation side of parallel Selenium execution.

The harder part is often preparing the suite for concurrency.

Before increasing the worker count, check whether tests have:

Independent test data
Separate browser sessions
Unique users or tenants
Isolated downloads
Deterministic cleanup
No dependence on execution order
Enough environment capacity
Clear ownership of shared resources

Parallelism does not remove test time. It redistributes it across infrastructure.

If the application, database, browser grid, or third-party service cannot handle the additional load, the suite may become faster on paper but less reliable in practice.

The useful metric is not simply total runtime. It is how much trustworthy feedback the suite produces per unit of time and infrastructure cost.

A green CI run is not proof that an AI test agent is safe

AI test agents introduce another version of the trust problem.

A team may see a high pass rate and assume the agent is performing well. But pass rate only tells you how often the final tests were green. It does not tell you whether the agent changed the test correctly, weakened assertions, ignored a defect, or adapted to the wrong behavior.

Why CI Pass Rates Don’t Tell You Whether an AI Test Agent Is Safe to Trust makes this distinction clear.

An AI agent can improve the pass rate in both good and bad ways.

A good repair might replace a brittle selector with a stable one.

A bad repair might remove an assertion, accept any visible element, increase a timeout until the symptom disappears, or update the expected result to match a regression.

Both repairs can turn red into green.

That means teams need more than a final status. They need visibility into what changed and why.

Useful controls include:

Reviewing agent-generated changes
Recording the previous and new locator
Tracking assertion modifications separately
Limiting which parts of a test the agent can edit
Requiring approval for behavior-changing repairs
Measuring defect detection, not only pass rate
Replaying repairs against known negative cases

A test system should optimize for accurate feedback, not for the maximum possible number of green checks.

Generated code and managed automation solve different problems

The arrival of AI coding tools has made it much easier to generate Playwright or Selenium code. That is valuable, especially for experienced engineers who want to accelerate setup and repetitive implementation work.

But code generation is not the same as test management.

Cursor for Playwright Tests vs Endtest: Generated Code or Managed Test Automation? explores that distinction.

An AI coding assistant can help create a test file. The team still needs to decide how that test will be:

Reviewed
Stored
Executed
Scheduled
Debugged
Maintained
Reported
Shared with non-developers
Connected to environments and credentials
Governed over time

For some teams, owning all of that in code is exactly the right choice. They already have the engineering capacity, framework conventions, and CI infrastructure.

Other teams primarily need reliable regression coverage and do not want to build an internal testing platform around generated scripts.

The decision is less about whether code is good or bad. It is about which layer the team wants to own.

Generated code gives you implementation output.

Managed automation attempts to provide the operating system around the tests.

Maintenance is the real cost of regression coverage

Most test automation tools look effective during a proof of concept.

The difficult question is what happens six months later, after the product has changed, the original author has moved to another project, and the suite contains hundreds of scenarios.

Endtest Review for Teams That Need Maintainable Regression Coverage Across Fast-Changing Web Apps looks at testing through that longer-term lens.

Maintenance cost is influenced by more than the number of failures.

It includes:

Time spent understanding why a test failed
Time spent distinguishing product defects from test defects
Knowledge required to update the test
Delays caused by unavailable test owners
Flaky reruns
Changes to shared helpers
Infrastructure upkeep
Reporting and triage overhead
The cost of tests that silently stopped checking the right thing

This is why the fastest tool for creating the first ten tests is not necessarily the fastest tool for managing the next thousand.

A realistic evaluation should include deliberate change.

After creating the initial test, modify the application:

Rename an element.
Move a component.
Change the order of results.
Introduce a loading delay.
Add an overlay.
Alter the text.
Run the same scenario in another browser.

Then measure how easy it is to understand and repair the failure.

That exercise usually reveals more than another polished demo.

The common thread is control over uncertainty

Search suggestions, drag-and-drop interfaces, marketing pages, consent overlays, parallel execution, and AI-generated repairs look like separate topics.

They share the same underlying problem: uncontrolled variability.

Reliable browser testing depends on deciding which variables should be fixed, which should be observed, and which should be allowed to change.

You need controlled data for ranking tests.

You need controlled state for overlays.

You need controlled resources for parallel execution.

You need controlled permissions for AI agents.

You need controlled assertions for pages with frequently changing content.

The best browser tests are not the ones that attempt to predict every possible UI detail. They are the ones that clearly identify the business behavior that must remain true, create the conditions needed to observe it, and fail for reasons that a human can understand.

That is much harder than clicking a button and checking a message.

It is also where test automation starts becoming genuinely useful.

DEV Community