Markus Gasser

Posted on Jun 1

Why Tests Start Failing Over Time, and What Teams Can Actually Do About It

#testing #qa #automation #devops

A belief that sounds reasonable is this: if a test passed last week, it should still be a reliable signal this week. When a suite starts failing, the instinct is often to blame the app, rerun the job, or add another retry. That can work for a day. Over time, it turns into a maintenance tax that nobody planned for.

The real problem is usually not one thing. Tests fail over time because the product changes, the test code ages, and the team’s habits slowly drift toward convenience over signal. The good news is that most of that drift is manageable if you treat automation like code that needs design, review, and cleanup, not just execution.

Myth 1: A flaky test is just a flaky test

Reality: flakes usually point to weak assumptions

When a test fails intermittently, it is tempting to label it as random noise. Sometimes it is environmental noise, but often the test is depending on something unstable, a timing edge, an async state that is not ready yet, a locator that changes with every UI tweak, or shared data that another test mutates.

A useful way to think about flakes is to ask, “What assumption is this test making that the product does not guarantee?” If a test assumes an element appears immediately, that is a timing assumption. If it assumes a row count will stay fixed while background work is still settling, that is a state assumption. If it depends on a CSS class that design keeps renaming, that is a selector assumption.

The fix is not to hide the failure with more retries. The fix is to make the test wait for the right condition, use stable identifiers, and isolate the data or environment it depends on. If a test still flakes after that, it is often telling you the workflow itself is not deterministic enough for the level of automation you are asking from it.

Myth 2: Fragile locators are just a frontend problem

Reality: selector choices become a maintenance contract

Many teams discover locator fragility only after a design refresh, a component library change, or a checkout redesign breaks half the suite. The issue is rarely that someone picked the wrong selector once. It is that the suite accumulated too many tests tied to visual structure instead of stable meaning.

If a test reaches for nth-child chains, generated class names, or deeply nested DOM structure, it is usually borrowing implementation details that the product team is free to change. That may feel fast in the moment, but every future UI refactor turns into a test rewrite.

Stable locators are not just about test convenience. They are part of maintainable product engineering. Prefer semantic HTML, accessible roles where they fit, and explicit test ids where your team agrees they are appropriate. That advice lines up well with the accessibility angle in Why Frontend Teams Keep Missing Accessibility Regressions in Review. The piece makes a practical case that when teams rely on review alone, regressions in semantics and keyboard behavior slip through. The same gap affects automation, because tests that cannot “see” meaningful structure end up locking onto brittle markup instead.

Myth 3: Retries and self-healing are free reliability

Reality: they can improve signal, or hide breakage

There is a real place for retries, locator recovery, and self-healing in CI. They can smooth over transient failures and reduce noise when the issue is clearly external, like a momentary network hiccup or a known platform instability. But when they are used as a blanket response, they start masking actual product problems.

A test that silently heals its selector may pass even though the page changed in a way users would notice. A retry may turn a useful failure into a green build without explaining whether the app is now slower, inconsistent, or half-broken. That is not resilience, it is ambiguity.

This is why governance matters. Define which failures are safe to retry, which should fail fast, and which should open an investigation. Keep the logs visible enough that a healed run still leaves a trail. The article Self-Healing Tests in CI: When They Help, When They Hide Real Breakages is a good reminder that automation trust depends on knowing when the tool corrected a test and when it should have stopped the line.

Myth 4: More coverage automatically means less risk

Reality: coverage without maintenance becomes debt with a dashboard

Teams often expand suites because they want confidence, and that is understandable. But if every new test adds another brittle selector, another shared fixture, and another slow setup path, the suite gets harder to trust even as it gets larger.

A bloated suite creates a strange kind of drag. Engineers stop reading failures carefully because they are too common. QA spends more time triaging noise than improving coverage. Product changes slow down because the suite needs babysitting after every release. At that point, the test system is no longer protecting velocity, it is consuming it.

This is where thoughtful scope matters. Not every flow deserves the same level of end-to-end coverage. Some paths need exhaustive automation, others need a smaller smoke layer and a few targeted integration tests. For fast-changing user journeys, especially checkout-style flows, the practical lesson in Endtest for QA Teams Testing Fast-Moving Checkout Flows: What Actually Breaks First is that readable assertions and stable reruns matter more than trying to capture every visual detail. That is a useful pattern to borrow even if you are not using the same tool.

Myth 5: Test maintenance is just the QA team’s job

Reality: the whole team shapes automation quality

It is easy for a team to treat broken tests like a QA backlog item. In practice, most of the causes sit upstream. Developers choose the structure that locators depend on. Designers influence how often the UI churns. Product decisions determine whether workflows can be made deterministic. Infrastructure choices affect timing, network stability, and environment consistency.

So if a suite keeps breaking, the answer is not to ask QA to “own it harder.” The answer is to build a shared maintenance model. That means agreeing on naming conventions for stable selectors, reviewing testability during feature work, and treating test failures as product signals, not just pipeline noise.

It also means having an honest conversation about what to do when automation outgrows the team that maintains it. In some cases, outsourcing regression support can make sense, but only if the provider understands process maturity, reporting quality, and long-term ownership. The checklist in Checklist for Reviewing a QA Agency Before You Outsource Regression Testing is helpful precisely because it focuses on the operating model, not just promised coverage.

Myth 6: If a test passes locally, it is good enough

Reality: local success can hide environment-sensitive failures

Local runs are useful, but they are also forgiving in ways CI is not. A developer machine may have warm caches, different timing, a cleaner browser state, or a dataset that happened to be in the right shape. CI exposes the parts of the test that depend on order, latency, or shared state.

That is why maintenance is not only about fixing failures. It is about making failures reproducible. A good test failure should answer three questions quickly: what broke, where did it break, and under what condition. If a team cannot answer those questions, the suite is not just flaky, it is hard to debug.

The more a suite leans on explicit setup, scoped fixtures, stable waits, and clear assertions, the easier it becomes to distinguish product failures from test failures. That reduces the constant churn that makes automation feel expensive.

The maintenance habits that actually reduce drag

There is no single trick that makes tests stay healthy forever, but a few habits consistently help:

Keep selectors meaningful

Use locators that reflect how a user or assistive technology would identify the element, not how the DOM happened to be arranged last sprint.

Make async states explicit

Wait for the state you care about, not just for the page to exist. A loaded page is not the same thing as a ready workflow.

Review failures like code

A flaky failure should get the same seriousness as a bug in production. Ask whether the problem is the product, the test, or the environment.

Trim tests that no longer earn their keep

If a test is expensive to maintain and rarely catches meaningful regressions, retire it or replace it with a cheaper layer.

Set rules for retries and healing

Use them to reduce noise, not to redefine success.

Treat testability as part of feature quality

If a feature is impossible to automate without brittle hacks, the feature probably needs better hooks, better semantics, or better observability.

A better mental model for automation

The most useful shift I have seen is this: stop treating automation as a one-time asset and start treating it as a living system. Living systems drift. UI changes, data changes, dependencies change, and teams change. That does not mean automation is unreliable by nature. It means reliability has to be designed and maintained.

When teams accept that, they stop asking, “Why did the test fail again?” and start asking, “What changed in the system, and what does this failure teach us?” That is a much healthier place to be, because it turns debugging into learning instead of triage.

If you reduce flaky failures, choose stable locators, and put guardrails around retries and healing, your suite becomes easier to trust. More importantly, it becomes easier to keep.

DEV Community