DEV Community

Tudor Brad
Tudor Brad

Posted on • Originally published at betterqa.co

Why your test suite becomes unmaintainable in year two

Last quarter we onboarded a client whose lead developer sent us their automation repo with a warning in the README: "tests are flaky, just rerun them if they fail." We opened the project. There was a single Page Object file called MainPage.js that was 3,147 lines long. Every selector in the suite was an XPath that started with //div[1]/div[2]/div[3]/span[4]. The CI pipeline had a retry policy of five attempts per test because the suite was so unstable that a single pass was statistically unlikely.

The previous QA contractor had been paid for 18 months to build this. It was worse than having no tests at all, because the team trusted it enough to not check things manually, and not enough to believe a red build.

This is not a rare story. We run QA for clients across 24 countries out of our Cluj-Napoca office, and we inherit test suites like this constantly. The patterns that break them are almost always the same. Here is what we find, what we tried that made it worse, and what we wish the original authors had done differently.

The 3000-line Page Object is always the first sign

The Page Object Model is a good idea. The problem is that most teams learn about POM from a tutorial that shows one login page with three elements, and they scale that pattern by adding more lines to the same file.

What ends up in the repo is a class named MainPage or AppPage that contains locators for the header, the sidebar, the dashboard, the settings modal, the user profile, the checkout flow, and the help widget. When the checkout button moves, someone searches the 3000-line file, finds four candidate selectors that all look plausible, updates the wrong one, and the tests still fail.

The fix is not "write smaller Page Objects." The fix is that a Page Object should represent a URL or a self-contained component, not an application. If your product has a dashboard, a settings page, and a checkout flow, that is three Page Objects minimum. If the dashboard has a sidebar that appears on every authenticated page, the sidebar is its own component object that gets composed into the pages that use it.

We refactored that 3147-line file into 23 smaller ones over two sprints. Same test coverage. The bugs we were paid to find started showing up in test failures instead of getting lost in retry noise.

Selectors are the thing that kills you

Almost every inherited test suite we have seen uses selectors that will break the next time a developer touches the HTML. CSS selectors like div.container > div:nth-child(2) > button. XPaths that walk the entire DOM tree. Class names copied from Tailwind or Bootstrap that change whenever the design team updates a color.

The root cause is that developers were not involved in writing the tests, so they never added stable hooks for the QA team. The QA team then used whatever they could find in the DevTools inspector, which was whatever the framework emitted.

This is a conversation, not a technical fix. When we start on a new project, the first thing we ask is whether we can add data-testid attributes to the application. Most developers say yes. It costs them nothing. A five-minute PR that adds data-testid="checkout-submit" to one button is worth more than three hours of XPath archaeology when that button moves.

If the developers refuse, or if we inherit a codebase where that conversation cannot happen, the next best option is text-based selectors. Playwright's getByRole and getByText work on semantic HTML that tends to change less often than class names. They are not bulletproof. But they are more resilient than div.MuiButton-root-284.

We built Flows, our self-healing test recorder, specifically because selector maintenance eats so much QA time on long-running projects. Flows records tests by watching what you click, and when a selector breaks later, it tries alternative ways to find the same element before giving up. It is not magic, and it cannot save a test that points to a button that got deleted, but it buys you a lot of time on projects where the UI is still moving.

The tests that failed randomly on Tuesdays

One client had a suite that failed every Tuesday morning at about 11:15 Romanian time. Not every test. Just a handful. The pattern was consistent enough that the team had started ignoring red builds on Tuesdays.

It took us two weeks to figure out what was happening. The tests hit a staging environment that shared a database with a scheduled data refresh job. The job ran every Tuesday around 11:00 UTC (13:00 Romania in winter, 12:00 in summer), and for about 20 minutes while it was running, certain records the tests relied on were in an inconsistent state.

The tests were not flaky. The environment was. But because nobody had written the tests to verify the preconditions they depended on, the failures looked random, and "just rerun them" was the accepted workaround.

This is the category of pain the Test Automation Pyramid is supposed to prevent. If most of your coverage is at the UI level, every test is coupled to the full stack including the database, the network, the third-party services, and the scheduled jobs. Moving coverage down to unit and integration tests does not mean "write fewer UI tests because they are slow." It means the UI tests should only cover what can only be tested at the UI level, because every additional UI test is another thing that can fail for reasons that have nothing to do with your code.

We did not fix this client's suite by adding more retries. We fixed it by deleting about 40 percent of the UI tests, moving their logic down to API-level tests that did not touch the UI at all, and adding setup code to the remaining UI tests that verified the database state before running.

The refactor that made things worse

A few years ago we tried to help a client migrate their Selenium suite to Cypress because Cypress would "fix the flakiness." We were wrong, and we made things worse.

The flakiness was not Selenium's fault. It was the same problem as the Tuesday failures: the tests were coupled to unstable state. Moving them to Cypress did not fix the state, it just rewrote the same bad tests in a new framework. What it did do was burn three months of budget, delete the one thing the client had that was working (the existing test runs were unreliable but not zero), and leave them with a half-migrated suite that nobody wanted to finish.

The lesson we took from that project is that test framework migrations are almost never the answer. If tests are flaky, the flakiness is usually in how the tests are written or what they depend on, not the framework. We now refuse to migrate a suite to a new framework unless the client has a specific reason that the old framework cannot solve, and even then we migrate one module at a time and keep both suites running until the new one has proven itself.

Data-driven testing sounds great until you do it wrong

Separating test data from test scripts is good advice. What the advice does not tell you is that data-driven tests can amplify your problems instead of reducing them.

We had a client with a test that read a 400-row CSV of user accounts and ran the same login flow for each one. When the login page changed, one test file broke 400 times, and the test report was unreadable. The team disabled the entire test rather than fix it.

The right move is to be honest about what data-driven tests are for. They are good when you need to verify that a workflow handles different categories of input correctly: a valid email, an invalid email, an email at the maximum length, a unicode email. That is maybe 5 to 15 rows, chosen deliberately. It is not good for running the same test against every row in a database dump "because we have the data."

If you find yourself writing a data-driven test with more than 20 rows, ask whether you are testing behavior or just making the test report longer. Usually you are making the report longer.

What "continuous improvement" actually looks like

The original version of this article said to "review and update your test scripts regularly." That is true but useless advice. Every team already knows they should do this. Nobody does, because there is no budget for it and no way to measure it.

What we do on client projects is different. We track two numbers per week:

  1. Flake rate: percentage of test runs that failed on the first attempt but passed on a rerun without any code change. If this number is above 2 percent, we stop writing new tests and fix existing ones until it comes down.
  2. Selector churn: how many tests had their selectors updated in the last sprint. A high number means the UI is changing faster than the tests can keep up, which is a signal to talk to the developers about data-testid or to invest in tools like Flows that can absorb the change automatically.

Neither number is in any testing tutorial we have seen. They are what we learned to watch after the fifth inherited broken suite.

The honest limitation

None of this guarantees a maintainable suite. If the product pivots every six weeks, or if the developers refuse to add test hooks, or if the deadline culture rewards "ship now, test later," your suite will become unmaintainable no matter how carefully you write it. We have walked away from projects where the problem was upstream of anything QA could fix.

The best predictor we have found for whether a test suite will stay maintainable is whether the developers see the test suite as their problem too. When developers add data-testid attributes without being asked, fix tests they broke in their own PRs, and treat red CI as a block on merging, the suite stays healthy. When QA is a wall that builds throw things over, no pattern in this article will save you.


BetterQA runs independent QA for companies that want someone outside their dev team verifying the software before users see it. We built Flows because selector maintenance was eating too much of our time, and we keep it as part of the toolkit we bring to client projects. If you are inheriting a test suite that feels like the one in the story at the top of this article, we have probably seen worse, and we can help you untangle it. betterqa.co

Top comments (0)