Raju Shanigarapu

Posted on May 8 • Originally published at raju-shanigarapu.vercel.app

Why Your Automation Framework is Failing (It's the Architecture)

#automation #qa #architecture #testautomation

I've seen it happen at every company I've joined.

Someone built an automation framework 2 years ago. It has 800 tests. Half of them fail on any given day. Nobody touches it except to disable the failing ones. The team has silently agreed to stop trusting it.

This is the most common QA failure mode. And it has almost nothing to do with the tool choice.

The Real Reason Frameworks Fail

Every post-mortem I've done on a dead automation framework finds the same root causes:

1. Tests were written for the happy path, not the system.

The team automated what users should do, not what the system could encounter. The first time a timeout, a race condition, or an API degradation hit production, the tests were useless.

2. No ownership model.

Who fixes a failing test? If the answer is "whoever broke it," the answer is actually "nobody." Automation without explicit ownership is automation in decline.

3. The framework grew faster than its architecture.

Someone wrote test_login.py and copy-pasted it 300 times. No page object model. No fixtures. No hierarchy. 300 tests that all fail when the login selector changes by one character.

4. It wasn't treated as code.

Tests have linting standards, code reviews, and refactoring cadence — or they don't. Teams that treat test code as second-class code produce second-class automation.

5. No feedback loop with engineering.

If developers can merge code without seeing automation results, the automation is decorative. Tests that don't block pipelines don't protect pipelines.

What Good Architecture Actually Looks Like

I've built automation systems that outlive teams and survive platform migrations. Here's what they have in common:

A Single Source of Locator Truth

Locators live in one place. Not scattered across test files. Not duplicated across 40 helpers. When the UI changes, you update one layer and every test that touches that element is fixed.

Page Object Model is table stakes. If you're not using it, stop reading this and implement it today.

Test Isolation as a Non-Negotiable

Every test must be independently executable. No test should depend on another test having run first. No shared mutable state between tests.

If you can't run test_checkout_flow.py in isolation and have it pass, you don't have a test — you have a dependency chain waiting to cascade.

Retry Logic That's Honest

Retry is not a solution. It's a suppressor. Use it sparingly, with a cap (3 retries max), with logging that exposes every retry. A test that passes on retry 3 is not a passing test — it's a flaky test with a mask on.

Track your retry rate. If it's above 5%, you have a structural problem.

Contract-First API Testing

Before testing behavior, test the contract. Does the API schema match what you expect? Does it match what downstream services expect?

API contract tests are the highest ROI automation you can write. They catch breaking changes before any UI test ever runs, and they run in milliseconds.

CI/CD Integration From Day One

Not after the framework "matures." Day one. Tests that don't run in the pipeline don't matter. Test results that don't block merges don't influence behavior.

If you can't run your smoke suite in under 5 minutes, fix that before writing more tests.

The Framework Health Checklist

Before expanding any automation project, run this checklist:

[ ] Can I run any single test in isolation?
[ ] Does every test have an owner?
[ ] Are failing tests blocking merges?
[ ] Is the flaky test rate below 5%?
[ ] Are locators centralized?
[ ] Is test code reviewed like production code?
[ ] Do tests run in CI on every PR?

If more than 2 of those are "no," you're building on sand.

The Architecture Decision That Matters Most

Here's the one I see teams skip most often:

Define your test pyramid before writing test one.

How many unit tests? How many integration tests? How many E2E tests? What's the expected execution time for each tier?

Without this contract, teams default to writing whatever's easiest. Usually E2E tests. Slow, brittle, expensive E2E tests that replace the fast, reliable tests that should have been written first.

The pyramid isn't a suggestion. It's load-bearing.

My Rule of Thumb

If your automation framework requires more than 20% of QA time to maintain, it's not working for you — you're working for it.

Good automation accelerates. Bad automation accumulates.

The difference is almost always in the first 30 days of decisions.

Originally published at https://raju-shanigarapu.vercel.app/blog/why-automation-frameworks-fail

DEV Community