DEV Community: Tudor Brad

When to test what: honest notes from eight years of picking the wrong strategy

Tudor Brad — Thu, 09 Apr 2026 20:01:46 +0000

A founder called me on a Thursday. He had paid his agency for three weeks of "full test coverage" on a prototype he was about to demo to investors. The demo was on Monday. Over the weekend, his team pivoted the entire data model based on feedback from an advisor, and every single one of those tests became garbage. Three weeks of billed work, gone.

He was upset with the agency. I told him it wasn't the agency's fault. Someone should have asked what stage the product was in before writing a line of test code, and nobody did, including us when he had asked us for a quote two months earlier.

That call stuck with me. Testing advice on the internet tends to be one-size-fits-all. "Write unit tests." "Automate everything." "Shift left." All of it is technically correct, and all of it is wrong if you apply it at the wrong stage. The honest truth is that the test strategy for a prototype is almost the opposite of the strategy for a mature product, and picking the wrong one costs you either quality or time, sometimes both.

I run QA operations across 50+ engineers working with teams in 24 countries. We've made a lot of these mistakes. Here's what we've actually learned about what to test, and what to leave alone, at each stage of a product's life.

Stage one: the prototype that probably won't survive

A prototype is code written to answer a question. Usually the question is something like "would users click this?" or "can we actually build this integration?" or "does this algorithm produce the output we need?"

The honest truth about prototypes is that most of them get thrown away. Not rewritten. Thrown away. The code exists to prove or disprove a hypothesis, and once you have your answer, you move on.

Writing unit tests for a prototype is usually a waste. Writing E2E tests is almost always a waste. I say this as someone who has literally billed clients for doing both, and I'm not proud of it. The story above with the founder wasn't an isolated incident. We've had at least two other projects where we wrote comprehensive automated test suites for prototypes that got scrapped within a month.

What actually helps at the prototype stage is manual exploratory testing. Have someone who didn't write the code sit down and try to use it. Take notes. Not bug reports in Jira, notes. What confused them? What broke in an obvious way? Does the core thing it's supposed to prove actually work?

If you must write automated tests at this stage, write a handful of smoke tests that verify the happy path still runs after changes. Nothing more. The goal is to move fast and learn, not to build a bulletproof codebase that nobody will ever use.

Stage two: the MVP that has paying users

An MVP is different. Real people are giving you real money or real attention, and if it breaks, they leave. But an MVP is also still changing fast. You're probably rewriting chunks of it every two weeks based on what users actually do versus what you assumed they'd do.

This is the stage where most teams get testing wrong in both directions. Either they under-test because "we're still pivoting," and they ship bugs that erode the trust of their first users. Or they over-test, building an elaborate test pyramid for features that get deleted the following sprint.

The right move at MVP stage is selective: heavy testing on the parts that touch money, auth, or data integrity, and light testing on everything else. If your app takes payments, your payment flow gets unit tests, integration tests, and E2E tests. If your app has login, your auth gets the same. Everything else gets smoke tests and exploratory coverage.

I'll give you the under-tested story I promised. A client MVP had a subscription upgrade flow that worked fine in our test environment. We covered the happy path, the cancellation path, and the "invalid card" path. We didn't cover the "user upgrades from a grandfathered legacy plan that had a different billing cycle" path, because nobody mentioned that legacy plans existed. That bug shipped to production. Three customers were double-charged. We refunded them, apologized, and added the missing test. But the trust hit with those customers was real. One of them churned a month later.

That bug wouldn't have been caught by writing more tests. It would have been caught by asking the question "what user states exist that we don't have in our test data?" before we started writing tests. This is the part that tools don't solve.

Stage three: the growth phase where everything breaks

Your product works. Users are signing up. Revenue is climbing. And everything is on fire.

Growth-stage products break in ways that earlier-stage products don't, because scale exposes assumptions that were invisible when you had 50 users. A query that ran in 40 milliseconds at 50 users takes 8 seconds at 50,000. A cache that had 99% hit rate fills up and starts evicting hot data. An API rate limit that was fine for a while suddenly isn't.

This is where you actually need load testing, performance testing, and observability. Not because the gurus say so. Because real things are breaking at 2am and your team is burning out fixing them.

This is also where automated regression testing starts to pay off, because the codebase is big enough that humans can't hold all of it in their heads anymore. You need a computer to tell you that the password reset flow still works, because no human is going to manually test the password reset flow every sprint.

The honest lesson from this stage: build the automated regression suite a little earlier than feels comfortable, but not as early as the textbooks say. If you do it in the MVP stage, you'll throw half of it away. If you do it after you've already scaled, you're playing catch-up while the fires get worse. The sweet spot is when you have enough users that manual testing has become slow, but not so many that you're firefighting.

Stage four: the mature product nobody wants to break

A mature product has thousands or millions of users and an existing reputation. At this stage, the calculus of testing flips entirely. The cost of shipping a bug is massive. The cost of shipping a feature a week late is comparatively small.

This is the stage where you can justify everything the textbooks tell you to do. Unit tests for core logic. Integration tests for service boundaries. E2E tests for critical user journeys. Performance tests before every major release. Security testing. Accessibility audits. Regression suites that run on every commit.

But here's the thing nobody mentions: mature products also accumulate test debt. Tests that were written for features that got deprecated. Flaky tests that nobody has time to fix. Fixtures that reference test users who were deleted from the database. Test environments that have drifted from production.

We audited a mature product for a client last year. They had 3,400 automated tests. About 800 of them were testing features that no longer existed in the product. Another 400 were so flaky that the team had started ignoring them. The effective test coverage was maybe half of what the numbers said. Pruning tests is as important as writing them, and nobody budgets time for it.

The question we should have been asking all along

Looking back at eight years of running QA across dozens of products, the pattern is clear. The biggest testing mistakes we made weren't about choosing the wrong tool or writing the wrong assertion. They were about not asking "what stage is this product in?" before we wrote our first test.

A prototype needs speed and learning. An MVP needs protection of critical paths and permission to break the rest. A growth product needs scale testing and automation catching up to the size of the codebase. A mature product needs discipline and ongoing maintenance of the test suite itself.

If you mix these up, you either waste weeks on tests that get thrown away, or you ship bugs that cost you users. Both are expensive. Both are avoidable if you start with the stage question instead of the tool question.

This is one of the reasons we lean on the idea of independent testing at BetterQA. A team too close to the code usually defaults to the habits of the last stage they worked in. The chef doesn't certify his own dish, and the developer who built the prototype is usually the wrong person to decide whether it needs E2E coverage. Someone from outside can look at the product honestly and ask where it actually is, not where the team wishes it was.

What I'd tell that founder now

If I could go back to the call with the founder who lost three weeks of test work over the weekend, here's what I'd say.

Before you hire anyone to test your product, answer one question out loud: is this code going to exist in three months in roughly the same shape it's in today? If the answer is no, you don't need automated tests. You need an experienced human to bang on it for a few hours and tell you what's broken. If the answer is yes, then you can start thinking about what categories of test to invest in, and in what order.

Everything else is just choosing between tools, and tools are the easy part.

The production disasters we've watched happen, and the habit that would have prevented all of them

Tudor Brad — Thu, 09 Apr 2026 20:01:41 +0000

The Tuesday the database got smaller

A client of ours ran a loyalty program with about 120,000 members. On a Tuesday afternoon their agency pushed a "cleanup migration" to production. The intent was to merge duplicate accounts where the same email had signed up twice with different casing. The script ran, the dashboard was snappier than usual, and someone in the client's marketing team noticed by Wednesday morning that roughly 40,000 members had vanished from the list.

The migration had matched on normalized email, yes, but it had also silently deleted the "loser" row instead of merging the points balances first. There was no soft delete. There was no dry run log. The backup was 19 hours old, which meant a full day of new signups and point redemptions was gone by the time anyone restored it.

The agency's post-mortem used the word "oversight" four times. The real word is "untested". Nobody had run the script against a production-shaped dataset before Tuesday. The staging DB had 300 rows in it. I know because I asked.

The other one was a checkout

Different client, different year. An e-commerce build. Clean code, good developers, the sort of team you would hire again. They shipped a payment integration update to handle a new 3D Secure flow. It worked on desktop Chrome. It worked on Android. It worked in their test Stripe account.

What it did on mobile Safari was charge the card, fail to register the webhook response, show the user a generic error, and then charge again when the user tapped the retry button. For about six hours on a Saturday, roughly one in four mobile checkouts double-billed. Support found out from Twitter. The refund process took two weeks because the agency had to reconcile it manually against Stripe exports.

I want to be fair to the developers here. The bug was subtle. Safari's handling of redirect-based payment flows has edge cases that will age you a year per hour you spend debugging them. But the reason it reached production was not the subtlety. It was that the team's QA process was one developer clicking through on one phone before the PR merged. There was no mobile test matrix. There was no webhook replay in staging. There was no synthetic buyer running the full flow every 10 minutes after deploy.

The part where I admit we missed one ourselves

BetterQA runs its own internal tools and we eat our own cooking, mostly. But in 2024 we shipped a change to our timesheet platform that quietly broke PTO accrual for contractors who had been hired mid-month. The math was off by a fraction of a day. It was the kind of thing you would only notice if you looked at your own balance closely and thought "that seems low".

One contractor did notice, a month later, and flagged it politely in Slack. When we dug in, we found the test coverage for the accrual logic had a gap exactly where new-hire proration lived. We had tested the happy path, the leaver path, and the anniversary path. We had not tested the "joined halfway through April" path. A QA team. Missing a date edge case. On our own product.

I bring this up because the usual posture of a QA company writing a blog post is to sound like the wise adult in the room. We are not. We are the people who got burned enough times to build checklists, and we still occasionally miss things. The checklists are how we miss them less often than we used to.

What these stories have in common

None of the three disasters I just described were caused by bad engineers. They were caused by the same structural problem, which is that the team doing the work was also the team signing off on the work. Our founder has a line he uses in sales conversations, and it sounds glib until you have lived through one of these: the chef should not certify his own dish.

When a developer tests the feature they just wrote, they test the paths they imagined while writing it. They do not test the paths they did not imagine, because by definition they did not imagine them. An independent tester, even one who is less skilled technically, will wander into those unimagined paths because they are not carrying the mental model the developer built. This is why the migration script passed review, and why the mobile Safari checkout looked fine in the PR, and why the accrual bug sat in our own codebase for weeks.

Independent QA is not about hiring a second tier of people to do clicks. It is about introducing a viewpoint that is not contaminated by the assumptions of the build.

What I would actually tell a client who asked me how to avoid this

Stop asking agencies "do you do QA". Every agency says yes. Start asking: who on your team will physically run my feature on a device they did not use to build it, before it gets to my users. If the answer is "the same developer who wrote it", you already know what kind of outage you are buying.

Ask to see the last bug report they wrote for a previous client. Not a metric. An actual bug report, with steps and severity and what they did about it. If they cannot show you one because "we don't really write those", they are not doing QA, they are doing vibes.

Ask what happens after launch. Not "do you offer support", every agency offers support. Ask what the response time is for a production incident at 9pm on a Friday, and ask who specifically takes that call. The answer should be a name, not a policy.

Ask them about something they got wrong on a past project and how they caught it. If they cannot think of one, they either have a very bad memory or they are not looking closely enough at their own work. Both are disqualifying.

A short note on feature creep, because it keeps showing up in disasters

I did not want to turn this into a listicle but I have to mention one more thing, because it is the quiet cause of maybe half the production incidents I have seen in the last few years. Scope.

Every feature you ship is a surface area you have to test, monitor, and eventually rewrite. When an agency pitches you a fancy real-time dashboard and you do not actually need real-time, you are not getting a free gift. You are agreeing to debug a WebSocket connection at 2am six months from now when it starts dropping on AWS's us-east-1 for reasons nobody can reproduce. The cheapest feature is the one you did not build.

Most of the "digital transformation" budgets I see waste about 30% of their money on things that looked exciting in the pitch deck and got tested by nobody before launch. The honest agencies will talk you out of half of it. The dishonest ones will let you pay for all of it and then blame the complexity when it breaks.

The uncomfortable summary

If I had to give you one habit that would have prevented the loyalty migration disaster, the Safari double-charge, and our own accrual bug, it is this: before any change reaches users, a person who did not write the change has to try to break it on production-realistic data, on the real devices users use, with the real integrations in place. Not a dev clicking through staging. Not an automated test of the happy path. A human being, with fresh eyes, who gets paid to think nastily.

You can hire this role in-house. You can bring in a partner. We happen to think the partner model works better for most teams, which is why BetterQA exists, but I would rather you have bad independent QA than no independent QA. The point is not the logo on the invoice. The point is that no chef certifies their own dish.

If you are about to ship a migration, a payment change, or anything that touches money, authentication, or customer data, ask yourself one question before you merge the PR. Who is going to break this on purpose before my customers break it by accident? If the answer is nobody, you are the nobody, and the Tuesday is coming.

Why your test suite becomes unmaintainable in year two

Tudor Brad — Thu, 09 Apr 2026 19:56:17 +0000

Last quarter we onboarded a client whose lead developer sent us their automation repo with a warning in the README: "tests are flaky, just rerun them if they fail." We opened the project. There was a single Page Object file called MainPage.js that was 3,147 lines long. Every selector in the suite was an XPath that started with //div[1]/div[2]/div[3]/span[4]. The CI pipeline had a retry policy of five attempts per test because the suite was so unstable that a single pass was statistically unlikely.

The previous QA contractor had been paid for 18 months to build this. It was worse than having no tests at all, because the team trusted it enough to not check things manually, and not enough to believe a red build.

This is not a rare story. We run QA for clients across 24 countries out of our Cluj-Napoca office, and we inherit test suites like this constantly. The patterns that break them are almost always the same. Here is what we find, what we tried that made it worse, and what we wish the original authors had done differently.

The 3000-line Page Object is always the first sign

The Page Object Model is a good idea. The problem is that most teams learn about POM from a tutorial that shows one login page with three elements, and they scale that pattern by adding more lines to the same file.

What ends up in the repo is a class named MainPage or AppPage that contains locators for the header, the sidebar, the dashboard, the settings modal, the user profile, the checkout flow, and the help widget. When the checkout button moves, someone searches the 3000-line file, finds four candidate selectors that all look plausible, updates the wrong one, and the tests still fail.

The fix is not "write smaller Page Objects." The fix is that a Page Object should represent a URL or a self-contained component, not an application. If your product has a dashboard, a settings page, and a checkout flow, that is three Page Objects minimum. If the dashboard has a sidebar that appears on every authenticated page, the sidebar is its own component object that gets composed into the pages that use it.

We refactored that 3147-line file into 23 smaller ones over two sprints. Same test coverage. The bugs we were paid to find started showing up in test failures instead of getting lost in retry noise.

Selectors are the thing that kills you

Almost every inherited test suite we have seen uses selectors that will break the next time a developer touches the HTML. CSS selectors like div.container > div:nth-child(2) > button. XPaths that walk the entire DOM tree. Class names copied from Tailwind or Bootstrap that change whenever the design team updates a color.

The root cause is that developers were not involved in writing the tests, so they never added stable hooks for the QA team. The QA team then used whatever they could find in the DevTools inspector, which was whatever the framework emitted.

This is a conversation, not a technical fix. When we start on a new project, the first thing we ask is whether we can add data-testid attributes to the application. Most developers say yes. It costs them nothing. A five-minute PR that adds data-testid="checkout-submit" to one button is worth more than three hours of XPath archaeology when that button moves.

If the developers refuse, or if we inherit a codebase where that conversation cannot happen, the next best option is text-based selectors. Playwright's getByRole and getByText work on semantic HTML that tends to change less often than class names. They are not bulletproof. But they are more resilient than div.MuiButton-root-284.

We built Flows, our self-healing test recorder, specifically because selector maintenance eats so much QA time on long-running projects. Flows records tests by watching what you click, and when a selector breaks later, it tries alternative ways to find the same element before giving up. It is not magic, and it cannot save a test that points to a button that got deleted, but it buys you a lot of time on projects where the UI is still moving.

The tests that failed randomly on Tuesdays

One client had a suite that failed every Tuesday morning at about 11:15 Romanian time. Not every test. Just a handful. The pattern was consistent enough that the team had started ignoring red builds on Tuesdays.

It took us two weeks to figure out what was happening. The tests hit a staging environment that shared a database with a scheduled data refresh job. The job ran every Tuesday around 11:00 UTC (13:00 Romania in winter, 12:00 in summer), and for about 20 minutes while it was running, certain records the tests relied on were in an inconsistent state.

The tests were not flaky. The environment was. But because nobody had written the tests to verify the preconditions they depended on, the failures looked random, and "just rerun them" was the accepted workaround.

This is the category of pain the Test Automation Pyramid is supposed to prevent. If most of your coverage is at the UI level, every test is coupled to the full stack including the database, the network, the third-party services, and the scheduled jobs. Moving coverage down to unit and integration tests does not mean "write fewer UI tests because they are slow." It means the UI tests should only cover what can only be tested at the UI level, because every additional UI test is another thing that can fail for reasons that have nothing to do with your code.

We did not fix this client's suite by adding more retries. We fixed it by deleting about 40 percent of the UI tests, moving their logic down to API-level tests that did not touch the UI at all, and adding setup code to the remaining UI tests that verified the database state before running.

The refactor that made things worse

A few years ago we tried to help a client migrate their Selenium suite to Cypress because Cypress would "fix the flakiness." We were wrong, and we made things worse.

The flakiness was not Selenium's fault. It was the same problem as the Tuesday failures: the tests were coupled to unstable state. Moving them to Cypress did not fix the state, it just rewrote the same bad tests in a new framework. What it did do was burn three months of budget, delete the one thing the client had that was working (the existing test runs were unreliable but not zero), and leave them with a half-migrated suite that nobody wanted to finish.

The lesson we took from that project is that test framework migrations are almost never the answer. If tests are flaky, the flakiness is usually in how the tests are written or what they depend on, not the framework. We now refuse to migrate a suite to a new framework unless the client has a specific reason that the old framework cannot solve, and even then we migrate one module at a time and keep both suites running until the new one has proven itself.

Data-driven testing sounds great until you do it wrong

Separating test data from test scripts is good advice. What the advice does not tell you is that data-driven tests can amplify your problems instead of reducing them.

We had a client with a test that read a 400-row CSV of user accounts and ran the same login flow for each one. When the login page changed, one test file broke 400 times, and the test report was unreadable. The team disabled the entire test rather than fix it.

The right move is to be honest about what data-driven tests are for. They are good when you need to verify that a workflow handles different categories of input correctly: a valid email, an invalid email, an email at the maximum length, a unicode email. That is maybe 5 to 15 rows, chosen deliberately. It is not good for running the same test against every row in a database dump "because we have the data."

If you find yourself writing a data-driven test with more than 20 rows, ask whether you are testing behavior or just making the test report longer. Usually you are making the report longer.

What "continuous improvement" actually looks like

The original version of this article said to "review and update your test scripts regularly." That is true but useless advice. Every team already knows they should do this. Nobody does, because there is no budget for it and no way to measure it.

What we do on client projects is different. We track two numbers per week:

Flake rate: percentage of test runs that failed on the first attempt but passed on a rerun without any code change. If this number is above 2 percent, we stop writing new tests and fix existing ones until it comes down.
Selector churn: how many tests had their selectors updated in the last sprint. A high number means the UI is changing faster than the tests can keep up, which is a signal to talk to the developers about data-testid or to invest in tools like Flows that can absorb the change automatically.

Neither number is in any testing tutorial we have seen. They are what we learned to watch after the fifth inherited broken suite.

The honest limitation

None of this guarantees a maintainable suite. If the product pivots every six weeks, or if the developers refuse to add test hooks, or if the deadline culture rewards "ship now, test later," your suite will become unmaintainable no matter how carefully you write it. We have walked away from projects where the problem was upstream of anything QA could fix.

The best predictor we have found for whether a test suite will stay maintainable is whether the developers see the test suite as their problem too. When developers add data-testid attributes without being asked, fix tests they broke in their own PRs, and treat red CI as a block on merging, the suite stays healthy. When QA is a wall that builds throw things over, no pattern in this article will save you.

BetterQA runs independent QA for companies that want someone outside their dev team verifying the software before users see it. We built Flows because selector maintenance was eating too much of our time, and we keep it as part of the toolkit we bring to client projects. If you are inheriting a test suite that feels like the one in the story at the top of this article, we have probably seen worse, and we can help you untangle it. betterqa.co

The Angular E2E testing setup we actually ship in 2026

Tudor Brad — Thu, 09 Apr 2026 19:56:12 +0000

The Angular E2E testing setup we actually ship in 2026

I have a folder on my laptop called graveyard. It contains the last three Protractor test suites I wrote before Angular 15 shipped, each one a little shrine to confidence I no longer have. When the Angular team announced Protractor deprecation at the end of 2022, a lot of us hoped there'd be an official replacement coming in a release or two. There wasn't. There still isn't. Angular 17 came and went, standalone components took over, and the answer to "what do we use for E2E?" quietly became "figure it out."

So this is what we figured out at BetterQA after running Angular E2E for roughly 40 client projects since the Protractor funeral. Some of it is boring best practice. Some of it is warnings. One section near the end is me complaining about a Playwright bug that still hasn't been fixed.

What we tried first (and why most of it failed)

When Protractor was declared end-of-life, the migration suggestions from the Angular team were Cypress, WebDriverIO, Nightwatch, TestCafe, and Playwright. We tried four of those on real client codebases in 2023. Here's the honest scorecard.

Cypress worked great until we hit cross-origin auth flows (common in fintech, which is half our portfolio). We were on Cypress 12 at the time, before the real cy.origin() stabilization, and we burned two weeks on a workaround for a Keycloak redirect that turned out to be unfixable without rearchitecting the test suite. We shipped it, but I still feel bad about it.

Nightwatch had a beautiful config file and absolutely nothing else going for it in Angular-land. The waiting story was worse than Playwright's and the community examples all assumed jQuery selectors. We killed it after a week.

WebDriverIO was fine. Really, fine. Not exciting, not broken. Good Protractor migration tooling. We still use it on one legacy project where the client is locked into Selenium Grid infrastructure.

Playwright won because it was the only one that didn't lie about flakiness. Specifically: when a test failed, the trace viewer showed me exactly why, and the answer was almost never "Playwright got confused." That alone made it worth switching.

The version that actually works

If you're setting this up today (April 2026), this is the combination I've tested against:

Angular 19.2 (works fine on 17.3 and 18.1 too)
Playwright 1.51
Node 20.12 LTS (do not use Node 22 yet, we hit two issues with the webServer block, I'll get to those)
@playwright/test (not playwright directly, this matters more than people think)

Starting from an existing Angular CLI project:

cd your-angular-project
npm init playwright@latest

When the wizard asks where to put tests, say e2e. When it asks about GitHub Actions, say no, you'll write your own, the generated workflow is fine for a blog post but it will not survive contact with a real monorepo.

The webServer gotcha nobody mentions

Here is the config we actually ship, and then I'll tell you about the part that cost me four hours last month:

import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './e2e',
  fullyParallel: true,
  forbidOnly: !!process.env.CI,
  retries: process.env.CI ? 2 : 0,
  workers: process.env.CI ? 2 : undefined,
  reporter: process.env.CI ? [['html'], ['github']] : 'html',
  use: {
    baseURL: 'http://localhost:4200',
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
  },
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
    { name: 'firefox', use: { ...devices['Desktop Firefox'] } },
    { name: 'webkit', use: { ...devices['Desktop Safari'] } },
  ],
  webServer: {
    command: 'npm run start -- --configuration=test',
    url: 'http://localhost:4200',
    reuseExistingServer: !process.env.CI,
    timeout: 180_000,
  },
});

That timeout: 180_000 is there because Angular's first compile on a cold CI cache regularly takes 90 seconds or more on a project of any real size. Playwright defaults to 60 seconds and then kills the server. We lost a whole afternoon to flaky CI runs before I realized Playwright was timing out the dev server, not the test.

The other thing, and this is the one I yell about: the webServer block uses a child process. If your Angular build fails (say, a TypeScript error in a component), Playwright does not surface the compiler error. It just sits there until timeout expires and says "webServer did not start." You have to go look at the server stdout yourself, and even then the output is buffered weirdly. We now have a pre-flight in our CI that runs ng build --configuration=test before Playwright runs, purely so we get a readable error when the build breaks.

Writing tests that don't die on refactor day

Playwright's locator API is genuinely better than Protractor's. That said, everyone writes their first tests wrong. Including me. My first Angular Playwright test looked like this:

await page.locator('.btn.btn-primary.submit').click();

Two weeks later a designer renamed the class and the test broke. This is the lesson that never sinks in until it bites you.

What we do now, and enforce in code review, is this:

// e2e/contact.spec.ts
import { test, expect } from '@playwright/test';

test('contact form submits successfully', async ({ page }) => {
  await page.goto('/contact');

  await page.getByTestId('name-input').fill('Test User');
  await page.getByTestId('email-input').fill('test@example.com');
  await page.getByTestId('submit-button').click();

  await expect(page.getByTestId('success-message')).toBeVisible({ timeout: 10_000 });
});

Every interactive element in an Angular template gets a data-testid. Non-negotiable. We add an ESLint rule via eslint-plugin-angular-template to warn when a (click) binding has no test id. It is annoying. It is also the reason our tests survive the quarterly design refreshes.

One caveat: for route-based assertions, use Playwright's URL matchers, not the text on the page. Angular's router can transition visually before the URL updates depending on your navigation strategy, and we got burned by a race condition where the title changed before the URL did.

await page.getByRole('link', { name: 'Pricing' }).click();
await expect(page).toHaveURL(/\/pricing/);  // do this
// not: await expect(page.locator('h1')).toContainText('Pricing');  // flaky

Network mocking replaces $httpBackend

If you're coming from Protractor, you probably used $httpBackend or ngMocks for HTTP interception. Playwright's page.route() is a drop-in replacement, and honestly it's nicer because it operates at the browser level, which means you catch real fetch calls, not just ones that went through Angular's HttpClient.

test('handles 500 from users API', async ({ page }) => {
  await page.route('**/api/users', (route) =>
    route.fulfill({
      status: 500,
      contentType: 'application/json',
      body: JSON.stringify({ error: 'internal' }),
    })
  );

  await page.goto('/users');
  await expect(page.getByTestId('error-banner')).toContainText('Something went wrong');
});

The one thing page.route() cannot do gracefully is mock streaming responses. We have a client with a server-sent-events endpoint for real-time data, and mocking that required dropping down to a WebSocket-like shim that Playwright doesn't officially support. It works, but it's ugly, and any Playwright release could break it. If you're testing SSE in Angular, budget time for pain.

The part I still hate: standalone component tests

Angular 17+ pushes standalone components hard, and Playwright has a "component testing" mode that's supposed to let you test them without bootstrapping the full app. On paper this is great. In practice, in our experience, the Angular adapter for Playwright Component Testing is less mature than the React and Vue ones. We've had flaky tests where the component's change detection doesn't trigger on the first render, and you have to manually fixture.detectChanges() in a way that feels very 2018.

For now, we skip Playwright CT for Angular and use the regular full-app E2E approach, accepting the slower startup. When a client asks "why don't we use component testing?", I tell them honestly: it's not quite ready yet, and the debugging experience when it fails is worse than the thing it's supposed to replace. Maybe by Angular 21. Ask me again next year.

The GitHub Actions workflow we actually use

This is close to what we deploy on client projects. It's not the fanciest version, but it's the one that has survived real CI conditions for two years.

name: E2E

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  e2e:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    strategy:
      fail-fast: false
      matrix:
        shard: [1/3, 2/3, 3/3]
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20.12'
          cache: 'npm'

      - run: npm ci

      - name: Build Angular (fail fast on compile errors)
        run: npx ng build --configuration=test

      - name: Install Playwright browsers
        run: npx playwright install --with-deps chromium

      - name: Run E2E
        run: npx playwright test --shard=${{ matrix.shard }}

      - name: Upload trace on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-trace-${{ strategy.job-index }}
          path: test-results/
          retention-days: 7

A few things worth calling out:

The separate ng build step is the pre-flight I mentioned earlier. If the Angular code doesn't compile, we fail in 30 seconds with a readable error, not in three minutes with "webServer did not start."

We only install chromium in CI, not all three browsers. This saves about two minutes per run and we run Firefox and WebKit nightly on a separate workflow. Running all three on every PR is a luxury tax most projects can't afford.

fail-fast: false on the matrix is deliberate. If shard 1 fails, we still want to see what happened in shards 2 and 3. Otherwise you fix one flake, re-run, find another, fix that, re-run, find a third. Been there.

When you just don't have time to write all this

I've spent this whole article telling you how to hand-write Playwright setup for Angular, but I'll be honest about something: we don't always hand-write these tests on client projects. When a client comes to us with an Angular app and 0% E2E coverage and a deadline, we use Flows, which is a Chrome extension we built specifically because writing locators manually was the most soul-crushing part of the job.

The pitch is simple: you browse the app, click the things a user would click, and Flows generates the Playwright test. It's not magic, the generated code still needs review, and we've definitely shipped tests where Flows picked a brittle selector because the developer forgot the data-testid on a button. But as a starting point it turns a two-hour test-writing session into a ten-minute recording plus a code review. On a project where you need coverage yesterday, that's the difference between having tests and not.

We also use BugBoard to track the failures these tests catch, because once your E2E suite is actually running on every PR, you will find a surprising number of real bugs that were hiding behind "works on my machine" and spreadsheet-based bug tracking stops scaling around 40 open issues.

The setup we'd recommend if you're starting today

If you're reading this because you're about to set up E2E on a new Angular project, here's the short version of what I'd do:

Start with Playwright 1.51+, Node 20, the config block from the top of this article, and data-testid attributes on every interactive element from day one. Write five tests: login, logout, one happy-path transaction, one error state, one navigation flow. Get those running in CI before you write test number six. Don't try to get to 80% coverage in week one, that's how you end up with a flaky suite everyone ignores.

Then, when you need to scale, either commit to the maintenance cost of hand-written tests or use a recorder for the bulk of coverage and keep hand-written tests for the tricky edge cases. Both approaches work. What doesn't work is pretending a neglected test suite will fix itself.

If you want to read more of what we've learned running QA for 50+ engineers across roughly 200 client projects, we post it at BetterQA and on the BetterQA blog. Come say hi if Angular E2E is the kind of thing that keeps you up at night. You're in good company.

About the author

Tudor B. is founder at BetterQA. He started the company in 2018 after being hired for one healthcare project that had so many bugs the client needed him to scale from one to eight people. That became BetterQA. Today the team is 50+ engineers across 24+ countries, with NATO NCIA approvals and ISO 9001 certification. The philosophy is simple: the chef should not certify his own dish.

The three AI tools we tried for QA and the one we kept

Tudor Brad — Thu, 09 Apr 2026 19:50:39 +0000

The first time I watched an engineer try to use ML to generate test cases for a client project, she had been at it for about six hours. She was using Diffblue Cover on a Java monolith, which had been pitched to her as an automatic unit test generator. It had produced something like 2,400 tests. Every single one of them passed. None of them would have caught the bug we were actually chasing, which was a session expiry race condition in the checkout flow.

I still remember her sitting there staring at the coverage report, which was showing 87 percent, and asking me whether anyone had ever actually shipped software on the back of those numbers. The tests weren't wrong, exactly. They just verified that the code did what the code did. A method that multiplied two numbers had a test confirming it multiplied two numbers. A controller that returned a 200 had a test confirming it returned a 200. The checkout race condition lived in the gap between two services, and no amount of per-method unit tests were going to find it.

That was maybe two years ago. Since then we've tried three different AI-flavoured tools for QA work at BetterQA, and I want to be honest about which ones wasted our time and which one earned a place in our workflow. Because the marketing around "AI in QA" has gotten so loud that you can't find an honest post about any of it, and I'm tired of reading whitepapers that explain the benefits without admitting the failures.

The visual regression disaster

The second tool was an AI-powered visual regression service. I'm not going to name it, because the team behind it is probably working hard and the category has moved on, but the pitch was beautiful. You give it a URL, it takes baseline screenshots, then on every deploy it compares the new render against the baseline and tells you what actually changed visually, not just pixel-by-pixel.

The promise was: no more flaky screenshot diffs. The AI understands when an animated element is just in a different frame. It knows the difference between "the header font changed" and "the ad network served a different creative."

We ran it against a client e-commerce site for two weeks. On one deploy it flagged 400 differences. Four hundred. Most of them were product carousel images that had rotated since the baseline was captured. Some were the cookie banner appearing in a slightly different state. A handful were genuine CSS regressions that we did need to know about, and those were buried under the noise like needles in a haystack made of other needles.

We tuned it. We set thresholds. We excluded regions. We trained it on what to ignore. After two weeks our engineer Mihai said the exact phrase "I would rather write Cypress assertions by hand than tune this thing for another day," and we killed the pilot. The problem wasn't that the AI was bad at image comparison. It was genuinely impressive at that. The problem was that "visual regression" is not actually a question an AI can answer without understanding intent, and intent is exactly what these tools don't have.

The bug triage bot that tried to be helpful

The third experiment was an LLM-based bug triage assistant. We set it up to read incoming bug reports from one client's Jira, classify them by severity, route them to the right team, and draft a first-response message to the reporter. This one wasn't a total failure. The classification was decent. The routing worked most of the time. But the drafted responses had this confident, slightly off quality where they would reference things the reporter hadn't said, or reassure them about a fix timeline that hadn't been agreed. One of our clients got back a response that promised a hotfix within 24 hours for a bug that was clearly a feature request.

We pulled the drafting feature. Kept the classification. It's still running, and it saves our triage lead maybe 20 minutes a day. Useful, but not the 10x productivity boost the vendor's landing page promised.

The one that earned its keep

Here's the one that worked. We built it ourselves. Inside BugBoard, our test management platform, we use the Anthropic API (specifically Claude) to help generate test cases from user stories and requirements documents. You paste in a feature description and it produces a draft set of test cases covering happy paths, edge cases, negative scenarios, and some permission-matrix stuff that humans tend to forget.

The honest version of this story: it's not magic. The draft is rarely shippable as-is. Our engineers review every single generated case, rewrite about 40 percent of them, delete maybe 15 percent as nonsense, and keep the rest. The cases that get kept are usually the boring ones that a tired QA engineer would have missed on a Friday afternoon. The nonsense cases are the ones where Claude hallucinates a feature that doesn't exist because it pattern-matched against something in the requirement doc that looked familiar.

But even with that 40 percent rewrite rate, it's saved us hours per project. A senior QA engineer can produce a first draft of a test plan in 30 minutes instead of a day. The 30 minutes still includes review time. The day used to be mostly typing.

The key thing is that we never let it skip the human. There is no auto-publish. There is no "generate test plan and run it unattended." Every draft goes through a human who understands the product, the client, and what the test is actually supposed to prove. If we stripped out that review step, we'd be shipping the same tautological nonsense that Diffblue produced for us in the first story.

What I actually think is happening

After doing this for long enough, I've stopped believing the framing of "AI for QA." I think AI is good at certain subtasks that QA engineers do, and it's bad at the thing that actually makes QA valuable, which is adversarial creativity. A good tester looks at a form and thinks, what if I paste 10,000 characters in here. What if I open two tabs and submit from both. What if my connection drops halfway through. What if the admin user and the regular user both hit this endpoint at the same time. None of that is pattern matching. It's imagination, and specifically the pessimistic kind of imagination that comes from having been burned by production incidents.

Tudor, who founded BetterQA back in 2018 in Cluj-Napoca, has this line he uses in almost every talk: "AI will replace development before it replaces QA." When I first heard it I thought it was a sales line. After the Diffblue experiment, and the visual regression disaster, and the triage bot that invented hotfix timelines, I think it's just true. Development is about turning intent into code, which LLMs are increasingly good at. QA is about figuring out what the intent missed, what the developer didn't think of, what the product manager didn't write down, what the end user is going to do that nobody predicted. That's a very different job, and current AI is not close to doing it.

Also: the chef should not certify his own dish. An AI trained on your codebase is just a very fast chef tasting its own cooking. It will tell you the food is fine. It will be confidently wrong. You need someone standing outside the kitchen who doesn't care about the deadline and isn't related to the sous-chef.

ML systems are not traditional software, and that matters

One thing the original version of this post got right, and I want to keep, is that testing ML models is a fundamentally different job from testing deterministic software. If you're shipping a model that makes recommendations, classifies images, or scores loan applications, your QA process has to deal with things traditional testing doesn't touch.

Data quality is the first one. I've seen teams train models on data that had a bias baked in from the sampling process nobody documented, and then wonder why the model kept producing weird outputs for one demographic group. The model was working correctly. It was learning exactly what it had been shown. The problem was upstream.

Model drift is the second one. A model that's 94 percent accurate on Monday can be 81 percent accurate on Thursday if the real-world inputs have shifted. You can't catch this with a traditional regression test suite because there isn't a fixed expected output. You need production monitoring that flags when the distribution of predictions starts to look different, and you need someone qualified to look at that and decide whether it's a retraining problem or a data source problem.

And then there's the fairness and compliance layer, which is where things get interesting if you're shipping into regulated industries. A hiring model that disadvantages a protected class isn't just technically incorrect, it's illegal in most jurisdictions. GDPR's right to explanation means you need to be able to tell a user why the model made the decision it made, which is a fun constraint to impose on a neural network. Auditing this stuff requires QA engineers who understand the legal frame, not just the code.

What we actually do for clients shipping ML features

When a client asks us to QA an ML product, we don't run a test suite and sign off. We do the boring deterministic testing around it, because the rest of the product still matters. Then we build a validation harness that checks the model against held-out data, measures precision and recall on the specific subgroups the client cares about, and flags drift from the baseline. Then we do adversarial testing, which means trying to get the model to produce bad outputs on purpose. If it's a content classifier, we try to sneak things past it. If it's a chatbot, we try to get it to say things it shouldn't. If it's a recommendation engine, we look for echo chambers.

The last piece, and this is the one I care about most after the last year of AI adoption, is prompt injection testing for anything LLM-powered. If the product includes a model that can be talked to, we try to talk it into leaking things it shouldn't leak. This is a new category of testing that basically didn't exist three years ago, and I think it's going to be a huge part of QA for the next decade. The attack surface of "what can a user type into a text box that makes the AI do something it wasn't supposed to do" is enormous, and most teams shipping AI features haven't thought about it at all.

The short version

We tried Diffblue-style unit test generation. It produced tests that proved the code did what the code did, and missed the actual bug. We tried AI visual regression. It generated 400 false positives on a single deploy and we killed the pilot after two weeks. We tried LLM bug triage. We kept the classification part and pulled the drafting part after it promised a hotfix nobody had agreed to.

The one that worked is the one where AI drafts and humans decide. Inside BugBoard, Claude helps us generate test case drafts that still get reviewed by a human QA engineer who knows the product. It saves real hours. It does not replace the human. If we ever remove that review step, we will end up with 2,400 passing tests and a broken checkout flow, and we'll deserve it.

That's the thing I wish more people would say out loud. AI can make QA faster. It cannot make QA better on its own. The better part still comes from a person who's been burned enough times to know what to look for, and who is paid to care when everyone else is trying to ship.

Being the only QA in a 6-dev scrum team: honest notes

Tudor Brad — Thu, 09 Apr 2026 19:50:34 +0000

Let me tell you what agile scrum actually looks like when you're the only QA on a team of six developers.

The sprint planning deck will say "shift left" and "quality is everyone's responsibility". The retro board will have sticky notes about collaboration. And then somewhere around 4pm on a Friday, someone will drop a Slack message that says "hey, can you just run through it real quick before we deploy?"

That's the job. The rest is paperwork.

The blocker problem

In every scrum team I've joined as the single tester, there's a moment around sprint three where I stop being "the QA" and start being "the blocker". You can feel it shift. A dev finishes a ticket on Wednesday. It sits in your column until Friday because you're still testing the three tickets from Monday. Burndown chart flattens. Somebody mentions it in standup. Suddenly the conversation is about your throughput, not about the fact that five people are producing work faster than one person can verify it.

I used to try to explain the math. Six devs times two stories per sprint equals twelve things to test, plus regression, plus the bug fixes from last sprint that came back from staging. One person. Two weeks. You do it.

Nobody wants to hear the math. They want the green checkmark.

The PM who sneaks stories in

There's always one. Mine was called Radu. Lovely person, terrible for my sanity.

Radu had a habit of adding tickets to the sprint backlog on day four without acceptance criteria, without a design, and without telling anyone. You'd open Jira Monday morning and there would just be a new story sitting there called "small tweak to the checkout flow" assigned to the sprint. Small tweak. Sure.

When I asked what "done" looked like, the answer was usually "you'll know when you see it". When I asked what the edge cases were, the answer was "the client just wants it to work". When I filed a bug, the answer was "that's not what we agreed" even though nothing had been agreed because nothing was written down.

I got burned enough times that I started refusing to test anything without written acceptance criteria in the ticket. That was my little rebellion. It worked for about two sprints before Radu started writing acceptance criteria like "should work as expected" and I had to start another fight.

The sprint where everything fell over

Here's the one I still think about.

We were building a multi-tenant dashboard for a logistics client. Sprint 14. The team had committed to a big piece of work: role-based permissions across four user types. Admin, manager, driver, viewer. Each role had different screens, different buttons, different data.

I spent the planning session trying to talk the team into splitting it. Four roles is four times the test matrix. Four times the edge cases. Four times the "what happens when a manager tries to access an admin page" questions. The lead dev said it was fine, it was all using the same permission middleware, testing one role would basically cover all of them.

You can guess where this is going.

Day eight of the sprint, a dev finishes the first role and hands it over. I start testing. It's fine. Admin role works. I move to the manager role on day nine. It's also fine, mostly. I find two bugs, file them. Driver role lands day eleven. Broken in three places. Viewer role lands day twelve, which is the Friday before sprint end.

Viewer role was the one where I found that a viewer could hit the admin API directly if they knew the endpoint. No auth check on the backend route. Just the frontend hiding the button. A viewer could delete shipments. A viewer could change other users' passwords. A viewer could do pretty much anything an admin could do, as long as they opened devtools and typed the URL.

I filed it as a critical bug at 3:40pm on Friday. The lead dev told me it was out of scope for this sprint, that the frontend was hiding the buttons so it "wasn't really an issue", and that we should ship and fix it in sprint 15. The PM agreed. Deploy was scheduled for 5pm.

I made a scene. I don't make scenes often. I made this one.

I went to the CTO. I showed him the curl command. I made him run it on his own laptop against staging. He watched a viewer account delete a shipment. Deploy got cancelled. We spent sprint 15 properly implementing backend permission checks across all four roles, and the lead dev and I didn't speak for about a week.

Here's what fixed it long term: we started writing test cases during planning, not after. If a story went into the sprint, it went in with a checklist of what needed to be verified, written by me, reviewed by the dev, agreed by the PM. If you couldn't describe how to test it, you couldn't commit to it. Took me three sprints to enforce, and I had to threaten to quit once, but it stuck.

That was the real shift-left. Not a blog post. A fight.

What the ceremonies actually look like

Sprint planning is where you earn your paycheck. Not by estimating. By asking the dumb questions. "What does done look like?" "What happens when a user is offline?" "What's the behaviour when the API returns 500?" "Is there a timeout on this?" Half the value of a tester in planning is forcing the team to think about things they'd rather not think about. The devs hate it. The PM hates it. The product gets better.

Standups are mostly theatre, and that's fine. I use them to telegraph what I'm blocked on. Never in a passive aggressive way. Just plainly. "I'm testing the search filter today. If the backend contract changes, I'll need to restart. Please don't change the contract without telling me." The word "please" is doing a lot of work in that sentence.

Sprint review is where I quietly make sure the thing we're demoing is the thing I actually tested. You would not believe how often the demo branch is not the branch that was on staging when QA signed off. I learned to ask "which commit are we showing?" every single time. It makes the devs roll their eyes. I don't care.

Retros are where nothing changes, until suddenly something does. I never go in expecting wins. I go in with one observation, delivered calmly, backed by a specific example. "In sprint 14, we didn't split the permissions story and I found a critical bug 90 minutes before deploy. I think we should split stories that touch more than two user roles." That's how the test-case-during-planning rule got in. One retro. One example. No drama.

What I actually do all day

The job title says tester. The actual work is:

Writing test cases nobody asked for because nobody else will. Chasing acceptance criteria that don't exist yet. Running regression on the stuff that was "already tested last sprint" because something in the shared component library changed. Arguing about whether a bug is a bug or "by design". Automating the things that keep coming back. Reviewing PRs for testability even when nobody asked me to. Documenting what the product actually does, because the spec hasn't been updated in six months. Explaining to stakeholders why "it works on my machine" is not the same as "it works".

And on good days, exploratory testing. Which is where I find the weird stuff. The viewer-can-delete-shipments stuff. The "what happens if you paste 10,000 characters into this field" stuff. The "what if the user speaks Arabic and reads right to left" stuff. That's the bit I actually like. That's the bit that justifies the job.

The thing about being the only one

Being the single QA on a scrum team is lonely in a way people don't talk about. You're the one person whose job is to find the problems, on a team whose job is to not have problems. Every bug you file is a tiny friction. Every "can you verify this before merge" is you slowing someone down. You get used to being the person people are slightly annoyed with.

The trick, and it took me years to figure this out, is to remember that the friction isn't personal. The friction is the job. If nobody feels any friction, you're not doing the work. If everybody hates you, you're doing it wrong. The sweet spot is when the devs are mildly annoyed but they also ask you to look at their PRs before merge because they'd rather get your eyes on it than have a customer find it.

You get there by being fair. By never filing a bug in anger. By writing up what you found and what you tried and what you expected, every single time, even when you're tired. By closing bugs that turn out to be wrong, publicly, without ego. By learning the product so well that when you say "this feels wrong", people listen.

The outsourcing pitch I'm not going to make subtly

Here's the honest version. If you're the only QA on a team of six, you are going to burn out. The math doesn't work. You can optimise your process, you can push for automation, you can fight for test cases in planning, and it will still be too much work for one person. At some point the team either hires another tester, or they accept lower quality, or they bring in an independent QA partner who can scale up and down with the sprint load.

I work at BetterQA. We started in Cluj-Napoca in 2018 and now we're 50+ engineers across 24 countries. The reason we exist is because development teams shouldn't validate their own code. The chef doesn't certify his own dish. When you bring in an independent QA team, the PM can't lean on them to close bugs that "make the dev team look bad". The testers don't report to the dev manager. They report to quality. That's the whole pitch.

But even if you don't hire us, hire someone. Hire a second tester. Don't leave one person to do the verification work for a whole team. It's not fair, and it's not how you ship good software.

What I wish someone had told me on day one

Write everything down. If it's not in the ticket, it didn't happen.

Never agree to test something that doesn't have acceptance criteria. "Works as expected" is not acceptance criteria.

Find an ally on the dev team. One person who gets it. One person who will back you up in planning when you say the story is too big. You will need them.

Your job is not to prevent all bugs. Your job is to prevent the bugs that matter from reaching users.

"Can you just run through it real quick?" is never real quick. Schedule it. Put it in the ticket. Test it properly or don't test it at all.

And when you find the critical bug on a Friday afternoon, file it. Escalate it. Make the scene. The deploy can wait. The users can't.

This article was originally published on the BetterQA blog. BetterQA is an independent QA and software testing company based in Cluj-Napoca, founded in 2018, with 50+ engineers working across 24 countries.

Accessibility testing: what automated scanners keep missing

Tudor Brad — Thu, 09 Apr 2026 19:44:51 +0000

The audit that made us take this seriously

A few years back we took on a fintech client who had passed an automated accessibility scan. Green across the board in their internal tool. They were confident enough to quote WCAG 2.1 AA compliance on their marketing page.

Then one of our QA engineers ran through the signup flow with NVDA. The screen reader got stuck on a custom dropdown that had no aria-expanded, no keyboard handler, and no role. A blind user would hit that control and land in a dead end. The automated scanner had flagged nothing because the element was technically a <div> with an onClick handler and the scanner had no way to know what it was supposed to be.

That is the gap nobody talks about when they write accessibility guides. Automated tools catch roughly 30 to 40 percent of WCAG violations. The rest requires a human who understands how people with disabilities actually use software.

What automated scanners are good at

We run axe-core, Pa11y, and our own tool Auditi on every client project. They are fast, consistent, and catch the boring stuff at scale:

Color contrast failures (anything below 4.5:1 on normal text)
Missing alt attributes on images
Form inputs without associated labels
Duplicate IDs in the DOM
Missing language attributes on the <html> element
<button> elements with no accessible name
Heading hierarchy jumps (h2 to h4 with no h3 in between)

axe-core in particular is the workhorse. It is what powers most browser extensions, it runs inside Playwright for CI, and it rarely gives false positives. Pa11y is useful when you want a command line output for Jenkins or GitHub Actions. Auditi is what we built when we needed something that scans an entire sitemap and produces a shareable report for a client who does not want to read JSON.

Here is the catch. Every one of these tools is measuring things a computer can check. Alt text length. Contrast ratio calculations. Whether an input has a for attribute. None of them can tell you whether the alt text actually describes what matters in the image, or whether the reading order makes sense, or whether a screen reader user can figure out where they are after the page updates via JavaScript.

The stuff scanners miss (and real users hit)

Last year we audited a booking platform. The automated scan came back with 14 issues, mostly contrast. Our manual pass found 47 more. A sample of what only showed up with a human in the loop:

A modal that stole focus on open but did not return focus to the triggering button on close. A keyboard-only user would lose their place and have to tab from the top of the document again.
An "add to cart" button that visually changed to "added" but had no aria-live region. Screen reader users got no confirmation anything happened.
A date picker that was fine with a mouse, fine with a keyboard, and completely broken with VoiceOver on iOS. The announcement said "button button button" for every day cell.
A cookie banner that trapped focus inside itself but had no visible close button. Desktop users could tab out by accident. Mobile users with switch control could not.
Skip links that worked but jumped to a container with tabindex="-1" that then fired a focus style that looked like a blinking cursor in the middle of the page.
Error messages that appeared inline in red text. The color was fine. The problem was the error had no programmatic association with the input, so screen readers announced the field as valid.

Not one of those made it into an axe report.

A workflow that actually catches things

This is what we run on client projects now. It is not the only way, but it has survived about 200 audits.

Automated baseline. Run axe-core, Pa11y, and Auditi on every key page. Record the violation count. Fix the low hanging fruit first so the manual pass does not get drowned in contrast warnings.
Keyboard-only walkthrough. Unplug the mouse. Try to complete the main flows using Tab, Shift+Tab, Enter, Space, and arrow keys. Note every place focus disappears, jumps unexpectedly, or gets trapped.
Screen reader pass. NVDA on Windows, VoiceOver on Mac and iOS, TalkBack on Android. You do not need to be fluent. You need to notice when the reader says something confusing or says nothing at all.
Zoom to 200 and 400 percent. WCAG 1.4.10 covers reflow. A lot of sites break at 400 percent. Text gets cut off, sticky headers cover content, buttons move off screen.
Color blind simulation. Chrome DevTools has simulators built in. Check whether anything relies on color alone to convey information (red error text, green success state with no icon).
Real user testing. When the budget allows it. A 30 minute session with someone who uses a screen reader daily will teach you more than a week of our own testing.

The order matters. If you start with manual testing you waste time on issues the scanner would have caught in 10 seconds. If you stop at the scanner you ship a product that is technically compliant and practically unusable.

The limitation we ran into with our own tool

Auditi is good at what it does. Sitemap crawl, WCAG checks, a shareable report with severity grouping. But we hit a wall when clients asked us to test flows behind authentication. A scanner can check the login page. It cannot log in, navigate to the checkout, add an item, and test the checkout modal. Not without either shipping credentials to the cloud (which clients hate) or running the scan locally with cookies.

We ended up pairing Auditi with Playwright for the authenticated paths. Playwright logs in, navigates, and then injects axe-core into the page at each step. It is clunky. We have not found a clean solution that works for every client. This is the honest limitation of automated accessibility testing: if the critical user journey is behind a login, you need a hybrid setup, and that hybrid setup is still work.

Why this is our hill to die on

BetterQA has existed since 2018, we run 50+ engineers across 24 countries, and accessibility is one of the services clients most often ask us to retrofit after something goes wrong. Usually a legal letter. Sometimes a complaint from a user. Once, an ADA lawsuit in the US that cost the client more than our retainer for three years combined.

We believe accessibility testing is exactly the kind of work that should not be done by the team that built the feature. Not because they do not care, but because they cannot see what they built from outside. The chef should not certify his own dish. Devs test the happy path with a mouse on a 27 inch monitor. Real users show up on a cracked iPhone with VoiceOver, a tremor, and 30 seconds of patience.

If you want to see how we approach audits or how we built Auditi, you can start at betterqa.co. If you just want the tools: axe-core, Pa11y, NVDA, VoiceOver. Run them in that order. Fix what they find. Then find someone who actually uses assistive tech and watch them try to use your product.

That last step is the one that will change how you build.

The accessibility lawsuit that almost happened, and what we learned from the audit

Tudor Brad — Thu, 09 Apr 2026 19:44:46 +0000

In 2024, one of our clients got a demand letter from a law firm specializing in ADA compliance lawsuits. The letter cited specific pages on their e-commerce platform that were inaccessible to screen reader users. The client called us in a panic. They'd never done accessibility testing. Not once in four years of operation.

We ran a full WCAG 2.1 AA audit over three days. We found 67 distinct violations. Missing alt text on product images, form inputs without labels, a checkout flow that trapped keyboard focus inside a modal with no escape route, color contrast ratios as low as 2.1:1 on critical action buttons (WCAG AA requires 4.5:1).

The irony? Most of these were fixable. The engineering team cleared 48 of the 67 violations in eight working days. The remaining 19 required design changes that took another three weeks. The legal issue was resolved before it reached a courtroom. But the client had spent four years accumulating this debt, and it took a legal threat to prioritize it.

That engagement changed how we talk to clients about accessibility.

What automated scanners actually catch

We use Auditi, the accessibility platform we built, along with axe-core and Pa11y for automated scanning. Here's what people don't understand about automated accessibility tools: they catch roughly 30-40% of WCAG violations. That's it.

Automated tools are great at finding missing alt text, broken label associations, insufficient color contrast, missing landmark regions, and duplicate IDs. These are structural issues with clear right-and-wrong answers.

What they miss is everything that requires context. An automated tool can verify that an image has alt text. It cannot tell you whether that alt text is useful. alt="image" passes automated scanning. It tells a screen reader user nothing. alt="Blue wool sweater, front view, size medium" is what they actually need.

Automated tools also can't test keyboard navigation flows. They can check that interactive elements are focusable, but they can't tell you that tabbing through your checkout form goes: name, then submit button, then email, then back to name. That ordering makes no sense to a sighted user and makes even less sense to someone navigating by keyboard alone.

This is why we always pair automated scans with manual testing using actual assistive technology. NVDA on Windows, VoiceOver on Mac, TalkBack on Android. Our testers navigate the entire application using only a keyboard and screen reader combination. The bugs they find in those sessions are consistently more severe than what the scanner reports.

The color contrast problem across our entire ecosystem

I'll share something embarrassing. In early 2026, we ran our ecosystem scanner across all 13 BetterQA product sites and discovered that almost every site had color contrast violations. Our standard purple (#a855f7, Tailwind's purple-400) had a contrast ratio of about 3.3:1 on white backgrounds. WCAG AA needs 4.5:1.

We'd built and deployed a dozen products with a brand color that failed basic accessibility. It took us two weeks to fix across the ecosystem, bumping to purple-600 (#9333ea) which hits 4.6:1. The point isn't that we're bad at this. The point is that accessibility issues are that easy to introduce and that easy to miss when you're not testing for them.

What WCAG levels actually mean in practice

WCAG has three levels: A, AA, and AAA. Most regulations and lawsuits reference AA. Here's the practical difference.

Level A is the bare minimum. Your site has alt text, your forms have labels, your content doesn't rely solely on color to convey information. If you fail Level A, screen reader users literally cannot use parts of your site.

Level AA is what most organizations should target. It adds color contrast requirements (4.5:1 for normal text, 3:1 for large text), requires text resizing up to 200% without loss of content, and mandates consistent navigation patterns. This is where the legal bar sits in most jurisdictions.

Level AAA is aspirational for most products. It requires a 7:1 contrast ratio, sign language interpretation for audio content, and reading level accommodations. Very few commercial products achieve full AAA compliance, and it's not typically required by law.

We recommend AA as the target for all clients. It's achievable, legally defensible, and covers the vast majority of user needs.

The retrofit problem

The client with the demand letter spent roughly $45,000 in engineering time fixing accessibility issues retroactively. If they'd included accessibility testing from the start, the incremental cost would have been a fraction of that.

This is the pattern we see consistently. Teams build for two or three years without accessibility testing. Then something forces the issue, either a legal threat, a government contract requirement, or a user complaint that reaches someone with authority. Suddenly it's a priority, but now it's a retrofit.

Retrofitting accessibility is genuinely harder than building it in. When a React component has been in production for two years with no ARIA attributes, adding them often means rethinking the component's DOM structure. A modal that was built without keyboard trap management might need to be rewritten entirely. A custom dropdown that works fine with a mouse but is invisible to a screen reader can't just have role="listbox" slapped on it and work correctly.

We had one client where retrofitting a single date picker component took three days because the original implementation used div elements styled to look like inputs. The screen reader announced them as "group" elements with no indication they were interactive. Rebuilding it with proper semantic HTML and ARIA took the developer into areas of the spec she'd never touched.

The business case nobody wants to make

About 15-20% of the global population has some form of disability. In the US alone, that's roughly 61 million adults. Many of them are potential customers who will leave your site and go to a competitor if they can't use yours.

But here's the thing nobody says out loud: most companies don't invest in accessibility because of the business case. They invest because of the legal risk. The number of ADA-related web accessibility lawsuits has increased every year since 2017. In 2024, over 4,000 were filed in the US alone.

I'd rather people care about accessibility because it's right than because they're afraid of getting sued. But either motivation leads to the same outcome: a more usable product. I'll take it.

What we tell new clients

Start with an automated scan. Use axe-core, Pa11y, or Auditi. It takes an hour and gives you a baseline. Fix the automated findings first, because they're the easiest wins and they're defensible in court.

Then do a manual audit with real assistive technology. This is where the hard bugs surface. Keyboard navigation, screen reader flows, focus management. Budget two to five days depending on the size of the application.

Integrate accessibility checks into your CI pipeline. axe-core has integrations for Playwright, Cypress, and most other test frameworks. This prevents new violations from shipping.

Train your developers. Not a one-hour webinar. Actual hands-on sessions where they use NVDA to navigate their own product. We've found this single exercise changes behavior more than any policy document. When a developer hears their own form announced as "edit text, edit text, edit text, button" with no labels, they understand the problem in a way no Jira ticket can convey.

And test with actual users with disabilities if you can. We've run usability sessions where a blind user completed a task in 90 seconds that our testers assumed would take 30. We've also watched a user with motor impairments struggle for four minutes with a drag-and-drop interface that had no keyboard alternative. Those sessions change what you build.

At BetterQA, we include accessibility testing in our standard QA engagements because we've seen what happens when it's left out. It's always more expensive later. And someone always gets left behind.

The QA team that tried to go 80% automated and what actually happened

Tudor Brad — Thu, 09 Apr 2026 19:36:57 +0000

A fintech client came to us in 2023 with a clear goal: get to 80% test automation in six months. They had 400 manual test cases, a team of four QA engineers who'd never written code, and a CTO who'd read a McKinsey report about automation ROI. The number 80% came from that report, not from any analysis of their actual product.

We got to 45% automation in eight months. The client was disappointed for about two weeks, until they realized their regression cycle dropped from five days to one and a half. The 45% we automated were the right 45%. The remaining 55% were things that genuinely needed human judgment, and trying to automate them would have produced a flaky, unmaintainable mess.

That project taught us more about automation transformation than any best practice guide.

The part everyone skips: figuring out what to automate

The first thing we did was sort those 400 test cases into three buckets. Bucket one: tests that run every sprint, follow the same steps, and check deterministic outcomes. Things like login, CRUD operations on accounts, and balance calculations. These are automation candidates. About 160 of the 400.

Bucket two: tests that involve visual judgment, subjective UX evaluation, or complex multi-system workflows with timing dependencies. A payment reconciliation flow that depends on a third-party bank API response, for example. These stay manual. About 140 tests.

Bucket three: the remaining 100 tests that nobody could clearly categorize. We left these alone for the first three months and revisited them later. About 40 eventually got automated. The rest stayed manual.

Most automation efforts fail because teams skip this sorting step. They try to automate everything, hit the hard cases early, get frustrated, and declare automation "doesn't work for our product." It works fine. You just automated the wrong things first.

The tool fight

The CTO wanted Selenium because he'd used it at a previous company in 2019. Our team recommended Playwright. This turned into a three-week debate that accomplished nothing except burning goodwill.

Here's what we've learned across dozens of these transformations: the tool matters less than people think. Playwright is faster and has better auto-waiting. Cypress has better developer experience for teams already using JavaScript. Selenium has the widest browser support. Pick one, commit, move on. We've seen successful automation suites in all three.

For this client, we went with Playwright because their app was React-based and their dev team already used TypeScript. That alignment matters more than any feature comparison chart.

We use Flows, our own Chrome extension, for teams that want to record tests without writing code. It records browser interactions and replays them with self-healing selectors, which means the test doesn't break every time a developer renames a CSS class. We built it because selector maintenance was eating 30% of our automation team's time on some projects. But for this client, they wanted code-based tests, so Playwright it was.

The first month was painful

We wrote 20 tests in the first month. Most guides would tell you that's too slow. But those 20 tests were solid. They ran in CI, they didn't flake, and they covered the login flow, account creation, the main dashboard load, and basic transaction queries.

What slowed us down was test data. The application didn't have a clean way to seed test data, so every test had to create its own state from scratch. A test that should have been 15 lines was 60 lines because of setup. We spent two weeks building a test data factory before we could move faster.

Nobody talks about test data in automation articles. It's boring. It's also the thing that determines whether your suite takes 4 minutes or 40 minutes to run.

Month three: the flake crisis

By month three we had 80 automated tests and a 72% pass rate on CI. That sounds terrible, and it was. Eight tests were genuinely flaky. They'd pass locally, fail in CI, pass again on retry. The team was spending mornings investigating failures that turned out to be timing issues, not real bugs.

We stopped writing new tests for two weeks and fixed the flaky ones. Most of them had the same root cause: the app used optimistic UI updates, so Playwright would see the expected text before the API call actually completed. When the API was slow in CI (shared resources, less CPU), the test would sometimes catch a loading state instead.

The fix was boring: explicit waits for network idle on specific API calls, not global timeouts. We also added a retry-once policy in CI, which sounds like a hack but reduced our false failure rate from 28% to under 3%.

The honest numbers

After eight months:

180 tests automated out of 400 (45%)
CI run time: 12 minutes for the full suite
Regression cycle: 1.5 days (down from 5)
False failure rate: 2.8%
Tests maintained by: 2 of the 4 QA engineers (the other 2 focused on exploratory testing)
Cost of the automation effort: roughly equivalent to 6 months of one senior engineer's time

Was it worth it? Yes, but not because of some dramatic ROI calculation. It was worth it because those two engineers running manual regressions for five days every sprint were bored, making mistakes, and starting to job-hunt. Automation didn't replace them. It gave them different work. One became the automation lead. The other moved into performance testing, which the team had never done before.

What we'd do differently

We should have built the test data factory in week one, not week six. Every automation engagement we've done since then starts with data setup.

We should have set the target at "automate the right things" instead of a percentage. The 80% number created pressure to automate tests that weren't good candidates, and we pushed back successfully, but it took energy that could have gone elsewhere.

We should have involved the developers earlier. For the first two months, the dev team treated our automation suite as "the QA thing." Once we started contributing test utilities back to their codebase and catching bugs in their PR pipeline, they started adding test IDs to their components voluntarily. That collaboration made everything faster.

The pattern we see now

After doing this across multiple clients, the pattern is consistent. Teams that succeed at automation transformation share three things: they pick the right tests to automate first (not all tests), they invest in infrastructure before writing tests (data factories, CI configuration, environment management), and they accept that the final automation percentage will be lower than whatever number someone put in a slide deck.

At BetterQA, we've run these transformations for healthcare clients, fintech platforms, and SaaS products. The tools change, the domain changes, but the mistakes are always the same. Everyone wants to skip straight to writing tests. Nobody wants to set up the data layer. And the target percentage is always too high.

The honest version is less exciting than the pitch deck version. But it's the version that actually ships working software.

The automation engineer who couldn't reproduce the bug

Tudor Brad — Thu, 09 Apr 2026 19:36:52 +0000

A few years ago we hired a junior tester who was obsessed with Playwright. He'd done a bootcamp, finished a Udemy course, and showed up on day one asking which framework we used so he could start writing specs. He was smart, motivated, and by week two he had a respectable suite of tests running against one of our clients' checkout flows.

The tests were green. The client was shipping. And then a support ticket came in saying the checkout page looked "weird" on certain Android devices when users had two saved addresses and tried to edit the second one.

Our automation guy opened the ticket, shrugged, and said he couldn't reproduce it. His tests were passing. The selectors were stable. The CI pipeline was happy. He genuinely believed the bug didn't exist, because his framework told him it didn't.

A senior tester on our team opened the app on her phone, followed the steps from the ticket, and found the bug in about ninety seconds. It was a z-index issue on a modal that only appeared when the second saved address had a longer street name than the first. No automated suite on earth was going to catch that, because nobody would ever think to write a test for it.

That was the moment I stopped pretending manual testing was some kind of vestigial skill.

I used to think manual testing was a phase people grew out of

I'll be honest. When I started at BetterQA I thought manual testing was the stuff you did before you got good enough to automate. The narrative in most job postings, most conference talks, most LinkedIn posts felt like a moving walkway: you start clicking buttons, you learn Selenium, you graduate to Playwright, you end up writing CI/CD pipelines and eventually you stop touching the actual product entirely. Manual testing was the ground floor. The point was to leave it.

I was wrong about this in a way that took me embarrassingly long to admit.

What I've seen across the fifty-plus engineers we have spread across twenty-four countries is that the testers who never properly learned to sit with a product and break it by hand are the ones who write the most useless automation. Not bad automation in a technical sense. Their code is often cleaner than mine. But the tests they write cover the happy path, they cover what the spec says should work, and they cover the scenarios that are easy to describe in a Jira ticket. They do not cover the weird stuff. They do not cover the stuff that matters.

Automation is a lens, and lenses have blind spots

Here's the thing nobody tells you when you're learning Playwright: an automated test can only find a bug you already suspected might exist. You have to write the assertion. You have to know what "correct" looks like. The test framework is a flashlight pointed exactly where you told it to point, and everything outside that beam is invisible.

Manual testing is what tells you where to point the flashlight.

When I watch a good manual tester work, they're not executing a test plan like a robot. They're forming hypotheses. They open the app and they notice the hover state takes half a second longer than it should. They click the back button twice in a row and see a flash of an unauthenticated view. They resize the browser to a weird width and watch the layout crack in a spot nobody thought to check. These aren't bugs that appear in any requirements document. They're the bugs end users actually hit.

Our founder has a line he repeats to anyone who'll listen: the chef shouldn't certify his own dish. The same thing applies to automation engineers certifying their own coverage. If you never sat with the product manually, you don't know what you don't know, and your automation suite is going to reflect that gap with unsettling accuracy.

The junior hire who wrote beautiful useless tests

Back to our Playwright obsessive. After the Android checkout incident, we put him on a different kind of onboarding. For six weeks he did nothing but manual exploratory sessions on client products. No scripts. No automation. Just him, a notebook, and a list of prompts like "try to confuse the login flow" and "pretend you're an impatient user and click everything twice."

He hated it at first. He told me it felt like going backwards. He'd spent a year learning automation frameworks and now we were making him take screenshots and write in English sentences. I sympathised but I didn't budge, because I'd seen what happens when testers skip this step.

By week four something clicked. He came into a standup and said "I found this thing where if you refresh during the payment redirect, the session token doesn't clear, so if the next person on that machine opens the same URL they land in the previous user's cart." Nobody had asked him to check that. No test plan contained that scenario. He'd developed the instinct for where bugs like to hide, and once he had the instinct, his automation got dramatically better. The tests he wrote a few months later were the ones that caught real regressions, not the ones that rubber-stamped the happy path.

That transformation is the actual case for making people master manual testing first. It's not about saving money on tooling licences, and it's not about the software development lifecycle or any other textbook framing. It's about developing the intuition that tells you what's worth automating in the first place.

What gets missed when you skip this step

The easy argument for manual testing is cost. You don't need a fancy framework, you don't need runners, you don't need infrastructure. True, but boring. The real argument is that manual testing is the only testing that operates on the actual product as a user would experience it, with all the irrational clicking and impatient scrolling and tab-switching that real humans do.

Automation will tell you whether the button submits the form. Manual testing will tell you that the button looks like a link, so nobody clicks it. Automation will tell you that the error message renders. Manual testing will tell you that the error message renders in a modal that traps keyboard focus and can't be dismissed on mobile. Automation will tell you the checkout completes in under three seconds. Manual testing will tell you that users keep giving up on step four because the progress bar goes backwards when you hit continue.

These are not edge cases. These are the bugs that lose you customers. And they are invisible to anyone who hasn't spent real hours touching the product the way a confused, distracted, slightly annoyed human would touch it.

What we actually do with new hires now

Every person who joins our team, regardless of their background, spends their first weeks doing manual exploration on a real client product. Automation engineers, security testers, people with ten years of experience, people with two. They all start the same way. They get a product, they get a vague prompt, and they go hunt.

We tell them explicitly: we are not evaluating how many bugs you find. We are evaluating whether you develop the instinct for where bugs tend to hide. Some people hate this. Some people thrive on it. Almost everyone, regardless of which camp they start in, writes better automation six months later because of it.

If you're early in your QA career and someone tells you manual testing is what you do until you learn Playwright, I'd push back. Manual testing is what teaches you what's worth testing at all. The framework is just the tool you pick up once you've earned the right to use it.

The automation engineer who can't reproduce a bug isn't bad at automation. He's bad at the thing that comes before automation. And the thing that comes before automation is sitting with the product, paying attention, and letting your discomfort guide you to where the real problems are hiding.

Fuzz testing found bugs in our API that unit tests never would

Tudor Brad — Thu, 09 Apr 2026 08:34:50 +0000

I used to think our test suites were solid. We had unit tests, integration tests, contract tests for the API layer. Good coverage numbers. The kind of setup that makes you feel safe when you merge to main on a Friday afternoon.

Then we ran a fuzzer against the same API and watched it fall apart in under an hour.

Fourteen crashes. Server panics on malformed JSON. A file upload endpoint that accepted literally anything as long as you set the right Content-Type header. An input field on a form that crashed the entire backend process when it received a float instead of an integer.

None of these showed up in our existing tests. Not one.

That was the day I stopped treating fuzzing as a "nice to have" and started treating it as the part of security testing that actually finds the bugs hiding between your test cases.

What fuzzing actually does

Fuzzing is simple in concept. You throw garbage at your software and see what breaks.

More precisely: you take valid inputs, mutate them in thousands of ways (wrong types, oversized strings, null bytes, nested objects 500 levels deep, unicode edge cases, truncated payloads), and send them at your application as fast as you can. Then you watch for crashes, hangs, memory leaks, unexpected error codes, and data that leaks out in error messages.

The OWASP fuzzing page describes the technique well if you want the textbook version. But here is what it looks like in practice: you point a tool at an endpoint, go make coffee, and come back to a list of inputs that made your software do something it should not have done.

The reason this works so well is that developers test for what they expect. You write a test that sends valid JSON and checks the response. Maybe you write a test that sends empty JSON and checks for a 400 error. But you probably do not write a test that sends JSON with a key that is 50,000 characters long, or a nested array 200 levels deep, or a number where a string should be with a trailing null byte.

Fuzzers do not have expectations. They just try things. And software has a lot of assumptions baked into it that only surface when those assumptions get violated.

The bugs fuzzing catches that nothing else does

Let me walk through the actual categories of failures we find during fuzz testing engagements. These are real patterns from real projects.

Input type confusion. A registration form expects a string for the phone number field. The API handler parses it and passes it to a validation function that calls .match() on it. Send an integer instead of a string and the backend throws an unhandled TypeError. The server returns a 500 with a stack trace that includes the file path and line number. Now an attacker knows your framework, your file structure, and exactly where to probe next.

Unit tests rarely cover this because the developer wrote the test with the same mental model they used to write the code. They send a string because that is what the field is for.

Malformed JSON handling. We see this constantly. APIs that parse JSON request bodies without validating the structure first. Send {"user": {"name": {"name": {"name": ...}}}} nested 100 times and the server either runs out of memory or hits a recursion limit and crashes. Send JSON with a trailing comma (technically invalid) and some parsers accept it while others throw. Send a 10MB payload to an endpoint that expects 200 bytes and there is no size limit enforced.

These are not exotic attacks. They are basic robustness issues that every public-facing API should handle. Fuzzers find them in minutes.

File upload validation gaps. This one is a classic. An endpoint says it accepts PNG files. It checks the Content-Type header. It does not check the actual file content. So you can upload a PHP script, a shell script, or an SVG containing embedded JavaScript, and the server happily stores it. Depending on the server configuration, that file might be directly executable.

We tested a client's document upload feature and found that it validated the file extension in the filename but not the actual bytes. Rename malicious.php to malicious.php.png and it went straight through.

Error message information leakage. When software crashes on unexpected input, the error messages often contain information that should never reach the client. Database connection strings, internal IP addresses, full stack traces with dependency versions, SQL query fragments. Fuzzers trigger these crashes systematically, and each crash response becomes a reconnaissance opportunity for an attacker.

Integer overflows and boundary values. We worked on a payment processing system where fuzz testing found an integer overflow in the transaction amount field. The field was a 32-bit signed integer. Send a value just past 2,147,483,647 and the system wrapped around to a negative number. In a payment context, that could mean a credit instead of a debit. Standard tests sent amounts like 100, 500, 10000. Nobody tested what happens at the boundary of the data type itself.

Why your existing tests miss these

Your unit tests are written by the same people who wrote the code. They share the same assumptions about what valid input looks like. They test the happy path and a handful of known error cases.

Your integration tests verify that components work together correctly when given correct data. They rarely test what happens when component A sends garbage to component B.

Your end-to-end tests simulate real user behavior. Real users do not typically paste 50,000 characters into a phone number field or send raw bytes to a JSON endpoint. Attackers do.

Fuzzing fills the gap between "does it work correctly?" and "does it fail safely?" Those are two very different questions, and most test suites only answer the first one.

How we actually run fuzz tests

At BetterQA, fuzzing is part of our DAST (Dynamic Application Security Testing) work. We built an AI Security Toolkit with over 30 scanners, and fuzzing is integrated into the dynamic analysis pipeline.

Here is how a typical engagement works:

1. Map the attack surface. Before we fuzz anything, we need to know what exists. We crawl the application, identify all endpoints, document the expected input formats, and note which endpoints handle sensitive data (auth, payments, file uploads, admin functions).

2. Seed the fuzzer with valid inputs. Good fuzzing starts with valid data. We capture real requests from the application (with test accounts, never production data), and the fuzzer uses these as templates. It knows what a valid request looks like, so it can make targeted mutations rather than purely random noise.

3. Run mutation-based fuzzing. The fuzzer takes each valid input and generates thousands of variants. Wrong types, boundary values, encoding tricks, oversized payloads, special characters, null bytes, format string patterns. Each variant gets sent to the endpoint, and we capture the response code, response body, response time, and any server-side logs.

4. Triage the findings. Not every crash is a security vulnerability. Some are just robustness issues (the server returns a 500 but recovers cleanly). Some are actual security holes (the server leaks data, accepts the malformed input as valid, or enters an inconsistent state). We classify each finding by severity and exploitability.

5. Verify and document. Every finding gets manually verified. We reproduce the crash, confirm the root cause, and write up the fix. No false positives in the final report.

For web applications, we often use OWASP ZAP as one of the tools in this pipeline. For APIs, we combine custom fuzzing scripts with tools like Burp Suite's Intruder or purpose-built API fuzzers. For projects with unusual protocols (IoT devices, custom binary formats), we write targeted fuzzers from scratch.

When to fuzz (and when not to)

Fuzzing works best when:

You have a public-facing API that accepts user input
You process file uploads
You handle payment or financial data
You parse complex data formats (JSON, XML, CSV, binary protocols)
You have already done basic security testing and want to go deeper

Fuzzing is less useful when:

The application has no external input surface (purely internal batch processing)
You have not done basic input validation yet (fix the obvious stuff first, then fuzz)
The codebase changes so frequently that findings become stale before they are fixed

The best time to start fuzzing is after your first round of functional testing is stable but before you go to production. That is when the cost of fixing issues is lowest and the risk of missing something is highest.

The security testing reality in 2024

As Tudor Brad, BetterQA's founder, puts it: "It's a good versus evil game right now." AI is accelerating development speed, which means more code ships faster, which means more potential vulnerabilities reach production faster. Features that used to take months now take days. The testing has to keep pace.

Fuzzing is one of the few techniques that scales with code output. You do not need to manually write a test case for every possible malformed input. The fuzzer generates them. You just need to point it at the right targets and have someone who knows what they are looking at to triage the results.

If you have never run a fuzzer against your application, I would strongly suggest trying it on a staging environment. The results will probably surprise you. We have yet to fuzz a non-trivial application and find zero issues. Every single engagement has turned up something the existing test suite missed.

The question is never "does my software have these bugs?" The question is "do I find them before someone else does?"

More on security testing and QA practices on the BetterQA blog.

Payment testing: the card types that break in production

Tudor Brad — Thu, 09 Apr 2026 08:34:46 +0000

The bug that costs you money twice

Last year we tested a fintech client's checkout flow. Everything passed in Stripe test mode. Green across the board. Then they went live in Germany and 30% of transactions started failing silently. No error page. No retry prompt. Just... nothing happened when the user clicked "Pay."

The problem was 3D Secure. Their integration handled the initial charge request fine, but never implemented the redirect flow for SCA (Strong Customer Authentication). In test mode, Stripe skips 3D Secure unless you explicitly use the 4000002760003184 test card. Nobody on the dev team had used that card. So nobody knew the integration was broken for every European card that required authentication.

The client found out when chargebacks started hitting. That is the worst way to discover a payment bug: your payment processor tells you, your bank tells you, and your users have already left.

Why payment bugs are different from other bugs

A broken image on your landing page is embarrassing. A broken payment flow is expensive. Here is what makes payment bugs uniquely painful:

Direct revenue loss. Every failed transaction is money that almost entered your account and didn't. If 5% of your transactions fail due to a card type you never tested, that is 5% of revenue gone. Not "at risk." Gone.

Chargebacks compound the damage. When a payment goes through incorrectly (wrong amount, duplicate charge, currency mismatch), you don't just refund the money. You pay chargeback fees. Enough chargebacks and your payment processor raises your rates or drops you entirely.

User trust evaporates instantly. People are anxious about money. A single failed payment makes a user question whether your site is legitimate. They won't debug it for you. They will close the tab and buy from someone else.

Silent failures hide the problem. Unlike a 500 error that shows up in your monitoring, many payment failures happen at the processor level and return a generic decline. Your logs show "card_declined" but the real cause is that your integration doesn't handle the card network correctly.

This is why we treat payment testing as its own discipline, not just "form validation with a credit card field."

Card types that actually break things

Here are the specific card type issues we run into repeatedly when testing payment integrations for clients.

Amex and the 15-digit problem

American Express cards have 15 digits and a 4-digit CVV (called CID). Visa and Mastercard have 16 digits and a 3-digit CVV. This sounds trivial until you see how many integrations hardcode maxLength="16" on the card number input and maxLength="3" on the CVV field.

We tested a SaaS platform where Amex cards were being silently rejected. No error message. The form just wouldn't submit. The frontend validation required exactly 16 digits, so any 15-digit PAN was treated as incomplete. The user saw a disabled submit button and assumed they typed something wrong.

Test cards to use:

Amex:           3782 822463 10005    (15 digits, 4-digit CID)
Visa:           4242 4242 4242 4242  (16 digits, 3-digit CVV)
Mastercard:     5555 5555 5555 4444  (16 digits, 3-digit CVV)

What to check:

Card number field accepts 15, 16, and 19 digits
CVV field accepts both 3 and 4 digits
Card type detection updates dynamically (Amex logo appears when you type 37xx)
Backend validation matches frontend rules

UnionPay and 19-digit PANs

UnionPay cards can be 16, 17, 18, or 19 digits long. If your validation regex is ^\d{16}$, you are rejecting a card network used by over a billion people.

We see this constantly in integrations targeting Asian markets. The dev team builds and tests with Visa/Mastercard, launches in Singapore or Malaysia, and gets support tickets from users who "can't enter their card number."

UnionPay (19):  6200 0000 0000 0000 003
UnionPay (16):  6200 0000 0000 0005

The fix is straightforward: accept 13-19 digits and let the payment processor handle network-specific validation. Your frontend should not be the gatekeeper for PAN length.

Diners Club and the 14-digit edge case

Diners Club cards traditionally have 14 digits, though newer ones may have 16. If your system strips spaces and then checks length === 16, Diners Club users cannot pay.

Diners Club:    3056 9309 0259 04   (14 digits)

This one is less common globally but still matters if you operate in parts of South America or accept corporate cards. We have seen it break on subscription billing platforms where the initial charge worked (the card was tokenized by Stripe directly) but a later recurring charge failed because the platform's own validation ran during a card update flow.

3D Secure and SCA failures

This is the big one. 3D Secure (3DS) adds an authentication step where the card issuer verifies the cardholder, usually through a redirect or iframe popup. In the EU, SCA regulations make this mandatory for most online transactions.

The problem: Stripe's test mode does not trigger 3DS by default. You need to explicitly use test cards that simulate the 3DS flow:

3DS required:       4000 0027 6000 3184
3DS required (fail): 4000 0084 0000 1629
3DS optional:       4000 0025 0000 3155

What breaks:

The redirect URL is not configured, so the user gets sent to a blank page
The return handler does not check payment_intent.status after the redirect
Mobile webviews block the 3DS popup, so the authentication never completes
The webhook handler does not account for the requires_action status

We tested a client's mobile app where 3DS worked perfectly in the browser but failed 100% of the time in the iOS webview. The app's WKWebView had javaScriptEnabled set to true but blocked popups, which is how the 3DS challenge was presented. Every EU user on iOS could not complete a payment.

Currency and amount edge cases

Currency bugs are sneaky because they often produce a valid charge for the wrong amount. The user gets billed, the amount looks plausible, and nobody notices until reconciliation.

Common issues we test for:

Zero-decimal currencies. JPY, KRW, and several others do not use decimal subunits. If your system sends 1000 to Stripe for a 10.00 USD charge (correct, because Stripe uses cents), sending 1000 for a JPY charge means 1000 yen, not 10 yen. The amount field interpretation changes by currency.

# USD: $10.00 = 1000 (cents)
# JPY: 1000 yen = 1000 (no subunit)
# BHD: 10.000 BD = 10000 (three decimal places)

Rounding on conversion. If your platform shows prices in EUR but charges in USD after conversion, rounding differences can mean the user sees 9.99 EUR but gets charged 10.01 EUR equivalent. Small difference. Big trust problem.

Minimum charge amounts. Stripe requires a minimum of 50 cents USD (or equivalent). If your platform allows a 0.10 USD tip or a discount that reduces the charge below the minimum, the payment fails at the processor level with a generic error.

How we structure payment test suites

When we pick up a payment integration project, here is the sequence we follow. This is not theory. This is what we actually run.

Phase 1: Card type coverage matrix. We build a grid of every card network the client wants to support, crossed with every payment scenario (one-time charge, subscription, refund, partial refund, card update). Each cell gets tested. No assumptions that "if Visa works, Mastercard works."

Phase 2: Authentication flows. We test every 3DS path: success, failure, abandonment (user closes the popup), timeout, and network error during redirect. We test on desktop browsers, mobile browsers, and in-app webviews separately because they behave differently.

Phase 3: Error handling and messaging. We trigger every decline code Stripe can return (insufficient funds, expired card, incorrect CVV, processing error, card not supported) and verify the user sees a specific, actionable message. "Payment failed" is not acceptable. "Your card was declined. Please check your card details or try a different payment method" is the minimum.

Phase 4: Webhook reliability. We verify that payment confirmation does not depend solely on the client-side redirect. If the user closes their browser after 3DS but before the redirect completes, the webhook from Stripe should still update the order. We test this by intentionally killing the browser session mid-payment and confirming the backend processes the webhook correctly.

Phase 5: Currency and locale. We test with cards issued in different countries, in different currencies, with different locale settings on the browser. A Japanese user with a JPY card on a platform that prices in USD should see a coherent experience from price display through to their bank statement.

Stripe test cards quick reference

For developers setting up their own payment test suites, here are the cards we use most often:

Scenario	Card number	Notes
Success	`4242 4242 4242 4242`	Always succeeds
Generic decline	`4000 0000 0000 0002`	Always declined
Insufficient funds	`4000 0000 0000 9995`	Specific decline reason
Incorrect CVC	`4000 0000 0000 0127`	CVC check fails
Expired card	`4000 0000 0000 0069`	Expiry check fails
3DS required	`4000 0027 6000 3184`	Triggers authentication
3DS failure	`4000 0084 0000 1629`	Authentication fails
Amex	`3782 822463 10005`	15 digits, 4-digit CID
Dispute/chargeback	`4000 0000 0000 0259`	Triggers dispute

Use any future expiry date and any 3-digit CVC (4-digit for Amex). For full documentation, check Stripe's testing page.

The test mode trap

Here is the pattern we see over and over: a team builds a payment integration, tests it thoroughly in Stripe test mode, and ships it. Then production breaks in ways that test mode never revealed.

Test mode is not production. It does not enforce SCA. It does not check real BIN ranges. It does not apply real fraud detection rules. It does not connect to actual card networks. It is a simulation, and like all simulations, it has blind spots.

The gap between test mode and production is where payment bugs live. You can narrow that gap by using the right test cards, testing authentication flows explicitly, and verifying webhook handling under failure conditions. But you cannot eliminate it entirely without production monitoring.

We always recommend that clients set up real-time alerting on payment failure rates. A 2% failure rate on day one that creeps to 8% by day thirty means something changed at the processor or issuer level, and no amount of pre-launch testing catches that.

What we have learned from testing payments across clients

After testing payment integrations for fintech and e-commerce clients at BetterQA, a few things stand out:

Card type validation belongs at the processor level, not your frontend. Let Stripe or Adyen validate the PAN. Your job is to not block valid cards before they reach the processor.
3D Secure is not optional in Europe. If you sell to EU customers and your integration does not handle 3DS, you will lose transactions. Not might. Will.
Test the sad paths harder than the happy paths. A successful payment needs to work. A failed payment needs to communicate clearly. Most teams spend 90% of testing time on success and 10% on failure. We flip that ratio.
Webhooks are your safety net. Client-side confirmation is unreliable. Browsers crash, users close tabs, networks drop. Your backend must handle payment confirmation through webhooks independently of what happens in the browser.
Currency handling is a category of bugs, not a single check. Zero-decimal currencies, three-decimal currencies, conversion rounding, minimum amounts: each one is a distinct failure mode.

Payment bugs are expensive, embarrassing, and preventable. The card types and scenarios in this article are the ones we see break most often. Test them before your users find them for you.

More on how we approach QA for complex integrations: betterqa.co/blog