Flytebit

Posted on Jun 4

Vibe Thinking - When QA Becomes the New Bottleneck

#vibecoding #vibethinking #qualityassurance #aitesting

The dev team is shipping fast. Their sprint velocity has genuinely jumped.

But the QA queue is three days long. The pipeline takes 45 minutes to run. Every fast PR is sitting in a holding pattern, waiting for a human tester who has a backlog of 23 tickets. The testers are working harder than ever, and they're still the constraint.

The bottleneck didn't disappear. It just moved to a different part of the org.

This is the pattern that shows up in every organisation that installs AI coding tools and leaves QA unchanged. The developer transformation is real, but it runs ahead of the testing and pipeline model it depends on. The sprint doesn't speed up; the queue just builds up somewhere new.

This is Post 3 in the Vibe Thinking series ↗. The posts covered so far are:

Post 0 - Vibe Thinking - The Full Org Transformation ↗ Why developer-only vibe coding doesn't change the sprint, and what full transformation actually requires.
Post 1 - Vibe Thinking - The Developer Who Codes at the Speed of Thought ↗ The developer layer, the discipline required to make fast output safe.
Post 2 - Vibe Thinking - The PM Who Writes Requirements That an AI Can Actually Use ↗ The PM layer, and how ambiguous requirements become faster garbage in a vibe coding workflow.

This post is about what happens when the code reaches testing.

The Queue Migration

Vibe coding doesn't eliminate bottlenecks. It relocates them.

Before AI coding tools, the constraint was typically writing code - developers were the limiting factor. The sprint was sized around how long it took to build. When vibe coding works well, that constraint moves. Code output per developer can multiply significantly. PRs arrive faster. More tickets hit "Dev Done" per week than ever before.

But the pipeline doesn't know that. The testing environment still runs the same regression suite it always did. The QA team still has the same headcount. The CI/CD pipeline still takes the same time to run. The release cadence still assumes the same weekly output that justified it.

The result is predictable: faster code flowing into a pipe built for slower code, and the pipe becomes the constraint. It gets misread as a QA problem or a DevOps problem. The actual issue is transformation completeness: the org changed one layer and left everything downstream unchanged.

The good news: this is the most solvable bottleneck in the sequence. Most of what QA teams do manually today can be automated or shifted earlier in the pipeline - using tooling that didn't exist five years ago, without reducing quality. Often the quality improves.

Why Manual Regression Breaks Under Vibe Coding Volume

Manual regression testing has always been a compromise. Thorough in theory, chronically under-resourced in practice. Most teams run partial regression at best - covering the critical paths and hoping the edge cases don't surface in production.

That compromise was sustainable when code moved slowly. When a sprint produced twenty changed components, a QA team of three could cover it - barely, but reliably. When vibe coding doubles or triples the output, the same team now faces forty or sixty changed components per sprint. The math doesn't work.

More code arriving faster with the same human review bandwidth makes the situation worse, regardless of how fast the code ships.

There's a second problem specific to AI-generated code: the patterns are harder to spot manually. AI tends to produce output that is syntactically clean and structurally plausible, which means it reads well in a quick review. The issues tend to be semantic: incorrect assumptions about state, edge cases handled incorrectly, security-relevant behaviours that look fine at a glance. Manual testing that worked well for hand-typed code misses more of what AI generates, for exactly the same effort.

The answer is a different testing model.

Why "QA Sign-Off" Needs to Be Redesigned

The definition of done hasn't changed in most organisations since they adopted sprints: dev complete, QA sign-off, deploy.

That model was designed around a world where testing is a phase - something that happens after the code is written, before the release. It made sense when code came out slowly enough that QA had time to run the suite, log findings, hand back to dev, and cycle through again.

In a vibe coding org, that sequence breaks in two ways.

First: The cycle time assumption is wrong. If a developer can build a feature in a morning, a two-day QA cycle for that feature is a 4:1 delay ratio. The feature sits finished, waiting. The developer moves to the next thing. By the time QA comes back with findings, the developer has moved context three tasks forward. Context-switching back is expensive.

Second: The ownership assumption is wrong. "QA owns quality" is the default in a sequential model. In a vibe coding world (where AI-generated code is producing output the developer hasn't fully reasoned through), quality has to be everyone's responsibility from the first line, not a function's job at the end of the sequence. Pushing quality to the end of the pipeline is how OWASP-class issues reach production without anyone catching them.

QA sign-off isn't going away, but what it means is changing. The real question is at what point in the flow QA gets embedded, and what "QA" actually means when testing can be automated and AI-generated at every stage.

What is shift left, and why it means something different with AI

Most QA engineers know the term, and most organisations say they practise it. Few have actually closed the gap between the principle and the pipeline.

Shift left means moving quality activities earlier in the development lifecycle, to the left of a timeline where code moves from requirements through to release. The earlier a defect is found, the cheaper it is to fix. The principle is sound; the challenge has always been execution.

There are three distinct models, and they are not equivalent. To make the differences concrete, the same feature runs through all three: "Add a CSV export button to the filtered report view - users can export up to 10,000 rows of their report data." What changes is how much of the calendar gets consumed by iteration loops, and how much slips through without anyone catching it.

Model 1: Traditional QA (Test Last)

Requirements → Development → Lead code review ↺ → QA ↺ → Release.

The cost of context-switching back to fix code you wrote weeks ago is real, and this is the model most organisations are still running, regardless of what their job postings say.

Model 2: Traditional shift left (TDD, BDD, CI, pre-AI)

Requirements + AC → QA drafts test cases from AC + Dev builds with unit tests in parallel → CI on every commit → Lead code review ↺ → QA executes pre-drafted cases ↺ → Release.

Better than Model 1 - test cases are ready when the PR lands, and CI catches regressions early. But defects QA finds still trigger the same context-switch loop. Two dependencies also remain: developers writing unit tests manually under sprint pressure, and the AC being complete enough for QA to test against.

In practice, coverage is the first thing to get cut - and gaps in the AC become gaps in the test suite.

Model 3: Shift left with AI (the model this post is about)

AI drafts brief + AC (PM reviews) → AI generates test cases + Gherkin specs (QA reviews, commits specs to repo) + AI generates code + unit tests (Dev reviews) in parallel → AI code review tool flags issues + suggests fixes (Dev applies/modifies) ↺ → Quality gates green → Lead verifies gate summary + Architectural sign-off ↺ → QA wires E2E automation from Gherkin specs + merged implementation → Automated E2E + QA exploratory ↺ → release.

When AI enters the pipeline, the testing model can't stay the same. Output volume changes the math: AI generates a full feature in the time it takes a developer to write three test cases. Teams that add AI coding without updating the test model hit a coverage cliff - output accelerates but coverage stays manual. This model resolves that by having AI generate code and unit tests together, both automated, neither optional.

The risk profile shifts too. AI-generated code introduces predictable, pattern-specific vulnerabilities that functional testing alone doesn't catch. Shift left with AI requires security scanning embedded in the pipeline, not just functional coverage.

And the constraint that made shift left hard in a manual world (writing tests is time-consuming and gets deprioritised) disappears when test generation is automated. Shift left with AI becomes a default state enforced by the pipeline, not a discipline imposed on developers.

Across all three models, the developer effort is the same. What differs is how much of that effort gets multiplied into calendar days by manual process, and how much of what matters gets missed by the people reviewing and testing manually. The sections below cover the tools that make Model 3's pipeline possible.

What to Let Go Of

The Security Dimension

The security implications of vibe coding at scale aren't getting enough attention, and QA is the function best positioned to own the response.

45% of AI code fails security tests
Veracode Spring 2026 GenAI Code Security Update ↗ found that across all tested models and languages, 45% of AI-generated code introduces a known security flaw, with Java failing at 71% and XSS vulnerabilities failing at 85%. Those numbers have barely moved in two years of model releases. That's the part I find more unsettling than the 45% itself: the models are getting more capable, not more security-aware.

When nearly half of AI-generated code arrives with known security vulnerabilities, the question is where in the pipeline those vulnerabilities are being caught.

In most organisations today, the honest answer sits somewhere between code review (occasionally) and production (often). Neither is the right place.

Take a common pattern. An AI coding agent, given the prompt "Add an endpoint that returns user order history", will generate a working endpoint. It will also, in a significant proportion of cases, do one or more of the following: fail to scope the returned data to the authenticated user (returning other user's orders if the user ID is passed directly), skip input sanitisation on the order ID parameter, or expose fields that contain PII beyond what the use case requires. The endpoint works and passes a happy-path test. It ships.

This is the default output pattern of a model working from a prompt without security constraints baked into the brief.

The fragmented ownership problem makes this worse. When the prompt author, the AI, and the reviewer are different people (or when the reviewer is doing a light pass), nobody truly owns the security posture of what shipped. Diffuse accountability is how OWASP vulnerabilities stay in production for months.

Shadow IT extends the problem further. As vibe coding lowers the technical barrier to building, non-engineering functions (operations teams, marketing, analytics) will start shipping their own tools and automations, often outside any security governance model. QA and AppSec teams need to extend their remit to account for this, before a self-built internal tool becomes the attack surface for a production system it touches.

What QA's new security remit includes:

Security test coverage as a standard pipeline component, not a separate audit phase
Automated scanning for OWASP Top 10 patterns on every PR - not post-release
A governance policy for AI-built tools created outside the core engineering team
Explicit ownership assigned for every AI-generated component that handles sensitive data

TESTR - Automated Unit Test Coverage at Every Commit

Learn more about TESTR ↗

PASSR - Automated Engineering Review Across Every Commit

Learn more about PASSR ↗

The Pipeline Has to Catch Up Too

Test coverage and security scanning address what's in the code. The pipeline itself is a separate constraint.

A CI/CD configuration designed for weekly releases creates artificial latency that compounds across every fast PR. If the pipeline runs for 45 minutes and a team is merging six PRs per day, that's four and a half hours of pipeline time per day - which means PRs are waiting in queue, not running in parallel, and the feedback loop between code and validated deployment is measured in hours rather than minutes.

Pipeline evolution in a vibe coding org works along a few structural dimensions.

Parallelisation. Tests that used to run sequentially can run in parallel. A suite that takes 45 minutes sequentially can often run in under 10 when the test jobs are properly split. Most teams have never parallelised their pipelines because the weekly release cadence didn't make the latency painful enough to fix. Vibe coding makes it painful enough.

Incremental testing. Running the full suite on every commit is expensive. Running only the tests relevant to the changed code paths - with a full suite scheduled less frequently - dramatically reduces per-commit pipeline time without reducing coverage.

Environment parity. Pipelines that fail in staging but pass in production (or vice versa) are a sign that environments have drifted. Containerised environments with infrastructure-as-code eliminate the most common source of "works on my machine" pipeline failures - which are even more common when the code was generated by AI rather than hand-typed by a developer who knows the environment.

The pipeline is infrastructure, and it needs to be treated as a first-class engineering concern rather than an operational afterthought. In a vibe coding org, a slow pipeline causes as much friction as a slow developer.

Working with Flytebit

At FLYTEBIT TECHNOLOGIES, Vibe Coding Transformation ↗ is a structured engagement.

QA processes and pipeline health are part of every transformation feasibility study we run. In most engagements, the QA layer is where we find the biggest gap between what teams believe is happening and what the pipeline metrics actually show. Teams describe their QA process as "solid," and then the pipeline data shows a three-day average from PR open to QA sign-off, a 60% manual regression rate, and security scanning that runs quarterly rather than on every commit.

We map the current testing model, identify where test coverage falls below the risk level of the code, and establish what an automated-first pipeline looks like for that team's specific codebase and deployment model. TESTR and PASSR are part of that picture, as is the DevOps configuration that feeds them.

Not sure where your organisation stands today? The Vibe Coding Transformation Readiness Quick Check ↗ takes five minutes and gives you a per-function view of where your pipeline is most exposed.

If your team is shipping faster with vibe coding and the QA queue is growing to match, that's the conversation to start ↗.

Key takeaways

✅ Vibe coding relocates the bottleneck, it doesn't remove it: If QA and DevOps are unchanged, the pipeline becomes the new constraint almost immediately after the developer transformation takes hold.
✅ Manual regression doesn't scale with AI-generated output volume: The testing model has to change at the same time as the development model - not after the queue builds up.
✅ Shift-left quality means acceptance signals defined in the brief: Testing at the end of the sequence is the most expensive time to do it. Quality has to enter the workflow at the requirements stage, not the QA stage.
✅ 45% of AI-generated code fails security tests (Veracode Spring 2026): This is a pipeline governance problem, not just a developer problem. QA owns the catch point - and most teams don't have one in the right place.
✅ Fragmented ownership is how vulnerabilities stay in production: When the prompt author, AI, and reviewer are separate people without explicit accountability, nobody truly owns the security posture of what shipped.
✅ Shadow IT risk grows with vibe coding adoption: Non-engineering teams will build their own tools. AppSec governance needs to extend to everything that touches production systems - not just what engineering ships.
✅ TESTR keeps unit test coverage current as output volume increases: Auto-generated, auto-run on every commit - coverage scales with the team instead of lagging behind it, without anyone having to schedule it.
✅ PASSR catches Performance, Availability, Security, and Scalability issues on every PR: Before human review, with description, impact, and a ready fix. The PASSR portal makes quality trends visible across all repos over time.
✅ Pipeline configuration is a first-class engineering concern: Parallelisation, incremental testing, and environment parity aren't optional refinements. In a vibe coding org, a slow pipeline is as much a bottleneck as a slow developer.

Originally published at flytebit.com ↗ on May 28, 2026.

DEV Community