Amine

Posted on Mar 24

AI Writes Code in Hours. It Still Takes Weeks to Ship.

#ai #devops #cicd #programming

Teams adopt AI-assisted coding tools, feature code ships in hours instead of days, PRs fly in faster than ever. Stripe recently shared that their AI agents produce over 1,300 merged pull requests per week with no human-written code.

Yet overall delivery speed stalls. QA is still a week behind. Staging is broken because three teams deployed conflicting changes to the same shared environment on Tuesday. The PRs pile up, waiting for a slot in an environment that’s currently on fire.

The bottleneck was never writing code. It was everything that happens after: coordination, validation, integration. And now that code is being produced by AI agents at machine speed, those bottlenecks are more painful and more visible than ever.

The fix isn’t an AI problem, it’s a delivery infrastructure problem. And I’ve been working on it since long before AI coding tools existed. Eight years ago, I joined a 30-person startup on a greenfield project and had the chance to implement preview environments from day one, starting with a simple Python script that spun up an environment per Git branch. Over the years, as the company grew into a French Next40 company valued at $5 billion, with hundreds of engineers shipping across dozens of services, that setup evolved into a full CD pipeline built on Kubernetes and ArgoCD.

This post shares the key delivery practices and lessons that most improved our shipping speed, offering actionable strategies for teams facing the same bottlenecks. Those foundations matter even more now that AI is flooding pipelines with code faster than most teams can validate it. Whether you’re just starting out or retrofitting an existing pipeline, there’s something here you can steal.

The Queue Where Lead Time Goes to Die

To talk about delivery speed, it helps to agree on what “fast” actually means. The DORA research program gives us four metrics for this:

Deployment frequency
Lead time for changes
Change failure rate
Time to restore service.
Preview environments and branching strategies primarily attack the first two, and if done wrong, blow up the other two.

Most organizations I’ve worked with share some version of the same workflow. Developers create feature branches. Those branches eventually get merged into a staging branch. That branch gets deployed to a single shared staging environment. QA tests it. If staging is “blessed,” a release gets cut and shipped to production, usually a batch of changes from multiple teams.

Before I got the chance to build things differently, I lived inside this workflow at previous jobs, and it has at least four problems.

Staging is a scarce resource. To test meaningfully, QA needs a stable, known state. If team A deploys to staging and starts testing, and team B deploys on top of it mid-session, team A’s results are invalidated. So teams take turns. Only one set of changes can meaningfully occupy staging at a time. Every other team queues up, and that queue is where lead time goes to die.

Multiple features are mixed together in the same environment, so when something breaks, you genuinely can’t tell which change caused it. Debugging becomes archaeology, digging through layers of other people’s commits to find the one that broke the build.

Parallel work collides. Environment drift accumulates silently. Leftover test data from one team’s session pollutes another team’s tests. Config changes conflict in ways nobody anticipated.

And it only gets worse when branches live too long. They diverge from main over days or weeks. When they finally merge, the integration itself becomes a source of defects. The longer the branch lives, the more painful the merge. At a previous job, I saw two-week-old branches that required a full day of conflict resolution just to get back to green and then the merge introduced bugs that neither branch had in isolation.

A common response among organizations is to solve this by creating more staging environments. Teams go from one broken staging to four broken staging environments. More environments just move the chaos around. You don’t fix a systemic workflow problem by throwing more infrastructure at it.

Those experiences were exactly why I pushed to do things differently from the start.

What If Every PR Got Its Own World?

When I joined the startup, I had the chance to answer that question from day one. Our small team had few repositories and few CI pipelines. We wrote a simple Python script that spun up an environment per pull request. Someone opened a PR, and within a few minutes, a fresh environment was running with exactly that branch’s changes, accessible at a unique URL. QA could test it. Product could see it. Designers could click through it. When the PR merged, the environment disappeared.

A preview environment is exactly that: an on-demand, isolated, ephemeral environment spun up automatically for a specific branch or pull request, mirroring production as closely as feasible. It’s created by your CI/CD pipeline when a branch or PR is opened. It’s destroyed when the branch is merged or closed. It uses the same stack, configuration shape, and realistic data as production. And it gets a stable, shareable URL that anyone can open in a browser.

Feedback loops compress from days to minutes, reviewers see and test running software within minutes of a push. QA cycle time drops when every PR gets its own testable environment instead of waiting for a slot in shared staging. Bugs get isolated to a single change, because the preview contains exactly one team’s work on top of a known-good baseline.

In the early days, with that Python script, this worked beautifully. It was trivial, almost.

When the Script Stopped Being Enough

As the company grew, more teams, more services and more integration edges emerged, the challenges scaled with us. Spin-up times crept longer as the stack grew. This was already painful with human-speed development. With AI agents opening PRs around the clock, we were about to hit a wall.

That Python script was simple, built for a world with few services. It used Helm to deploy apps into Kubernetes, but it ran sequentially, one service after another, in a single chain. If any deployment in the chain broke, the whole thing had to be rerun from the start. As the number of services grew, this became increasingly painful. A failure in service twelve meant re-deploying services one through eleven all over again. On top of that, database seeding took longer and longer as the dataset grew, dragging spin-up times out further with every new service we added.

That Python script got us far, but we outgrew it. We needed a proper CD pipeline. We built it using ArgoCD, deployments could run in parallel, fail independently, and recover without taking everything else down with them.

You have to start small, get the basics right, and build from there. That’s where the maturity comes from. I’ve watched other organizations try to skip straight to preview environments with complex systems, and it didn’t go well. Previews that take twenty minutes to spin up and are flaky just teaches developers to ignore them. We avoided most of those pitfalls because we took that path.

The Branch Is the Unit of Preview

Before you can make preview environments work, you need to get your branching strategy right. And this is where I see a lot of teams get confused.

Preview-driven workflows need a discrete unit to preview, an isolated branch with a clear scope that can be spun up, tested, and validated before it merges. That’s why short-lived feature branching is a natural fit. The pattern is to create a feature branch from main, push your work, and a preview environment spins up automatically from that branch. QA, stakeholders, and automated tests validate the change in isolation. Only when the branch passes review and testing does it get merged back into main.

The feature branch is the unit of preview. It’s what triggers the environment, what gets tested, and what gets promoted. That workflow is impossible if you’re committing straight to the trunk.

The discipline is in keeping these branches short-lived. A feature branch that lives for two or three days is healthy. One that lives for three weeks is another staging environment in disguise, and you’ll get the same merge pain and drift.

Short-lived branches reduce integration risk, so per-branch previews are cheap and safe. You’re previewing a small, well-scoped change, not a two-week divergence. Main stays releasable at all times, so previews serve their intended purpose, validation before merge. And the whole system reinforces small-batch thinking, which is a strong lever for improving both lead time and change failure rate.

No long-lived feature branches, every change is expected to be mergeable and releasable when tests pass, and teams use feature flags to keep incomplete features dark while still merging frequently into main. Flags decouple deployment (a technical event: new code is running) from release (a business event: users can see it). That’s what makes short-lived branches work for features that span multiple PRs. You merge each piece into main as it’s done, hidden from users, without keeping branches alive for weeks. And with progressive rollout, you can extend this further: ship to 5% of users, then 25%, then everyone, with kill switches for instant rollback if something goes wrong.

The Foundations Nobody Wants to Build

The forty-five-minute pipeline. At one point, our CI pipeline was taking forty-five minutes per run. Developers started batching changes and pushing less often. Instead of small, frequent pushes feeding into fast previews, we got large, infrequent pushes that defeated the whole point. We invested weeks trimming it down, and the improvement in delivery cadence was immediate.

Your system needs to build and test every commit on every PR automatically (unit tests, integration tests, static analysis, security scanning) but two things matter more than coverage numbers: speed and trustworthiness. Flaky tests are worse than slow tests. Flaky tests teach developers to ignore failures, and once that happens, your entire code quality degrades. We enforced a hard gate early on: no merge to main without green CI and at least one code review. No exceptions.

The deploy that nobody could reproduce. Early on, deployments to preview environments sometimes drifted from what ran in production. A config value hardcoded here, a manual step skipped there. Every deployment to preview, staging, and production must be fully automated and repeatable. No manual SSH. No copy-pasting configs. The principle we settled on was “build once, deploy many”. Produce a single artifact (docker container) from CI and promote that exact artifact through environments. Only configuration differs. Define your deployment pipeline as version-controlled code (CI workflows, ArgoCD manifests, Helm charts) so every environment gets deployed the same way. That’s what prevents config drift and keeps preview environments trustworthy.

The “works on my machine” problem, at infrastructure scale. Preview, staging, and production should be as similar as possible in topology, runtime, and configuration shape. If the preview runs on a fundamentally different stack, you’re testing the wrong thing. IaC (Terraform, Pulumi, Kubernetes manifests, Helm charts) lets you define infrastructure once as templates, parameterizing only what must differ between environments. Containerization makes per-branch spin-up and tear-down practical. Without it, spinning up a fresh environment per PR is an exercise in pain. With it, a preview environment becomes just another parameterized deployment of the same infrastructure your production already runs on.

The bug that nobody saw coming. One week, we noticed storage for our staging logs spiking for no apparent reason, and a few endpoints showed slightly higher latency. It turned out a bug in our logging library was outputting raw PDFs, base64-encoded, into the logs. We caught it in a preview environment before it reached production and before it could leak customer data into our log pipeline. That’s why observability matters at every layer: centralized logging, metrics, dashboards, and distributed tracing. In previews, observability catches error spikes and performance regressions before they reach shared environments. In production, it’s what keeps your change failure rate and time to restore under control as you ship more frequently. Track your DORA metrics before and after adopting previews, the improvement should be measurable. If it’s not, something in the foundation is weak.

Previews at Microservices Scale

At microservices scale, these foundations get tested hard. A preview becomes a composite: frontend, backend services, databases, message queues, the full vertical slice. A backend developer changes the pricing service, a frontend developer builds the new pricing UI, and the CI/CD system spins up a preview where both run at their PR versions while every other service runs at the stable main version. Both developers share the same preview URL and iterate together.

In my experience, every preview environment should be a full clone of your entire stack. All services, all backing infrastructure. Partial deployments, where you spin up only the changed services and route to a shared baseline for the rest, introduce subtle, hard-to-reproduce bugs. You’re testing against a routing illusion, not a real environment. You trade infrastructure cost for debugging cost, and in my experience, debugging cost is almost always more expensive.

The cost question always comes up, but it’s manageable. Previews are ephemeral and fault-tolerant by nature. Use spot instances to cut compute cost by 60–90%. Right-size aggressively: one small replica per service, minimum CPU and memory. You’re testing correctness, not load capacity. Use clever lifecycle management: hard TTLs, automatic teardown on PR merge/close, and a sweeper job for anything that slips through. Zombie environments are the real cost killer, not active previews. Scale to zero when idle, pause everything outside business hours. With these strategies, full-clone previews often cost roughly the same as a single shared staging environment once you account for the developer time wasted fighting that shared environment. What you get in return is a preview that behaves like production, every time.

For microservices specifically, a few additional prerequisites apply: each service needs its own CI/CD pipeline producing versioned artifacts, you need an orchestration layer that can spin up multi-service previews, and service discovery must handle ephemeral environments cleanly. On the testing side, contract tests against actual preview APIs catch breaking changes far more reliably than mocks, and end-to-end tests running against the preview URL exercise critical flows across service boundaries in an isolated, stable environment.

Start With One Team, One Service

If you’re early on this path, start with an honest assessment.

Can you deploy to production whenever you want?
Is main always in a releasable state?
Are your tests fast and trustworthy?
If the answer to either of those is “not really,” that’s where to focus first, not on spinning up preview infrastructure.

The adoption path I’ve seen work looks roughly like this: first, clean up your branching model, move to short-lived feature branches and keep main deployable at all times. Then, invest in CI, automated tests, hard merge gates, and fast feedback. Next, automate your deployments end to end and codify your infrastructure. Add feature flags so you can decouple deployment from release and start doing smaller, safer releases. Only then should you enable preview environments. On each PR, spin up an environment, run automated tests, publish the URL, tear it down on merge. If you’re running microservices, add composite previews, full-clone environments with aggressive cost controls, contract tests, and cross-service end-to-end tests. And throughout: measure your DORA metrics before and after. The improvement should be visible. If it’s not, something in the foundation is weak.

Each layer builds on the last. Skip a layer, and you’ll end up with a fragile, expensive system that creates more problems than it solves. I’ve watched organizations bolt previews onto an immature pipeline and get flaky environments that take twenty minutes to spin up and fail half the time. I’ve seen teams pair previews with long-lived branches. The previews work in isolation, but merges to main still break everything. I’ve seen organizations stack too many staging layers: dev, QA, UAT, pre-prod, staging, each with its own queue, slowing delivery without proportional safety. I’ve seen production data leak into preview environments because nobody thought about data isolation. Anonymized seed data and per-environment secret scoping aren’t glamorous work, but they’re essential. And I’ve seen plenty of cloud bills spike from zombie environments that nobody remembered to clean up or was too afraid to remove.

That’s essentially what we did. We started with one team, a few services, and a Python script. Then, we tightened the pipeline as we grew. Added more services. Evolved the tooling. Measured the impact. Let the results justify the investment in each next layer. And when we needed to bring new teams onboard, the approach was the same: start with the foundations, add previews once they’re solid, and let the numbers make the case.

The Amplifier

By the time the company reached 700 engineers, preview environments spanning more than 200 applications spun up in roughly five minutes. On any given day, around 150 preview environments were running in parallel, each one an isolated, full-clone world where a team could validate their changes without waiting for anyone else.

AI tools will keep making code faster to write. Coding assistants will get better, code generation will get cheaper, and PRs will keep flying in faster than ever. When an AI agent opens a PR, a preview environment spins up automatically, and now you can validate the AI’s output in a running system, in isolation, before it touches main. No shared staging queue. No guessing whether the generated code actually works end to end. The teams that win won’t be the fastest coders; they’ll be the ones who can validate and ship without breaking stride.

Preview environments amplify your delivery process: invest first in solid foundations, fast, trustworthy CI, automated deployments, and short-lived branches, then layer on previews to unlock real speed and reliability.

So the next time your team’s AI assistant churns out a feature in an afternoon and the PR sits in a queue for a week, don’t blame the tools. Look at what happens after the code is written. That’s where the real bottleneck lives, and that’s where preview environments, done right, make all the difference.

Originally published on Medium.

DEV Community