Rivail Pinto

Posted on Apr 9

Stop Treating Deploys Like a Gamble

#cicd #devops #management #softwareengineering

Here's what changed.

A while back, our team had a problem that looked like a success: we were shipping constantly. PRs merged daily, features going out every week, stakeholders happy. Then one Friday afternoon, a release took down a core flow for 20% of our users. No rollback plan. No feature flags. No one sure which commit was the culprit.

We weren't moving fast. We were just falling forward.

If you're a tech lead or engineering manager, you've probably lived some version of this. So here's what we actually changed — not the ideal textbook version, but the stuff that made a real difference.

The bad habits that sneak in

These aren't rookie mistakes. They're patterns that emerge when teams are under pressure and moving quickly:

Treating "merged" as "shipped"
Merging to main and deploying to production become the same action. No staging. No smoke tests. No window to catch regressions before they reach users.

Manual release checklists
A shared doc with 30 steps that someone runs through by hand before every deploy. It works until it doesn't — because humans skip steps, especially at 5pm on a Friday.

All-or-nothing deployments
Every release goes to 100% of users immediately. No gradual rollout, no way to limit blast radius when something goes wrong.

No observability at deploy time
The team ships and then... waits. No automated checks on error rates, latency, or key business metrics. Bugs get reported by users, not caught by the system.

What actually helped us

1. Treat your pipeline as a product

Your CI/CD pipeline is infrastructure that your team uses every day. If it's slow, flaky, or hard to understand, people work around it — and that's when shortcuts happen.

We invested time making our pipeline fast and reliable: parallelizing test suites, caching aggressively, and making failure messages actually useful. When the pipeline became trustworthy, people stopped bypassing it.

2. Feature flags before feature branches

Long-lived feature branches are a trap. They accumulate merge debt and create painful integration moments. We shifted toward shipping behind feature flags — code goes to production, but it's off. This let us decouple deploy from release, which changed everything about how we thought about risk.

Tools like LaunchDarkly, Unleash, or even a simple home-rolled flag system work fine here. The point isn't the tool — it's the mindset shift.

3. Automate the release checklist

If a step in your release process can be scripted, it should be. We moved our manual checklist into the pipeline itself: health checks, smoke tests, migration validations. What used to take 45 minutes of human attention became a 3-minute automated gate.

The checklist still exists — but now it's code, not a doc.

4. Progressive delivery as the default

Canary releases and gradual rollouts aren't just for big companies. Even at smaller scale, shipping to 5% of users first and watching your error rate for 15 minutes before going wider is a completely achievable practice. It's also the thing that will save you on the days something goes wrong.

Tools like Argo Rollouts, Spinnaker, or native cloud provider features (AWS CodeDeploy, GCP Traffic Splitting) make this more accessible than it used to be.

5. Automated rollback, not heroics

The worst post-mortems I've been part of involved someone manually reverting commits under pressure while Slack was on fire. Define what "bad" looks like in metrics (error rate spikes, p95 latency jumps), automate detection, and automate rollback. Make the rollback boring.

Tools worth knowing

A few that have been genuinely useful, not just popular:

GitHub Actions / GitLab CI — solid defaults for most teams, good ecosystem
LaunchDarkly / Unleash — feature flag management with proper targeting and audit trails
Datadog / Grafana + Prometheus — deploy markers + dashboards so you can see the moment things shift
Argo Rollouts — Kubernetes-native progressive delivery with solid rollback support
Sentry — error tracking that surfaces issues before users start complaining

The real shift

The technical changes mattered, but the bigger shift was cultural: we stopped treating deployment as the finish line. Deployment is just one checkpoint. The finish line is "users are getting value and nothing is on fire."

When your pipeline is automated, your releases are gradual, and your rollback is boring — shipping stops feeling like a gamble.

That's when you can actually move fast.

What's one thing your team changed that made the biggest difference in how you ship? Drop it in the comments.

DEV Community