
Kubernetes is the de facto standard for running containerized workloads at scale.
But when it comes to deploying safely, its default approach is surprisingly limited.
You get:
- Readiness probes
- Rolling updates
And that's about it.
For production systems, that's not enough.
Rolling updates don't control risk — they just distribute it.
That's exactly the gap Argo Rollouts is designed to solve.
The Problem with Rolling Updates
Kubernetes rolling updates provide a basic safety net:
- Pods are replaced gradually
- Health checks ensure they're alive
But they don't give you:
- Control over who sees the new version
- Visibility into real production impact
- Metric-based validation before proceeding
- Automatic rollback on failure
So most deployments still look like:
Deploy → Wait → Hope nothing breaks
And when things go wrong — they affect everyone at once.
What Argo Rollouts Actually Does
Argo Rollouts brings progressive delivery to Kubernetes.
Instead of pushing changes globally, it lets you:
- Gradually expose changes to a subset of users
- Measure real-world impact using your existing metrics
- Automatically decide whether to proceed or roll back
It introduces a new custom resource:
Rollout — a drop-in replacement for
Deploymentwith progressive delivery built in💡 Note: Argo Rollouts does not interfere with existing
Deploymentresources. It only acts onRolloutobjects — so you can introduce it incrementally, one service at a time.
How Argo Rollouts Works (Under the Hood)
It's not a single tool — it's a system of components working together.
Rollout (CRD)
The core resource. Defines:
- Strategy (blue-green or canary)
- Step-by-step rollout logic
- Analysis and traffic rules
Controller
The brain of the system:
- Watches
Rolloutchanges - Creates and manages
ReplicaSets - Progresses or aborts deployments
- Ignores standard
Deploymentobjects completely
ReplicaSets
Managed automatically:
- Old version → scaled down
- New version → scaled up
- You never touch these directly
Services & Ingress
- Control traffic routing between versions
- Enable canary and blue-green switching
- Integrate with service meshes
AnalysisTemplate
Defines:
- What metrics to check
- How often to check them
- What counts as success or failure
- Reusable across multiple rollouts
AnalysisRun
The live execution of those checks:
- ✅ Success → rollout continues
- ❌ Failure → automatic rollback
- ⏸ Inconclusive → rollout pauses for human judgement
Experiment
Run stable and canary versions side-by-side under identical traffic. This gives you true A/B testing in production — no timing bias, no guesswork.
Progressive Delivery in Plain Language
Instead of:
Deploy → Hope
You move to:
Deploy → Observe → Decide
You define the signals that indicate a healthy deploy:
- HTTP success rate
- Error rate
- Latency P99
And the system enforces them automatically. If metrics degrade, the rollout stops. No pager. No incident call. No scrambling.
🔵 Blue-Green vs 🟡 Canary: Which Should You Use?
This is where most teams overcomplicate things.
Blue-Green
Two versions run simultaneously:
- Old → serves 100% of live traffic
- New → idle, being tested via a preview service
When you're satisfied, traffic switches instantly.
Why it works: Only one version is ever active — no version-conflict issues.
Best for: Shared databases, queue workers, legacy applications.
Trade-off: The switch is all-or-nothing. No gradual exposure.
Canary
Traffic is gradually shifted:
5% → 25% → 50% → 100%
At each step, metrics are evaluated. The rollout proceeds, pauses, or aborts based on what it sees.
Why it works: Limits the blast radius of failures to a small percentage of users.
Requirements: A traffic routing layer (Istio, NGINX, etc.) and an app that can safely run two versions simultaneously.
Trade-off: More complexity, but far lower risk per release.
Quick Comparison
| Feature | Blue-Green | Canary |
|---|---|---|
| Complexity | Low | Medium–High |
| Risk (blast radius) | High (instant switch) | Low |
| Traffic control | 0% or 100% | Gradual % |
| Works with legacy systems | ✅ Yes | ❌ Often no |
| Works with queue workers | ✅ Yes | ❌ No |
| Requires traffic manager | ❌ No | ✅ Usually |
Rule of thumb: Start with blue-green. No extra infrastructure, works everywhere, immediate improvement over rolling updates. Evolve to canary once you trust your metrics and your system supports dual versions.
Traffic Management: The Layer Kubernetes Is Missing
Native Kubernetes Services can only route traffic based on pod selectors. They can't:
- Split traffic by exact percentage
- Route based on HTTP headers
- Mirror traffic silently
That's where service meshes come in. Argo Rollouts integrates with:
- Istio
- NGINX Ingress
- AWS ALB
- Traefik
- Ambassador, Kong, Apache APISIX, SMI, Google Cloud and more
Three Advanced Routing Techniques
1. Percentage-Based Routing
The baseline. Route N% to canary, the rest to stable. Works with all providers.
90% → stable service
10% → canary service
2. Header-Based Routing (Istio only)
Route internal users, QA teams, or beta testers to the new version based on a custom HTTP header — regardless of the overall traffic percentage.
- setHeaderRoute:
name: "internal-test"
match:
- headerName: X-Canary-User
headerValue:
exact: "true"
3. Traffic Mirroring (Istio only)
Copy real production traffic to the canary silently. Users always see the stable response — the canary's response is discarded. This lets you validate the new version under real load with zero user impact.
- setMirrorRoute:
name: mirror-route
percentage: 35
match:
- method:
exact: GET
path:
prefix: /
💡 Mirroring is one of the safest ways to validate changes before exposing them to any users.
Automated Analysis: Where the Real Power Is
This is what separates Argo Rollouts from basic deployment tools.
You define rules:
successCondition: result[0] >= 0.95 # success rate >= 95%
failureLimit: 3 # abort after 3 failures
Argo will:
- Continue the rollout if metrics are healthy
- Rollback automatically if metrics fail
- Pause if the picture is unclear
No manual intervention needed.
Types of Analysis
Background Analysis — runs continuously during the canary steps. Fails at any point → rollout aborts.
Inline Analysis — a blocking step in your rollout sequence. Rollout waits until this completes before proceeding.
Blue-Green Pre-Promotion — validates the new version before traffic switches. Fails → traffic never switches.
Blue-Green Post-Promotion — validates the new version after traffic switches. Fails → traffic switches back automatically.
Full Example: Background Analysis with Prometheus
# Rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
analysis:
templates:
- templateName: success-rate
startingStep: 2 # don't start analysis until 40% traffic
args:
- name: service-name
value: my-service.default.svc.cluster.local
steps:
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 40
- pause: {duration: 10m}
- setWeight: 60
- pause: {duration: 10m}
- setWeight: 80
- pause: {duration: 10m}
# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 5m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.example.com:9090
query: |
sum(irate(
istio_requests_total{
destination_service=~"{{args.service-name}}",
response_code!~"5.*"
}[5m]
)) /
sum(irate(
istio_requests_total{
destination_service=~"{{args.service-name}}"
}[5m]
))
If success rate drops below 95% in any three consecutive 5-minute windows → rollout aborts, canary weight resets to zero, old version continues serving 100% of traffic.
Supported Metric Providers
| Provider | Notes |
|---|---|
| Prometheus | Most common, full query support |
| Datadog | Use default() to handle nil results |
| New Relic | NRQL queries |
| CloudWatch | AWS-native metrics |
| Wavefront | Tanzu environments |
| Graphite | Self-hosted metrics |
| InfluxDB | Time-series metrics |
| Kayenta | Canary analysis from Spinnaker |
| Web (HTTP) | Custom webhook endpoints |
| Kubernetes Jobs | Run arbitrary analysis as a Job |
Deploying with Helm: A Real Walkthrough
In real systems you won't apply raw YAML manually. Here's a complete Helm-based setup.
Step 1: Install the Controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
-f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
Step 2: Install the kubectl Plugin
# macOS
brew install argoproj/tap/kubectl-argo-rollouts
# Linux
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
Step 3: Scaffold a Chart
helm create my-app
cd my-app
Delete templates/deployment.yaml and create templates/rollout.yaml.
Step 4: Blue-Green Rollout Template
# templates/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: {{ include "my-app.fullname" . }}
spec:
replicas: 3
selector:
matchLabels:
app: {{ include "my-app.name" . }}
template:
metadata:
labels:
app: {{ include "my-app.name" . }}
spec:
containers:
- name: app
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
ports:
- containerPort: 80
strategy:
blueGreen:
activeService: {{ include "my-app.fullname" . }}-active
previewService: {{ include "my-app.fullname" . }}-preview
autoPromotionEnabled: false
Step 5: Define Two Services
# templates/service-active.yaml
apiVersion: v1
kind: Service
metadata:
name: {{ include "my-app.fullname" . }}-active
spec:
selector:
app: {{ include "my-app.name" . }}
ports:
- port: 80
targetPort: 80
# templates/service-preview.yaml
apiVersion: v1
kind: Service
metadata:
name: {{ include "my-app.fullname" . }}-preview
spec:
selector:
app: {{ include "my-app.name" . }}
ports:
- port: 80
targetPort: 80
Step 6: Deploy and Upgrade
# Initial deploy
helm install my-app ./my-app
# Deploy new version
helm upgrade my-app ./my-app --set image.tag=v2
Argo Rollouts deploys v2 to the preview service. Production traffic stays on v1 untouched.
Test against my-app-preview. When satisfied:
kubectl argo rollouts promote my-app
Traffic switches to v2 instantly.
Graduating to Canary
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 30s}
- setWeight: 50
- pause: {duration: 60s}
- setWeight: 100
Traffic shifts 20% → 50% → 100% with a pause at each stage for observation.
Adding Metric-Driven Analysis
# templates/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status!~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
Attach it to your rollout:
strategy:
canary:
analysis:
templates:
- templateName: success-rate
steps:
- setWeight: 20
- pause: {duration: 1m}
Now if success rate drops below 95% → rollout aborts automatically. No human needed.
Common Mistakes
"Rolling updates are enough"
They're not. You're still guessing. A bad deploy reaches all your users before you know it's bad.
"Canary is always better"
Not if your system can't handle two versions running simultaneously. Shared databases, queue workers, and locked resources all rule it out. Start with blue-green.
"Metrics are optional"
They're not optional — they're the entire decision-making system. Without metrics, Argo Rollouts is just a more complicated way to do a rolling update.
"Start with the full setup"
Don't introduce a traffic provider, metric analysis, header routing, and canary all at once. You'll be debugging the infrastructure instead of your application. Layer in complexity one step at a time.
Recommended Adoption Path
| Step | What to add | What you gain |
|---|---|---|
| 1 | Blue-green + Helm | Instant rollback, zero infrastructure change |
| 2 | Manual promotion gates | Human review before traffic switches |
| 3 | Metric-based analysis | Automatic rollback on failure |
| 4 | Traffic provider (Istio/NGINX) | Exact percentage splits |
| 5 | Canary strategy | Low blast radius per release |
| 6 | Header routing + mirroring | Zero-impact production validation |
Stop at whichever step still feels worth the complexity for your team. Step 3 alone is a meaningful improvement for most organisations.
Final Takeaway
The shift Argo Rollouts enables isn't just technical. It's this:
You're no longer deploying code. You're managing risk.
Argo Rollouts gives you control, visibility, and automation. But it doesn't define "safe" for you — that's still your responsibility. You set the thresholds. You define what success looks like. You decide how aggressive your rollout steps are.
The system executes your definition of safe, automatically, at every deploy.
When rollbacks are instant, the cost of a bad deploy drops dramatically. When canaries give you real signal before full exposure, the fear that slows release cadence starts to lift. Teams that once deployed weekly because deployments were scary start deploying daily — not because they became less careful, but because the system became more forgiving.
That's the real promise of progressive delivery.
Top comments (0)