aman kohli

Posted on Mar 22

Argo Rollouts: Stop Gambling with Kubernetes Deployments

#kubernetes #cicd #docker #argocd

Kubernetes is the de facto standard for running containerized workloads at scale.

But when it comes to deploying safely, its default approach is surprisingly limited.

You get:

Readiness probes
Rolling updates

And that's about it.

For production systems, that's not enough.

Rolling updates don't control risk — they just distribute it.

That's exactly the gap Argo Rollouts is designed to solve.

The Problem with Rolling Updates

Kubernetes rolling updates provide a basic safety net:

Pods are replaced gradually
Health checks ensure they're alive

But they don't give you:

Control over who sees the new version
Visibility into real production impact
Metric-based validation before proceeding
Automatic rollback on failure

So most deployments still look like:

Deploy → Wait → Hope nothing breaks

And when things go wrong — they affect everyone at once.

What Argo Rollouts Actually Does

Argo Rollouts brings progressive delivery to Kubernetes.

Instead of pushing changes globally, it lets you:

Gradually expose changes to a subset of users
Measure real-world impact using your existing metrics
Automatically decide whether to proceed or roll back

It introduces a new custom resource:

Rollout — a drop-in replacement for Deployment with progressive delivery built in

💡 Note: Argo Rollouts does not interfere with existing Deployment resources. It only acts on Rollout objects — so you can introduce it incrementally, one service at a time.

How Argo Rollouts Works (Under the Hood)

It's not a single tool — it's a system of components working together.

Rollout (CRD)

The core resource. Defines:

Strategy (blue-green or canary)
Step-by-step rollout logic
Analysis and traffic rules

Controller

The brain of the system:

Watches Rollout changes
Creates and manages ReplicaSets
Progresses or aborts deployments
Ignores standard Deployment objects completely

ReplicaSets

Managed automatically:

Old version → scaled down
New version → scaled up
You never touch these directly

Services & Ingress

Control traffic routing between versions
Enable canary and blue-green switching
Integrate with service meshes

AnalysisTemplate

Defines:

What metrics to check
How often to check them
What counts as success or failure
Reusable across multiple rollouts

AnalysisRun

The live execution of those checks:

✅ Success → rollout continues
❌ Failure → automatic rollback
⏸ Inconclusive → rollout pauses for human judgement

Experiment

Run stable and canary versions side-by-side under identical traffic. This gives you true A/B testing in production — no timing bias, no guesswork.

Progressive Delivery in Plain Language

Instead of:

Deploy → Hope

You move to:

Deploy → Observe → Decide

You define the signals that indicate a healthy deploy:

HTTP success rate
Error rate
Latency P99

And the system enforces them automatically. If metrics degrade, the rollout stops. No pager. No incident call. No scrambling.

🔵 Blue-Green vs 🟡 Canary: Which Should You Use?

This is where most teams overcomplicate things.

Blue-Green

Two versions run simultaneously:

Old → serves 100% of live traffic
New → idle, being tested via a preview service

When you're satisfied, traffic switches instantly.

Why it works: Only one version is ever active — no version-conflict issues.

Best for: Shared databases, queue workers, legacy applications.

Trade-off: The switch is all-or-nothing. No gradual exposure.

Canary

Traffic is gradually shifted:

5% → 25% → 50% → 100%

At each step, metrics are evaluated. The rollout proceeds, pauses, or aborts based on what it sees.

Why it works: Limits the blast radius of failures to a small percentage of users.

Requirements: A traffic routing layer (Istio, NGINX, etc.) and an app that can safely run two versions simultaneously.

Trade-off: More complexity, but far lower risk per release.

Quick Comparison

Feature	Blue-Green	Canary
Complexity	Low	Medium–High
Risk (blast radius)	High (instant switch)	Low
Traffic control	0% or 100%	Gradual %
Works with legacy systems	✅ Yes	❌ Often no
Works with queue workers	✅ Yes	❌ No
Requires traffic manager	❌ No	✅ Usually

Rule of thumb: Start with blue-green. No extra infrastructure, works everywhere, immediate improvement over rolling updates. Evolve to canary once you trust your metrics and your system supports dual versions.

Traffic Management: The Layer Kubernetes Is Missing

Native Kubernetes Services can only route traffic based on pod selectors. They can't:

Split traffic by exact percentage
Route based on HTTP headers
Mirror traffic silently

That's where service meshes come in. Argo Rollouts integrates with:

Istio
NGINX Ingress
AWS ALB
Traefik
Ambassador, Kong, Apache APISIX, SMI, Google Cloud and more

Three Advanced Routing Techniques

1. Percentage-Based Routing

The baseline. Route N% to canary, the rest to stable. Works with all providers.

90% → stable service
10% → canary service

2. Header-Based Routing (Istio only)

Route internal users, QA teams, or beta testers to the new version based on a custom HTTP header — regardless of the overall traffic percentage.

- setHeaderRoute:
    name: "internal-test"
    match:
    - headerName: X-Canary-User
      headerValue:
        exact: "true"

3. Traffic Mirroring (Istio only)

Copy real production traffic to the canary silently. Users always see the stable response — the canary's response is discarded. This lets you validate the new version under real load with zero user impact.

- setMirrorRoute:
    name: mirror-route
    percentage: 35
    match:
      - method:
          exact: GET
        path:
          prefix: /

💡 Mirroring is one of the safest ways to validate changes before exposing them to any users.

Automated Analysis: Where the Real Power Is

This is what separates Argo Rollouts from basic deployment tools.

You define rules:

successCondition: result[0] >= 0.95   # success rate >= 95%
failureLimit: 3                        # abort after 3 failures

Argo will:

Continue the rollout if metrics are healthy
Rollback automatically if metrics fail
Pause if the picture is unclear

No manual intervention needed.

Types of Analysis

Background Analysis — runs continuously during the canary steps. Fails at any point → rollout aborts.

Inline Analysis — a blocking step in your rollout sequence. Rollout waits until this completes before proceeding.

Blue-Green Pre-Promotion — validates the new version before traffic switches. Fails → traffic never switches.

Blue-Green Post-Promotion — validates the new version after traffic switches. Fails → traffic switches back automatically.

Full Example: Background Analysis with Prometheus

# Rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2   # don't start analysis until 40% traffic
        args:
        - name: service-name
          value: my-service.default.svc.cluster.local
      steps:
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 40
      - pause: {duration: 10m}
      - setWeight: 60
      - pause: {duration: 10m}
      - setWeight: 80
      - pause: {duration: 10m}

# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{
              destination_service=~"{{args.service-name}}",
              response_code!~"5.*"
            }[5m]
          )) /
          sum(irate(
            istio_requests_total{
              destination_service=~"{{args.service-name}}"
            }[5m]
          ))

If success rate drops below 95% in any three consecutive 5-minute windows → rollout aborts, canary weight resets to zero, old version continues serving 100% of traffic.

Supported Metric Providers

Provider	Notes
Prometheus	Most common, full query support
Datadog	Use `default()` to handle nil results
New Relic	NRQL queries
CloudWatch	AWS-native metrics
Wavefront	Tanzu environments
Graphite	Self-hosted metrics
InfluxDB	Time-series metrics
Kayenta	Canary analysis from Spinnaker
Web (HTTP)	Custom webhook endpoints
Kubernetes Jobs	Run arbitrary analysis as a Job

Deploying with Helm: A Real Walkthrough

In real systems you won't apply raw YAML manually. Here's a complete Helm-based setup.

Step 1: Install the Controller

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

Step 2: Install the kubectl Plugin

# macOS
brew install argoproj/tap/kubectl-argo-rollouts

# Linux
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts

Step 3: Scaffold a Chart

helm create my-app
cd my-app

Delete templates/deployment.yaml and create templates/rollout.yaml.

Step 4: Blue-Green Rollout Template

# templates/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "my-app.fullname" . }}
spec:
  replicas: 3
  selector:
    matchLabels:
      app: {{ include "my-app.name" . }}
  template:
    metadata:
      labels:
        app: {{ include "my-app.name" . }}
    spec:
      containers:
        - name: app
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          ports:
            - containerPort: 80
  strategy:
    blueGreen:
      activeService: {{ include "my-app.fullname" . }}-active
      previewService: {{ include "my-app.fullname" . }}-preview
      autoPromotionEnabled: false

Step 5: Define Two Services

# templates/service-active.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ include "my-app.fullname" . }}-active
spec:
  selector:
    app: {{ include "my-app.name" . }}
  ports:
    - port: 80
      targetPort: 80

# templates/service-preview.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ include "my-app.fullname" . }}-preview
spec:
  selector:
    app: {{ include "my-app.name" . }}
  ports:
    - port: 80
      targetPort: 80

Step 6: Deploy and Upgrade

# Initial deploy
helm install my-app ./my-app

# Deploy new version
helm upgrade my-app ./my-app --set image.tag=v2

Argo Rollouts deploys v2 to the preview service. Production traffic stays on v1 untouched.

Test against my-app-preview. When satisfied:

kubectl argo rollouts promote my-app

Traffic switches to v2 instantly.

Graduating to Canary

strategy:
  canary:
    steps:
      - setWeight: 20
      - pause: {duration: 30s}
      - setWeight: 50
      - pause: {duration: 60s}
      - setWeight: 100

Traffic shifts 20% → 50% → 100% with a pause at each stage for observation.

Adding Metric-Driven Analysis

# templates/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status!~"5.."}[1m])) /
            sum(rate(http_requests_total[1m]))

Attach it to your rollout:

strategy:
  canary:
    analysis:
      templates:
        - templateName: success-rate
    steps:
      - setWeight: 20
      - pause: {duration: 1m}

Now if success rate drops below 95% → rollout aborts automatically. No human needed.

Common Mistakes

"Rolling updates are enough"

They're not. You're still guessing. A bad deploy reaches all your users before you know it's bad.

"Canary is always better"

Not if your system can't handle two versions running simultaneously. Shared databases, queue workers, and locked resources all rule it out. Start with blue-green.

"Metrics are optional"

They're not optional — they're the entire decision-making system. Without metrics, Argo Rollouts is just a more complicated way to do a rolling update.

"Start with the full setup"

Don't introduce a traffic provider, metric analysis, header routing, and canary all at once. You'll be debugging the infrastructure instead of your application. Layer in complexity one step at a time.

Recommended Adoption Path

Step	What to add	What you gain
1	Blue-green + Helm	Instant rollback, zero infrastructure change
2	Manual promotion gates	Human review before traffic switches
3	Metric-based analysis	Automatic rollback on failure
4	Traffic provider (Istio/NGINX)	Exact percentage splits
5	Canary strategy	Low blast radius per release
6	Header routing + mirroring	Zero-impact production validation

Stop at whichever step still feels worth the complexity for your team. Step 3 alone is a meaningful improvement for most organisations.

Final Takeaway

The shift Argo Rollouts enables isn't just technical. It's this:

You're no longer deploying code. You're managing risk.

Argo Rollouts gives you control, visibility, and automation. But it doesn't define "safe" for you — that's still your responsibility. You set the thresholds. You define what success looks like. You decide how aggressive your rollout steps are.

The system executes your definition of safe, automatically, at every deploy.

When rollbacks are instant, the cost of a bad deploy drops dramatically. When canaries give you real signal before full exposure, the fear that slows release cadence starts to lift. Teams that once deployed weekly because deployments were scary start deploying daily — not because they became less careful, but because the system became more forgiving.

That's the real promise of progressive delivery.

DEV Community