DEV Community

aman kohli
aman kohli

Posted on

Argo Rollouts: Stop Gambling with Kubernetes Deployments

Argocd Rollout
Kubernetes is the de facto standard for running containerized workloads at scale.

But when it comes to deploying safely, its default approach is surprisingly limited.

You get:

  • Readiness probes
  • Rolling updates

And that's about it.

For production systems, that's not enough.

Rolling updates don't control risk — they just distribute it.

That's exactly the gap Argo Rollouts is designed to solve.


The Problem with Rolling Updates

Kubernetes rolling updates provide a basic safety net:

  • Pods are replaced gradually
  • Health checks ensure they're alive

But they don't give you:

  • Control over who sees the new version
  • Visibility into real production impact
  • Metric-based validation before proceeding
  • Automatic rollback on failure

So most deployments still look like:

Deploy → Wait → Hope nothing breaks

And when things go wrong — they affect everyone at once.


What Argo Rollouts Actually Does

Argo Rollouts brings progressive delivery to Kubernetes.

Instead of pushing changes globally, it lets you:

  • Gradually expose changes to a subset of users
  • Measure real-world impact using your existing metrics
  • Automatically decide whether to proceed or roll back

It introduces a new custom resource:

Rollout — a drop-in replacement for Deployment with progressive delivery built in

💡 Note: Argo Rollouts does not interfere with existing Deployment resources. It only acts on Rollout objects — so you can introduce it incrementally, one service at a time.


How Argo Rollouts Works (Under the Hood)

It's not a single tool — it's a system of components working together.

Rollout (CRD)

The core resource. Defines:

  • Strategy (blue-green or canary)
  • Step-by-step rollout logic
  • Analysis and traffic rules

Controller

The brain of the system:

  • Watches Rollout changes
  • Creates and manages ReplicaSets
  • Progresses or aborts deployments
  • Ignores standard Deployment objects completely

ReplicaSets

Managed automatically:

  • Old version → scaled down
  • New version → scaled up
  • You never touch these directly

Services & Ingress

  • Control traffic routing between versions
  • Enable canary and blue-green switching
  • Integrate with service meshes

AnalysisTemplate

Defines:

  • What metrics to check
  • How often to check them
  • What counts as success or failure
  • Reusable across multiple rollouts

AnalysisRun

The live execution of those checks:

  • Success → rollout continues
  • Failure → automatic rollback
  • Inconclusive → rollout pauses for human judgement

Experiment

Run stable and canary versions side-by-side under identical traffic. This gives you true A/B testing in production — no timing bias, no guesswork.


Progressive Delivery in Plain Language

Instead of:

Deploy → Hope

You move to:

Deploy → Observe → Decide

You define the signals that indicate a healthy deploy:

  • HTTP success rate
  • Error rate
  • Latency P99

And the system enforces them automatically. If metrics degrade, the rollout stops. No pager. No incident call. No scrambling.


🔵 Blue-Green vs 🟡 Canary: Which Should You Use?

This is where most teams overcomplicate things.

Blue-Green

Two versions run simultaneously:

  • Old → serves 100% of live traffic
  • New → idle, being tested via a preview service

When you're satisfied, traffic switches instantly.

Why it works: Only one version is ever active — no version-conflict issues.

Best for: Shared databases, queue workers, legacy applications.

Trade-off: The switch is all-or-nothing. No gradual exposure.


Canary

Traffic is gradually shifted:

5% → 25% → 50% → 100%
Enter fullscreen mode Exit fullscreen mode

At each step, metrics are evaluated. The rollout proceeds, pauses, or aborts based on what it sees.

Why it works: Limits the blast radius of failures to a small percentage of users.

Requirements: A traffic routing layer (Istio, NGINX, etc.) and an app that can safely run two versions simultaneously.

Trade-off: More complexity, but far lower risk per release.


Quick Comparison

Feature Blue-Green Canary
Complexity Low Medium–High
Risk (blast radius) High (instant switch) Low
Traffic control 0% or 100% Gradual %
Works with legacy systems ✅ Yes ❌ Often no
Works with queue workers ✅ Yes ❌ No
Requires traffic manager ❌ No ✅ Usually

Rule of thumb: Start with blue-green. No extra infrastructure, works everywhere, immediate improvement over rolling updates. Evolve to canary once you trust your metrics and your system supports dual versions.


Traffic Management: The Layer Kubernetes Is Missing

Native Kubernetes Services can only route traffic based on pod selectors. They can't:

  • Split traffic by exact percentage
  • Route based on HTTP headers
  • Mirror traffic silently

That's where service meshes come in. Argo Rollouts integrates with:

  • Istio
  • NGINX Ingress
  • AWS ALB
  • Traefik
  • Ambassador, Kong, Apache APISIX, SMI, Google Cloud and more

Three Advanced Routing Techniques

1. Percentage-Based Routing

The baseline. Route N% to canary, the rest to stable. Works with all providers.

90% → stable service
10% → canary service
Enter fullscreen mode Exit fullscreen mode

2. Header-Based Routing (Istio only)

Route internal users, QA teams, or beta testers to the new version based on a custom HTTP header — regardless of the overall traffic percentage.

- setHeaderRoute:
    name: "internal-test"
    match:
    - headerName: X-Canary-User
      headerValue:
        exact: "true"
Enter fullscreen mode Exit fullscreen mode

3. Traffic Mirroring (Istio only)

Copy real production traffic to the canary silently. Users always see the stable response — the canary's response is discarded. This lets you validate the new version under real load with zero user impact.

- setMirrorRoute:
    name: mirror-route
    percentage: 35
    match:
      - method:
          exact: GET
        path:
          prefix: /
Enter fullscreen mode Exit fullscreen mode

💡 Mirroring is one of the safest ways to validate changes before exposing them to any users.


Automated Analysis: Where the Real Power Is

This is what separates Argo Rollouts from basic deployment tools.

You define rules:

successCondition: result[0] >= 0.95   # success rate >= 95%
failureLimit: 3                        # abort after 3 failures
Enter fullscreen mode Exit fullscreen mode

Argo will:

  • Continue the rollout if metrics are healthy
  • Rollback automatically if metrics fail
  • Pause if the picture is unclear

No manual intervention needed.


Types of Analysis

Background Analysis — runs continuously during the canary steps. Fails at any point → rollout aborts.

Inline Analysis — a blocking step in your rollout sequence. Rollout waits until this completes before proceeding.

Blue-Green Pre-Promotion — validates the new version before traffic switches. Fails → traffic never switches.

Blue-Green Post-Promotion — validates the new version after traffic switches. Fails → traffic switches back automatically.


Full Example: Background Analysis with Prometheus

# Rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2   # don't start analysis until 40% traffic
        args:
        - name: service-name
          value: my-service.default.svc.cluster.local
      steps:
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 40
      - pause: {duration: 10m}
      - setWeight: 60
      - pause: {duration: 10m}
      - setWeight: 80
      - pause: {duration: 10m}
Enter fullscreen mode Exit fullscreen mode
# AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{
              destination_service=~"{{args.service-name}}",
              response_code!~"5.*"
            }[5m]
          )) /
          sum(irate(
            istio_requests_total{
              destination_service=~"{{args.service-name}}"
            }[5m]
          ))
Enter fullscreen mode Exit fullscreen mode

If success rate drops below 95% in any three consecutive 5-minute windows → rollout aborts, canary weight resets to zero, old version continues serving 100% of traffic.

Supported Metric Providers

Provider Notes
Prometheus Most common, full query support
Datadog Use default() to handle nil results
New Relic NRQL queries
CloudWatch AWS-native metrics
Wavefront Tanzu environments
Graphite Self-hosted metrics
InfluxDB Time-series metrics
Kayenta Canary analysis from Spinnaker
Web (HTTP) Custom webhook endpoints
Kubernetes Jobs Run arbitrary analysis as a Job

Deploying with Helm: A Real Walkthrough

In real systems you won't apply raw YAML manually. Here's a complete Helm-based setup.

Step 1: Install the Controller

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
Enter fullscreen mode Exit fullscreen mode

Step 2: Install the kubectl Plugin

# macOS
brew install argoproj/tap/kubectl-argo-rollouts

# Linux
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
Enter fullscreen mode Exit fullscreen mode

Step 3: Scaffold a Chart

helm create my-app
cd my-app
Enter fullscreen mode Exit fullscreen mode

Delete templates/deployment.yaml and create templates/rollout.yaml.

Step 4: Blue-Green Rollout Template

# templates/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "my-app.fullname" . }}
spec:
  replicas: 3
  selector:
    matchLabels:
      app: {{ include "my-app.name" . }}
  template:
    metadata:
      labels:
        app: {{ include "my-app.name" . }}
    spec:
      containers:
        - name: app
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          ports:
            - containerPort: 80
  strategy:
    blueGreen:
      activeService: {{ include "my-app.fullname" . }}-active
      previewService: {{ include "my-app.fullname" . }}-preview
      autoPromotionEnabled: false
Enter fullscreen mode Exit fullscreen mode

Step 5: Define Two Services

# templates/service-active.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ include "my-app.fullname" . }}-active
spec:
  selector:
    app: {{ include "my-app.name" . }}
  ports:
    - port: 80
      targetPort: 80
Enter fullscreen mode Exit fullscreen mode
# templates/service-preview.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ include "my-app.fullname" . }}-preview
spec:
  selector:
    app: {{ include "my-app.name" . }}
  ports:
    - port: 80
      targetPort: 80
Enter fullscreen mode Exit fullscreen mode

Step 6: Deploy and Upgrade

# Initial deploy
helm install my-app ./my-app

# Deploy new version
helm upgrade my-app ./my-app --set image.tag=v2
Enter fullscreen mode Exit fullscreen mode

Argo Rollouts deploys v2 to the preview service. Production traffic stays on v1 untouched.

Test against my-app-preview. When satisfied:

kubectl argo rollouts promote my-app
Enter fullscreen mode Exit fullscreen mode

Traffic switches to v2 instantly.


Graduating to Canary

strategy:
  canary:
    steps:
      - setWeight: 20
      - pause: {duration: 30s}
      - setWeight: 50
      - pause: {duration: 60s}
      - setWeight: 100
Enter fullscreen mode Exit fullscreen mode

Traffic shifts 20% → 50% → 100% with a pause at each stage for observation.

Adding Metric-Driven Analysis

# templates/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.95
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status!~"5.."}[1m])) /
            sum(rate(http_requests_total[1m]))
Enter fullscreen mode Exit fullscreen mode

Attach it to your rollout:

strategy:
  canary:
    analysis:
      templates:
        - templateName: success-rate
    steps:
      - setWeight: 20
      - pause: {duration: 1m}
Enter fullscreen mode Exit fullscreen mode

Now if success rate drops below 95% → rollout aborts automatically. No human needed.


Common Mistakes

"Rolling updates are enough"

They're not. You're still guessing. A bad deploy reaches all your users before you know it's bad.

"Canary is always better"

Not if your system can't handle two versions running simultaneously. Shared databases, queue workers, and locked resources all rule it out. Start with blue-green.

"Metrics are optional"

They're not optional — they're the entire decision-making system. Without metrics, Argo Rollouts is just a more complicated way to do a rolling update.

"Start with the full setup"

Don't introduce a traffic provider, metric analysis, header routing, and canary all at once. You'll be debugging the infrastructure instead of your application. Layer in complexity one step at a time.


Recommended Adoption Path

Step What to add What you gain
1 Blue-green + Helm Instant rollback, zero infrastructure change
2 Manual promotion gates Human review before traffic switches
3 Metric-based analysis Automatic rollback on failure
4 Traffic provider (Istio/NGINX) Exact percentage splits
5 Canary strategy Low blast radius per release
6 Header routing + mirroring Zero-impact production validation

Stop at whichever step still feels worth the complexity for your team. Step 3 alone is a meaningful improvement for most organisations.

Final Takeaway

The shift Argo Rollouts enables isn't just technical. It's this:

You're no longer deploying code. You're managing risk.

Argo Rollouts gives you control, visibility, and automation. But it doesn't define "safe" for you — that's still your responsibility. You set the thresholds. You define what success looks like. You decide how aggressive your rollout steps are.

The system executes your definition of safe, automatically, at every deploy.

When rollbacks are instant, the cost of a bad deploy drops dramatically. When canaries give you real signal before full exposure, the fear that slows release cadence starts to lift. Teams that once deployed weekly because deployments were scary start deploying daily — not because they became less careful, but because the system became more forgiving.

That's the real promise of progressive delivery.

Top comments (0)