DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Production Canary Architecture (what actually guarantees zero downtime)

Image

Image

Image

Image

Client
  ↓
Ingress (NGINX / ALB)
  ↓
Service
  ↓
Pods
   ├─ Stable (v1) 90%
   └─ Canary (v2) 10%
Enter fullscreen mode Exit fullscreen mode

Zero downtime comes from 4 protections working together:

  1. Readiness probe
  2. Rolling pod startup
  3. Traffic splitting at ingress
  4. Fast rollback

COMPONENTS (Production Required)

Component Why
Deployment (stable + canary) Parallel versions
Readiness probe Prevents early traffic
Service Stable endpoint
Ingress (NGINX / ALB) Traffic split
Canary weight Controlled exposure
Fast rollback Safety net

1️⃣ STABLE Deployment (v1)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-stable
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: web
      track: stable
  template:
    metadata:
      labels:
        app: web
        track: stable
    spec:
      containers:
      - name: app
        image: hashicorp/http-echo:0.2.3
        args:
          - "-listen=:8080"
          - "-text=STABLE v1"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 2
Enter fullscreen mode Exit fullscreen mode

Why this is production-safe:

  • Old pods stay serving traffic
  • New pods join only when ready

2️⃣ CANARY Deployment (v2)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
      track: canary
  template:
    metadata:
      labels:
        app: web
        track: canary
    spec:
      containers:
      - name: app
        image: hashicorp/http-echo:0.2.3
        args:
          - "-listen=:8080"
          - "-text=CANARY v2"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 2
Enter fullscreen mode Exit fullscreen mode

Key:

  • Canary pods exist
  • But traffic is NOT automatic

3️⃣ SERVICES (split by label)

Stable Service

apiVersion: v1
kind: Service
metadata:
  name: web-stable
spec:
  selector:
    app: web
    track: stable
  ports:
  - port: 80
    targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

Canary Service

apiVersion: v1
kind: Service
metadata:
  name: web-canary
spec:
  selector:
    app: web
    track: canary
  ports:
  - port: 80
    targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

4️⃣ INGRESS WITH TRAFFIC SPLITTING (PRODUCTION CORE)

NGINX Ingress Canary (10%)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-stable
spec:
  rules:
  - host: web.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-stable
            port:
              number: 80
Enter fullscreen mode Exit fullscreen mode
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
  - host: web.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-canary
            port:
              number: 80
Enter fullscreen mode Exit fullscreen mode

Result:

  • 90% → stable
  • 10% → canary
  • No restart
  • No downtime

5️⃣ LIVE TRAFFIC VERIFICATION

while true; do
  curl -s http://web.local
  sleep 0.3
done
Enter fullscreen mode Exit fullscreen mode

Expected:

STABLE v1
STABLE v1
CANARY v2
STABLE v1
Enter fullscreen mode Exit fullscreen mode

6️⃣ METRICS & OBSERVATION (DevOps responsibility)

You monitor:

  • Error rate
  • Latency
  • Logs
kubectl logs -l track=canary
kubectl get pods
Enter fullscreen mode Exit fullscreen mode

If metrics are clean → promote.


7️⃣ PROMOTION (ZERO DOWNTIME)

Increase canary traffic:

nginx.ingress.kubernetes.io/canary-weight: "50"
Enter fullscreen mode Exit fullscreen mode

Then:

nginx.ingress.kubernetes.io/canary-weight: "100"
Enter fullscreen mode Exit fullscreen mode

Finally:

  • Scale stable down
  • Rename canary → stable
kubectl scale deploy app-stable --replicas=0
kubectl scale deploy app-canary --replicas=6
Enter fullscreen mode Exit fullscreen mode

8️⃣ ROLLBACK (INSTANT, ZERO DOWNTIME)

One command:

kubectl delete ingress web-canary
Enter fullscreen mode Exit fullscreen mode

Traffic instantly:

  • 100% → stable
  • Canary pods still exist (for debugging)

This is why canary is safer than rolling.


Why this is 100% no downtime

Protection Result
Readiness probe No early traffic
Parallel pods No replacement gap
Ingress split Gradual exposure
Fast rollback Instant recovery

What REAL companies add on top

Production teams usually add:

  • Prometheus + alerts
  • Auto-promotion
  • Error budget checks
  • Argo Rollouts

But this design already meets production SRE standards.


Final DevOps rule (remember this)

Rolling update replaces pods.
Canary protects users.

Top comments (0)