Aisalkyn Aidarova

Posted on Jan 7

Production Canary Architecture (what actually guarantees zero downtime)

#architecture #devops #kubernetes #sre

Client
  ↓
Ingress (NGINX / ALB)
  ↓
Service
  ↓
Pods
   ├─ Stable (v1) 90%
   └─ Canary (v2) 10%

Zero downtime comes from 4 protections working together:

Readiness probe
Rolling pod startup
Traffic splitting at ingress
Fast rollback

COMPONENTS (Production Required)

Component	Why
Deployment (stable + canary)	Parallel versions
Readiness probe	Prevents early traffic
Service	Stable endpoint
Ingress (NGINX / ALB)	Traffic split
Canary weight	Controlled exposure
Fast rollback	Safety net

1️⃣ STABLE Deployment (v1)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-stable
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: web
      track: stable
  template:
    metadata:
      labels:
        app: web
        track: stable
    spec:
      containers:
      - name: app
        image: hashicorp/http-echo:0.2.3
        args:
          - "-listen=:8080"
          - "-text=STABLE v1"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 2

Why this is production-safe:

Old pods stay serving traffic
New pods join only when ready

2️⃣ CANARY Deployment (v2)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
      track: canary
  template:
    metadata:
      labels:
        app: web
        track: canary
    spec:
      containers:
      - name: app
        image: hashicorp/http-echo:0.2.3
        args:
          - "-listen=:8080"
          - "-text=CANARY v2"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 2

Key:

Canary pods exist
But traffic is NOT automatic

3️⃣ SERVICES (split by label)

Stable Service

apiVersion: v1
kind: Service
metadata:
  name: web-stable
spec:
  selector:
    app: web
    track: stable
  ports:
  - port: 80
    targetPort: 8080

Canary Service

apiVersion: v1
kind: Service
metadata:
  name: web-canary
spec:
  selector:
    app: web
    track: canary
  ports:
  - port: 80
    targetPort: 8080

4️⃣ INGRESS WITH TRAFFIC SPLITTING (PRODUCTION CORE)

NGINX Ingress Canary (10%)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-stable
spec:
  rules:
  - host: web.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-stable
            port:
              number: 80

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
  - host: web.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-canary
            port:
              number: 80

Result:

90% → stable
10% → canary
No restart
No downtime

5️⃣ LIVE TRAFFIC VERIFICATION

while true; do
  curl -s http://web.local
  sleep 0.3
done

Expected:

STABLE v1
STABLE v1
CANARY v2
STABLE v1

6️⃣ METRICS & OBSERVATION (DevOps responsibility)

You monitor:

Error rate
Latency
Logs

kubectl logs -l track=canary
kubectl get pods

If metrics are clean → promote.

7️⃣ PROMOTION (ZERO DOWNTIME)

Increase canary traffic:

nginx.ingress.kubernetes.io/canary-weight: "50"

Then:

nginx.ingress.kubernetes.io/canary-weight: "100"

Finally:

Scale stable down
Rename canary → stable

kubectl scale deploy app-stable --replicas=0
kubectl scale deploy app-canary --replicas=6

8️⃣ ROLLBACK (INSTANT, ZERO DOWNTIME)

One command:

kubectl delete ingress web-canary

Traffic instantly:

100% → stable
Canary pods still exist (for debugging)

This is why canary is safer than rolling.

Why this is 100% no downtime

Protection	Result
Readiness probe	No early traffic
Parallel pods	No replacement gap
Ingress split	Gradual exposure
Fast rollback	Instant recovery

What REAL companies add on top

Production teams usually add:

Prometheus + alerts
Auto-promotion
Error budget checks
Argo Rollouts

But this design already meets production SRE standards.

Final DevOps rule (remember this)

Rolling update replaces pods.
Canary protects users.

DEV Community