Anuj Tyagi

Posted on Jul 1

Canary Deployments with Flagger

#deployment #flagger #kubernetes #canary

Introduction

In the fast-paced world of software deployment, the ability to release new features safely and efficiently can make or break your application's reliability. Canary deployments have emerged as a critical strategy for minimizing risk while maintaining continuous delivery. In this comprehensive guide, we'll explore how to implement robust canary deployments using Flagger, a progressive delivery operator for Kubernetes.

What is Canary Deployment?

Canary deployment is a technique for rolling out new features or changes to a small subset of users before releasing the update to the entire system. Named after the "canary in a coal mine" practice, this approach allows you to detect issues early and rollback quickly if problems arise.

Instead of replacing your entire application at once, canary deployments gradually shift traffic from the stable version (primary) to the new version (canary), monitoring key metrics throughout the process. If the metrics indicate problems, the deployment automatically rolls back to the stable version.

Why Choose Flagger?

Flagger is a progressive delivery operator that automates the promotion or rollback of canary deployments based on metrics analysis. Here's why it stands out:

Automated Traffic Management: Gradually shifts traffic between versions
Metrics-Driven Decisions: Uses Prometheus metrics to determine deployment success
Multiple Ingress Support: Works with NGINX, Istio, Linkerd, and more
Webhook Integration: Supports custom testing and validation hooks
HPA Integration: Seamlessly works with Horizontal Pod Autoscaler

Prerequisites and Setup

As shared above, Flagger provides multiple integration options but I used Nginx ingress controller and Prometheus for metrics.

Required Components

NGINX Ingress Controller (v1.0.2 or newer)
Horizontal Pod Autoscaler (HPA) enabled
Prometheus for metrics collection and analysis
Flagger deployed in your cluster

Verification Commands

# Check NGINX ingress controller
kubectl get service --all-namespaces | grep nginx

# Verify HPA is enabled
kubectl get hpa --all-namespaces

# Confirm Flagger installation
kubectl get all -n flagger

Step 1: Installing Flagger

Flagger can be deployed using Helm or ArgoCD. Once installed, it creates several Custom Resource Definitions (CRDs):

kubectl get crds | grep flagger
# Expected output:
# alertproviders.flagger.app
# canaries.flagger.app  
# metrictemplates.flagger.app

Step 2: Understanding Flagger's Architecture

When you deploy a canary with Flagger, it automatically creates and manages several Kubernetes objects:

Original Objects (You Provide)

deployment.apps/your-app
horizontalpodautoscaler.autoscaling/your-app
ingresses.extensions/your-app
canary.flagger.app/your-app

Generated Objects (Flagger Creates)

deployment.apps/your-app-primary
horizontalpodautoscaler.autoscaling/your-app-primary
service/your-app
service/your-app-canary
service/your-app-primary
ingresses.extensions/your-app-canary

Step 3: Creating Your First Canary Configuration

Here's a comprehensive canary configuration example:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-app
  namespace: production
spec:
  provider: nginx

  # Reference to your deployment
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app

  # Reference to your ingress
  ingressRef:
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: my-app

  # Optional HPA reference
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: my-app

  # Maximum time for canary to make progress before rollback
  progressDeadlineSeconds: 600

  service:
    port: 80
    targetPort: 8080
    portDiscovery: true

  analysis:
    # Analysis runs every minute
    interval: 1m

    # Maximum failed checks before rollback
    threshold: 5

    # Maximum traffic percentage to canary
    maxWeight: 50

    # Traffic increment step
    stepWeight: 10

    # Metrics to monitor
    metrics:
    - name: "error-rate"
      templateRef:
        name: error-rate
      thresholdRange:
        max: 0.02  # 2% error rate threshold
      interval: 1m

    - name: "latency"
      templateRef: 
        name: latency
      thresholdRange:
        max: 500  # 500ms latency threshold
      interval: 1m

    # Optional webhooks for testing
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 15s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://my-app-canary:8080/"

Step 4: Setting Up Service Monitors

For Prometheus to collect metrics from both primary and canary services, you need to create separate ServiceMonitor resources:

# Canary ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-canary
spec:
  endpoints:
    - port: metrics
      path: /metrics
      interval: 5s
  selector:
    matchLabels:
      app.kubernetes.io/name: my-app-canary

---
# Primary ServiceMonitor  
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-primary
spec:
  endpoints:
    - port: metrics
      path: /metrics
      interval: 5s
  selector:
    matchLabels:
      app.kubernetes.io/name: my-app-primary

At this point, you may find metrics discovery in the Prometheus,

Step 5: Creating Custom Metric Templates

Flagger uses MetricTemplate resources to define how metrics are calculated. Here's an example for error rate comparison:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
spec:
  provider:
    type: prometheus
    address: http://prometheus:9090
  query: |
    sum(
      rate(
        http_requests_total{
              service="my-app-canary",
              status=~"5.*"
          }[1m]
      ) or on() vector(0))/sum(rate(
          http_requests_total{
              service="my-app-canary"
          }[1m]
      ))
    - sum(
      rate(
        http_requests_total{
              service="my-app-primary",
              status=~"5.*"
          }[1m]
      ) or on() vector(0))/sum(rate(
          http_requests_total{
              service="my-app-primary"
          }[1m]
      ))

This query calculates the difference in error rates between canary and primary versions. The or on() vector(0) ensures the query returns 0 when no metrics are available instead of failing.

Understanding the Canary Analysis Process

The Promotion Flow

When Flagger detects a new deployment, it follows this process:

Initialization: Scale up canary deployment alongside primary
Pre-rollout Checks: Execute pre-rollout webhooks
Traffic Shifting: Gradually increase traffic to canary (10% → 20% → 30% → 40% → 50%)
Metrics Analysis: Check error rates, latency, and custom metrics at each step
Promotion Decision: If all checks pass, promote canary to primary
Cleanup: Scale down old primary, update primary with canary spec

Rollback Scenarios

Flagger automatically rolls back when:

Error rate exceeds threshold
Latency exceeds threshold
Custom metric checks fail
Webhook tests fail
Failed checks counter reaches threshold

Monitoring Canary Progress

# Watch all canaries in real-time
watch kubectl get canaries --all-namespaces

# Get detailed canary status
kubectl describe canary/my-app -n production

# View Flagger logs
kubectl logs -f deployment/flagger -n flagger-system

Advanced Features

Webhooks for Enhanced Testing

Flagger supports multiple webhook types for comprehensive testing:

webhooks:
  # Manual approval before rollout
  - name: "confirm-rollout"
    type: confirm-rollout
    url: http://approval-service/gate/approve

  # Pre-deployment testing
  - name: "integration-test"
    type: pre-rollout
    url: http://test-service/
    timeout: 5m
    metadata:
      type: bash
      cmd: "run-integration-tests.sh"

  # Load testing during rollout
  - name: "load-test"
    type: rollout
    url: http://loadtester/
    metadata:
      cmd: "hey -z 2m -q 10 -c 5 http://my-app-canary/"

  # Manual promotion approval
  - name: "confirm-promotion"
    type: confirm-promotion
    url: http://approval-service/gate/approve

  # Post-deployment notifications
  - name: "slack-notification"
    type: post-rollout
    url: http://notification-service/slack

HPA Integration

When using HPA with canary deployments, Flagger pauses traffic increases while scaling operations are in progress:

autoscalerRef:
  apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  name: my-app-primary
  primaryScalerReplicas:
    minReplicas: 2
    maxReplicas: 10

Alerting and Notifications

Configure alerts to be notified of canary deployment status:

analysis:
  alerts:
    - name: "canary-status"
      severity: info
      providerRef:
        name: slack-alert
        namespace: flagger-system

Production Considerations

Traffic Requirements

For effective canary analysis, you need sufficient traffic to generate meaningful metrics. If your production traffic is low:

Consider using load testing webhooks
Implement synthetic traffic generation
Adjust analysis intervals and thresholds accordingly

Metrics Selection

Choose metrics that accurately reflect your application's health:

Error Rate: Monitor 5xx responses
Latency: Track P95 or P99 response times
Custom Business Metrics: Application-specific indicators

Deployment Timing

Calculate your deployment duration:

Minimum time = interval × (maxWeight / stepWeight)
Rollback time = interval × threshold

For example, with interval=1m, maxWeight=50%, stepWeight=10%, threshold=5:

Minimum deployment time: 1m × (50/10) = 5 minutes
Rollback time: 1m × 5 = 5 minutes

Troubleshooting Common Issues

Missing Metrics

Problem: Canary fails due to missing metrics
Solution: Verify ServiceMonitor selectors match service labels

Webhook Failures

Problem: Load testing webhooks time out
Solution: Increase webhook timeout and verify load tester accessibility

HPA Conflicts

Problem: Scaling issues during canary deployment

Solution: Ensure HPA references are correctly configured for both primary and canary

Network Policies

Problem: Traffic routing issues
Solution: Verify network policies allow communication between services

Best Practices

Start Small: Begin with low traffic percentages and gradual increases
Monitor Actively: Set up comprehensive alerting for canary deployments
Test Thoroughly: Use webhooks for automated testing at each stage
Plan for Rollback: Ensure your rollback process is well-tested
Document Everything: Maintain clear documentation of your canary processes

Conclusion

Flagger provides a robust, automated solution for implementing canary deployments in Kubernetes environments. By gradually shifting traffic while monitoring key metrics, it enables safe deployments with automatic rollback capabilities.

The combination of metrics-driven analysis, webhook integration, and seamless traffic management makes Flagger an excellent choice for teams looking to implement progressive delivery practices. Start with simple configurations and gradually add more sophisticated monitoring and testing as your confidence grows.

Remember that successful canary deployments depend not just on the tooling, but also on having appropriate metrics, sufficient traffic, and well-defined success criteria. With proper implementation, Flagger can significantly reduce deployment risks while maintaining the agility your development teams need.

DEV Community