Soumyajyoti Mahalanobish

Posted on Jul 1

Monitoring Celery Workers with Flower: Your Tasks Need Babysitting

#opensource #python #monitoring #sitereliabilityengineering

So you've got Celery workers happily executing tasks in your Kubernetes cluster, but you're flying blind. Your workers could be on fire, stuck in an endless queue, and you'd be the one to blame here.

We've all been there

where we're staring at logs hoping to divine the health of our distributed systems. Time to set up some proper monitoring.

Celery is one way of doing distributed task processing, but it's opaque when it comes to observability. You can see logs, but logs don't tell you if workers are healthy, how long tasks are taking, or whether your queue is backing up. That's where Flower comes in, it's the one of the monitoring tools for Celery environments.

This guide covers integrating Flower with Prometheus and Grafana to get proper metrics-driven monitoring. Whether you're using Grafana Cloud, self-hosted Grafana, the k8s-monitoring Helm chart, or individual components, we'll walk through the setup, explain why each piece matters, and tackle the gotchas.

What You'll Need

Kubernetes cluster with Celery workers already running
Some form of Prometheus-compatible metrics collection (Alloy, Prometheus Operator, plain Prometheus, etc.)
Grafana instance (cloud or self-hosted)
Basic Kubernetes knowledge
Patience for the inevitable configuration mysteries

Understanding the Architecture

Before diving into configuration, let's understand what we're building. Flower sits between your Celery workers and your monitoring system. It connects to your message broker (Redis/RabbitMQ), watches worker activity, and exposes metrics in Prometheus format.

The flow looks like this:

Celery workers process tasks from the broker
Flower monitors the broker and worker activity
Flower exposes metrics at /metrics endpoint
Your metrics collector (Prometheus/Alloy) scrapes these metrics
Grafana visualizes the data

The key insight is that Flower doesn't directly monitor workers, it monitors the broker's state and worker events, which is why it can give you a complete picture of your distributed system.

The Setup: Flower with Prometheus Metrics

Here's the thing about Flower, it's great at showing you pretty graphs in its web UI, but getting it to export metrics for Prometheus requires a specific flag that's easy to miss. By default, Flower only exposes basic Python process metrics, which are useless for understanding your Celery workload.

Deploy Flower (the Right Way)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flower
  labels:
    app: flower
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flower
  template:
    metadata:
      labels:
        app: flower
    spec:
      containers:
      - name: flower
        image: mher/flower:latest
        ports:
        - containerPort: 5555
        env:
        - name: CELERY_BROKER_URL
          value: "redis://your-redis-service:6379/0"
        command: 
        - celery
        - flower
        - --broker=redis://your-redis-service:6379/0
        - --port=5555
        - --prometheus_metrics  # This is the magic flag

That --prometheus_metrics flag is doing the heavy lifting here. Without it, you'll get basic Python process metrics (memory usage, GC stats, etc.) but none of the Celery-specific goodness like worker status, task counts, or queue depths. This flag tells Flower to export its internal monitoring data in Prometheus format.

The broker URL needs to match exactly what your Celery workers are using. Flower connects to the same broker to observe worker activity and task flow. If there's a mismatch, Flower won't see your workers.

Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: flower-service
  labels:
    app: flower
spec:
  selector:
    app: flower
  ports:
  - name: metrics  # Named ports help with service discovery
    port: 5555
    targetPort: 5555

The named port (metrics) is crucial for ServiceMonitor configurations later. Many monitoring setups rely on port names rather than numbers for service discovery, making your configuration more resilient to port changes.

Metrics Collection: Choose Your Adventure

How you get these metrics into your monitoring system depends entirely on your infrastructure setup. Kubernetes monitoring has evolved into several different patterns, each with its own tradeoffs.

Option 1: ServiceMonitor (Prometheus Operator/k8s-monitoring)

ServiceMonitors are part of the Prometheus Operator ecosystem and provide declarative configuration for scrape targets. They're the cleanest approach if you're using Prometheus Operator or the k8s-monitoring Helm chart.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: flower-metrics
  labels:
    app: flower
spec:
  selector:
    matchLabels:
      app: flower
  endpoints:
  - port: metrics      # References the named port
    path: /metrics
    interval: 30s
    scrapeTimeout: 20s
  namespaceSelector:
    matchNames:
    - your-namespace

The critical detail here is port: metrics vs targetPort: metrics. ServiceMonitors reference the service's port definition, not the container port directly. This indirection allows you to change container ports without updating monitoring configs.

Getting this configuration right requires the same attention to detail as any other infrastructure code.

One character difference can mean the difference between working monitoring and hours of debugging.

Here, the namespaceSelector restricts which namespaces this ServiceMonitor applies to. Without it, the ServiceMonitor tries to find matching services across all namespaces, which can cause confusion in multitenant clusters.

Option 2: Prometheus Annotations

If you're using vanilla Prometheus with annotation based discovery, you configure scraping through service annotations. This is simpler but less flexible than ServiceMonitors.

apiVersion: v1
kind: Service
metadata:
  name: flower-service
  labels:
    app: flower
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "5555"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: flower
  ports:
  - name: metrics
    port: 5555
    targetPort: 5555

The annotations tell Prometheus to scrape this service. Your Prometheus configuration needs to include a job that discovers services with these annotations. This approach is more straightforward but offers less control over scraping behavior.

Option 3: Alloy Configuration (Manual)

Grafana Alloy offers more flexibility than traditional Prometheus. You can configure complex discovery and relabeling rules to handle dynamic environments.

# Add to your Alloy config
discovery.kubernetes "flower_pods" {
  role = "pod"
  selectors {
    role = "pod"
    label = "app=flower"
  }
}

discovery.relabel "flower_pods" {
  targets = discovery.kubernetes.flower_pods.targets
  rule {
    source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
    action = "keep"
    regex = "true"
  }
  rule {
    source_labels = ["__address__", "__meta_kubernetes_pod_annotation_prometheus_io_port"]
    action = "replace"
    regex = "([^:]+)(?:\\d+)?;(\\d+)"
    replacement = "${1}:${2}"
    target_label = "__address__"
  }
}

prometheus.scrape "flower_metrics" {
  targets    = discovery.relabel.flower_pods.output
  forward_to = [prometheus.remote_write.your_destination.receiver]
  scrape_interval = "30s"
}

This configuration discovers pods with the app=flower label, applies relabeling rules to construct proper scrape targets, and forwards metrics to your storage backend. The relabeling rules transform Kubernetes metadata into the format Prometheus expects.

Option 4: Static Prometheus Config

For simple setups or development environments, static configuration is the most straightforward approach.

# prometheus.yml
scrape_configs:
  - job_name: 'flower'
    static_configs:
      - targets: ['flower-service.your-namespace.svc.cluster.local:5555']
    metrics_path: '/metrics'
    scrape_interval: 30s

This hardcodes the service endpoint, which works fine for stable environments but doesn't handle dynamic scaling or service changes gracefully.

Verification: Making Sure It Actually Works

Before diving into dashboard creation, verify that metrics are flowing correctly. This saves hours of troubleshooting later when you're wondering why your graphs are empty.

Check the Metrics Endpoint

kubectl port-forward svc/flower-service 5555:5555
curl http://localhost:5555/metrics

You should see metrics that look like this:

flower_worker_online{worker="celery@worker-1"} 1.0
flower_events_total{task="process_data",type="task-sent"} 127.0
flower_worker_number_of_currently_executing_tasks{worker="celery@worker-1"} 3.0
flower_task_prefetch_time_seconds{task="process_data",worker="celery@worker-1"} 0.001

If you're only seeing basic Python metrics (python_gc_objects_collected_total, process_resident_memory_bytes, etc.), you're missing the --prometheus_metrics flag. The Celery-specific metrics are what make this whole exercise worthwhile.

Check Your Monitoring System

The verification process depends on your monitoring setup:

For ServiceMonitor setups: Check the Prometheus Operator or Alloy UI for discovered targets. Look for your Flower service in the targets list with status "UP".

For annotation-based: Navigate to your Prometheus targets page (/targets) and verify the Flower job appears with healthy status.

For manual configs: Check your collector's logs for any scraping errors and verify the target appears in the monitoring system's target list.

Understanding the Metrics

Flower exports several categories of metrics, each providing different insights into your Celery system:

Worker Metrics: flower_worker_online tells you which workers are active. flower_worker_number_of_currently_executing_tasks shows current load per worker.

Event Metrics: flower_events_total tracks task lifecycle events (sent, received, started, succeeded, failed). These form the basis for throughput and success rate calculations.

Timing Metrics: flower_task_runtime_seconds (histogram) shows task execution duration. flower_task_prefetch_time_seconds measures queue wait time.

Queue Metrics: Various metrics help you understand queue depth and processing patterns.

Building Useful Dashboards

Now for the payoff - turning those metrics into actionable insights. The key is building dashboards that help you answer specific operational questions.

Essential Queries

Worker Health Questions: "Are my workers running? How many are active?"

# Total online workers
sum(flower_worker_online)

# Per-worker status
flower_worker_online

Throughput Questions: "How many tasks are we processing? Is throughput increasing?"

# Tasks being sent to workers (per second)
rate(flower_events_total{type="task-sent"}[5m])

# Tasks being processed (per second)
rate(flower_events_total{type="task-received"}[5m])

Queue Health Questions: "Is my queue backing up? How long do tasks wait?"

# Tasks currently executing
sum(flower_worker_number_of_currently_executing_tasks)

# Time tasks spend waiting in queue
flower_task_prefetch_time_seconds

Performance Questions: "How long do tasks take? Are they getting slower?"

# 95th percentile task duration
histogram_quantile(0.95, rate(flower_task_runtime_seconds_bucket[5m]))

# Median task duration
histogram_quantile(0.50, rate(flower_task_runtime_seconds_bucket[5m]))

Dashboard Design Philosophy specifically for celery

Start with high-level health indicators, then provide drill-down capabilities. A good Celery dashboard answers these questions in order:

System Health: Are workers running? Is the system processing tasks?
Throughput: How much work are we doing? Is it increasing or decreasing?
Performance: How fast are tasks completing? Are there performance regressions?
Queue Health: Are tasks backing up? Where are the bottlenecks?

Scaling Considerations

Multiple Worker Types

Real Celery deployments often have specialized workers for different task types. CPU-intensive tasks, I/O-bound tasks, and priority queues all need separate monitoring.

# CPU-intensive work monitor
command: ["celery", "-A", "tasks.cpu", "flower", "--port=5555", "--prometheus_metrics"]

# I/O-bound work monitor  
command: ["celery", "-A", "tasks.io", "flower", "--port=5555", "--prometheus_metrics"]

Each Flower instance monitors a specific Celery app, giving you granular visibility into different workload types. You'll need separate services and scrape configurations for each instance.

This approach lets you set different SLAs and alerting thresholds for different workload types. Your real-time fraud detection tasks might need sub-second response times, while your batch report generation can tolerate longer delays.

Resource Allocation

Flower itself is lightweight, but its resource needs scale with worker count and task frequency. A busy system with hundreds of workers and thousands of tasks per minute will use more memory to track state.

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Self-Hosted Prometheus

For self-hosted setups, configure Grafana to read from your Prometheus instance:

# grafana datasource config
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus-server.monitoring.svc.cluster.local:9090
    access: proxy

This assumes Prometheus and Grafana are in the same cluster. For cross-cluster or external access, you'll need appropriate networking and authentication configuration.

Security Considerations

Production Flower deployments need proper security controls. Flower's web interface shows detailed information about your task processing, which could be sensitive.

Authentication

Enable basic authentication at minimum:

env:
- name: FLOWER_BASIC_AUTH
  value: "admin:your_secure_password"

For production systems, consider OAuth integration or running Flower behind an authentication proxy. Celery-exporter provides similar metrics without the web interface overhead. It's purpose-built for Prometheus integration and might use fewer resources than Flower. However, you lose Flower's web interface for ad-hoc investigation.

Caution!

Getting Celery monitoring right requires attention to several key details:

The --prometheus_metrics flag transforms Flower from a simple web interface into a proper metrics exporter
Your metrics collection method should match your infrastructure setup and operational preferences
ServiceMonitor port configuration matters - port references service ports, targetPort references container ports
Label matching between ServiceMonitors, services, and pods must be exact
Your monitoring system's target discovery UI is invaluable for debugging configuration issues

The setup might seem complicated, but each piece serves a specific purpose in building a robust monitoring system. Once you have this foundation, you can extend it with alerting rules, additional dashboards, and integration with your incident response workflow.

DEV Community