DEV Community

Cover image for Prometheus at Scale: Surviving the Cardinality Cliff
Samson Tanimawo
Samson Tanimawo

Posted on

Prometheus at Scale: Surviving the Cardinality Cliff

The Day Prometheus Fell Over

Prometheus memory usage spiked from 8GB to 32GB overnight. OOM-killed. Monitoring was down for 20 minutes while we scrambled.

The cause? Someone added a request_path label to an HTTP metric. With 50,000 unique paths, that one label created 500,000 new time series.

This is the cardinality cliff.

Understanding Cardinality

Cardinality = number of unique time series. Every unique combination of metric name + label values = one time series.

http_requests_total{method="GET", status="200", service="api"}
http_requests_total{method="GET", status="404", service="api"}
http_requests_total{method="POST", status="200", service="api"}
Enter fullscreen mode Exit fullscreen mode

That's 3 time series. Manageable. But:

http_requests_total{method="GET", status="200", service="api", user_id="u1"}
http_requests_total{method="GET", status="200", service="api", user_id="u2"}
... (100,000 users)
Enter fullscreen mode Exit fullscreen mode

That's 100,000 time series from ONE metric. Multiply by methods, status codes, and services you're at millions.

Detection: Finding High-Cardinality Metrics

# Top 20 metrics by series count
topk(20, count by (__name__) ({__name__=~".+"}))
Enter fullscreen mode Exit fullscreen mode

Or from the Prometheus API:

curl -s http://prometheus:9090/api/v1/status/tsdb | \
jq '.data.seriesCountByMetricName | sort_by(.value | tonumber) | reverse |.[0:20]'
Enter fullscreen mode Exit fullscreen mode

Prevention: The Label Policy

We enforce a label policy:

label_policy:
allowed_high_cardinality:
- pod_name # Bounded by cluster size
- container_name # Bounded by deployments
- node_name # Bounded by cluster size
- namespace # Bounded and controlled

forbidden_labels:
- user_id # Unbounded
- request_id # Unbounded
- email # Unbounded
- ip_address # Unbounded
- url_path # Semi-unbounded

max_cardinality_per_metric: 10000
review_required_above: 5000
Enter fullscreen mode Exit fullscreen mode

Mitigation: Relabeling Rules

# prometheus.yml
scrape_configs:
- job_name: 'api-service'
metric_relabel_configs:
# Drop high-cardinality labels before ingestion
- source_labels: [__name__]
regex: 'http_request_duration_.*'
action: labeldrop
regex: 'url_path'

# Aggregate URL paths into categories
- source_labels: [url_path]
regex: '/api/v1/users/.*'
target_label: url_path
replacement: '/api/v1/users/:id'

# Drop metrics we don't need at all
- source_labels: [__name__]
regex: 'go_gc_.*|process_virtual_.*'
action: drop
Enter fullscreen mode Exit fullscreen mode

Scaling Strategy: Federation + Thanos

Architecture:

Service A ──→ Prometheus-1 ──→ Thanos Sidecar ──→ Object Storage
Service B ──→ Prometheus-1 ↓
Service C ──→ Prometheus-2 ──→ Thanos Sidecar ──→ Thanos Store
Service D ──→ Prometheus-2 ↓
Thanos Query
↓
Grafana
Enter fullscreen mode Exit fullscreen mode
# Thanos sidecar on each Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-with-thanos
spec:
template:
spec:
containers:
- name: prometheus
args:
- '--storage.tsdb.retention.time=6h' # Short local retention
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=2h'
- name: thanos-sidecar
args:
- 'sidecar'
- '--objstore.config-file=/etc/thanos/objstore.yml'
Enter fullscreen mode Exit fullscreen mode

Local Prometheus keeps 6 hours. Thanos handles long-term storage in S3. Query across all Prometheus instances through Thanos Query.

Recording Rules: The Performance Multiplier

# Pre-compute expensive queries
groups:
- name: service-level-aggregations
interval: 30s
rules:
- record: service:http_request_duration:p99_5m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (service, le))

- record: service:http_error_rate:5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

- record: service:http_requests_rate:5m
expr: sum(rate(http_requests_total[5m])) by (service)
Enter fullscreen mode Exit fullscreen mode

Dashboards query recording rules (instant) instead of raw metrics (slow).

If you want monitoring that handles scale automatically without cardinality headaches, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)