Samson Tanimawo

Posted on Apr 22

Prometheus at Scale: Surviving the Cardinality Cliff

#prometheus #monitoring #observability #sre

The Day Prometheus Fell Over

Prometheus memory usage spiked from 8GB to 32GB overnight. OOM-killed. Monitoring was down for 20 minutes while we scrambled.

The cause? Someone added a request_path label to an HTTP metric. With 50,000 unique paths, that one label created 500,000 new time series.

This is the cardinality cliff.

Understanding Cardinality

Cardinality = number of unique time series. Every unique combination of metric name + label values = one time series.

http_requests_total{method="GET", status="200", service="api"}
http_requests_total{method="GET", status="404", service="api"}
http_requests_total{method="POST", status="200", service="api"}

That's 3 time series. Manageable. But:

http_requests_total{method="GET", status="200", service="api", user_id="u1"}
http_requests_total{method="GET", status="200", service="api", user_id="u2"}
... (100,000 users)

That's 100,000 time series from ONE metric. Multiply by methods, status codes, and services you're at millions.

Detection: Finding High-Cardinality Metrics

# Top 20 metrics by series count
topk(20, count by (__name__) ({__name__=~".+"}))

Or from the Prometheus API:

curl -s http://prometheus:9090/api/v1/status/tsdb | \
jq '.data.seriesCountByMetricName | sort_by(.value | tonumber) | reverse |.[0:20]'

Prevention: The Label Policy

We enforce a label policy:

label_policy:
allowed_high_cardinality:
- pod_name # Bounded by cluster size
- container_name # Bounded by deployments
- node_name # Bounded by cluster size
- namespace # Bounded and controlled

forbidden_labels:
- user_id # Unbounded
- request_id # Unbounded
- email # Unbounded
- ip_address # Unbounded
- url_path # Semi-unbounded

max_cardinality_per_metric: 10000
review_required_above: 5000

Mitigation: Relabeling Rules

# prometheus.yml
scrape_configs:
- job_name: 'api-service'
metric_relabel_configs:
# Drop high-cardinality labels before ingestion
- source_labels: [__name__]
regex: 'http_request_duration_.*'
action: labeldrop
regex: 'url_path'

# Aggregate URL paths into categories
- source_labels: [url_path]
regex: '/api/v1/users/.*'
target_label: url_path
replacement: '/api/v1/users/:id'

# Drop metrics we don't need at all
- source_labels: [__name__]
regex: 'go_gc_.*|process_virtual_.*'
action: drop

Scaling Strategy: Federation + Thanos

Architecture:

Service A ──→ Prometheus-1 ──→ Thanos Sidecar ──→ Object Storage
Service B ──→ Prometheus-1 ↓
Service C ──→ Prometheus-2 ──→ Thanos Sidecar ──→ Thanos Store
Service D ──→ Prometheus-2 ↓
Thanos Query
↓
Grafana

# Thanos sidecar on each Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-with-thanos
spec:
template:
spec:
containers:
- name: prometheus
args:
- '--storage.tsdb.retention.time=6h' # Short local retention
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=2h'
- name: thanos-sidecar
args:
- 'sidecar'
- '--objstore.config-file=/etc/thanos/objstore.yml'

Local Prometheus keeps 6 hours. Thanos handles long-term storage in S3. Query across all Prometheus instances through Thanos Query.

Recording Rules: The Performance Multiplier

# Pre-compute expensive queries
groups:
- name: service-level-aggregations
interval: 30s
rules:
- record: service:http_request_duration:p99_5m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (service, le))

- record: service:http_error_rate:5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

- record: service:http_requests_rate:5m
expr: sum(rate(http_requests_total[5m])) by (service)

Dashboards query recording rules (instant) instead of raw metrics (slow).

If you want monitoring that handles scale automatically without cardinality headaches, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community