Last month, a team I was talking to added a pod_id label to debug a networking issue. Seemed harmless - only 200 pods.
But with 50 metrics per pod and 2-minute pod churn during deployments, they created 150,000 new series per hour. Prometheus memory climbed from 8GB to 32GB in a week. They didn't notice until it OOMKilled during a production incident.
The fix took 10 minutes. The outage took 3 hours. The postmortem took a week.
The Checklist
Before adding any label that could explode, ask:
1. Which system stores this?
Prometheus pays cardinality costs at write time (memory). ClickHouse pays at query time (aggregation). Know your failure mode.
2. Is this for alerting or investigation?
Alerting must be bounded. Investigation can be unbounded - but maybe shouldn't live in Prometheus.
3. What's the expected cardinality?
distinct_values × other_label_combinations = series count
200 pods × 50 metrics × 10 endpoints = 100,000 series. Per deployment.
4. What's the growth rate?
Will this 10x in a year? Containers, request IDs, user IDs - these grow with traffic.
5. Is there a fallback?
Can you drop this label via metric_relabel_configs if it explodes? Test this before you need it.
6. Who owns this label?
When it causes problems at 3am, who gets paged?
Metrics to Watch
Before cardinality bites:
prometheus_tsdb_head_series # Active series count
prometheus_tsdb_head_chunks_created_total # Rate of new chunks
prometheus_tsdb_symbol_table_size_bytes # Memory for interned strings
process_resident_memory_bytes # Actual memory usage
If head_series grows faster than expected, you have a problem brewing.
Going Deeper
I wrote a full breakdown of how Prometheus and ClickHouse handle cardinality differently at the storage engine level - head blocks, posting lists, Gorilla encoding, columnar storage, GROUP BY explosions.
https://last9.io/blog/high-cardinality-metrics-prometheus-clickhouse/ - covers why they fail in completely different ways and how to design pipelines knowing that.
Top comments (0)