Why Cardinality Explosion is Always a Problem?

#observability #monitoring

There's a insidious problem that frequently surfaces in metric systems, sometimes slowing down the entire infrastructure or skyrocketing costs without us even realizing it: Cardinality explosion. This situation seriously impacts data storage and query performance, especially in monitoring and observability platforms. I've seen firsthand multiple times how failing to manage high cardinality correctly leads to significant headaches for both myself as a system administrator and the teams I work with.

Cardinality explosion means an excessive increase in the number of unique label combinations within a metric set. For instance, when monitoring HTTP request status codes, we use labels like status_code: 200. However, when we start labeling these metrics with each user's ID, each request's unique session ID, or the full path of each URL, the number of unique combinations rapidly multiplies. This inflates the index sizes in metric databases, slows down queries, and drives storage costs to unpredictable levels.

What is Cardinality and Why is it Important?

Cardinality is the number of unique values in a dataset. In the context of monitoring systems, it refers to the number of unique label combinations available for a metric name. For example, let's say we have a metric called http_requests_total. If we only monitor this metric with method (GET, POST) and status (200, 404, 500) labels, its cardinality will be relatively low (e.g., 2 methods * 3 statuses = 6 unique combinations).

However, when I add labels like path (every unique URL path) or user_id (every unique user ID) to this metric, cardinality can suddenly explode. Consider an e-commerce site with thousands of different product pages and millions of users. In this scenario, labeling each request with path and user_id creates millions, or even billions, of unique metric series. In a production ERP scenario I encountered, when I tried to add a unique transaction_id label for every operation coming from operator screens, my metric system suddenly started trying to create 10,000 new series per second. This meant an unmanageable load for the system.

ℹ️ Metric Series and Cardinality

A metric series is defined by its metric name and all its label combinations. High cardinality means a large number of unique metric series. Each of these series stores its own values over time.

Such high cardinality places a significant burden on the time-series databases (TSDBs) that store your metrics. Each unique series requires separate storage space and an index entry. This directly increases disk space, memory usage, and CPU power requirements. While it might seem like an easily overlooked detail, in my experience, I've seen a small label addition increase disk usage by 200% within a week.

Impacts on Storage and Performance

One of the most apparent consequences of cardinality explosion is the dramatic deterioration of storage costs and query performance. Metric systems are often built on time-series databases. These databases maintain a header and an index entry for each unique metric series. When there are millions of unique series, these headers and indexes consume a vast amount of disk space. In a manufacturing company's ERP, when we mistakenly added high-cardinality labels like worker_id and task_id, our Prometheus server's disk usage jumped from 2 TB to 6 TB, and this only contained 3 days of data.

Query performance is also directly affected. When you run a metric query, the database must scan the indexes to find all relevant time series. If there are millions of series, this scan process takes a very long time. In an incident I experienced on April 28th, some panels on my Grafana dashboards went from taking 5 seconds to load to 30 seconds. When I performed a root cause analysis, I found that a colleague from the development team had added a request_uuid label to a service's endpoints. Because this request_uuid changed with every request, it instantly led to millions of new series.

# Prometheus relabel_configs example: dropping the request_uuid label
- source_labels: [__name__]
  regex: 'http_requests_total'
  action: keep
- source_labels: [request_uuid]
  regex: '.*'
  action: drop

By dropping the request_uuid label using relabel_configs like the one above, both disk usage returned to normal and dashboards started loading within seconds. This was a concrete example showing how critical each label is. In systems using relational databases like PostgreSQL for metric storage (which I've used for the backend of some of my own side projects), high cardinality can lead to excessive bloating of B-tree indexes and WAL bloat issues. This increases disk I/O and slows down VACUUM operations, degrading overall database performance.

Cardinality Explosion in Monitoring and Alerting Systems

Cardinality explosion doesn't just stop at storage and performance issues; it also undermines the effectiveness of our monitoring and alerting systems. With an excessive number of unique metric series, defining meaningful alerts becomes nearly impossible. You don't know which series to write an alert for. For example, instead of writing an alert for an overly specific series like http_requests_total{status="500", path="/api/v1/users/.*", user_id="12345"}, you'd want to write an alert for a more general series like http_requests_total{status="500"}. However, in a high-cardinality situation, even general metrics are divided into so many sub-series that you'd need to examine all of them individually to find the root cause of a problem.

In a situation I encountered on an internal platform at a bank, each microservice used separate instance_id and deployment_id labels. When a new deployment was made, the deployment_id changed, and old series were no longer monitored. This prevented alerts from being triggered for new series that appeared healthy but were actually problematic. As long as the deployment_id label wasn't removed, new metric series were created with every deployment, and old series became "dead." This situation required constant adjustment of alerts and increased operational overhead.

⚠️ False Positives and Negatives

High cardinality can cause your system to trigger numerous unnecessary alerts (false positives) or cause real problems to be overlooked (false negatives). It's impossible to monitor every unique series individually.

Another problem arises with log cardinality, as seen in systems like fail2ban. The RateLimitInterval and RateLimitBurst settings in journald are designed to prevent overly chatty services from overwhelming the system. If a service starts generating unique error messages containing each user's IP address, even journald will limit these logs, and I won't be able to see the real problem. This is a form of cardinality explosion, just occurring on logs instead of metrics. In one of my projects, I saw that journald's rate limit was hit because a service was generating thousands of logs related to invalid_auth_token errors from different IPs at 3:14 AM. This delayed my understanding of why the system was malfunctioning during those critical hours.

Costs and Resource Consumption

One of the most tangible and painful consequences of cardinality explosion is its direct impact on costs. Especially if you use cloud-based metric services, high cardinality can quickly lead to uncontrollable bills. Most cloud providers charge based on data ingestion and storage volume. Millions of unique metric series can mean petabytes of data, which translates to astronomical bills.

In the backend of my side project's financial calculators, I use Redis as a cache. Initially, I included each user's unique query parameters in the Redis keys. In a short time, my Redis server's memory filled up, and the OOM eviction policy kicked in, starting to delete important data. This was essentially a cardinality explosion of Redis keys. I resolved this issue by using more generalized keys or setting a specific TTL (Time-To-Live), rather than embedding user_id and query_hash directly into the keys. Changing the Redis OOM eviction policy settings to allkeys-lru also somewhat relieved the system by automatically deleting the least used keys.

# Checking and setting OOM eviction policy via Redis CLI
redis-cli config get maxmemory-policy
redis-cli config set maxmemory-policy allkeys-lru

This situation manifests differently on bare-metal servers as well. High cardinality causes metric collection services like Prometheus to consume more CPU and RAM. Reading, processing, and indexing data from disk requires more processing power. In a client project, I observed that 80% of a server with 128GB of RAM was being used solely for Prometheus to index and query metrics. This was essentially a waste of the server's resources that could have been used for other critical workloads. Another example was the continuous triggering of cgroup memory.high soft limits for my Prometheus container because the cardinality explosion exceeded the container's expected memory usage. This indicated a constant memory pressure on the system, and I eventually had to migrate to a larger server.

Solution Approaches and My Experiences

I've tried several different approaches to deal with cardinality explosion and found that each has its own trade-offs. The basic strategy is to either discard unnecessary labels before collection or drop them after collection and aggregate data at higher levels.

Filtering and Dropping Labels (Relabeling): In systems like Prometheus, relabel_configs is one of the most powerful tools. You can reconfigure, drop, or modify labels from the source system before writing them to the target system. The request_uuid example I gave earlier was a concrete application of this. In another scenario, by dropping the git_commit_hash label, which changes with each deployment of a microservice, I ensured that metric series remained independent of deployments. This was critical when implementing blue-green deployment strategies.
Data Aggregation: Storing some metrics at a lower resolution or collecting them with less detailed labels can significantly reduce cardinality. For example, we can store CPU usage metrics coming in every 10 seconds by averaging them every 1 minute. Or, we can generalize the path label from /api/v1/users/123 to /api/v1/users/*. This is usually done with recording rules. In my production ERP, I reduced cardinality by 95% by replacing the product_serial_number label, which contained each product's unique serial number, with just the product_type label.
Sampling: For very high-volume events, sampling instead of collecting all data is an option. For example, collecting metrics for only one out of every 100 requests. However, this leads to a loss of detail in the data and carries the risk of overlooking certain rare situations. Therefore, it should be used carefully in critical systems. In my Android spam app, instead of logging the details of every incoming call individually, I collected metrics only for specific types of calls or for calls with a spam score exceeding a certain threshold.

💡 Aggregation and Detail Loss Trade-off

When aggregating to reduce cardinality, you must carefully determine how much detail loss you are willing to accept. Overly aggressive aggregation can make it difficult to find the root cause of a problem. This is a matter of trade-off.

Proper Metric Design: From the very beginning, it's critical to think about which labels are truly necessary and which will lead to high cardinality. When designing a metric, I always ask myself: "Will each unique value of this label truly help me find the cause of a problem, or is it just noise?" Most of the time, values that are always unique, like request_id or session_id, should not be used as metric labels. These are more meaningful in trace or log systems.

Practical Applications and Recommendations

I'd like to share the practical experiences and recommendations I've gained for preventing and managing cardinality explosion. These are approaches that have been useful to me both when setting up systems from scratch and when fixing problems in existing systems.

Constrain and Standardize Labels: Determine the number of labels you will use for a metric and the range of unique values each label can take from the outset. For example, using more general labels like service_name and environment instead of hostname allows you to monitor the same service running on different machines as a single logical entity. In my company's general network segmentation, instead of using a separate label for each VLAN, I add meaning to metrics using only more general labels like network_zone and segment_type.
Configure Auto-Discovery Systems Carefully: Systems like Prometheus automatically discover targets and start collecting metrics through service discovery. However, this doesn't guarantee that every newly added service or container won't come with high-cardinality labels. In a Docker Compose-based deployment of mine, each new container automatically sending metrics with a container_id label caused an unexpected increase in cardinality. In this situation, dropping the container_id label with relabel_configs was the best solution.
Regularly Monitor Cardinality: It's crucial for your metric system to monitor its own cardinality. Internal metrics in Prometheus like tsdb_head_series or prometheus_tsdb_head_active_series allow you to track the number of active series. When you see sudden increases in these metrics, you can understand that a cardinality explosion is starting. I've set up an alarm for these metrics on the monitoring system running on my own VPS. When the number of active series exceeds a certain threshold (e.g., 1 million), I receive a notification.

# Query showing the number of active metric series in Prometheus
sum(prometheus_tsdb_head_active_series)

This query provides the instantaneous number of active metric series, and keeping this number below a certain threshold is generally good practice. For example, if this value suddenly increases by 50%, I investigate it immediately.

Use SLOs (Service Level Objectives) and Error Budgets: SLOs guide me in determining which metrics are truly important and at what level of detail they need to be monitored. If a metric does not directly contribute to an SLO or is not necessary for managing an error budget, there's no point in monitoring it with high-cardinality labels. While tracking the production line's efficiency SLO in a production ERP, using labels that only contained the machine type and production line, instead of each machine's unique serial number, was sufficient.

Conclusion

Cardinality explosion is an insidious problem that, if ignored, slows down your systems, increases your costs, and weakens your monitoring capabilities. My approach to this issue has always been proactive; that is, to prevent the problem before it arises. When designing our metrics, we must carefully consider which labels truly carry operational meaning and which ones only create noise.

My experiences have shown me that proper relabel_configs rules, smart aggregation strategies, and continuous monitoring of metric cardinality are the keys to dealing with this problem. High-cardinality metrics often create new problems rather than solving existing ones. Therefore, while keeping unique identifiers like request_id in logs and tracing systems, we should feed our metric systems with more generalized and meaningful labels. This way, our systems will run more stably, and we can focus on the data that truly matters. In the next post, I will explain how I automated these approaches using a systemd unit and how I managed cgroup limits.