High Cardinality Metrics: Does the Benefit Outweigh the Cost?

#monitoring #observability #performance #metrics

High Cardinality Metrics: The Silent Monster or Hidden Hero of Our Systems?

Monitoring the health of our systems is critically important, especially in today's dynamic and distributed architectures. One of the cornerstones of this monitoring process is metric collection. However, we know that not all metrics hold the same value. While some offer a general overview, others contain incredibly detailed information. It's these streams of detailed information that emerge as "high cardinality metrics." High cardinality means that the number of unique values a metric label can take is very high. For instance, a user ID, a request ID, or a trace ID, each unique identifier, increases metric cardinality. So, how much does accessing such detailed information affect our systems' cost and performance? In this post, I will delve into the impact of high cardinality metrics on our systems, their benefits, and their costs.

High cardinality metrics might seem indispensable at first glance for understanding system behavior. Labeling every request and every operation with a unique identifier makes finding the root cause during an issue incredibly easy. Let's consider a problem we encountered a few months ago in an e-commerce platform's order processing system. We were experiencing intermittent delays in our orders, and detailed metric collection was essential to find the source of this delay.

# Example query (in a system like Prometheus/Mimir)
sum by (user_id, order_id) (
  rate(http_requests_total{job="order-processor", status="500"}[5m])
)

With a query like the one above, we could see within seconds which users and which orders were receiving 500 errors. This helped us understand if the problem was a general database issue or specific to a particular order flow. If we hadn't collected this order_id as a label, finding the source of the problem could have taken hours, or even days. Situations like these demonstrate how valuable high cardinality can actually be.

What is High Cardinality and Why is it Important?

High cardinality is the situation where the number of unique values for labels in a metric set is very high. To explain with a simple example, let's consider monitoring the response times of a web server. We have a metric named http_request_duration_seconds. If we collect this metric with only the endpoint label, its cardinality will be relatively low. However, when we add more labels like endpoint, user_id, request_id, tenant_id, each unique combination creates a new level of cardinality.

For instance, in a SaaS application with thousands of active users, assume each user has their own tenant_id. If we collect a metric like active_users_count with tenant_id, the cardinality instantly jumps to thousands. This directly impacts the storage, processing, and querying costs of the monitoring system.

ℹ️ Technical Definition

High cardinality typically refers to metrics with millions of unique label value combinations. This situation exponentially increases the amount of data that metric databases (e.g., Prometheus, VictoriaMetrics, Mimir) store in memory and on disk.

Reaching such a high level of cardinality allows us to capture unique information from every corner of the system. A few years ago, while working at a financial technology company, I was trying to resolve a performance issue in the company's internal banking system. The problem was intermittent slowdowns in a specific transaction type. Initially, general metrics were not sufficient. However, once I ensured that each transaction was labeled with a unique transaction_id, I was able to find the source of the problem with incredible speed.

# Example query (in a system like Prometheus/Mimir)
avg by (transaction_id) (
  http_request_duration_seconds{job="transaction-processor", status="200"}
) > 2.0

This query showed all transactions taking longer than 2 seconds, broken down by transaction_id. Without the transaction_id, I would have had to scour logs to understand which transaction was slow. This experience is a concrete proof of how much high cardinality can accelerate the troubleshooting process, especially in complex and distributed systems.

Storage and Processing Costs: The Price of High Cardinality

While the detailed information provided by high cardinality metrics is appealing, it comes at a price. The most apparent cost is storage. Metric databases store each unique label combination separately. As cardinality increases, the amount of data that needs to be stored also increases exponentially. This imposes a significant burden on both disk space and memory usage.

A few years ago, while working at a fintech company, I noticed that the costs of our metric collection system (a customized Prometheus setup) were rapidly increasing. When we analyzed it, we found that the main reason was the session_id label we had added to monitor user sessions. Each user session had a unique ID, and adding this ID to our metrics doubled the size of our database within a few months.

⚠️ Cost Increase

High cardinality metrics can significantly increase the cost of metric database infrastructure, especially when long-term data retention policies are applied. This situation should be considered in budget planning.

In addition to storage costs, processing costs should not be overlooked. Queries on high cardinality metrics consume more CPU and memory resources on the database. This load increases further when complex aggregations (sum, avg, rate, etc.) are performed. This can lead to longer query response times and a decrease in overall system performance.

For example, in a client project, querying millions of requests labeled with request_id could take several minutes to complete. During this time, other metric queries of the system were also affected. This was a significant problem not only in terms of cost but also operational efficiency.

# Example of a slow query
sum by (pod_name) (
  rate(container_cpu_usage_seconds_total{namespace="production", pod_name=~"app-.*"}[5m])
)

Queries like this, when there are thousands of pods and you're trying to calculate the CPU usage for each pod individually, can lead to performance issues. Therefore, when using high cardinality metrics, query optimization and database configuration are of great importance.

Using High Cardinality Wisely: Trade-offs and Best Practices

Instead of completely abandoning high cardinality metrics, using them wisely is often the best approach. This means identifying which metrics truly need high cardinality and keeping costs under control.

First, it's important to question whether every metric label truly requires high cardinality. For example, labels like environment (production, staging, development) or region (us-east-1, eu-west-2) generally have low cardinality and don't cause problems. However, labels like user_id, request_id, session_id have high cardinality potential.

A few years ago, while working for a telecommunications company, we were monitoring the performance of a service. Initially, we were adding a unique connection_id label to the metrics for each connection. This led to an incredible cardinality with approximately 500,000 active connections. To solve the problem, we decided to label only connections that were in an error state or were slow for a certain period with the connection_id.

💡 Example Application

Contextual Cardinality: In specific scenarios like troubleshooting or debugging, you can enable high cardinality labels only for relevant metrics. For example, recording error_code and request_id labels when a specific error code is returned.

This approach significantly reduced the overall metric collection load. In other words, instead of recording everything all the time, being able to access the relevant details when needed was a more sustainable solution. This is a reflection of the "good enough" philosophy; we avoid unnecessary complexity and collect only the details that truly add value.

Second, careful selection of the metric database is crucial. Some metric databases are optimized to handle high cardinality better. For instance, distributed metric storage systems like Mimir or Cortex offer advantages in terms of horizontal scalability and query performance. Solutions like VictoriaMetrics also claim to optimize memory and disk usage.

Finally, query optimization is vital. For queries on high cardinality metrics to be efficient, you must carefully design your metric schema. You should only include the labels you need in your queries, rather than unnecessarily grouping labels.

# Optimized query example
sum by (endpoint) (
  rate(http_requests_total{job="api-gateway", status=~"5..|4.."}[5m])
)

This query shows error rates broken down by endpoint only. If we want to analyze a specific request_id, we can make this query more specific:

# More specific query
rate(http_requests_total{job="api-gateway", status=~"5..", request_id="xyz123"}[5m])

This approach allows us to use metrics more efficiently and keep costs under control.

High Cardinality in Monitoring Systems: Real-World Scenarios

One of the most common areas where high cardinality metrics are encountered is in applications where we monitor user behavior. In a mobile application, each user might have a unique user_id, and each interaction might have an event_id. Collecting this information is invaluable for improving user experience and identifying issues.

A few months ago, in an Android application I developed, I was monitoring the rate at which users were using a specific feature. Initially, I used a simple counter named feature_usage. However, when I wanted to understand which users were using the feature and under what conditions, I realized I needed to add labels like user_id and device_model.

# Example query (in a system like Prometheus/Mimir)
sum by (feature_name) (
  increase(app_feature_usage_total{user_id!="", device_model!=""}[1d])
)

This query showed which features were being used more and on which device models this usage was concentrated. Information like this directly influences product development decisions. However, these added labels also rapidly increased the size of the metric database as the application's user base grew. Therefore, I was careful to label only truly critical interactions and user segments with high cardinality.

Another example relates to distributed tracing in distributed systems. Each request has a unique trace_id as it travels between different services. This trace_id allows us to understand which services the request passed through and how much time it spent in each service. This is an incredibly powerful tool for finding performance bottlenecks.

# Example query by Trace ID
avg by (service_name) (
  rate(http_request_duration_seconds{job="service-a", trace_id="trace-12345"}[1m])
)

A query like this clearly shows which service a specific request (trace) is stuck in or is slow. However, recording every request with a trace_id generates an enormous amount of data. Therefore, sampling methods are often used when collecting trace data. Collecting traces at a certain rate provides sufficient detail while keeping costs under control.

Conclusion: A Balanced Approach

High cardinality metrics are a powerful tool that allows us to delve into the depths of our systems. They can accelerate troubleshooting processes, help us understand user behavior, and provide critical information for optimizing system performance. However, this power comes at a price: increased storage and processing costs, more complex queries, and potential performance issues.

In my experience, the most effective approach is a balanced one. Carefully evaluating whether each metric label truly requires high cardinality, collecting only the details that genuinely add value, and acting wisely in metric database selection and query optimization allows us to leverage the benefits of high cardinality while keeping costs under control.

We must remember that the purpose of monitoring systems is to provide us with actionable information. Excessive detail can sometimes be like "searching for a needle in a haystack" and can unnecessarily increase costs. The important thing is to be able to access the right data, at the right time, at a reasonable cost. This requires a continuous process of evaluation and optimization.