Managing High Cardinality Metrics in 3 Steps: Cost vs. Detail

#career #observability #metrics #costoptimization

Today, I'm diving into a problem I frequently encounter in operational metrics, but one that isn't discussed enough: High Cardinality Metrics. We face this situation especially in large-scale systems when we want to keep a separate metric record for each request, each connection, or each user. While this means our system provides highly detailed information, it can also turn into a monster that rapidly drains our budget. Based on my experience, I'll explain how we can manage these high cardinality metrics in three steps and how to strike that delicate balance between cost and detail level.

What is High Cardinality and Why is it a Problem?

First, let's clarify the concept of high cardinality. As the number of unique label combinations in a metric increases, its cardinality rises. For example, when monitoring an HTTP request, using labels like method, path, status_code is common and manageable. However, when you add labels that are unique for each request, such as user_id, request_id, trace_id, cardinality quickly escalates to astronomical levels.

So, why is this a problem? The fundamental reason is the amount of resources required to store, process, and query these metrics. Time Series Databases (TSDBs) consume more disk space, more CPU, and more memory as cardinality increases. Your queries become slower because the database has to scan more unique keys. This situation can cause your bills to swell rapidly, especially with cloud-based metric collection services. For example, in systems like Prometheus, each new label combination means a new series, which exponentially increases storage and query load.

ℹ️ Cardinality Example

Consider a web server monitoring HTTP requests.

Low Cardinality:

http_requests_total{method="GET", path="/api/v1/users", status_code="200"}

http_requests_total{method="POST", path="/api/v1/products", status_code="201"}

High Cardinality:

http_requests_total{method="GET", path="/api/v1/users", status_code="200", user_id="user-12345", request_id="req-abcde"}

http_requests_total{method="GET", path="/api/v1/users", status_code="200", user_id="user-67890", request_id="req-fghij"}

The user_id and request_id labels in the second example significantly increase cardinality by making each request unique.

At this point, the "I must record everything, just in case" approach quickly leads to an unsustainable cost burden. Therefore, understanding which metrics are truly valuable and managing them intelligently is critically important.

Step 1: Needs Analysis and the Question "What Should We Monitor?"

The first step, as always, is to understand what we need to monitor. This requires a much deeper analysis than simply saying "let's monitor the error rate." Do we want to know which users are affected? Which specific API endpoints are experiencing issues? Which infrastructure component is causing this slowdown? The answers to these questions will determine which labels we need.

In my experience, most of the time we don't need all the details. For example, does a user have a request that takes 5 minutes, or 5 seconds? If we are doing general performance analysis, instead of monitoring individual users like user_id, we might be interested in grouping request durations into different categories (e.g., 0-1 second, 1-5 seconds, 5-10 seconds, 10+ seconds). This reduces cardinality while still allowing us to identify critical performance issues. Such grouping (like histogram buckets) preserves the level of detail and significantly reduces cost.

In one project, we were directly adding user_id and tenant_id labels to our metrics for every request. This was initially done with the logic of "let's monitor every user and every customer." However, we soon saw our data storage costs increase by 40%. With 100,000 active users and 500 tenants in our system, millions of label combinations were being generated every second. After our analysis, we decided that it was sufficient to monitor only problematic requests or specific tenants that triggered slow performance. As a result, we completely removed the user_id label and started using the tenant_id label only for metrics exceeding a certain error threshold. The outcome? A 25% reduction in costs and still the ability to detect issues.

Step 2: Aggressive Filtering and Summarization Techniques

After identifying our needs, the second step is to filter and summarize our metrics to meet those needs. The techniques that come into play here allow us to keep cardinality under control.

The first method is to eliminate unnecessary labels from the outset. We may not need labels like request_id or trace_id in every metric, which are only used during debugging. Such labels are generally unnecessary for normal operational monitoring and inflate cardinality. We can direct these labels to logs collected during error situations or to specialized tracing tools.

The second method involves aggregation and summarization techniques. When processing metrics in your collection tool or database, we can combine some labels or record only values within specific ranges. For example, for a duration_seconds metric, instead of storing all values, we can create histograms based on specific ranges:

duration_seconds_bucket{le="0.1"}
duration_seconds_bucket{le="0.5"}
duration_seconds_bucket{le="1.0"}
duration_seconds_bucket{le="+Inf"}

This approach, instead of storing the exact duration of each request, indicates which range the duration falls into. This reduces cardinality while still allowing us to understand the general state of performance. Tools like Prometheus's histogram_quantile function can calculate approximate percentiles from such histogram data.

Another effective technique is to use "recording rules." In Prometheus, we can create new metrics that are pre-summarized according to specific labels. For example, we can define a new metric by summarizing the http_requests_total metric not by the path label, but only by the status_code and method labels. This allows us to obtain lower cardinality, frequently used summary metrics while preserving the high cardinality of the main metric.

⚠️ Caution When Using Recording Rules!

Using recording rules too frequently or with too many labels can also increase the cardinality of the new metrics. Therefore, it is important to carefully determine which summary metrics are truly necessary.

Additionally, some metric collection tools offer features like "label dropping" or "label filtering." These features help reduce cardinality by automatically removing specific labels before metrics are stored or processed. For example, we can automatically drop labels like request_id in the main collection pipeline.

Step 3: Cost Tracking and Optimization Cycle

The final step is to turn this process into a continuous improvement cycle. Managing high cardinality metrics is not a one-time task; it requires continuous monitoring and adjustment.

First, we must regularly track the cost of our metrics. By examining the billing details of the metric collection service we use, we should understand which metrics or labels affect the cost the most. Many cloud providers offer label-based cost analysis. Using this data, we can identify which labels are "expensive."

If we notice that a metric or label group is causing unexpectedly high costs, we should return to the first two steps: review the needs analysis and apply aggressive filtering/summarization techniques. Perhaps a label that was once important is no longer as critical. Or perhaps we can make our summarization technique more efficient.

For example, on an e-commerce platform, we were monitoring metrics related to payment transactions. Initially, we were recording them with labels like payment_method, currency, transaction_id, user_id. When we performed a cost analysis, we found that transaction_id and user_id accounted for 70% of the total cost. While these labels were useful for debugging, they were unnecessary for general performance tracking. The actions we took were:

We made transaction_id and user_id labels mandatory only for metrics exceeding a certain error threshold (e.g., payment error). These labels were removed from all other metrics.
We used the currency label only for comparisons between different currencies. In most cases, this label was also unnecessary as a single currency was used.
For the remaining metrics, we created lower cardinality metrics using recording rules.

As a result of these changes, we achieved more than a 50% reduction in our metric costs. This process taught us that we must always find the most cost-effective and informative way.

💡 Recommended Checkpoints

Monthly Cost Reports: Regularly check the cost of your metric infrastructure.

Cardinality Monitoring: Monitor the cardinality values of the metric systems you use. Some systems offer special tools for this.

Label Audit: Periodically review which labels are being used and how many unique combinations they create.

This continuous optimization cycle allows us to keep our operational costs under control and enables us to deeply understand our system's performance. We must remember that the best metric is one that provides accurate information without straining our budget. High cardinality management is key to striking this balance.

In conclusion, while high cardinality metrics may seem daunting at first glance, it is possible to overcome this challenge with proper analysis, intelligent filtering, and continuous optimization. By following these three steps, we can delve into the depths of our systems while also intelligently managing our costs. This approach is not only budget-friendly but also ensures that our monitoring infrastructure is more efficient and sustainable.