Metric Cardinality: High or Low? 4 Steps to Making the Right Choice

#life #metrik #kardinalite #monitoring

In metric collection systems, cardinality is a critical concept for balancing performance and cost. I have prepared a 4-step guide on how this balance is established in the real world.

In this post, I will explain what metric cardinality is, why it matters, and how we can find the right balance in our systems based on my own experiences. We won't just stick to theoretical knowledge; we will address this topic with concrete examples and steps.

What is Metric Cardinality and Why Should We Care?

Metric cardinality is the number of unique label combinations of the metrics we use in our monitoring systems. Simply put, the more different labels we use to define a metric, the higher its cardinality becomes. For example, when monitoring a server's CPU usage, adding labels like instance, job, region, and az increases cardinality.

This has a direct impact on storage space, query performance, and costs. High cardinality requires more disk space, causes queries to run slower, and leads to higher costs in cloud environments.

ℹ️ The Relationship Between Cardinality and Cost

For example, in systems like Prometheus, each unique label combination is stored as a separate time series. This directly increases disk usage and database load. To give an example, if you have 1000 servers and add a dynamic label like instance_id for each server instead of adding two static labels like environment: production and region: eu-central-1, you can quickly drive cardinality up to thousands or even millions of unique time series. This is a cost item that should not be ignored, especially in large-scale distributed systems.

The mindset of "let's label everything" when collecting metrics might provide more visibility initially, but in the long run, it can make our systems unmanageable. Therefore, we must definitely consider cardinality when defining our metric collection strategies.

Step 1: Analyze Your Existing Metrics and Labels

The first step is to understand which metrics you are collecting and what labels you have assigned to them. This analysis will help you identify unnecessary or over-labeled metrics. Most monitoring systems offer a list of current metrics and labels. In Prometheus, you can access this data using commands like promtool tsdb analyze or through visualization tools like Grafana.

During this analysis, you should determine which labels provide truly distinctive information and which are just repetitive or static values. For example, adding a label like deployment_version: v1.2.3 to every metric needlessly increases cardinality if your entire system is running the same version. Instead, it might make more sense to access this information in a different way.

💡 Practical Tips for Label Analysis

When examining labels for a metric, ask these questions:

Is the value of this label always the same? (e.g., environment: staging)

Do the values of this label actually provide distinctive filtering when querying the metric?

Can this label be removed without losing the meaning of the metric?

Can another method with lower cardinality be used instead of this label? (e.g., storing information within the metric name)

For example, if you have an API request count metric and use the http_method label, this makes sense because you might want to monitor GET and POST requests separately. However, adding a label like user_agent: <browser_info> to every request blows up cardinality and is usually unnecessary.

This analysis will reveal the "cardinality monsters" in your system. Recognizing these monsters is the first step to defeating them.

Step 2: Clean Up Non-Distinctive Labels

You should remove the unnecessary or repetitive labels you identified during the analysis from your system. This usually requires updating the configurations of your metric collectors (agents). For example, for Prometheus, you can use relabel_configs or metric_relabel_configs directives to remove unwanted labels.

It is important to be careful during this cleanup process. Accidentally removing an important label can limit your monitoring capabilities. Therefore, it is best to apply changes in a test environment first and carefully observe their effects. As an example, in a microservices architecture, a label like service_name is critical. However, more dynamic and always-unique labels like pod_name can increase cardinality and can be removed if they are not used in general queries.

⚠️ Things to Consider in Label Cleanup

When cleaning up labels, make sure you do not lose the meaning or queryability of the metrics. A metric_relabel_configs rule like the following drops the http_status_code label:
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'http_requests_total'
      action: keep # Only affect the http_requests_total metric
    - source_labels: [http_status_code]
      regex: '.*' # Target all http_status_code labels
      action: labeldrop # Drop the label
Before applying such a rule, think about the cases where you filter the http_requests_total metric by http_status_code. If this filtering is done frequently, try to find a solution with lower cardinality instead of removing the label entirely.

The cleanup done in this step will directly provide storage space savings and query performance improvements.

Step 3: Adjust the Metric Collection Level

In some cases, adjusting the collection level of metrics can also help keep cardinality under control. For example, for less critical systems or situations requiring less detailed monitoring, you can collect metrics with fewer labels or with less frequent sampling. Many monitoring tools allow you to adjust the sampling rate for specific metrics.

However, this approach also has trade-offs. Lower sampling rates can make it harder to detect transient issues. Therefore, when adjusting the collection level, you need to carefully evaluate your monitoring needs and potential risks.

🔥 Risks of Low Sampling Rates

For example, you are measuring the response time per request of a web server. If you perform this measurement too infrequently (low sampling), you might not notice sudden and short-lived performance drops. This can negatively affect the user experience and make it harder to find the root cause of the problem. Especially in situations like security incidents or sudden performance spikes, detailed and high-sampling-rate metrics can be lifesavers.

The goal in this step is to ensure that the collection level of each metric is at the level of detail truly needed.

Step 4: Use Static Values Instead of Dynamic Labels

Using static values instead of dynamic labels in metrics is one of the most effective ways to manage cardinality. For example, instead of a unique ID for each pod or container, it is better to use static labels that only indicate the environment (production, staging) or the service (auth-service, user-service).

This is especially important in cases where the values of labels change constantly. If you are assigning a constantly changing value to a metric, this metric is probably not the right way to access the desired information. For such information, using different mechanisms like logging or distributed tracing might be more appropriate.

ℹ️ Dynamic Labels and Their Alternatives

There may also be cases where dynamic labels must be used. For example, in a microservices architecture, you might want to know which specific pod a request came from. However, in this case, it makes more sense to access this information through a trace ID associated with the request itself, rather than through metrics.
This avoids using dynamic labels like pod_name directly in metrics.
For example, for the requests_total metric generated by a service, the service_name label can be used instead of pod_name.
While this ensures that service_name is more static and predictable, it prevents the cardinality issue that would arise from pod_name constantly changing.

By following these steps, you can effectively manage your metric cardinality, optimize your systems' performance, and reduce costs.

In conclusion, metric cardinality is an important but often overlooked aspect of system monitoring. With the right analysis, cleanup, and labeling strategies, it is possible to build both more efficient and more cost-effective monitoring systems. By applying these 4 steps, you can fully leverage the power of metrics in your systems.