Metric Cardinality: An Overlooked Performance Burden or a Developer

#technology #observability #performance #monitoring

What is Metric Cardinality and Why is it Important?

Metric cardinality refers to the number of unique metric series in a monitoring system. Simply put, it's the count of distinct data points defined by a metric name and a set of labels. For example, a metric like http_requests_total, along with labels such as method="GET", path="/api/users", status="200", can have thousands of different combinations. Each unique label combination creates a separate metric series in the system.

This situation can impose a significant performance burden, especially in large-scale systems. Monitoring systems must store, process, and query these series. Uncontrolled increases in cardinality can lead to database overload, slow queries, and even system instability. In this post, we will examine in detail why metric cardinality becomes an overlooked performance burden and what mistakes developers might make in this regard.

The cardinality problem typically emerges as the system grows. A system that starts with a few thousand metric series can eventually reach hundreds of thousands, or even millions. The underlying reasons for this increase are usually related to how metrics are designed and how labels are used. A developer might add too many labels to make a metric more meaningful or use dynamically generated labels. This unknowingly increases the system's load.

Performance Impacts of Uncontrolled Cardinality

High metric cardinality leads to performance issues across various layers of monitoring systems. Firstly, the agents collecting metrics have to process more data, increasing CPU and memory usage. Subsequently, a significant load is placed on the time-series database (TSDB) that collects and stores this data. Indexing, storing, and querying millions of metric series requires far more resources than standard database operations.

One of the most apparent consequences of this is the degradation of query performance. When a developer or operations team wants to check the status of a specific metric, the monitoring system might have to scan millions of series. This can cause queries to take seconds, or even minutes. Alerting systems are also affected; false alarms might be triggered, or real alarms might be delayed. Finally, the cost of this situation cannot be ignored. More storage space, more powerful servers, and longer processing times increase overall operational costs.

⚠️ Point to Consider

High metric cardinality not only leads to performance issues but also to increased costs. Scaling monitoring systems directly impacts hardware and licensing expenses. Therefore, considering cardinality from the outset of metric design is critical.

To give an example, consider a metric like order_processed_total on an e-commerce platform. If labels such as user_id, session_id, or ip_address, which could be unique for each order, are added to this metric, each order creates a new metric series. In a system processing thousands of orders per hour, this metric can quickly reach millions of series, creating an immense load during querying.

Developer Mistakes: Common Pitfalls That Increase Cardinality

Developers can make some common mistakes while instrumenting metrics that unknowingly increase cardinality. The foremost among these is the overuse of labels. Every label added to a metric to understand it in more detail potentially increases cardinality. In particular, using values that are unique for each instance or change very frequently as labels is one of the biggest mistakes.

Another common error is using dynamically generated labels. For instance, using values like a user session ID or request ID directly as labels creates a new metric series for every new session or request. Such information is often more suitable for logging or distributed tracing tools, not for metrics.

ℹ️ Example Error Scenario

A developer added the request_id label to the api_calls_total metric. This was a unique ID for each API call. In a system with hundreds of thousands of API calls per day, this single label caused the monitoring system to encounter millions of new metric series. As a result, queries slowed down, and storage costs exceeded expectations significantly.

Lastly, a lack of naming standards for labels can also lead to cardinality issues. Different teams using different naming conventions can result in the creation of labels that mean the same thing but have different names. This also makes metric aggregation and analysis difficult.

Strategies for Managing Metric Cardinality

Several strategies can be employed to combat high metric cardinality. Firstly, labels must be chosen carefully. Every label added to a metric potentially increases cardinality. Therefore, only labels with truly meaningful and stable values should be used. For example, labels like environment (production, staging) or service (auth-service, user-service) are generally acceptable. However, dynamic or unique values like user_id or session_id are usually not appropriate.

Secondly, using aggregation techniques when collecting or instrumenting your metrics can be beneficial. This involves aggregating data at an earlier stage and sending fewer metric series to the monitoring system. For example, you can send average values or totals over a specific time interval.

💡 Creating a Label Strategy

Establishing a clear labeling strategy for each project is one of the most effective ways to prevent cardinality issues in the long run. This strategy should clearly define which types of labels are acceptable and which should be avoided.

Thirdly, understanding the capabilities of your monitoring system is important. Some monitoring systems may have special features to manage high-cardinality metrics more efficiently. Leveraging these features can mitigate performance issues.

Advanced Techniques: Aggregation and Summarization

Another important way to manage metric cardinality is by using aggregation and summarization techniques. These techniques reduce storage and processing load by transforming raw metric data into more meaningful and summarized forms. For instance, instead of storing the duration of every individual request for a request_duration_seconds metric, you can store the average duration, percentiles, or standard deviation over a specific time interval (e.g., 1 minute).

Such aggregations are often performed by the monitoring tools themselves or through specialized collector agents. For example, in systems like Prometheus, aggregations can be defined on metrics using record and alert rules.

# Prometheus rules example
groups:
- name: aggregation_rules
  rules:
  - record: http_request_duration_seconds:mean
    expr: avg(rate(http_request_duration_seconds_sum[5m])) by (method, path, status)
  - record: http_request_duration_seconds:95p
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, method, path, status))

In the example above, the average value (mean) and the 95th percentile (95p) for the http_request_duration_seconds metric are calculated over a 5-minute interval. This allows us to store summarized data instead of storing the duration of each individual request separately. In this way, the number of original metric series is significantly reduced.

ℹ️ Histogram Metrics

Histogram metrics are a great way to capture the distribution itself. The _bucket series track the count of values falling into specific ranges. This allows you to calculate percentiles with functions like histogram_quantile. This prevents high cardinality while offering valuable insights into the distribution.

These types of summarized metrics save storage space and improve query performance. However, the disadvantage of this method is the loss of access to raw data. If it's necessary to drill down into the details of each individual request, these aggregations might not be sufficient. In such cases, logging or distributed tracing systems come into play.

Optimizing Monitoring Systems

Managing metric cardinality is not solely the responsibility of developers; optimizing the monitoring systems themselves is also necessary. Time-series databases (TSDBs) are specifically designed to efficiently store and query high-cardinality data. However, the configuration and maintenance of these systems are also important.

For instance, database indexing strategies directly impact query performance. Proper indexing allows for the rapid retrieval of the desired series from billions of data points. Furthermore, data retention policies are also crucial. Storing large datasets for extended periods increases storage costs and lengthens query times. Regularly deleting or archiving unnecessary old data helps maintain system performance.

⚠️ Database Selection and Configuration

Thoroughly understand the capabilities and limitations of the time-series database you are using (e.g., Prometheus, InfluxDB, VictoriaMetrics). Each database may have different optimizations and configuration options for cardinality management.

Moreover, monitoring systems themselves collect metrics. The cardinality of these "meta-metrics" should also be kept under control. Otherwise, the monitoring system itself can become a bottleneck. Regular performance tests and monitoring are vital for early detection and resolution of potential issues.

Conclusion: Embracing Cardinality as a Developer Responsibility

Metric cardinality is a frequently overlooked issue that can significantly impact system performance. This situation often arises from mistakes developers make when designing and instrumenting metrics. Every unique metric series imposes storage, processing, and query load on monitoring systems. When this load increases uncontrollably, it degrades system performance, increases costs, and leads to operational challenges.

To cope with this problem, developers need to be more careful in their label selection, avoid using dynamic or unique values as labels, and adopt aggregation techniques. Optimizing and correctly configuring the monitoring systems themselves is also an integral part of this effort. Shifting metric cardinality from being a "developer mistake" to a "developer responsibility" will enable us to build more performant, scalable, and cost-effective systems.

This is not just about performance optimization; it's also a part of good software engineering practice. Making informed decisions when designing and using monitoring systems makes a big difference in the long run.