Mustafa ERBAY

Posted on May 22 • Originally published at mustafaerbay.com.tr

Cardinality Management in Observability: 3 Ways to Reduce Costs

#observability #maliyetyonetimi #kardinalite #teknoloji

Are you struggling with high cardinality issues in your observability metrics, watching your data storage and querying costs spiral out of control? This situation can render the valuable data you collect to understand your systems' health completely unusable. I have experienced this problem firsthand multiple times, both in my own projects and in the enterprise systems I've worked on. High cardinality simply means having too many unique label combinations for a metric. This leads to overloaded databases, prolonged query times, and ultimately, bloated bills. In this post, I will share three fundamental methods I use to solve this annoying problem and reduce costs, complete with concrete examples.

When I first encountered this issue, we were monitoring the order processing flow of an e-commerce platform. We used unique identifiers like customer_id, product_id, and order_id as labels for every order. In a short time, millions of different label combinations were generated, and our Prometheus storage quickly filled up. Queries started taking almost 10 minutes. This was one of those moments when "observability" stopped helping us and started hindering us. Drawing from these experiences, I learned how to manage high cardinality.

Narrowing Down Metrics: Choose Your Labels Wisely

The root cause of high cardinality is using too many highly specific labels unnecessarily. When monitoring a metric, the first step is to question what we actually need to know. For example, when monitoring the latency of an API request, using unique labels for every request like request_id or trace_id usually doesn't make sense. Distributed tracing systems (such as Jaeger or Tempo) are much better suited for collecting this type of information. Metrics are generally used to understand broader trends and the overall health of the system.

While working on a production ERP system, we were monitoring CPU usage on a per-server basis. Initially, we used exactly 15 different labels, including server_name, datacenter, rack_number, os_version, kernel_version, and cpu_model. As the number of servers grew, this pushed cardinality to astronomical levels. After an analysis, we decided that just the server_name and datacenter labels were sufficient for our basic monitoring needs. Information like cpu_model was rarely needed, and we could turn to more specific system tools for those details. Following this change, the metric's cardinality count dropped by 90%.

💡 Tip for Label Selection

Before adding a label to a metric, ask yourself these questions:

Should this label change the meaning of the metric?

Can I still get the necessary information from the metric without this label?

How many unique values can this label take, and how does it affect the overall cardinality of the system?

Is this label only needed to diagnose a specific issue, or is it for general monitoring?

This narrowing strategy not only reduces storage costs but also incredibly improves query performance. Queries made with fewer labels require far less processing power on the database. This ensures that queries return in milliseconds, especially in time-series databases like Prometheus.

Real-World Example: `customer_id` vs. `customer_segment`

On a SaaS platform, we were monitoring API calls per user. Initially, we used the customer_id label. This meant a separate time series for every single customer. As the number of customers grew, cardinality skyrocketed.

To solve the problem, we identified the segment each customer belonged to (e.g., free, premium, enterprise) and started using the customer_segment label instead of customer_id. This simple change reduced cardinality thousands of times over. We could now track general trends regarding API calls for users in the premium segment. When we wanted to perform detailed analysis on an individual customer level, we retrieved this information directly from logs or tracing systems instead of metrics.

# Old metric definition (causes high cardinality)
http_requests_total{method="GET", path="/api/v1/users", status="200", customer_id="cust_abc123xyz"} 1

# New metric definition (lower cardinality)
http_requests_total{method="GET", path="/api/v1/users", status="200", customer_segment="premium"} 1

This strategy helps you group your metrics better and avoid creating useless data dumps. Remember, the goal of observability is to see the big picture and detect anomalies; recording every individual event with metrics is usually inefficient.

Aggregation and Summarization Strategies

Another effective way to manage high cardinality is to summarize data at ingestion time or at regular intervals. This involves converting raw data into more meaningful summary metrics with lower cardinality. This approach can significantly reduce costs, especially for data that needs to be stored for a long time or is rarely queried.

In a bank's internal systems, we were collecting detailed logs for every transaction. These logs contained a lot of information, such as transaction type, customer ID, and transaction amount. Over time, the size and volume of these logs grew so much that searching through them became nearly impossible. As a solution, we started generating summary metrics, such as the total number of transactions and average transaction amount, grouped by customer segment and transaction type for specific time intervals (e.g., 1-hour periods). These summary metrics were sufficient for understanding general trends and reduced our reliance on raw logs.

ℹ️ Why is Summarization Important?

Instead of storing raw data, summarization allows you to store a processed and condensed version of it that serves a specific purpose. This both saves storage space and speeds up queries.

Summarization strategies are usually implemented as a "pre-aggregation" or "roll-up" process. This can be part of your data collection tool (for example, Prometheus's rules.yml file or VictoriaMetrics' vmctl tools). These rules take specific metrics, group them by labels, and generate new summary metrics.

Real-World Example: Data Collected for Daily Reports

We were preparing daily reports for a manufacturing plant's ERP system. These reports included information such as the number of parts produced by each machine, error rates, and downtime. Initially, we recorded data produced by every machine every minute. This meant terabytes of data every day.

To solve the problem, we developed a workflow that calculated summary values at the end of each hour, such as the total number of parts produced, total number of errors, and total downtime for each machine. These summary values were stored with fewer labels (such as machine name and shift). Daily reports were now generated using this summary data. As a result, our data storage needs dropped by 95%, and reporting times went down to seconds.

# Prometheus rules.yml example
groups:
- name: aggregation_rules
  rules:
  - record: job:http_requests_total:sum_by_status_path
    expr: sum by (path, status) (rate(http_requests_total[5m]))

  - record: job:machine_production_total:sum_by_machine_shift
    expr: sum by (machine_id, shift) (increase(machine_production_counter[1h]))

These types of summarization rules are critical for both reducing costs and improving performance. Especially if you don't need constant access to raw data for long-term analysis, summarized data will often be more than enough.

Optimizing Data Retention Periods

One of the most direct ways to reduce the cost of observability data is to optimize retention periods. How long each metric should be kept should be determined based on business needs and regulatory compliance. Storing all data forever is both costly and unnecessary.

During a mobile app development process, we were tracking user interactions. Initially, we stored all events for 30 days. However, in our analysis, we realized that we almost never looked at data older than 7 days. In fact, for some metrics, even 2-3 days was sufficient. When we reduced the retention period to 7 days, we experienced a 75% drop in storage costs.

⚠️ Considerations for Retention Period Decisions

When determining the data retention period, consider not only technical needs but also legal requirements (such as mandatory data retention periods for financial transactions) and business analysis needs.

Most observability tools (Prometheus, Grafana Loki, Elasticsearch, etc.) allow you to define different retention policies based on different metric types or labels. For example, you can keep critical system metrics for 30 days, less critical ones for 7 days, and error logs for only 2 days. This flexibility allows you to use resources more efficiently.

Real-World Example: Legal Compliance and Anomaly Detection

At a financial technology (FinTech) company, we were monitoring transaction-related metrics and logs. Due to regulatory compliance, we had to store certain transaction details for at least 1 year. However, we didn't need such a long period for real-time anomaly detection and short-term performance analysis.

Therefore, we applied different retention policies:

Critical transaction details (raw logs): Stored for 1 year for regulatory compliance.
Summary metrics like transaction count and volume: Stored at high resolution for 30 days, then at low resolution (less frequently collected data) for 1 year.
Metrics used for short-term anomaly detection: Stored for only 7 days.

This multi-tiered retention strategy allowed us to meet regulatory requirements while significantly reducing costs for our short-term operational needs. By optimizing the data retention period, we saved approximately 60% on storage costs.

# Prometheus configuration (prometheus.yml example)
storage:
  rules:
    external_rules:
      - /etc/prometheus/rules/*.yml
  tsdb:
    retention.time: 7d # 7 days by default
    retention.size: 0  # Disable size limit (works based on time)

# Different storage regions or tools can be used for different retention periods.
# For example, different retention periods can be set for Loki.

When determining the data retention period, asking "how long must we absolutely keep it?" instead of "how long might we need it?" is a more pragmatic approach. This prevents unnecessary data accumulation and ensures your observability system runs more efficiently.

Observability is a powerful tool that increases the visibility of our systems. However, it is important not to let this power turn into uncontrolled cost growth. Managing high cardinality, choosing metrics wisely, summarizing data, and optimizing retention periods will help you both lower costs and increase the effectiveness of your observability system. By implementing these three methods, you can get the most out of your observability investments.

DEV Community

Cardinality Management in Observability: 3 Ways to Reduce Costs

Narrowing Down Metrics: Choose Your Labels Wisely

Real-World Example: `customer_id` vs. `customer_segment`

Aggregation and Summarization Strategies

Real-World Example: Data Collected for Daily Reports

Optimizing Data Retention Periods

Real-World Example: Legal Compliance and Anomaly Detection

Top comments (0)

Narrowing Down Metrics: Choose Your Labels Wisely

Real-World Example: customer_id vs. customer_segment

Aggregation and Summarization Strategies

Real-World Example: Data Collected for Daily Reports

Optimizing Data Retention Periods

Real-World Example: Legal Compliance and Anomaly Detection

Real-World Example: `customer_id` vs. `customer_segment`