DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Managing Cardinality Explosion in Observability in 3 Steps

In this profession, which I started by crimping cables in server rooms, I learned firsthand the hidden costs and operational burdens that every new technology brings along. In the past, when the disk filled up, we would SSH into the server and delete logs; now, we spend extra hours trying to save the disks and budgets of our centralized logging and metric systems (observability).

Especially in distributed architectures and containerized environments, one of the biggest insidious monsters we face is "cardinality explosion." A tiny label (tag) that developers add to metrics for ease of debugging can become the main culprit for thousands of dollars in bills from the finance team at the end of the month, or alarms ringing in the middle of the night because the disk is 100% full. In this post, I will explain how I solved this problem in my own projects and in the structures I consulted with, using concrete steps.

What is Cardinality and Why Does It Threaten a Platform Engineer's Career?

Mathematically speaking, cardinality is the number of unique elements in a set. In the world of observability, cardinality is the combination of all unique values that the labels of a metric can take. In other words, it is a Cartesian product problem. In time series databases (TSDB) like Prometheus, VictoriaMetrics, or Grafana Mimir, each unique label combination creates a new "time series" on disk and in memory.

Let me explain with an example. Suppose we have a simple metric measuring the duration of HTTP requests: http_request_duration_seconds_bucket. Let's assume we add the following labels to this metric:

  • method: GET, POST, PUT, DELETE (4 unique values)
  • status: 200, 201, 400, 401, 500 (5 unique values)
  • handler: /api/v1/login, /api/v1/checkout, /api/v1/users (3 unique values)

In this case, the total number of time series is: 4 * 5 * 3 = 60. This is an extremely reasonable and manageable number. However, what happens if an overzealous developer friend of ours decides to add a user_id label to the metric to make debugging easier? Let's assume there are 50,000 active users registered in the system:

4 (method) * 5 (status) * 3 (handler) * 50.000 (user_id) = 3.000.000 Time Series!
Enter fullscreen mode Exit fullscreen mode

You have produced 3 million active time series for a single metric. The TSDB must keep an index in memory (RAM) for each of these series. Memory consumption suddenly jumps from 4 GB to 64 GB, disk I/O hits the roof, and finally, the OOM (Out Of Memory) killer kicks in and shuts down your database. If you are using a SaaS solution like Datadog, you find yourself explaining to the general manager after an $18,500 bill arrives at the end of the month. This is why cardinality management is not just a technical detail, but a critical FinOps skill that determines the career lifespan of a platform engineer.

⚠️ Important Cost Warning

SaaS observability providers usually bill based on "Active Series per Month." A single wrong dynamic label can consume your company's monthly budget overnight. This casts a direct shadow on your technical leadership.

Step 1: Detecting and Blocking Dynamic Metric Labels

The number one reason for high cardinality is the use of dynamic data as label values. UUIDs, user email addresses, IP addresses, query parameters, or order numbers should never ever be metric labels. The place for such detailed and dynamic data is not metrics, but logs or distributed tracing (trace ID) systems.

If this mistake has already been made in your system, the first thing you need to do is find out which labels are causing this explosion. If you are using Prometheus, you can see the label names with the highest cardinality by running the following API query to examine the TSDB status:

curl -g 'http://localhost:9090/api/v1/status/tsdb' | jq '.data.labelNamesWithHighestNumOfLabelValues'
Enter fullscreen mode Exit fullscreen mode

The output will likely tell you directly which label is the culprit:

[
  {
    "name": "user_id",
    "valueCount": 142580
  },
  {
    "name": "transaction_id",
    "valueCount": 98450
  }
]
Enter fullscreen mode Exit fullscreen mode

Cleaning these labels at the code level can take time. As an immediate intervention, you can drop or normalize these labels during the scrape (collection) phase using Prometheus's relabel_configs mechanism. For example, converting dynamic URLs like /api/v1/users/12345 into a single /api/v1/users/:id pattern can be a lifesaver.

Dynamic Path Normalization with relabel_configs

The Prometheus configuration below detects values containing UUIDs or numeric IDs in the path label and normalizes them. This prevents the formation of thousands of different time series:

scrape_configs:
  - job_name: 'api-service'
    static_configs:
      - targets: ['localhost:8080']
    relabel_configs:
      # Normalize paths containing UUIDs
      - source_labels: [__name__, path]
        regex: 'http_request_duration_seconds_bucket;([a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12})'
        target_label: path
        replacement: '/api/v1/resource/:uuid'
      # Normalize paths containing numeric IDs
      - source_labels: [path]
        regex: '.*/[0-9]+.*'
        target_label: path
        replacement: '/api/v1/resource/:id'
Enter fullscreen mode Exit fullscreen mode

Thanks to this rule, all dynamic paths are merged during the scrape phase before being written to the database. You will have prevented millions of time series from forming on the disk before the data even enters the TSDB.

Step 2: Reducing Data Upfront with Aggregation and Recording Rules

Not every time series needs to be stored on a per-second basis. For example, we might want to see CPU usage for the last 3 hours in per-second detail, but seeing CPU usage from 6 months ago as hourly averages is more than sufficient for us. Instead of storing raw data for a long time, pre-aggregating it saves disk space and increases the loading speed of your dashboards.

In the Prometheus world, we call this Recording Rules. These rules run queries you define periodically in the background and record the result as a new metric with lower cardinality.

For example, the rule below runs every 1 minute, sums HTTP request rates across all pods, and turns them into a single metric at the service level:

groups:
  - name: api_rules
    rules:
      - record: service:http_requests:rate5m
        expr: sum by (service, status) (rate(http_requests_total[5m]))
Enter fullscreen mode Exit fullscreen mode

After writing this rule, there are two things you need to do:

  1. Replace the complex and slow queries in your Grafana dashboards with the new service:http_requests:rate5m metric.
  2. Keep the data retention period of the original high cardinality http_requests_total metric short (e.g., 3 days), while storing the new aggregated metric for 1 year.

ℹ️ Trade-off Analysis

When you use recording rules, you lose the ability to analyze micro-details in the past (e.g., the state of a specific pod at that moment). However, you store system-wide trends and service-based SLA/SLO metrics much more performantly and cheaply.

Step 3: Filtering at the OpenTelemetry Collector Level with Drop and Keep Filters

If you are building a modern architecture, placing an OpenTelemetry (OTel) Collector in between instead of sending metrics directly to Prometheus is one of the most logical architectural decisions. The OTel Collector acts as a proxy that takes data from the source, processes it, and routes it to the desired target (Prometheus, Datadog, Mimir, etc.).

This architecture gives us the advantage of having full control over the data before it reaches the central storage. With the filter and transform processors we define on the Collector, we can eliminate unnecessary metrics or labels while they are still passing through the network.

I had previously seen similar proxy layers save the day during VPS migration processes and in large-scale infrastructures. Here is an example config.yaml you can apply on the OTel Collector:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  # Drop certain metrics entirely
  filter/drop_noisy_metrics:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*\\.internal\\..*"
          - "process\\.cpu\\.time"
          - "jvm\\.gc\\.memory\\.allocated"

  # Delete high cardinality labels within metrics
  transform/clean_labels:
    error_mode: ignore
    metric:
      - keep_keys(attributes, ["service.name", "http.status_code", "http.method"])

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [filter/drop_noisy_metrics, transform/clean_labels]
      exporters: [prometheus]
Enter fullscreen mode Exit fullscreen mode

Thanks to this configuration:

  1. We completely eliminate noisy metrics produced by the JVM or internal services that we don't use operationally.
  2. We delete all extra labels like user_id and session_id in the remaining metrics with the transform processor, keeping only the service.name, status_code, and method labels that are critical for analysis.

A Real-Life Scenario: How I Reduced 12 Million Active Time Series in a Production ERP

A few years ago, we experienced a serious bottleneck in the monitoring infrastructure of a production ERP I was working on. The flow of metrics from operator screens, handheld terminals, and IoT devices on the production line was so intense that our Prometheus server was locking up every 12 hours due to hitting disk I/O limits.

When we investigated the issue, we saw that the work order ID and operator ID were stamped as labels on the temperature and pressure metrics coming from the IoT devices. Since work orders were constantly changing, the number of active time series reached 12,450,000. The TSDB indexes didn't fit in RAM, and the disk was constantly swapping.

Metric Parameter Before Optimization After Optimization Rate of Change
Active Time Series 12,450,000 385,000 %96.9 Reduction
Server RAM Usage 48 GB 6.2 GB %87.0 Savings
Disk I/O (Write Ops) 12,500 iops 850 iops %93.2 Decrease
Dashboard Query Time 14.2 sec 0.18 sec %98.7 Speedup

I applied the following steps for the solution:

  1. Label Migration: I completely removed work order and operator information from the metrics. I directed this information to the log tables in our PostgreSQL database and OpenTelemetry trace attributes.
  2. Prometheus Relabeling: During the time until the code update was deployed, I defined a labeldrop rule on Prometheus to instantly delete dynamic labels in incoming metrics.
  3. Mimir Integration: For long-term analysis, I set up Grafana Mimir and sent the data to an S3-compatible object storage area, eliminating the dependency on the local disk.

Thanks to these simple but decisive steps, we extended the life of the existing infrastructure without incurring additional hardware costs and ensured the stability of the system.

Career Perspective: Financially Conscious Engineering (FinOps) and Budget Management

The most important thing I noticed as I gained seniority in the tech world is that technical achievements are not measured solely by "using the newest library" or "building the most complex architecture." Real engineering is producing the optimal solution with limited resources (budget, hardware, time).

As a platform engineer or team leader, keeping your company's cloud bill under control is directly your responsibility. Instead of going to the CFO and saying, "Sir, our Kubernetes cluster is very stable," saying, "Thanks to the cardinality optimization we did, we reduced our observability budget from $12,000 USD to $2,500 USD per month and increased our query speeds by 10 times," is the actual output that will take your career to the next level.

Engineering decisions always involve trade-offs. While lowering cardinality, you give up the level of detail; however, with correct segmentation and tool selection (metric vs. log vs. trace), it is possible to both protect the budget and monitor the health of the system.

As a next step, go into the interface of Prometheus or the SaaS tool you use in your own system and identify the top 3 metrics with the highest cardinality. You will see that just normalizing those 3 metrics will provide a serious relief in your infrastructure.

Top comments (0)