Samson Tanimawo

Posted on Apr 18

Monitoring Costs Are Out of Control — Here's How to Fix It

#monitoring #devops #costs #observability

The $50K/Month Monitoring Bill

I audited our monitoring stack last quarter. The total cost across all tools: $52,000/month. For a company with 200 engineers. That's $260 per engineer per month just to watch our systems.

Something had to change.

Where the Money Goes

Breakdown of $52K/month:
  Custom metrics ingestion:  $18,000 (35%)
  Log storage & search:      $14,000 (27%)
  APM/Tracing:               $9,000  (17%)
  Alerting platform:         $4,000  (8%)
  Synthetic monitoring:      $3,000  (6%)
  Dashboards & visualization:$2,000  (4%)
  Status page:               $2,000  (4%)

The top two — metrics and logs — were 62% of the bill.

Strategy 1: Metrics Cardinality Audit

High-cardinality metrics are the #1 cost driver. One bad label can 10x your bill:

# This is fine: ~100 time series
http_requests_total{method="GET", status="200", service="api"}

# This is expensive: ~1,000,000 time series
http_requests_total{method="GET", status="200", service="api", user_id="..."}
#                                                              ^^^^^^^^^
#                                                              100K unique values

I wrote a script to find high-cardinality offenders:

import requests

def find_high_cardinality_metrics(prometheus_url, threshold=10000):
    # Get all metric names
    r = requests.get(f"{prometheus_url}/api/v1/label/__name__/values")
    metrics = r.json()['data']

    expensive = []
    for metric in metrics:
        # Count time series per metric
        r = requests.get(f"{prometheus_url}/api/v1/series", 
                        params={'match[]': metric})
        count = len(r.json()['data'])
        if count > threshold:
            expensive.append({'metric': metric, 'series_count': count})

    return sorted(expensive, key=lambda x: x['series_count'], reverse=True)

# Found: request_duration_bucket had 2.3M series due to URL path label

We found three metrics responsible for 60% of our series count. Fixed them in a day.

Strategy 2: Aggregation at Collection

Instead of sending raw metrics and aggregating at query time, aggregate at collection:

# prometheus.yml - recording rules
groups:
  - name: aggregations
    rules:
      # Pre-aggregate p99 latency per service (not per pod)
      - record: service:http_request_duration:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (service, le))

      # Pre-aggregate error rate per service
      - record: service:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

Pre-aggregated metrics are cheaper to store AND faster to query.

Strategy 3: Right-Size Your Retention

Do you really need 13 months of 15-second resolution metrics? Probably not.

retention_policy:
  raw_metrics:
    resolution: 15s
    retention: 7 days

  5_minute_rollups:
    resolution: 5m
    retention: 30 days

  1_hour_rollups:
    resolution: 1h
    retention: 13 months

  1_day_rollups:
    resolution: 1d
    retention: 5 years

Strategy 4: Eliminate Unused Dashboards

We had 340 dashboards. I checked access logs:

340 total dashboards
 82 viewed in last 30 days (24%)
 31 viewed in last 90 days (9%)
227 never viewed in 6 months (67%)

67% of our dashboards were zombie dashboards. Nobody looked at them, but they drove metric queries.

We archived everything not viewed in 90 days. Savings: $3,200/month from reduced query load.

Results

Before: $52,000/month
  Metrics cardinality fix:    -$11,000
  Log tier/sampling changes:  -$8,000
  Dashboard cleanup:          -$3,200
  Retention right-sizing:     -$5,800
  Duplicate tool consolidation:-$6,000
After: $18,000/month

Annual savings: $408,000

And we actually improved our observability because the remaining data was higher quality.

If you want monitoring that's cost-effective by design, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community