The $50K/Month Monitoring Bill
I audited our monitoring stack last quarter. The total cost across all tools: $52,000/month. For a company with 200 engineers. That's $260 per engineer per month just to watch our systems.
Something had to change.
Where the Money Goes
Breakdown of $52K/month:
Custom metrics ingestion: $18,000 (35%)
Log storage & search: $14,000 (27%)
APM/Tracing: $9,000 (17%)
Alerting platform: $4,000 (8%)
Synthetic monitoring: $3,000 (6%)
Dashboards & visualization:$2,000 (4%)
Status page: $2,000 (4%)
The top two — metrics and logs — were 62% of the bill.
Strategy 1: Metrics Cardinality Audit
High-cardinality metrics are the #1 cost driver. One bad label can 10x your bill:
# This is fine: ~100 time series
http_requests_total{method="GET", status="200", service="api"}
# This is expensive: ~1,000,000 time series
http_requests_total{method="GET", status="200", service="api", user_id="..."}
# ^^^^^^^^^
# 100K unique values
I wrote a script to find high-cardinality offenders:
import requests
def find_high_cardinality_metrics(prometheus_url, threshold=10000):
# Get all metric names
r = requests.get(f"{prometheus_url}/api/v1/label/__name__/values")
metrics = r.json()['data']
expensive = []
for metric in metrics:
# Count time series per metric
r = requests.get(f"{prometheus_url}/api/v1/series",
params={'match[]': metric})
count = len(r.json()['data'])
if count > threshold:
expensive.append({'metric': metric, 'series_count': count})
return sorted(expensive, key=lambda x: x['series_count'], reverse=True)
# Found: request_duration_bucket had 2.3M series due to URL path label
We found three metrics responsible for 60% of our series count. Fixed them in a day.
Strategy 2: Aggregation at Collection
Instead of sending raw metrics and aggregating at query time, aggregate at collection:
# prometheus.yml - recording rules
groups:
- name: aggregations
rules:
# Pre-aggregate p99 latency per service (not per pod)
- record: service:http_request_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (service, le))
# Pre-aggregate error rate per service
- record: service:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
Pre-aggregated metrics are cheaper to store AND faster to query.
Strategy 3: Right-Size Your Retention
Do you really need 13 months of 15-second resolution metrics? Probably not.
retention_policy:
raw_metrics:
resolution: 15s
retention: 7 days
5_minute_rollups:
resolution: 5m
retention: 30 days
1_hour_rollups:
resolution: 1h
retention: 13 months
1_day_rollups:
resolution: 1d
retention: 5 years
Strategy 4: Eliminate Unused Dashboards
We had 340 dashboards. I checked access logs:
340 total dashboards
82 viewed in last 30 days (24%)
31 viewed in last 90 days (9%)
227 never viewed in 6 months (67%)
67% of our dashboards were zombie dashboards. Nobody looked at them, but they drove metric queries.
We archived everything not viewed in 90 days. Savings: $3,200/month from reduced query load.
Results
Before: $52,000/month
Metrics cardinality fix: -$11,000
Log tier/sampling changes: -$8,000
Dashboard cleanup: -$3,200
Retention right-sizing: -$5,800
Duplicate tool consolidation:-$6,000
After: $18,000/month
Annual savings: $408,000
And we actually improved our observability because the remaining data was higher quality.
If you want monitoring that's cost-effective by design, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)