A CTO asked me: "Should we move off Datadog? It's eating our runway."
I said: "Before you migrate, show me your retention config."
They didn't have one. The default was still set.
60% of the bill was DEBUG logs nobody had queried in 90 days. CloudWatch forwarders were pushing everything — access logs, auth logs, health checks. All at 30-day retention. All indexed. All paid for.
The migration would have taken 3 months, cost the team's sanity, and moved the same problem to Grafana.
The actual fix was a 2-week config exercise:
→ Tag logs by severity + service ownership
→ 3 retention tiers: P0 incidents keep 90d, operational 7d, DEBUG 24h
→ Stop indexing health-check logs. Archive them raw to S3 at $0.023/GB
→ Custom metrics audit: 18% of them weren't on any dashboard or alert
→ APM sampling reduced from 100% to 10% on non-critical services
Result: Datadog bill dropped 51% in 6 weeks. No vendor change. No re-training. No migration risk.
The observability industry loves selling you a new tool. But the problem isn't usually the tool. It's:
→ Defaults that were set when your traffic was 10x smaller
→ Nobody owns retention policy
→ Custom metrics piled up, nothing ever got deleted
→ Alerts firing so often everyone muted them
If you're about to RFP a new observability vendor: audit your current one first. You'll save 6 months and 60% of the spend.
If this sounds like your stack, repost. There's a VP Engineering reading a Grafana pitch deck right now who needs to hear it.

Top comments (0)