The "Obvious" Math That's Wrong
Engineer A: "Datadog is $15K/month. Prometheus is free. We should self-host."
Engineer B: "But we'd need to pay an SRE to run it. That's $150K/year."
Engineer A: "Prometheus doesn't need a full SRE. It's easy."
Engineer B: "Famous last words."
This conversation happens at every company. Both sides have points. The real math is more complex.
The Total Cost Breakdown
Managed (Datadog, New Relic, Dynatrace):
Licensing: $X/month (scales with hosts, events, logs)
Integration time: 1-2 weeks per service
Training: 1 day per new hire
Ongoing: minimal
Self-hosted (Prometheus + Grafana + Loki + Alertmanager):
Infrastructure: hosting costs (~$500-$5000/month depending on scale)
Initial setup: 2-4 weeks of engineering time
Ongoing maintenance: 10-20% of 1 FTE
Upgrade costs: quarterly, each upgrade ~1 week
Storage growth: ~20% per year
Expertise: junior → senior SRE hire required
The honest answer: managed is cheaper for teams under 50 engineers. Self-hosted becomes cheaper around 200+ engineers if you can run it well.
The Real Variables
It's not just licensing cost vs. hosting cost. These factors matter more:
1. Data volume growth
Managed tools charge per GB ingested or per metric. If your logs 10x, your bill 10x's.
Self-hosted scales linearly with compute. You control the growth.
2. Retention requirements
Managed tools often charge extra for long retention. Self-hosted you store as much as your disk allows.
3. Cardinality
Prometheus dies at high cardinality. Datadog handles it but charges more. High-cardinality metrics are where self-hosted breaks.
4. Incident rate
Heavy incident load means heavy query load on your monitoring tools. Self-hosted needs bigger compute for this.
5. Team expertise
If your team has never run Prometheus, you'll spend 6 months in the pit learning cardinality mistakes, retention tuning, and HA setups. That's not free.
The Break-Even Calculation
Rough calculation for a 50-engineer startup:
Managed (Datadog):
- Licensing: $10K/month = $120K/year
- Maintenance: ~10 hours/month of engineering time
- Total: ~$125K/year
Self-Hosted (Prometheus + Grafana + Loki):
- Infrastructure: $2K/month = $24K/year
- Maintenance: ~30% of 1 SRE = $45K/year (loaded)
- Initial setup: $30K one-time
- Total year 1: $99K
- Total year 2+: $69K/year
At 50 engineers, self-hosted saves ~$60K/year once you're past the setup phase.
BUT:
- If the SRE quits, you're in trouble
- If the cardinality explodes, you're in trouble
- If an upgrade fails, you're in trouble
- If you need 24/7 reliability of monitoring itself, add another $50K/year
Factor these in honestly before you commit.
The Hybrid Approach
Most teams we see end up hybrid:
Metrics: Self-hosted Prometheus (cheap at scale)
Logs: Managed service (complex to self-host at scale)
Traces: Managed service (requires specialized knowledge)
Alerting: Self-hosted Alertmanager (simple, stable)
Dashboards: Self-hosted Grafana
This captures the best of both:
- Cheap for commodity metrics
- Paid for complex/specialized (logs, traces)
- Predictable ongoing costs
When Managed Is Obviously Right
- Team has < 20 engineers
- No one has ops expertise
- Growing fast, need to ship features
- Compliance requires certified tools (SOC2, HIPAA specifically)
- Need it running in 2 weeks, not 3 months
When Self-Hosted Is Obviously Right
- Team has 100+ engineers
- Strong ops culture already in place
- Budget pressure from managed costs
- Very high data volume (10TB+/day logs)
- Want full control over data (privacy, residency)
The Gray Zone (50-200 engineers)
Most startups land here. The calculation is close. Three factors break the tie:
Factor 1: Engineering time availability
If your SRE team is underwater with incidents, adding "maintain Prometheus" to their plate is a disaster. Pay for managed.
Factor 2: Growth rate
If you're doubling engineers every 6 months, managed costs explode. Build self-hosted capacity now.
Factor 3: Data sovereignty
Some customers won't accept their data leaving your infrastructure. This forces self-hosted regardless of cost.
The Hidden Costs of Managed
Managed isn't just the invoice. Also consider:
- Integration debt: every service you add to Datadog is a service you're locked into Datadog
- Custom metric costs: 10M custom metrics in Datadog is expensive
- Long-term pricing risk: vendors raise prices once you're locked in
- Export restrictions: getting your data out is often painful
- Support quality: P1 support response times vary wildly
Factor these into your TCO.
The Hidden Costs of Self-Hosted
- On-call for monitoring: when Prometheus dies, someone has to fix it
- Upgrade risk: each major version can break dashboards
- Scaling anxiety: "will it handle the next 2x growth?" is a question you ask yourself weekly
- Knowledge concentration: if the one expert leaves, you're stranded
- Integration maintenance: every new service needs exporters/agents
These aren't line items on an invoice. They're real costs.
The Decision Framework
Answer these questions honestly:
- How many engineers can dedicate time to monitoring maintenance?
- How fast is your data volume growing?
- What's your current monthly bill for managed tools?
- How much pain is there from vendor lock-in?
- What's the opportunity cost of engineering time spent on monitoring?
- Can you recruit or train the expertise?
If you can't answer #1-#6 confidently, stay managed until you can.
The Recommendation
Years 0-2: Managed. Ship features, don't fight your monitoring stack.
Years 3-5: Start evaluating. Maybe hybrid. Hire an SRE with deep ops experience.
Years 5+: Likely hybrid or mostly self-hosted. Costs justify the complexity.
The best monitoring stack is the one your team can actually operate. If that's managed, pay the bill. If that's self-hosted, invest in the expertise. Don't try to do both poorly.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)