Samson Tanimawo

Posted on May 7

The Economics of Self-Hosting vs. Managed Monitoring

#monitoring #costs #devops #selfhosted

The "Obvious" Math That's Wrong

Engineer A: "Datadog is $15K/month. Prometheus is free. We should self-host."

Engineer B: "But we'd need to pay an SRE to run it. That's $150K/year."

Engineer A: "Prometheus doesn't need a full SRE. It's easy."

Engineer B: "Famous last words."

This conversation happens at every company. Both sides have points. The real math is more complex.

The Total Cost Breakdown

Managed (Datadog, New Relic, Dynatrace):

Licensing: $X/month (scales with hosts, events, logs)
Integration time: 1-2 weeks per service
Training: 1 day per new hire
Ongoing: minimal

Self-hosted (Prometheus + Grafana + Loki + Alertmanager):

Infrastructure: hosting costs (~$500-$5000/month depending on scale)
Initial setup: 2-4 weeks of engineering time
Ongoing maintenance: 10-20% of 1 FTE
Upgrade costs: quarterly, each upgrade ~1 week
Storage growth: ~20% per year
Expertise: junior → senior SRE hire required

The honest answer: managed is cheaper for teams under 50 engineers. Self-hosted becomes cheaper around 200+ engineers if you can run it well.

The Real Variables

It's not just licensing cost vs. hosting cost. These factors matter more:

1. Data volume growth

Managed tools charge per GB ingested or per metric. If your logs 10x, your bill 10x's.

Self-hosted scales linearly with compute. You control the growth.

2. Retention requirements

Managed tools often charge extra for long retention. Self-hosted you store as much as your disk allows.

3. Cardinality

Prometheus dies at high cardinality. Datadog handles it but charges more. High-cardinality metrics are where self-hosted breaks.

4. Incident rate

Heavy incident load means heavy query load on your monitoring tools. Self-hosted needs bigger compute for this.

5. Team expertise

If your team has never run Prometheus, you'll spend 6 months in the pit learning cardinality mistakes, retention tuning, and HA setups. That's not free.

The Break-Even Calculation

Rough calculation for a 50-engineer startup:

Managed (Datadog):
- Licensing: $10K/month = $120K/year
- Maintenance: ~10 hours/month of engineering time
- Total: ~$125K/year

Self-Hosted (Prometheus + Grafana + Loki):
- Infrastructure: $2K/month = $24K/year
- Maintenance: ~30% of 1 SRE = $45K/year (loaded)
- Initial setup: $30K one-time
- Total year 1: $99K
- Total year 2+: $69K/year

At 50 engineers, self-hosted saves ~$60K/year once you're past the setup phase.

BUT:

If the SRE quits, you're in trouble
If the cardinality explodes, you're in trouble
If an upgrade fails, you're in trouble
If you need 24/7 reliability of monitoring itself, add another $50K/year

Factor these in honestly before you commit.

The Hybrid Approach

Most teams we see end up hybrid:

Metrics: Self-hosted Prometheus (cheap at scale)
Logs: Managed service (complex to self-host at scale)
Traces: Managed service (requires specialized knowledge)
Alerting: Self-hosted Alertmanager (simple, stable)
Dashboards: Self-hosted Grafana

This captures the best of both:

Cheap for commodity metrics
Paid for complex/specialized (logs, traces)
Predictable ongoing costs

When Managed Is Obviously Right

Team has < 20 engineers
No one has ops expertise
Growing fast, need to ship features
Compliance requires certified tools (SOC2, HIPAA specifically)
Need it running in 2 weeks, not 3 months

When Self-Hosted Is Obviously Right

Team has 100+ engineers
Strong ops culture already in place
Budget pressure from managed costs
Very high data volume (10TB+/day logs)
Want full control over data (privacy, residency)

The Gray Zone (50-200 engineers)

Most startups land here. The calculation is close. Three factors break the tie:

Factor 1: Engineering time availability

If your SRE team is underwater with incidents, adding "maintain Prometheus" to their plate is a disaster. Pay for managed.

Factor 2: Growth rate

If you're doubling engineers every 6 months, managed costs explode. Build self-hosted capacity now.

Factor 3: Data sovereignty

Some customers won't accept their data leaving your infrastructure. This forces self-hosted regardless of cost.

The Hidden Costs of Managed

Managed isn't just the invoice. Also consider:

Integration debt: every service you add to Datadog is a service you're locked into Datadog
Custom metric costs: 10M custom metrics in Datadog is expensive
Long-term pricing risk: vendors raise prices once you're locked in
Export restrictions: getting your data out is often painful
Support quality: P1 support response times vary wildly

Factor these into your TCO.

The Hidden Costs of Self-Hosted

On-call for monitoring: when Prometheus dies, someone has to fix it
Upgrade risk: each major version can break dashboards
Scaling anxiety: "will it handle the next 2x growth?" is a question you ask yourself weekly
Knowledge concentration: if the one expert leaves, you're stranded
Integration maintenance: every new service needs exporters/agents

These aren't line items on an invoice. They're real costs.

The Decision Framework

Answer these questions honestly:

How many engineers can dedicate time to monitoring maintenance?
How fast is your data volume growing?
What's your current monthly bill for managed tools?
How much pain is there from vendor lock-in?
What's the opportunity cost of engineering time spent on monitoring?
Can you recruit or train the expertise?

If you can't answer #1-#6 confidently, stay managed until you can.

The Recommendation

Years 0-2: Managed. Ship features, don't fight your monitoring stack.

Years 3-5: Start evaluating. Maybe hybrid. Hire an SRE with deep ops experience.

Years 5+: Likely hybrid or mostly self-hosted. Costs justify the complexity.

The best monitoring stack is the one your team can actually operate. If that's managed, pay the bill. If that's self-hosted, invest in the expertise. Don't try to do both poorly.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community