Anushka B

Posted on Apr 21 • Originally published at aicloudstrategist.com

Datadog alternatives for Indian mid-market (cost-aware)

#observability #devops #cloud #india

Ask ten Indian Series B SaaS CTOs what they pay Datadog per month, and eight of them will sigh before answering. The median honest number is ₹10.6 lakh/month (OneUptime's 2024 survey of 140 Indian SaaS companies). The top quartile pays above ₹22 lakh. The bottom quartile, typically Series A with a Starter plan, sits around ₹3 lakh — and climbs fast.

That is a line item on the P&L that nobody budgets for at incorporation, that scales super-linearly with traffic, and that most Indian CFOs cannot defend to a board at year three. This post is the math on why — and what Indian teams can do instead.

Why Datadog bills go unreasonable in Indian mid-market

Datadog's base pricing is not the problem. The $15/host/month Infrastructure plan is competitive. The problem is the add-ons, each metered separately, most of which you'll hit by your second year if the product works:

Log ingestion + retention. $0.10/GB ingested, $1.27/million events indexed at 15-day retention. An Indian SaaS generating 400 GB of logs/day (not unusual for any transactional product at Series B) pays ~₹8 lakh/month on logs alone.
APM + distributed tracing. $31/host/month for APM Pro, plus ingestion overages. A 40-host cluster is ₹1 lakh/month before overages.
Custom metrics. $0.05/metric/month. Easy to blow past 10,000 custom metrics when one Spring Boot service auto-generates 200 per endpoint. ₹40,000/month, silently.
Synthetic monitoring. $5 per 10,000 API tests. Browser tests at $12/1,000 runs. Global checks compound fast.
RUM (real user monitoring). $1.50/1,000 sessions. For a consumer-facing SaaS with 2 million DAU, ₹2.4 lakh/month.
Security monitoring, CSPM, DBM. Each is its own line item. Each adds ₹50K–₹2L.

The pricing isn't dishonest — Datadog publishes all of this. The problem is that Indian engineering teams adopt Datadog at Series A for its Infrastructure plan, and by Series B have gradually accreted 6–8 product lines none of which anyone individually approved.

The real spend distribution (OneUptime 2024 survey, Indian SaaS)

Company stage	Median Datadog spend/mo	Top quartile
Seed (pre-product)	₹0 (Free tier)	₹45,000
Series A	₹1.8 lakh	₹4.2 lakh
Series B	₹10.6 lakh	₹22 lakh
Series C+	₹28 lakh	₹70 lakh+

Survey sample: n=140 Indian SaaS companies, reporting period Q3 2024. Source: OneUptime public cost benchmark. Numbers converted at ₹83/USD and rounded.

The open-source stack: real infra cost + real SRE cost

The alternative that works for 70% of Indian mid-market SaaS is the Grafana Labs open stack: Prometheus (metrics), Grafana (dashboards), Loki (logs), Tempo (distributed tracing), optionally Alloy (agent) and OnCall (alerting).

Deployed honestly on EKS in ap-south-1, the infrastructure bill breaks down like this:

Component	Monthly ₹	Sized for
Prometheus (3× m5.xlarge + gp3 120 GB)	₹14,000	60 hosts, 15-day retention
Thanos (long-term storage, S3-backed)	₹5,000	1-year retention, compactor workload
Loki (2× m5.large + S3 storage)	₹12,000	200 GB/day log ingest
Tempo (1× m5.large + S3 storage)	₹4,000	Distributed tracing, 30-day retention
Grafana (1× t3.medium behind ALB)	₹3,500	Dashboards + alerting UI
S3 storage (logs + metrics cold)	₹6,500	~2.4 TB/mo, IA tier after 30 days
Subtotal (infra)	₹45,000	mid-market typical

Range: ₹35,000–₹90,000/month , scaling with log volume and retention horizon. Even at the high end, it is 7–10x cheaper than the median Datadog bill.

But infra isn't the full cost. The honest additional cost is SRE time:

Upgrades and patching: ~8 hours/month
Dashboard maintenance, alert tuning: ~12 hours/month
Incident response (on-call rotation overhead): ~10 hours/month
Cost-tuning cardinality, fixing broken exporters: ~6 hours/month

Total: 20–40 hours of SRE time/month , depending on maturity. At a ₹6,000/hour loaded cost, that's ₹1.2L–₹2.4L/month in time. Combined (infra + time): ₹1.7L–₹3.3L/month — still 3–6x cheaper than Datadog, but the SRE cost is the honest piece Datadog marketing correctly points out.

The managed option: ₹74,999/month, all-in

For Indian mid-market teams that don't have dedicated SRE bandwidth — and most Series B teams below 80 engineers don't — we run the entire Prometheus+Grafana+Loki+Tempo stack as a managed service through our Observe module.

The commercial is straightforward:

₹74,999/month, includes infra (hosted in your AWS account) + ongoing operation + 24×5 alert triage
We handle upgrades, patching, retention tuning, dashboard library maintenance
Migration from Datadog included: 6–8 week parallel-run, dashboard port, alert port, cutover
Your engineers keep admin access end-to-end — you own the stack, we operate it

Break-even math: a Series B team paying ₹10L/month on Datadog saves ₹9.25 lakh/month (87%) switching to our managed Observe. Over 24 months, that's ₹2.2 crore back on the P&L. Our earlier post on why Indian SaaS gives up on observability covers why the DIY path fails for teams without staff SRE; the managed option closes that gap.

The migration playbook (what actually happens in 6-8 weeks)

Every Datadog-to-open-stack migration we've run follows the same five-phase shape. Share this with your engineering lead before agreeing to anything — the phases are what determine whether the cutover is painful or routine.

Phase 1 — inventory (week 1). List every Datadog agent, integration, dashboard, monitor, synthetic check, log pipeline, APM service, and custom metric in use. Map each to its owner and its criticality tier (P0 = page on-call, P1 = Slack channel, P2 = weekly report). We use a single Google Sheet for this; no special tooling needed. For a typical Series B team we find 40–80 dashboards, 200–400 monitors, 60–120 logged services. This inventory becomes the acceptance criteria for Phase 5.

Phase 2 — stack stand-up (weeks 2-3). Deploy Prometheus + Thanos + Grafana + Loki + Tempo on EKS in your AWS account. Configure remote-write from existing application exporters (most teams already emit Prometheus-format metrics via Micrometer or prometheus_client libraries). Stand up Alloy as the agent on every host for log shipping to Loki. Point synthetic probes (we use Blackbox Exporter) at the critical endpoints identified in Phase 1. Configure Grafana OSS (not Cloud) with OIDC auth tied to your existing identity provider. Import the baseline dashboard pack — we maintain a library of 30 Indian-SaaS-typical dashboards (JVM, Postgres, Redis, Nginx, Kong, Envoy, Spring Boot, Node, Go, Python/FastAPI).

Phase 3 — parallel run (weeks 3-5). Dual-emit metrics to both Datadog and Prometheus. Dual-ship logs to both Datadog Logs and Loki. Validate every dashboard port by side-by-side screenshot review. This is where 80% of the work lives — dashboards are rarely perfectly equivalent, and the team discovers that several Datadog dashboards nobody looked at in 14 months. Kill them.

Phase 4 — alerting cutover (week 6). Migrate monitors from Datadog to Grafana Alerting + Prometheus Alertmanager. This is a cautious phase because on-call workflows must not regress. We cut over in three waves: P2 first (weekly reports), then P1 (Slack), then P0 (PagerDuty integration) only after two weeks of silent-mode parallel running where the open stack fires alerts but doesn't page. This catches false-positives and gaps before anyone loses sleep over them.

Phase 5 — Datadog decommission (weeks 7-8). Uninstall Datadog agents, disable synthetic tests, export historical data to S3 for archival (Datadog's export is slow — start this in week 4, not week 7), cancel subscription. Monthly Datadog spend goes to zero in the next billing cycle.

Cardinality is the gotcha nobody warns you about

The single largest self-hosted observability failure mode is metric cardinality explosion. A well-intentioned team instruments a new service with a user_id label on every metric. A month later, Prometheus is eating 40 GB of RAM, queries are timing out, and the Kubernetes node is OOMKilling the collector.

This is the reason we include a cardinality-governance playbook in every Observe engagement. Rules we enforce:

Never label metrics with unbounded dimensions (user IDs, request IDs, session tokens, email addresses).
Cap per-metric cardinality at 10,000 series. Auto-alert on any series exceeding 5,000 for 3 days.
Prefer log-based metrics (recording rules in Prometheus) over high-cardinality gauges.
Run prometheus_tsdb_head_series alerting with a growth-rate threshold — if the count is compounding, someone pushed a bad label.

Datadog papers over this problem by billing for it (each unique time series is a "custom metric" at $0.05/month). Prometheus punishes you operationally if you do it wrong. Knowing the difference is the whole craft.

When Datadog is still the right answer

This isn't a hit piece. Datadog is the best-of-breed product in several specific scenarios:

US/global customer base with enterprise-grade compliance needs. Datadog's SOC 2, ISO 27001, HIPAA, and FedRAMP postures are ahead of most open-source alternatives.
AI/ML pipelines where Datadog's profiling + Watchdog anomaly detection are core. These features have no clean open-source equivalent.
Teams > 400 engineers where the operational lift of a self-hosted stack exceeds the cost delta. Above that scale, Datadog's seat cost is the cheap part of your observability budget.

For the Indian mid-market target we work with — ₹5L–₹50L/month cloud spend, 40–150 engineers, ap-south-1-primary — the economics tilt decisively toward the open stack, managed or DIY.

What about SigNoz, Uptrace, New Relic, and the other alternatives?

The decision tree is not binary "Datadog vs. Prometheus stack". Several alternatives are worth knowing about, in rough order of Indian mid-market fit:

SigNoz — Indian-built, OpenTelemetry-native, strong APM, open-core. Free self-hosted, SaaS pricing starts at ~$0.4/GB log ingest. Good fit for teams that want a single product with APM built in, not a toolchain to integrate. Our honest feedback: feature parity with Datadog's APM lags by ~12 months but is closing fast.
Uptrace — similar positioning to SigNoz, ClickHouse-backed, OpenTelemetry-native. Lighter footprint, less mature ecosystem. Good fit for teams who want OSS-first with a clear migration path to a commercial tier later.
New Relic — the legacy incumbent. Pricing model (user-seat-based since 2020) makes it cheaper than Datadog at small engineering team sizes (<25 users) and more expensive above that. The India sales team is strong. Worth comparing if team size is <40 engineers.
Last9 — Indian observability platform, TimescaleDB-backed, used by Disney+ Hotstar and others. Strong for high-cardinality metrics, handles ingestion cost well. Good for teams whose primary pain is Datadog's custom-metric pricing.
AWS-native (CloudWatch + X-Ray) — rarely the right answer standalone. CloudWatch Logs Insights query costs compound faster than most teams expect. Useful as a complement, not a replacement.

Our bias, clearly stated: the Grafana-stack managed option wins on cost + flexibility + vendor-lock-in resistance. But if your team has a strong preference for a single-product SaaS experience and the budget tolerates it, SigNoz or Last9 are both credible choices for Indian mid-market.

Next step

If your current Datadog bill is above ₹5 lakh/month and you'd like a side-by-side cost comparison against a managed open-stack alternative — with no sales follow-up if the numbers don't work in your favour — we do that as part of our free 24-hour audit.

Start your free observability cost audit → aicloudstrategist.com/audit.html

Or jump straight to the product page: /observe.html.

Founder-led by Anushka B. AICloudStrategist is an Indian founding-cohort cloud consultancy. We publish pricing, benchmarks, and honest comparisons — including when a competitor is the better fit. See how we prove what we claim. First three Observe customers at ₹40,000 under our launch cohort; standard price ₹74,999/month thereafter.

AICloudStrategist · Founder-led. Enterprise-reviewed. · Written by Anushka B, Founder.

DEV Community