You Don't Need Datadog (Yet)
I see startups spending $5,000/month on Datadog with 8 engineers. That's $625 per engineer per month for monitoring. At that stage, you need that money for product development.
Here's the observability stack that costs under $200/month and covers 80% of what you need.
The Stack
Metrics: Prometheus + Grafana (free, self-hosted on K8s)
Logs: Loki (free, self-hosted) or CloudWatch ($)
Tracing: OpenTelemetry → Jaeger (free, self-hosted)
Alerting: Alertmanager → PagerDuty free tier
Status: Upptime (free, GitHub-based)
─────────────────────────────────────────────
Total: ~$150/month (PagerDuty + infrastructure)
Setup: 4 Hours, Not 4 Weeks
Step 1: Prometheus + Grafana (1 hour)
# helm install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.retention=7d \
--set prometheus.prometheusSpec.resources.requests.memory=512Mi \
--set prometheus.prometheusSpec.resources.limits.memory=1Gi
This gives you:
- Node metrics (CPU, memory, disk)
- Pod metrics
- K8s state metrics
- Pre-built Grafana dashboards
Step 2: Application Metrics (30 minutes)
# Add to your Python app
from prometheus_client import Counter, Histogram, start_http_server
REQUESTS = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['endpoint'])
@app.middleware
async def metrics_middleware(request, call_next):
start = time.time()
response = await call_next(request)
LATENCY.labels(endpoint=request.url.path).observe(time.time() - start)
REQUESTS.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
return response
# Expose /metrics endpoint
start_http_server(9090)
Step 3: Log Aggregation with Loki (1 hour)
helm install loki grafana/loki-stack \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi \
--set promtail.enabled=true
Loki stores logs indexed by labels (like Prometheus for logs). Way cheaper than Elasticsearch.
Step 4: Essential Alerts (1 hour)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: startup-essential-alerts
spec:
groups:
- name: essential
rules:
# Is the app up?
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 2m
labels: { severity: critical }
# Is it slow?
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels: { severity: warning }
# Is it erroring?
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels: { severity: critical }
# Is the disk filling?
- alert: DiskAlmostFull
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 10m
labels: { severity: warning }
# Is memory tight?
- alert: HighMemoryUsage
expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels: { severity: warning }
Five alerts. That's all you need to start. Add more as you learn what breaks.
When to Upgrade
Stage Team Size Stack Monthly Cost
───────────── ───────── ────────────────── ────────────
Pre-product 1-5 eng Prometheus+Grafana+Loki ~$150
Product-market 5-15 eng Add Jaeger+PagerDuty ~$500
Scaling 15-30 eng Consider managed service ~$2,000
Growth 30-50 eng Datadog/New Relic/etc ~$5,000+
Enterprise 50+ eng Full platform $10,000+
Don't skip stages. Each stage's stack is right for that stage.
The Anti-Pattern
Don't build a custom monitoring platform. I've seen three startups try this. All three eventually bought Datadog anyway, having wasted 6+ months of engineering time.
Use off-the-shelf tools. Configure them well. Move on to building your product.
If you want production-grade observability without the enterprise price tag, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)