DEV Community

Cover image for Cost-Effective Observability: The 80/20 Stack for Startups
Samson Tanimawo
Samson Tanimawo

Posted on

Cost-Effective Observability: The 80/20 Stack for Startups

You Don't Need Datadog (Yet)

I see startups spending $5,000/month on Datadog with 8 engineers. That's $625 per engineer per month for monitoring. At that stage, you need that money for product development.

Here's the observability stack that costs under $200/month and covers 80% of what you need.

The Stack

Metrics: Prometheus + Grafana (free, self-hosted on K8s)
Logs: Loki (free, self-hosted) or CloudWatch ($)
Tracing: OpenTelemetry → Jaeger (free, self-hosted)
Alerting: Alertmanager → PagerDuty free tier
Status: Upptime (free, GitHub-based)
─────────────────────────────────────────────
Total: ~$150/month (PagerDuty + infrastructure)
Enter fullscreen mode Exit fullscreen mode

Setup: 4 Hours, Not 4 Weeks

Step 1: Prometheus + Grafana (1 hour)

# helm install
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.retention=7d \
--set prometheus.prometheusSpec.resources.requests.memory=512Mi \
--set prometheus.prometheusSpec.resources.limits.memory=1Gi
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • Node metrics (CPU, memory, disk)
  • Pod metrics
  • K8s state metrics
  • Pre-built Grafana dashboards

Step 2: Application Metrics (30 minutes)

# Add to your Python app
from prometheus_client import Counter, Histogram, start_http_server

REQUESTS = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
LATENCY = Histogram('http_request_duration_seconds', 'Request latency', ['endpoint'])

@app.middleware
async def metrics_middleware(request, call_next):
start = time.time()
response = await call_next(request)
LATENCY.labels(endpoint=request.url.path).observe(time.time() - start)
REQUESTS.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
return response

# Expose /metrics endpoint
start_http_server(9090)
Enter fullscreen mode Exit fullscreen mode

Step 3: Log Aggregation with Loki (1 hour)

helm install loki grafana/loki-stack \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi \
--set promtail.enabled=true
Enter fullscreen mode Exit fullscreen mode

Loki stores logs indexed by labels (like Prometheus for logs). Way cheaper than Elasticsearch.

Step 4: Essential Alerts (1 hour)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: startup-essential-alerts
spec:
groups:
- name: essential
rules:
# Is the app up?
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 2m
labels: { severity: critical }

# Is it slow?
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels: { severity: warning }

# Is it erroring?
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels: { severity: critical }

# Is the disk filling?
- alert: DiskAlmostFull
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 10m
labels: { severity: warning }

# Is memory tight?
- alert: HighMemoryUsage
expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels: { severity: warning }
Enter fullscreen mode Exit fullscreen mode

Five alerts. That's all you need to start. Add more as you learn what breaks.

When to Upgrade

Stage Team Size Stack Monthly Cost
───────────── ───────── ────────────────── ────────────
Pre-product 1-5 eng Prometheus+Grafana+Loki ~$150
Product-market 5-15 eng Add Jaeger+PagerDuty ~$500
Scaling 15-30 eng Consider managed service ~$2,000
Growth 30-50 eng Datadog/New Relic/etc ~$5,000+
Enterprise 50+ eng Full platform $10,000+
Enter fullscreen mode Exit fullscreen mode

Don't skip stages. Each stage's stack is right for that stage.

The Anti-Pattern

Don't build a custom monitoring platform. I've seen three startups try this. All three eventually bought Datadog anyway, having wasted 6+ months of engineering time.

Use off-the-shelf tools. Configure them well. Move on to building your product.

If you want production-grade observability without the enterprise price tag, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)