Stop guessing what's broken in production. Here's a complete, deploy-it-this-week observability stack built on OpenTelemetry and Grafana — the same stack I've deployed for three clients in the last 18 months.
This isn't a toy setup. This is production-grade: traces, metrics, and logs unified under a single pane of glass, with auto-instrumentation for the most common runtimes, alerting that pages on symptoms not causes, and dashboards your non-SRE teammates can actually read.
What you'll build:
OpenTelemetry Collector (gateway mode) for vendor-agnostic telemetry collection
Grafana Tempo for distributed tracing
Prometheus + Grafana Mimir for metrics at scale
Loki for structured log aggregation
Grafana dashboards with pre-built SLO panels
AlertManager rules tied to error budgets
Prerequisites: Kubernetes 1.25+, Helm 3, basic familiarity with YAML. Estimated time: 3–5 hours end to end.
Why OpenTelemetry? The vendor-lock argument settled once and for all
You’ve heard it before: “Just use Datadog.” Then the bill arrives. Or “Use Prometheus alone.” Then you lose traces.
OpenTelemetry (OTel) is the single CNCF standard for generating and exporting telemetry data. Here’s why it wins:
One instrumentation, many backends: Instrument your app once with OTel SDKs. Send to Tempo, Jaeger, Datadog, or New Relic simultaneously.
No vendor lock-in: Your telemetry data remains in your control (S3 for traces, block storage for metrics).
Automatic context propagation: Trace IDs flow seamlessly across services, even across different languages (Java → Python → Node.js).
Future-proof: New backends emerge? Point your OTel Collector there. No code changes.
The bottom line: OTel is the USB-C of observability. Stop writing custom exporters.
Architecture overview: Collector, Backends, Visualization
Here’s what you’re deploying:
[Your App] --(OTLP)--> [OTel Collector (Gateway)] --+--> [Tempo] (traces)
+--> [Mimir] (metrics)
+--> [Loki] (logs)
|
[Grafana] (visualization)
|
[AlertManager] (paging)
OTel Collector (Gateway mode): Receives OTLP from all services. Validates, batches, and routes telemetry. Single ingress point.
Tempo: Object-storage-backed tracing. Cheap, scalable, no indexing costs.
Mimir: Horizontally scalable Prometheus-compatible metrics store.
Loki: Log aggregation with low-cost object storage.
Grafana: Unified UI with Explore, dashboards, and alerting.
AlertManager: Deduplicates, groups, and routes alerts to PagerDuty/Slack.
Storage requirements (minimal): 50GB for Loki, 100GB for Tempo (can use S3/GCS/MinIO), 50GB for Mimir.
Installing the OTel Collector (gateway mode Helm values)
Create otel-collector-values.yaml
mode: deployment # gateway mode (as opposed to daemonset for agent mode)
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
attributes:
actions:
- key: environment
value: production
action: upsert
exporters:
otlp/tempo:
endpoint: "tempo-distributor:4317"
tls:
insecure: true
prometheusremotewrite/mimir:
endpoint: "http://mimir-distributor:8080/api/v1/push"
loki:
endpoint: "http://loki-gateway:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite/mimir]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Deploy
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector -f otel-collector-values.yaml
Auto-instrumentation: Java, Python, Node.js, Go
No code changes for traces/metrics/logs. Use OTel's auto-instrumentation agents.
Java (Spring Boot, any JVM app)
ENV JAVA_TOOL_OPTIONS="-javaagent:/otel/opentelemetry-javaagent.jar"
ENV OTEL_SERVICE_NAME=payment-service
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
Python (Django, Flask, FastAPI)
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
otel-instrument \
--service_name checkout-service \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app.py
Node.js (Express, NestJS)
npm install @opentelemetry/auto-instrumentations-node
npx opentelemetry-instrument \
--service_name=api-gateway \
--exporter_otlp_endpoint=http://otel-collector:4317 \
node server.js
Go (manual instrumentation required, but minimal)
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure())
// ... standard setup (5 lines)
}
Verify: Check Collector logs for TraceID spans.
Deploying Tempo for distributed tracing
Tempo is designed for cost-effective tracing. It stores traces in object storage (S3/MinIO) and indexes only by trace ID.
tempo-values.yaml
tempo:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: minio.minio:9000
access_key: "minioadmin"
secret_key: "minioadmin"
insecure: true
pool:
max_workers: 100
queue_depth: 10000
overrides:
defaults:
ingestion:
rate_limit_bytes: 15000000 # 15MB/s
burst_size_bytes: 20000000
distributor:
config:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
Deploy
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install tempo grafana/tempo -f tempo-values.yaml
Query Tempo from Grafana: Add data source → Tempo → URL: http://tempo-query-frontend:16686
Prometheus + Mimir for long-term metrics storage
Mimir replaces single-instance Prometheus. It provides horizontal scaling, replication, and long-term retention.
mimir-values.yaml
mimir:
structuredConfig:
blocks_storage:
backend: s3
s3:
endpoint: minio.minio:9000
bucket_name: mimir-blocks
access_key_id: "minioadmin"
secret_access_key: "minioadmin"
insecure: true
ingester:
ring:
replication_factor: 3 # for HA
ruler:
rule_path: /data/rules
alertmanager_url: http://alertmanager:9093
ingester:
replicas: 3
distributor:
replicas: 2
querier:
replicas: 2
Deploy
helm upgrade --install mimir grafana/mimir -f mimir-values.yaml
Migrate existing Prometheus data
promtool tsdb create-blocks-from-rules --rules-file=recording-rules.yaml data/
Then point Prometheus remote write to http://mimir-distributor:8080/api/v1/push.
Loki for log aggregation with structured querying
Loki is like Prometheus for logs. It indexes only labels, not full text, making it cheap at scale.
loki-values.yaml
loki:
storage:
type: s3
s3:
endpoint: minio.minio:9000
bucketnames: loki-chunks
access_key_id: "minioadmin"
secret_access_key: "minioadmin"
s3forcepathstyle: true
insecure: true
schemaConfig:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_global_streams_per_user: 10000
chunk_store_config:
max_look_back_period: 672h # 28 days
Deploy
helm upgrade --install loki grafana/loki -f loki-values.yaml
Query example (LogQL)
{namespace="production", app="payment-service"} |= "error"
| json
| latency_ms > 500
| line_format "{{.trace_id}} - {{.message}}"
Grafana: Connecting all three data sources
grafana-values.yaml
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus-Mimir
type: prometheus
url: http://mimir-query-frontend:8080/prometheus
access: proxy
isDefault: true
- name: Tempo
type: tempo
url: http://tempo-query-frontend:16686
access: proxy
jsonData:
tracesToLogs:
datasourceUid: 'loki'
tags: ['service.name', 'pod']
serviceMap:
enabled: true
- name: Loki
type: loki
url: http://loki-gateway:3100
access: proxy
jsonData:
derivedFields:
- name: trace_id
matcherRegex: 'trace_id=(\w+)'
url: '$${__value.raw}'
datasourceUid: 'tempo'
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'slo'
orgId: 1
folder: 'SLO Dashboards'
type: file
options:
path: /var/lib/grafana/dashboards
Deploy
helm upgrade --install grafana grafana/grafana -f grafana-values.yaml
Test correlation: In Loki, find a log with trace_id=abc123. Click it → jumps to Tempo trace. In Tempo, see affected service → jumps to Mimir metrics for that service.
Building your first SLO dashboard (template included)
Save as slo-dashboard.json and mount into Grafana
{
"title": "SLO Dashboard - Payment Service",
"panels": [
{
"title": "Availability (30d SLI)",
"targets": [{
"expr": "sum(rate(http_requests_total{status!~'5..'}[$__range])) / sum(rate(http_requests_total[$__range]))",
"legendFormat": "Availability SLI"
}],
"thresholds": [
{"color": "red", "value": null, "op": "lt", "valueType": "absolute", "value": 0.995},
{"color": "yellow", "value": null, "op": "lt", "valueType": "absolute", "value": 0.999},
{"color": "green", "value": null, "op": "gte", "valueType": "absolute", "value": 0.999}
]
},
{
"title": "Error Budget Remaining (30d)",
"targets": [{
"expr": "(1 - (sum(rate(http_requests_total{status=~'5..'}[30d])) / sum(rate(http_requests_total[30d])))) / 0.999",
"legendFormat": "Budget remaining"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"color": {"mode": "thresholds"},
"thresholds": [
{"color": "red", "value": null, "op": "lt", "value": 0.7},
{"color": "yellow", "value": null, "op": "lt", "value": 0.9},
{"color": "green", "value": null, "op": "gte", "value": 0.9}
]
}
}
},
{
"title": "Latency P99 (30d SLI)",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__range])) by (le))",
"legendFormat": "P99 latency"
}]
}
]
}
SLO math explained
Availability target: 99.9% → error budget = 0.1% of requests can fail.
Budget remaining: (actual_availability - target) / (1 - target) → 1.0 means on track, 0 means exhausted.
AlertManager: Alerting on symptoms, not causes
Bad alert: "CPU on pod payment-7d8f9 is 92%" (cause)
Good alert: "Payment service error budget exhausted" (symptom)
alertmanager-config.yaml
route:
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'pagerduty-critical'
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: false
- match:
severity: warning
receiver: slack-warnings
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: <your-pd-key>
severity: critical
- name: 'slack-warnings'
slack_configs:
- api_url: <webhook>
channel: '#alerts-warning'
Prometheus alerting rule example (slo-alerts.yaml)
groups:
- name: slo
rules:
- alert: ErrorBudgetExhausted
expr: |
(1 - (sum(rate(http_requests_total{status=~"5.."}[30d]))
/ sum(rate(http_requests_total[30d])))) / 0.999 < 0.2
for: 5m
labels:
severity: critical
service: "{{$labels.service}}"
annotations:
summary: "Error budget for {{$labels.service}} is below 20%"
description: "Remaining budget: {{$value | humanizePercentage}}"
Deploy
kubectl create configmap alertmanager-config --from-file=alertmanager.yaml=alertmanager-config.yaml
helm upgrade --install prometheus prometheus-community/prometheus \
--set alertmanager.enabled=true \
--set alertmanager.configFromSecret=alertmanager-config
The 3 dashboards every on-call engineer needs
Stop building 50-panel dashboards. Start with these three.
Dashboard 1: Service Health (RED method)
Rate (requests per second) per endpoint
Errors (5xx rate, grouped by status code)
Duration (P50, P95, P99 latency)
Saturation (CPU/memory per pod, queue depth)
PromQL snippets
# Rate
sum(rate(http_requests_total[1m])) by (service, endpoint)
# Error ratio
sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))
# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Dashboard 2: Trace Explorer
Top 10 slowest traces in last hour
Trace heatmap (duration vs. timestamp)
Service dependency graph (from Tempo service graph)
High-error traces panel (filter by status.error=true)
Dashboard 3: The "Burndown" Chart
Error budget remaining (daily trend line)
SLO burn rate (1h, 6h, 24h windows)
Multi-burn alert status (green/yellow/red)
Top offending services by error budget consumption
Why this works: On-call opens Dashboard 1 → sees elevated latency → clicks a trace in Dashboard 2 → finds slow database query → checks Dashboard 3 to decide if paging SREs is urgent.
Final checklist for production readiness
Before you sleep soundly:
Ingestion testing: curl a test span/metric/log through the Collector.
Retention: Set Mimir 30d, Tempo 14d, Loki 30d (adjust to compliance).
Auth: Add Grafana OAuth (Google/GitHub) and basic auth for Mimir/Loki ingesters.
Backups: Object storage (MinIO/S3) should have versioning enabled.
Alert testing: Silence a service, verify PagerDuty gets the page.
Runbook: Link each alert to a Confluence doc (e.g., "ErrorBudgetExhausted → https://wiki/runbooks/slo").
What’s next? Add OpenTelemetry for your database (PostgreSQL, Redis, MongoDB) using OTel collector receivers. Or add synthetic monitoring with Blackbox exporter.
You now have the same stack that cost my clients $0/month (excluding storage) instead of $15k/month for Datadog. Ship it.
Top comments (0)