varun varde

Posted on Apr 14

Building Production-Grade Observability: OpenTelemetry + Grafana Stack

#devops #kubernetes #sre #grafana

Stop guessing what's broken in production. Here's a complete, deploy-it-this-week observability stack built on OpenTelemetry and Grafana — the same stack I've deployed for three clients in the last 18 months.

This isn't a toy setup. This is production-grade: traces, metrics, and logs unified under a single pane of glass, with auto-instrumentation for the most common runtimes, alerting that pages on symptoms not causes, and dashboards your non-SRE teammates can actually read.

What you'll build:

OpenTelemetry Collector (gateway mode) for vendor-agnostic telemetry collection
Grafana Tempo for distributed tracing
Prometheus + Grafana Mimir for metrics at scale
Loki for structured log aggregation
Grafana dashboards with pre-built SLO panels
AlertManager rules tied to error budgets

Prerequisites: Kubernetes 1.25+, Helm 3, basic familiarity with YAML. Estimated time: 3–5 hours end to end.

Why OpenTelemetry? The vendor-lock argument settled once and for all

You’ve heard it before: “Just use Datadog.” Then the bill arrives. Or “Use Prometheus alone.” Then you lose traces.

OpenTelemetry (OTel) is the single CNCF standard for generating and exporting telemetry data. Here’s why it wins:

One instrumentation, many backends: Instrument your app once with OTel SDKs. Send to Tempo, Jaeger, Datadog, or New Relic simultaneously.
No vendor lock-in: Your telemetry data remains in your control (S3 for traces, block storage for metrics).
Automatic context propagation: Trace IDs flow seamlessly across services, even across different languages (Java → Python → Node.js).
Future-proof: New backends emerge? Point your OTel Collector there. No code changes.

The bottom line: OTel is the USB-C of observability. Stop writing custom exporters.

Architecture overview: Collector, Backends, Visualization

Here’s what you’re deploying:

[Your App] --(OTLP)--> [OTel Collector (Gateway)] --+--> [Tempo] (traces)
                                                      +--> [Mimir] (metrics)
                                                      +--> [Loki] (logs)
                                                              |
                                                         [Grafana] (visualization)
                                                              |
                                                       [AlertManager] (paging)

OTel Collector (Gateway mode): Receives OTLP from all services. Validates, batches, and routes telemetry. Single ingress point.
Tempo: Object-storage-backed tracing. Cheap, scalable, no indexing costs.
Mimir: Horizontally scalable Prometheus-compatible metrics store.
Loki: Log aggregation with low-cost object storage.
Grafana: Unified UI with Explore, dashboards, and alerting.
AlertManager: Deduplicates, groups, and routes alerts to PagerDuty/Slack.

Storage requirements (minimal): 50GB for Loki, 100GB for Tempo (can use S3/GCS/MinIO), 50GB for Mimir.

Installing the OTel Collector (gateway mode Helm values)

Create otel-collector-values.yaml

mode: deployment   # gateway mode (as opposed to daemonset for agent mode)

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch:
      timeout: 1s
      send_batch_size: 1024
    memory_limiter:
      check_interval: 1s
      limit_mib: 512
    attributes:
      actions:
        - key: environment
          value: production
          action: upsert

  exporters:
    otlp/tempo:
      endpoint: "tempo-distributor:4317"
      tls:
        insecure: true
    prometheusremotewrite/mimir:
      endpoint: "http://mimir-distributor:8080/api/v1/push"
    loki:
      endpoint: "http://loki-gateway:3100/loki/api/v1/push"

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch, attributes]
        exporters: [otlp/tempo]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [prometheusremotewrite/mimir]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch]
        exporters: [loki]

Deploy

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector -f otel-collector-values.yaml

Auto-instrumentation: Java, Python, Node.js, Go

No code changes for traces/metrics/logs. Use OTel's auto-instrumentation agents.

Java (Spring Boot, any JVM app)

ENV JAVA_TOOL_OPTIONS="-javaagent:/otel/opentelemetry-javaagent.jar"
ENV OTEL_SERVICE_NAME=payment-service
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

Python (Django, Flask, FastAPI)

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
otel-instrument \
  --service_name checkout-service \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

Node.js (Express, NestJS)

npm install @opentelemetry/auto-instrumentations-node
npx opentelemetry-instrument \
  --service_name=api-gateway \
  --exporter_otlp_endpoint=http://otel-collector:4317 \
  node server.js

Go (manual instrumentation required, but minimal)

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
)

func initTracer() {
    exporter, _ := otlptracegrpc.New(ctx, 
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure())
    // ... standard setup (5 lines)
}

Verify: Check Collector logs for TraceID spans.

Deploying Tempo for distributed tracing

Tempo is designed for cost-effective tracing. It stores traces in object storage (S3/MinIO) and indexes only by trace ID.

tempo-values.yaml

tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: tempo-traces
        endpoint: minio.minio:9000
        access_key: "minioadmin"
        secret_key: "minioadmin"
        insecure: true
      pool:
        max_workers: 100
        queue_depth: 10000

  overrides:
    defaults:
      ingestion:
        rate_limit_bytes: 15000000   # 15MB/s
        burst_size_bytes: 20000000

distributor:
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"

Deploy

helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install tempo grafana/tempo -f tempo-values.yaml

Query Tempo from Grafana: Add data source → Tempo → URL: http://tempo-query-frontend:16686

Prometheus + Mimir for long-term metrics storage

Mimir replaces single-instance Prometheus. It provides horizontal scaling, replication, and long-term retention.

mimir-values.yaml

mimir:
  structuredConfig:
    blocks_storage:
      backend: s3
      s3:
        endpoint: minio.minio:9000
        bucket_name: mimir-blocks
        access_key_id: "minioadmin"
        secret_access_key: "minioadmin"
        insecure: true
    ingester:
      ring:
        replication_factor: 3   # for HA
    ruler:
      rule_path: /data/rules
      alertmanager_url: http://alertmanager:9093

  ingester:
    replicas: 3
  distributor:
    replicas: 2
  querier:
    replicas: 2

Deploy

helm upgrade --install mimir grafana/mimir -f mimir-values.yaml

Migrate existing Prometheus data

promtool tsdb create-blocks-from-rules --rules-file=recording-rules.yaml data/

Then point Prometheus remote write to http://mimir-distributor:8080/api/v1/push.

Loki for log aggregation with structured querying

Loki is like Prometheus for logs. It indexes only labels, not full text, making it cheap at scale.

loki-values.yaml

loki:
  storage:
    type: s3
    s3:
      endpoint: minio.minio:9000
      bucketnames: loki-chunks
      access_key_id: "minioadmin"
      secret_access_key: "minioadmin"
      s3forcepathstyle: true
      insecure: true

  schemaConfig:
    configs:
      - from: 2024-01-01
        store: boltdb-shipper
        object_store: s3
        schema: v12
        index:
          prefix: loki_index_
          period: 24h

  limits_config:
    ingestion_rate_mb: 10
    ingestion_burst_size_mb: 20
    max_global_streams_per_user: 10000

  chunk_store_config:
    max_look_back_period: 672h  # 28 days

Deploy

helm upgrade --install loki grafana/loki -f loki-values.yaml

Query example (LogQL)

{namespace="production", app="payment-service"} |= "error" 
| json 
| latency_ms > 500 
| line_format "{{.trace_id}} - {{.message}}"

Grafana: Connecting all three data sources

grafana-values.yaml

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus-Mimir
        type: prometheus
        url: http://mimir-query-frontend:8080/prometheus
        access: proxy
        isDefault: true

      - name: Tempo
        type: tempo
        url: http://tempo-query-frontend:16686
        access: proxy
        jsonData:
          tracesToLogs:
            datasourceUid: 'loki'
            tags: ['service.name', 'pod']
          serviceMap:
            enabled: true

      - name: Loki
        type: loki
        url: http://loki-gateway:3100
        access: proxy
        jsonData:
          derivedFields:
            - name: trace_id
              matcherRegex: 'trace_id=(\w+)'
              url: '$${__value.raw}'
              datasourceUid: 'tempo'

dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
      - name: 'slo'
        orgId: 1
        folder: 'SLO Dashboards'
        type: file
        options:
          path: /var/lib/grafana/dashboards

Deploy

helm upgrade --install grafana grafana/grafana -f grafana-values.yaml

Test correlation: In Loki, find a log with trace_id=abc123. Click it → jumps to Tempo trace. In Tempo, see affected service → jumps to Mimir metrics for that service.

Building your first SLO dashboard (template included)

Save as slo-dashboard.json and mount into Grafana

{
  "title": "SLO Dashboard - Payment Service",
  "panels": [
    {
      "title": "Availability (30d SLI)",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status!~'5..'}[$__range])) / sum(rate(http_requests_total[$__range]))",
        "legendFormat": "Availability SLI"
      }],
      "thresholds": [
        {"color": "red", "value": null, "op": "lt", "valueType": "absolute", "value": 0.995},
        {"color": "yellow", "value": null, "op": "lt", "valueType": "absolute", "value": 0.999},
        {"color": "green", "value": null, "op": "gte", "valueType": "absolute", "value": 0.999}
      ]
    },
    {
      "title": "Error Budget Remaining (30d)",
      "targets": [{
        "expr": "(1 - (sum(rate(http_requests_total{status=~'5..'}[30d])) / sum(rate(http_requests_total[30d])))) / 0.999",
        "legendFormat": "Budget remaining"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "min": 0,
          "max": 1,
          "color": {"mode": "thresholds"},
          "thresholds": [
            {"color": "red", "value": null, "op": "lt", "value": 0.7},
            {"color": "yellow", "value": null, "op": "lt", "value": 0.9},
            {"color": "green", "value": null, "op": "gte", "value": 0.9}
          ]
        }
      }
    },
    {
      "title": "Latency P99 (30d SLI)",
      "targets": [{
        "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__range])) by (le))",
        "legendFormat": "P99 latency"
      }]
    }
  ]
}

SLO math explained

Availability target: 99.9% → error budget = 0.1% of requests can fail.
Budget remaining: (actual_availability - target) / (1 - target) → 1.0 means on track, 0 means exhausted.

AlertManager: Alerting on symptoms, not causes

Bad alert: "CPU on pod payment-7d8f9 is 92%" (cause)
Good alert: "Payment service error budget exhausted" (symptom)

alertmanager-config.yaml

route:
  group_by: ['alertname', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'pagerduty-critical'
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: false
    - match:
        severity: warning
      receiver: slack-warnings

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: <your-pd-key>
        severity: critical
  - name: 'slack-warnings'
    slack_configs:
      - api_url: <webhook>
        channel: '#alerts-warning'

Prometheus alerting rule example (slo-alerts.yaml)

groups:
  - name: slo
    rules:
      - alert: ErrorBudgetExhausted
        expr: |
          (1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) 
          / sum(rate(http_requests_total[30d])))) / 0.999 < 0.2
        for: 5m
        labels:
          severity: critical
          service: "{{$labels.service}}"
        annotations:
          summary: "Error budget for {{$labels.service}} is below 20%"
          description: "Remaining budget: {{$value | humanizePercentage}}"

Deploy

kubectl create configmap alertmanager-config --from-file=alertmanager.yaml=alertmanager-config.yaml
helm upgrade --install prometheus prometheus-community/prometheus \
  --set alertmanager.enabled=true \
  --set alertmanager.configFromSecret=alertmanager-config

The 3 dashboards every on-call engineer needs

Stop building 50-panel dashboards. Start with these three.

Dashboard 1: Service Health (RED method)

Rate (requests per second) per endpoint
Errors (5xx rate, grouped by status code)
Duration (P50, P95, P99 latency)
Saturation (CPU/memory per pod, queue depth)

PromQL snippets

# Rate
sum(rate(http_requests_total[1m])) by (service, endpoint)

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))

# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Dashboard 2: Trace Explorer

Top 10 slowest traces in last hour
Trace heatmap (duration vs. timestamp)
Service dependency graph (from Tempo service graph)
High-error traces panel (filter by status.error=true)

Dashboard 3: The "Burndown" Chart

Error budget remaining (daily trend line)
SLO burn rate (1h, 6h, 24h windows)
Multi-burn alert status (green/yellow/red)
Top offending services by error budget consumption

Why this works: On-call opens Dashboard 1 → sees elevated latency → clicks a trace in Dashboard 2 → finds slow database query → checks Dashboard 3 to decide if paging SREs is urgent.

Final checklist for production readiness

Before you sleep soundly:

Ingestion testing: curl a test span/metric/log through the Collector.
Retention: Set Mimir 30d, Tempo 14d, Loki 30d (adjust to compliance).
Auth: Add Grafana OAuth (Google/GitHub) and basic auth for Mimir/Loki ingesters.
Backups: Object storage (MinIO/S3) should have versioning enabled.
Alert testing: Silence a service, verify PagerDuty gets the page.
Runbook: Link each alert to a Confluence doc (e.g., "ErrorBudgetExhausted → https://wiki/runbooks/slo").

What’s next? Add OpenTelemetry for your database (PostgreSQL, Redis, MongoDB) using OTel collector receivers. Or add synthetic monitoring with Blackbox exporter.

You now have the same stack that cost my clients $0/month (excluding storage) instead of $15k/month for Datadog. Ship it.

DEV Community

Building Production-Grade Observability: OpenTelemetry + Grafana Stack

Why OpenTelemetry? The vendor-lock argument settled once and for all

Architecture overview: Collector, Backends, Visualization

Installing the OTel Collector (gateway mode Helm values)

Auto-instrumentation: Java, Python, Node.js, Go

Deploying Tempo for distributed tracing

Prometheus + Mimir for long-term metrics storage

Loki for log aggregation with structured querying

Grafana: Connecting all three data sources

Building your first SLO dashboard (template included)

AlertManager: Alerting on symptoms, not causes

The 3 dashboards every on-call engineer needs

Top comments (0)