Distributed Tracing on a Budget

#programming #webdev

---
title: "Distributed Tracing on a Budget with OpenTelemetry and Grafana"
published: true
description: "Set up distributed tracing with OpenTelemetry tail-based sampling, Tempo, Loki, and Grafana for under $50/month at 10k RPM."
tags: devops, cloud, architecture, performance
canonical_url: https://blog.mvpfactory.co/distributed-tracing-on-a-budget-with-opentelemetry-and-grafana
---

## What We Are Building

Let me show you a pattern I use in every project that needs production visibility without the Datadog bill. We will wire up a complete observability pipeline — OpenTelemetry Collector with tail-based sampling, Grafana Tempo for traces, Loki for correlated logs, and Grafana dashboards — that keeps storage under **$50/month at 10,000 requests per minute**.

At 10k RPM, a naive trace-everything approach generates roughly 14.4 million traces per day. Datadog charges $31/million spans ingested after the free tier. A self-hosted Grafana stack brings that down to ~$45/month in storage costs. Here is the minimal setup to get this working.

## Prerequisites

- A running backend service (Node.js, Kotlin/Spring, or any OTel-supported runtime)
- Docker and Docker Compose for running Tempo, Loki, and Grafana
- S3-compatible object storage (AWS S3 or MinIO) for trace and log retention
- Basic familiarity with YAML configuration

## Step 1: Auto-Instrumentation With Zero Code Changes

OpenTelemetry's auto-instrumentation libraries cover most frameworks out of the box. Pick your runtime:

bash

Node.js -- add to your entrypoint

node --require @opentelemetry/auto-instrumentations-node/register app.js

Kotlin/Spring -- use the Java agent

java -javaagent:opentelemetry-javaagent.jar -jar your-service.jar


The Java agent automatically instruments Spring Web, gRPC, JDBC, Kafka, and HTTP clients. No code changes. Auto-instrumentation covers about 80% of what you need on day one — add manual spans for business-critical paths later.

## Step 2: The Collector Config That Controls Costs

This is the piece that makes everything affordable. The OpenTelemetry Collector's **tail-based sampling** waits for the complete trace before deciding whether to keep it. Unlike head-based sampling, you keep 100% of error traces and slow requests while aggressively sampling the happy path.

yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317

processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors-always
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-requests
type: latency
latency: {threshold_ms: 2000}
- name: high-cardinality-filter
type: string_attribute
string_attribute:
key: http.target
values: ["/health", "/ready", "/metrics"]
enabled_regex_matching: true
invert_match: true
- name: baseline-sample
type: probabilistic
probabilistic: {sampling_percentage: 5}
decision_cache:
sampled_cache_size: 100000

exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true

service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlp/tempo]


This keeps every error, every request over 2 seconds, drops health-check noise entirely, and samples only 5% of normal traffic. That reduces stored traces from ~14.4M/day to roughly 720k/day plus all errors and slow requests. Tempo's storage at that volume sits under $30/month on S3-compatible object storage.

## Step 3: Trace-to-Log Correlation

Here is the gotcha that will save you hours: this single pattern replaces most of what teams actually use Datadog for. Inject the trace ID into every log line, then configure Grafana to link them.

Include the `traceID` field in your Loki logging config as a label or structured metadata. Then add a derived field on your Loki data source in Grafana:

plaintext
Name: TraceID
Regex: traceID=(\w+)
Internal link → Target data source: Tempo
Query: ${__value.raw}


Clicking any trace ID in your logs now jumps directly to the full distributed trace in Tempo. If you only set up one thing from this tutorial, make it this.

## Step 4: The Dashboard That Tells You What Matters

Build a Grafana dashboard with these panels sourced from Tempo's metrics-generator:

- **R.E.D. metrics** (Rate, Error rate, Duration) from `traces_spanmetrics_latency_bucket`
- **Service map** using Tempo's built-in service graph
- **Top-N slow endpoints** via TraceQL: `{status = error} | avg(duration) > 1s`

## Storage Budget Breakdown

| Component | Storage Backend | Monthly Cost |
|---|---|---|
| Tempo traces | S3/MinIO (~50 GB) | ~$20 |
| Loki logs | S3/MinIO (~80 GB) | ~$25 |
| Grafana | Stateless | $0 |
| OTel Collector | Stateless | $0 |
| **Total** | | **~$45/month** |

## Gotchas

- **Start with tail-based sampling from day one.** Retrofitting sampling policies after you have committed to a vendor is painful. The collector config above immediately cuts trace volume by 90%+ while keeping every trace that actually matters.
- **The docs do not mention this, but** `decision_wait: 10s` means the collector buffers traces in memory. At high throughput, `num_traces: 50000` prevents OOM — tune this to your actual concurrency.
- **Instrument first, optimize later.** Auto-instrumentation gives you immediate coverage. Do not spend a week writing manual spans before your pipeline is even running.
- **Set up trace-to-log correlation before dashboards.** A single derived field in Grafana connecting Loki to Tempo replaces the core workflow teams pay thousands per month for.

## Wrapping Up

You now have a production-grade observability pipeline that costs roughly 3% of what Datadog charges for equivalent visibility. The tail-based sampling keeps your storage lean, the trace-to-log correlation keeps your debugging fast, and the whole stack runs on four stateless components you can drop into any Docker Compose or Kubernetes setup. Ship it, watch the traces roll in, and enjoy keeping that $50/month budget intact.

DEV Community

Distributed Tracing on a Budget

Node.js -- add to your entrypoint

Kotlin/Spring -- use the Java agent

Top comments (0)