LukaszGrochal

Posted on Feb 3

How a Missing Trace Led Me to Build a Local Observability Stack

#opentelemetry #observability #datadog #grafana

Last year, our team spent three days debugging why traces from a critical payment service weren't appearing in DataDog. This service processed ~15,000 orders daily—roughly $200K in transactions. The service was running, logs showed successful transactions, but the APM dashboard was empty. No traces. No spans. Nothing.

For three days, we couldn't answer basic questions: Was the payment gateway slow? Were retries happening? Where was latency hiding? Without traces, we were debugging blind—adding print statements to production code, tailing logs, guessing at latency sources.

The breakthrough came when someone asked: "Can we just run the same setup locally and see if traces actually leave the application?"

We couldn't. DataDog requires cloud connectivity. The local agent still needs an API key and phones home. There was no way to intercept and visualize traces without a DataDog account—and our staging key had rate limits that made local testing impractical.

So I built a stack that accepts ddtrace telemetry locally and routes it to open-source backends. Within an hour of running it, we found the bug. A config change from two sprints back had introduced this filter rule:

# The bug - intended to filter health checks, matched EVERYTHING
filter/health:
  traces:
    span:
      - 'attributes["http.target"] =~ "/.*"'  # Regex matched all paths!

Instead of filtering only /health endpoints, the regex /.* matched every single span. A one-character fix—changing =~ to == and using exact paths—and traces appeared in production within minutes.

Why did it take three days to find a one-character bug? Because we had no visibility into what the collector was actually doing. The config looked reasonable at a glance. The collector reported healthy. Logs showed "traces exported successfully"—but those were other services' traces passing through. Without a way to isolate our service's telemetry and watch it flow through the pipeline, we were guessing. The local stack gave us that visibility in minutes.

This repository is a cleaned-up, documented version of that debugging tool. It's now used across three teams: the original payments team, our logistics service team (who had a similar "missing traces" panic), and the platform team who adopted it for testing collector configs before production rollouts.

What This Stack Does

Point your ddtrace-instrumented application at localhost:8126. The OpenTelemetry Collector receives DataDog-format traces, converts them to OTLP, and exports to Grafana Tempo. Your application thinks it's talking to a DataDog agent.

No code changes required. Set DD_AGENT_HOST=localhost and your existing instrumentation works.

When To Use This (And When Not To)

This stack is valuable when:

You need to verify ddtrace instrumentation works before deploying
You're debugging why traces aren't appearing in production DataDog
You want local trace visualization without DataDog licensing costs
You're testing collector configurations (sampling, filtering, batching) before production rollout

Use something else when:

You're starting a new project—use OpenTelemetry native instrumentation for better portability
You need DataDog-specific features (APM service maps, profiling, Real User Monitoring)
You're processing sustained high throughput (see Performance section below)

Alternatives I evaluated:

Jaeger All-in-One: Simpler setup, but no native log correlation. You'd need a separate logging stack and manual trace ID lookup. For debugging, clicking from log → trace is essential.
DataDog Agent locally: Requires API key, sends data to cloud, rate limits apply. Defeats the purpose of local-only debugging.
OpenTelemetry Demo: Excellent for learning OTLP from scratch, but doesn't help debug existing ddtrace instrumentation—which was our whole problem.

Why Tempo over Jaeger for the backend? Tempo integrates natively with Grafana's Explore view, enabling the bidirectional log↔trace correlation that made debugging fast. Jaeger would require a separate UI and manual correlation.

Quick Start

git clone https://github.com/LukaszGrochal/demo-repo-otel-stack
cd local-otel-stack
docker-compose up -d

# Verify stack health
curl -s http://localhost:3200/ready   # Tempo
curl -s http://localhost:3100/ready   # Loki
curl -s http://localhost:13133/health # OTel Collector

Run the example application (requires uv):

cd examples/python
uv sync
DD_AGENT_HOST=localhost DD_TRACE_ENABLED=true uv run uvicorn app:app --reload

Generate a trace:

curl -X POST http://localhost:8000/orders \
  -H "Content-Type: application/json" \
  -d '{"user_id": 1, "product": "widget", "amount": 29.99}'

Open Grafana at http://localhost:3000 → Explore → Tempo.

Traces not appearing?

# Check collector is receiving data
docker-compose logs -f otel-collector | grep -i "trace"

# Common issues:
# - Port 8126 already bound (existing DataDog agent?)
# - DD_TRACE_ENABLED not set to "true"
# - Application not waiting for collector startup

Pattern 1: Subprocess Trace Propagation

The Problem We Hit

Once the filter bug was fixed, we used the local stack to investigate another issue: the payment service spawned worker processes to generate invoice PDFs after each order. In production DataDog, we could see the HTTP request span, but the PDF generation time was invisible—traces stopped at the subprocess boundary.

This made debugging timeouts nearly impossible. When customers complained about slow order confirmations, we couldn't tell if it was the payment gateway or the invoice generation. The worker was a black box.

Why ddtrace Doesn't Handle This

ddtrace automatically propagates trace context for HTTP requests, gRPC calls, Celery tasks, and other instrumented protocols. But subprocess.run() isn't a protocol—it's an OS primitive. ddtrace can't know whether you want context passed via environment variables, command-line arguments, stdin, or files.

The Solution

Inject trace context into environment variables before spawning. The key insight is just 10 lines—the rest is error handling. From examples/python/app.py:272-340:

def spawn_traced_subprocess(command: list[str], timeout: float = 30.0):
    env = os.environ.copy()

    # THE KEY PATTERN: inject trace context into subprocess environment
    current_span = tracer.current_span()
    if current_span:
        env['DD_TRACE_ID'] = str(current_span.trace_id)
        env['DD_PARENT_ID'] = str(current_span.span_id)

    with tracer.trace("subprocess.spawn", service="subprocess") as span:
        span.set_tag("subprocess.command", " ".join(command[:3]))
        result = subprocess.run(command, env=env, capture_output=True, timeout=timeout)
        span.set_tag("subprocess.exit_code", result.returncode)
        return result.returncode, result.stdout, result.stderr

The full implementation includes timeout handling, error tagging, and logging—see the repository for the complete 70-line version with production error handling.

The worker process reads the context automatically. Key insight: ddtrace reads DD_TRACE_ID and DD_PARENT_ID from the environment when it initializes. You don't need to manually link spans—just ensure ddtrace is imported and patched before creating spans.

From examples/python/worker.py:89-105:

def get_parent_trace_context() -> tuple[int | None, int | None]:
    """Read trace context injected by parent process."""
    trace_id = os.environ.get('DD_TRACE_ID')
    parent_id = os.environ.get('DD_PARENT_ID')
    if trace_id and parent_id:
        return int(trace_id), int(parent_id)
    return None, None

The worker creates nested spans that automatically link to the parent trace. From examples/python/worker.py:108-170:

def process_file(input_path: str, simulate_error: bool = False) -> dict:
    with tracer.trace("worker.process_file", service="file-worker") as span:
        span.set_tag("file.path", input_path)
        span.set_tag("worker.pid", os.getpid())

        with tracer.trace("file.read") as read_span:
            # ... file reading with span tags

        for i in range(chunks):
            with tracer.trace("chunk.process") as chunk_span:
                chunk_span.set_tag("chunk.index", i)
                # ... chunk processing

        with tracer.trace("file.write") as write_span:
            # ... file writing with span tags

        return {"lines_processed": processed_lines, "chunks": chunks}

See worker.py for the full implementation with error simulation and detailed span tagging.

Test it:

curl -X POST http://localhost:8000/process-file \
  -H "Content-Type: application/json" \
  -d '{"file_path": "test.txt"}'

The trace shows the complete chain: HTTP request → subprocess.spawn → worker.process_file → file.read → chunk.process (×N) → file.write. All connected under one trace ID.

Limitation

This only works for synchronous subprocess spawning where you control the invocation. For Celery, RQ, or other task queues, use their built-in trace propagation instead.

Pattern 2: Circuit Breaker Observability

We don't need another circuit breaker implementation—libraries like pybreaker and tenacity handle that. What matters for observability is tagging spans with circuit state so you can query failures during incidents.

From examples/python/app.py:609-618:

# Check inventory with circuit breaker
with tracer.trace("inventory.check", service="inventory-service") as span:
    span.set_tag("product", product)
    span.set_tag("circuit_breaker.state", external_service_circuit.state)

    if not external_service_circuit.can_execute():
        span.set_tag("error", True)
        span.set_tag("error.type", "circuit_open")
        PROM_ORDERS_FAILED.labels(reason="circuit_open").inc()
        raise HTTPException(status_code=503, detail="Inventory service circuit breaker open")

During an incident, query Tempo for circuit_breaker.state=OPEN to see:

When exactly the circuit opened
What failure pattern preceded it
Which downstream service caused the cascade

Pattern 3: Log-Trace Correlation

Click a log line in Loki, jump directly to the trace in Tempo.

Inject Trace IDs Into Logs

From examples/python/app.py:84-109:

class TraceIdFilter(logging.Filter):
    """Injects trace context into log records for correlation."""

    def filter(self, record):
        # Get current span from ddtrace
        span = tracer.current_span()
        if span:
            record.trace_id = span.trace_id
            record.span_id = span.span_id
            # Convert to hex format for Tempo compatibility
            record.trace_id_hex = format(span.trace_id, 'x')
        else:
            record.trace_id = 0
            record.span_id = 0
            record.trace_id_hex = '0'
        return True


# Set up logging with trace correlation
# Use hex format for trace_id to match Tempo's format
logging.basicConfig(
    format='%(asctime)s %(levelname)s [trace_id=%(trace_id_hex)s span_id=%(span_id)s] %(name)s: %(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)
logger.addFilter(TraceIdFilter())

Configure Grafana to Link Them

The Loki data source includes derived fields that extract trace IDs and create clickable links:

derivedFields:
  - datasourceUid: tempo
    matcherRegex: 'trace_id=([a-fA-F0-9]+)'
    name: TraceID
    url: '$${__value.raw}'

Correlation works bidirectionally:

Loki → Tempo: Click trace ID in any log entry
Tempo → Loki: Click "Logs for this span" in trace view

The Collector Pipeline

This is where the debugging power comes from. From config/otel-collector.yaml:146-160:

service:
  extensions: [health_check, zpages]

  pipelines:
    # Main traces pipeline - processes all incoming traces
    traces:
      receivers: [datadog, otlp]
      processors:
        - memory_limiter      # First: prevent OOM
        - filter/health       # Remove health check noise
        - attributes/sanitize # Remove sensitive data
        - probabilistic_sampler # Sample if needed
        - batch               # Batch for efficiency
        - resource            # Add metadata
      exporters: [otlp/tempo]

Why each processor matters:

Processor	Purpose	What breaks without it
`memory_limiter`	Prevents OOM on traffic spikes	Collector crashes, loses all buffered traces
`filter/health`	Removes health check noise	Storage fills with useless spans
`attributes/sanitize`	Strips sensitive headers	Credentials leaked to trace storage
`batch`	Groups spans for efficient export	High CPU, slow exports, Tempo overload

The filter configuration that caused our original production issue. From config/otel-collector.yaml:82-91:

filter/health:
  error_mode: ignore
  traces:
    span:
      - 'attributes["http.target"] == "/health"'
      - 'attributes["http.target"] == "/ready"'
      - 'attributes["http.target"] == "/metrics"'
      - 'attributes["http.target"] == "/"'
      - 'attributes["http.route"] == "/health"'
      - 'attributes["http.route"] == "/ready"'

Our production bug was a wildcard in one of these expressions that matched everything. Having a local stack to test filter rules before deploying them would have caught this in minutes, not days.

Performance Characteristics

Measured on M1 MacBook Pro, 16GB RAM, Docker Desktop 4.25:

Metric	Value	Methodology
Idle memory (full stack)	1.47 GB	`docker stats` after 5min idle
Collector memory	89 MB	Under load, batch size 100
Sustained throughput	~800 spans/sec	`hey` load test, 50 concurrent, 60 seconds
Tempo query latency	35-80ms	Trace with 50 spans, cold query
Export latency (P99)	18ms	Collector metrics `/metrics` endpoint

What does 800 spans/sec mean in practice? A typical request to our payment service generates 8-12 spans (HTTP, DB queries, external calls). That's ~70 requests/second before hitting limits. Our heaviest local testing—running integration suites with parallel workers—peaks at ~200 spans/sec, well within capacity.

At ~1200 spans/second, the collector begins dropping traces. You'll see this in the otelcol_processor_dropped_spans metric. For higher throughput, increase memory_limiter thresholds and batch sizes—but this is a local dev tool, not a production trace pipeline.

Security Model

What's Implemented

Measure	Purpose
`read_only: true`	Immutable container filesystem—compromise can't persist
`no-new-privileges`	Blocks privilege escalation via setuid
Network isolation	Tempo only accessible from internal Docker network
Resource limits	Memory caps prevent container resource exhaustion

What's NOT Implemented

TLS between components: All traffic is plaintext on the Docker network
Authentication: Grafana runs with anonymous access
Secrets management: No sensitive data in this stack

This is appropriate for local development. For shared dev environments, enable Grafana authentication:

# docker-compose.override.yml
services:
  grafana:
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=false
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}

Alerting

Pre-configured Prometheus and Loki alert rules, evaluated by Grafana:

Alert	Condition	Purpose
HighErrorRate	>10% order failures	Catch application bugs early
SlowRequests	P95 latency > 2s	Detect performance regressions
CircuitBreakerOpen	State = OPEN	External dependency issues
ErrorLogSpike	Error log rate > 0.1/sec	Unusual error patterns
ServiceDown	Scrape target unreachable	Infrastructure failures

Limitations

Trace ID format conversion: DataDog uses 64-bit IDs; OTLP uses 128-bit. The collector zero-pads. Cross-system correlation with 128-bit-native systems may fail.
No DataDog APM features: This gives you traces, not service maps, anomaly detection, or profiling integration.
Memory footprint: ~1.5GB at idle. Not suitable for resource-constrained environments.
Retention defaults: 24h for traces, 7d for logs. Configurable in tempo.yaml and loki.yaml.

What I'd Do Differently

1. Start with OpenTelemetry native instrumentation. If starting fresh today, I'd use the OpenTelemetry Python SDK rather than ddtrace. The 64-bit/128-bit trace ID mismatch we deal with is a symptom of building on a proprietary format. OTel gives you vendor portability from day one.

2. Use W3C Trace Context for subprocess propagation. The current pattern relies on ddtrace reading DD_TRACE_ID and DD_PARENT_ID from the environment—behavior that's not prominently documented and could change. A more portable approach would serialize W3C Trace Context headers to a temp file or pass via stdin:

# More portable alternative (pseudocode, not implemented here)
# W3C traceparent format: version-trace_id(32 hex)-parent_id(16 hex)-flags
traceparent = f"00-{trace_id:032x}-{span_id:016x}-01"
subprocess.run(cmd, input=json.dumps({"traceparent": traceparent}), ...)

3. Add a config validation mode. The filter regex bug that started this project could have been caught by a "dry run" mode that shows which spans would be filtered without actually dropping them. I may add this in a future version.

4. Consider ClickHouse for trace storage. Tempo is excellent for this use case, but for teams that need SQL queries over traces (e.g., "show me all spans where db.statement contains 'SELECT *'"), ClickHouse with the OTel exporter would be more powerful.

That said, for teams already invested in ddtrace, this stack provides immediate value without code changes—and that was the whole point.

Lessons for Incident Response

This incident changed how we handle observability issues:

"Can we reproduce it locally?" is now our first question. If the answer is no, we build the tooling to make it yes.
Config changes to observability pipelines get the same review rigor as application code. That regex change went through PR review—but nobody caught it because we couldn't test it.
Silent failures are the worst failures. The collector reported healthy while dropping 100% of our traces. We now have alerts on otelcol_processor_dropped_spans > 0.

Repository

github.com/LukaszGrochal/demo-repo-otel-stack

This is a documented, tested version of the debugging tool that helped us fix a production outage. The patterns—subprocess tracing, circuit breaker tagging, log correlation—are used across three teams in our development workflows.

MIT licensed. Issues and PRs welcome.

DEV Community