DEV Community

Cover image for Distributed Tracing: The Missing Piece of Your Observability Stack
Samson Tanimawo
Samson Tanimawo

Posted on

Distributed Tracing: The Missing Piece of Your Observability Stack

When Logs and Metrics Aren't Enough

You have great dashboards. Your log aggregation is solid. But when a user reports "the checkout page is slow," you still spend 30 minutes jumping between services trying to find the bottleneck.

That's the gap distributed tracing fills.

What Tracing Actually Shows You

A trace is a complete picture of a single request as it flows through your system:

User Request → API Gateway → Auth Service → Product Service → DB → Cache → Response
                  5ms          12ms           45ms       120ms  3ms
                                                          ^
                                              This is your bottleneck
Enter fullscreen mode Exit fullscreen mode

Without tracing, you'd see:

  • API Gateway: latency looks fine
  • Auth Service: latency looks fine
  • Product Service: latency is HIGH but why?

With tracing, you see the exact DB query inside Product Service that's taking 120ms.

Getting Started with OpenTelemetry

OpenTelemetry is the standard. Here's a minimal setup:

# Python example with Flask
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)

# Auto-instrument everything
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db.engine)
Enter fullscreen mode Exit fullscreen mode

That's it. Three auto-instrumentations cover 80% of what you need.

Custom Spans for the Other 20%

Auto-instrumentation gives you HTTP calls and DB queries. Add custom spans for business logic:

tracer = trace.get_tracer(__name__)

def process_order(order):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)

        with tracer.start_as_current_span("validate_inventory"):
            validate_inventory(order.items)

        with tracer.start_as_current_span("charge_payment"):
            charge_payment(order.payment_method, order.total)

        with tracer.start_as_current_span("send_confirmation"):
            send_email(order.customer_email)
Enter fullscreen mode Exit fullscreen mode

Sampling Strategy

You can't trace every request in production. Well, you can, but your bill will be astronomical.

# otel-collector-config.yaml
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of requests

  tail_sampling:
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Always keep slow requests
      - name: slow-requests  
        type: latency
        latency: {threshold_ms: 1000}
      # Sample 5% of everything else
      - name: default
        type: probabilistic
        probabilistic: {sampling_percentage: 5}
Enter fullscreen mode Exit fullscreen mode

Tail sampling is the key. It lets you keep 100% of interesting traces and only 5% of boring ones.

The Three Queries That Matter

Once you have tracing data, these three queries solve 90% of debugging:

1. "Show me the slowest traces in the last hour"
   → Finds performance regressions

2. "Show me traces with errors, grouped by service"
   → Finds which service is failing

3. "Show me traces for user X's request at time T"
   → Reproduces specific customer issues
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

  1. Not propagating trace context — If service A calls service B but doesn't pass the trace ID, you get broken traces
  2. Over-sampling in production — Start at 1-5%, increase as needed
  3. Not adding business context — Adding user.id, order.id, etc. to spans makes traces actually useful
  4. Ignoring async operations — Queues break trace propagation unless you explicitly pass context

If you want AI-powered trace analysis that automatically finds bottlenecks, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)