When Logs and Metrics Aren't Enough
You have great dashboards. Your log aggregation is solid. But when a user reports "the checkout page is slow," you still spend 30 minutes jumping between services trying to find the bottleneck.
That's the gap distributed tracing fills.
What Tracing Actually Shows You
A trace is a complete picture of a single request as it flows through your system:
User Request → API Gateway → Auth Service → Product Service → DB → Cache → Response
5ms 12ms 45ms 120ms 3ms
^
This is your bottleneck
Without tracing, you'd see:
- API Gateway: latency looks fine
- Auth Service: latency looks fine
- Product Service: latency is HIGH but why?
With tracing, you see the exact DB query inside Product Service that's taking 120ms.
Getting Started with OpenTelemetry
OpenTelemetry is the standard. Here's a minimal setup:
# Python example with Flask
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
# Auto-instrument everything
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db.engine)
That's it. Three auto-instrumentations cover 80% of what you need.
Custom Spans for the Other 20%
Auto-instrumentation gives you HTTP calls and DB queries. Add custom spans for business logic:
tracer = trace.get_tracer(__name__)
def process_order(order):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order.id)
span.set_attribute("order.total", order.total)
with tracer.start_as_current_span("validate_inventory"):
validate_inventory(order.items)
with tracer.start_as_current_span("charge_payment"):
charge_payment(order.payment_method, order.total)
with tracer.start_as_current_span("send_confirmation"):
send_email(order.customer_email)
Sampling Strategy
You can't trace every request in production. Well, you can, but your bill will be astronomical.
# otel-collector-config.yaml
processors:
probabilistic_sampler:
sampling_percentage: 10 # Sample 10% of requests
tail_sampling:
policies:
# Always keep errors
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Always keep slow requests
- name: slow-requests
type: latency
latency: {threshold_ms: 1000}
# Sample 5% of everything else
- name: default
type: probabilistic
probabilistic: {sampling_percentage: 5}
Tail sampling is the key. It lets you keep 100% of interesting traces and only 5% of boring ones.
The Three Queries That Matter
Once you have tracing data, these three queries solve 90% of debugging:
1. "Show me the slowest traces in the last hour"
→ Finds performance regressions
2. "Show me traces with errors, grouped by service"
→ Finds which service is failing
3. "Show me traces for user X's request at time T"
→ Reproduces specific customer issues
Common Mistakes
- Not propagating trace context — If service A calls service B but doesn't pass the trace ID, you get broken traces
- Over-sampling in production — Start at 1-5%, increase as needed
-
Not adding business context — Adding
user.id,order.id, etc. to spans makes traces actually useful - Ignoring async operations — Queues break trace propagation unless you explicitly pass context
If you want AI-powered trace analysis that automatically finds bottlenecks, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)