Haji Rufai

Posted on May 30

I Built a Distributed Tracing System from Scratch in Python

#python #observability #distributedsystems #backend

Every time I looked at OpenTelemetry's source code, I'd close the tab after five minutes. Not because it's bad — it's excellent engineering — but because the abstractions are so deep that you can't see what's actually happening. What does a span processor do? How does context propagation work across services? What's inside a traceparent header?

So I built my own. From scratch. Zero dependencies. Just Python's standard library and a lot of reading the W3C spec.

The result is TraceLite — a fully functional distributed tracing system with 18 modules, 193 tests, and about 5,700 lines of code.

What Even Is Distributed Tracing?

When a user hits your API and the request touches three services and two databases, something will be slow. Distributed tracing answers the question: which part?

A trace is the full journey of a request. A span is one unit of work within that journey. Spans nest: the top-level "handle HTTP request" span has children like "authenticate user" and "query database."

Each span records its start time, end time, attributes (like db.system = postgresql), and a link to its parent. Stitch them together and you get a waterfall diagram that shows exactly where time was spent.

The Span Data Model

Here's the core of TraceLite — the Span class:

class Span:
    def __init__(self, name, trace_id=None, parent_span_id=None,
                 kind=SpanKind.INTERNAL, resource=None, attributes=None):
        self.trace_id = trace_id or generate_trace_id()
        self.span_id = generate_span_id()
        self.parent_span_id = parent_span_id
        self.name = name
        self.kind = kind
        self.start_time_ns = monotonic_ns()
        self.end_time_ns = None
        self.status = SpanStatus(StatusCode.UNSET)
        self.attributes = dict(attributes or {})
        self.events = []
        self.links = []

Nothing exotic. A span is a named interval with metadata. The trace_id is shared across all spans in a request — that's how you reconstruct the full picture. The parent_span_id tells you which span created this one.

Span kinds matter more than you'd think. A SERVER span represents an incoming request being handled. A CLIENT span is an outgoing call. INTERNAL is everything else. When you build a service graph later, these kinds determine the edges.

Context Propagation: The Hard Part

Inside a single process, linking spans is easy: store the current span in a contextvars.ContextVar and check it when creating a new span.

_current_span: ContextVar[Optional[Span]] = ContextVar('current_span', default=None)

def set_current_span(span):
    return _current_span.set(span)

def get_current_span():
    return _current_span.get()

The set() call returns a token. When the span ends, you use that token to restore the previous value. This matters because Python's context vars are copy-on-write per task — they work correctly across asyncio tasks and threads without locks.

But what about across services? Service A calls Service B over HTTP. How does B know it's part of A's trace?

That's where the W3C Trace Context specification comes in. It defines a traceparent header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              ^  ^                                ^                 ^
              |  trace-id (32 hex)                span-id (16 hex)  flags
              version

The implementation is a W3CTraceContextPropagator with two methods:

class W3CTraceContextPropagator:
    def inject(self, span_context, carrier):
        header = f"00-{span_context.trace_id}-{span_context.span_id}-{flags}"
        carrier["traceparent"] = header

    def extract(self, carrier):
        header = carrier.get("traceparent", "")
        parts = header.split("-")
        return SpanContext(trace_id=parts[1], span_id=parts[2],
                           trace_flags=int(parts[3], 16))

Service A injects the header before making an HTTP call. Service B extracts it and uses the extracted SpanContext as the parent for its root span. Same trace_id, different span_id, linked via parent_span_id.

Sampling: Don't Record Everything

In production, recording every span would melt your storage. Sampling decides which traces to keep. TraceLite ships five strategies:

AlwaysOn / AlwaysOff — obvious
Probabilistic — uses the trace ID as a deterministic seed, so all spans in the same trace get the same decision
RateLimiting — token bucket algorithm, caps spans per second
ParentBased — delegates to the parent's sampling decision for consistency

The probabilistic sampler is worth looking at:

class ProbabilisticSampler(Sampler):
    def __init__(self, rate: float):
        self._bound = int(rate * (1 << 64))

    def should_sample(self, trace_id, name, parent_context, attributes):
        hash_val = int(trace_id[:16], 16)
        sampled = hash_val < self._bound
        return SamplingResult(
            decision=SamplingDecision.RECORD if sampled else SamplingDecision.DROP
        )

By hashing the trace ID, every service making the same sampling decision for the same trace. No coordination needed.

The Processing Pipeline

Spans flow through a pipeline: Tracer → Processor → Exporter. The SimpleSpanProcessor is synchronous — it exports each span immediately when it ends. The BatchSpanProcessor queues spans and exports them in batches on a background thread:

class BatchSpanProcessor(SpanProcessor):
    def __init__(self, exporter, max_queue=2048, batch_size=512,
                 schedule_delay_ms=5000):
        self._queue = queue.Queue(maxsize=max_queue)
        self._worker = threading.Thread(target=self._run, daemon=True)
        self._worker.start()

    def on_end(self, span):
        try:
            self._queue.put_nowait(span)
        except queue.Full:
            pass  # Drop span rather than block the application

That except queue.Full: pass is a real design decision. In a tracing system, you never want instrumentation to slow down the application. Dropping spans under load is the correct behavior.

Storage: SQLite with WAL

Spans are stored in SQLite using Write-Ahead Logging mode for concurrent reads and writes. The schema is flat — one row per span with JSON columns for structured data:

CREATE TABLE spans (
    trace_id TEXT,
    span_id TEXT PRIMARY KEY,
    parent_span_id TEXT,
    name TEXT,
    kind TEXT,
    service_name TEXT,
    status_code TEXT,
    start_time_ns INTEGER,
    end_time_ns INTEGER,
    duration_ns INTEGER,
    attributes_json TEXT,
    events_json TEXT
);
CREATE INDEX idx_trace ON spans(trace_id);
CREATE INDEX idx_time ON spans(start_time_ns);
CREATE INDEX idx_service ON spans(service_name);

The query builder provides a fluent API:

query = (TraceQuery()
    .service("api-gateway")
    .min_duration_ms(100)
    .errors_only()
    .last_hours(1)
    .limit(20))
traces = storage.get_traces(**query.build())

Analysis: Finding the Bottleneck

The analysis module has four functions that answer the questions you'd actually ask when debugging:

Critical path finds the chain of spans that determine end-to-end latency. If any span in this chain gets slower, the whole trace gets slower. The algorithm walks from root to leaf, always following the child with the latest end time.

Latency breakdown groups spans by service and operation, computing total time, average, and percentage of trace duration.

Error summary counts errors by service, operation, and exception type. When 90% of your errors come from one service's database calls, that's where you look.

Gap analysis finds dead time — intervals where the parent span was running but no child span was active. These gaps often reveal hidden work like serialization, GC pauses, or untraced third-party calls.

Service Graph

When you have spans from multiple services, you can build a dependency graph:

graph = ServiceGraph()
graph.add_traces(all_traces)
print(graph.to_ascii())

Service Dependency Graph
==================================================
  gateway ──> user-service
    142 calls, 45.2ms avg, 2.1% errors
  user-service ──> database
    89 calls, 12.8ms avg, 0.0% errors

The graph tracks call counts, average latency, and error rates per edge. It also exports to DOT format for Graphviz rendering and computes topological sort for deployment ordering.

The Decorator Shortcut

For application code, the @trace decorator eliminates boilerplate:

from tracelite.decorators import trace

@trace(record_args=True)
def process_order(order_id: str, items: list):
    validate_inventory(items)
    charge_payment(order_id)

@trace
def validate_inventory(items):
    for item in items:
        check_stock(item)

With record_args=True, function arguments are automatically captured as span attributes. The decorator handles context propagation, error recording, and cleanup.

Visualization

Two ASCII renderers make traces readable in terminals and CI logs:

● GET /api/users (245ms)
├── ● auth_middleware (18ms)
├── ● rate_limiter (6ms)
└── ● call_user_service (198ms)
    └── ● GET /internal/users (195ms)
        ├── ● check_cache (3ms)
        └── ● db_query (172ms)
            └── ● SELECT users (170ms)

The waterfall diagram shows timing relationships. The span tree shows nesting. Both derive from the same span data — just different projections.

There's also an HTML dashboard generator that produces a self-contained file with trace list, waterfall view, and service map. No server required — just open the HTML file.

What I Learned

Building a tracing system teaches you things that using one doesn't:

Context propagation is the whole game. The span data model is simple. Getting context to flow correctly across threads, async tasks, and service boundaries is where all the complexity lives.
Sampling must be deterministic. If Service A decides to sample a trace but Service B doesn't, you get partial traces. Using the trace ID as the randomness source guarantees consistency without coordination.
Dropping data is a feature. The batch processor's except queue.Full: pass pattern felt wrong at first. But tracing is observability, not correctness. It must never slow down the application, even if that means losing some spans.
W3C Trace Context is well-designed. The spec is tight — 26 pages that cover propagation, mutation rules, and vendor-specific tracestate. Reading it directly was more useful than reading any tutorial about it.
SQLite is underrated for observability. WAL mode handles concurrent reads and writes well enough for single-node use. The indexed queries are fast. And the data is immediately queryable without a separate query language.

The full source is at github.com/hajirufai/tracelite — 18 modules, 193 tests, zero dependencies. The landing page is at hajirufai.github.io/tracelite.

DEV Community