Erythix

Posted on Mar 21

Distributed Tracing in ML Pipelines: From Preprocessing to Inference

#distributedsystems #machinelearning #monitoring #performance

How OpenTelemetry exposes the bottlenecks your metrics will never see

Samuel Desseaux · Erythix

1. The Lie of the Green Dashboard

It is 2 PM on a Tuesday. Your team receives a user report: predictions have been slow since this morning. You open Grafana. CPU at 38%, GPU at 72%, HTTP error rate at 0.2%, p99 latency at 1.4s. Nothing breaches a configured threshold. You tell the user everything looks nominal.

Two hours later, a second report. Then a third. The problem exists. Your tools cannot see it.

This scenario is not hypothetical. It is the daily reality of most teams operating ML pipelines in production without distributed tracing. Classic metrics measure the state of a service at a given moment. They do not measure the life of a request as it travels through multiple services. These are two fundamentally different levels of observation, and conflating them is a systematic source of operational blind spots.

The distinction matters more in ML pipelines than anywhere else in software engineering, because a machine learning pipeline is not a function. It is a chain of distributed transformations, each with its own dependencies, its own timing characteristics, and its own failure modes.

2. Why an ML Pipeline is structurally difficult to observe

Consider a typical production pipeline for a recommendation engine or an LLM completion service. A request arrives at the API Gateway, which validates it, enriches context via a feature store, assembles a batch, sends it to the model server, retrieves raw output, validates and formats it, then finally responds to the client. Six to eight distinct services, sometimes in different runtimes (Python, Go, Triton), sometimes on different machines, sometimes in different availability zones.

In this context, a performance degradation can originate anywhere in the chain. And its cause is rarely where it appears to be.

The technical impact is direct. Without cross-service visibility, diagnosing a regression takes hours. Engineers manually scan logs from each service, compare timestamps, and mentally reconstruct a sequence that tooling should surface automatically. This is not a knowledge problem; it is an instrumentation problem.

The business impact is consistently underestimated. A slow ML pipeline means a recommendation engine that responds after the user has already scrolled past. It means an LLM assistant that feels broken. In the most critical contexts, fraud detection, credit scoring, medical triage, it means a decision rendered too late to be useful. Latency is not an infrastructure problem. It is a value delivery problem.

3. The Four Bottlenecks that metrics will never see

Before reaching solutions, the problems deserve precise names. Field experience on production ML pipelines consistently surfaces four categories of bottlenecks that are structurally invisible to classic monitoring.

3.1 The Cascade Bottleneck in feature extraction

The feature store is the most under-monitored dependency in an ML pipeline. It handles requests efficiently when serving from cache, then falls back to the underlying database for the minority of cases where data is not warm. That minority can have a latency ten to fifty times higher than the cache path.

What the metrics show: p50 at 15ms, p99 at 800ms on the feature store service. If you are already looking at that specific service dashboard, the problem is visible. If you are looking at aggregate pipeline latency, the feature store is buried in the noise. And if you do not know the feature store is the culprit, you will not look at its dashboard until the investigation is already underway.

A distributed trace, by contrast, immediately shows that on slow requests the feature_extraction span accounts for 60% of total pipeline time, and that it is consistently the db_fallback child span driving the duration.

The business impact crystallizes around new users: those whose features are not yet warm in cache experience the worst latency precisely at the moment when engagement is most fragile and first impressions are being formed.

3.2 The GPU Queue hiding behind utilization numbers

The GPU is the most expensive resource in an ML infrastructure. Its standard monitoring metric is utilization percentage. 75% looks healthy, neither underused nor saturated.

But GPU utilization measures the percentage of time the GPU is executing kernels. It does not measure the time requests spend waiting in the queue for GPU access. A GPU running at 75% utilization can spend 60% of its chargeable compute time waiting for prior requests to release memory before beginning actual computation.

Distributed tracing decomposes the inference span into two distinct measurements: queue_wait_ms and forward_pass_ms. When the ratio of queue wait to forward pass exceeds 1, the GPU is a bottleneck regardless of what the utilization gauge reads.

The economics are stark. On a service handling 10,000 requests per hour, if each request waits 300ms in the GPU queue, that is 3,000 seconds of accumulated client-facing latency per hour. And the GPU billing meter runs equally on queue time and compute time.

3.3 Silent revalidations

This pattern is the most deceptive of the four. Your HTTP error rate is 0%. Users receive valid responses. Everything appears correct by every standard metric.

Behind the scenes, the post-processor is receiving malformed model outputs: truncated JSON, missing required fields, unexpected formats. It repairs them by replaying validation with relaxed parameters, sometimes by re-invoking the model. The end user sees a valid response with slightly elevated latency. The monitoring dashboard sees nothing unusual.

This behavior is an early indicator of model degradation. A model that starts producing malformed outputs 5% of the time will progressively worsen to 10%, then 20%. Without measuring the revalidation rate, the degradation only becomes detectable through HTTP error rate increases, which is to say: too late, after user-facing failures have already begun.

Distributed tracing makes these validation attempts visible as attributes and events on the post_processing span. Aggregated across requests, they form an early warning signal that HTTP-level metrics fundamentally cannot provide.

3.4 The Unmanaged Component cold start

The tokenizer, the image preprocessor, the feature scaler: these components are typically loaded from disk or initialized on the first request, then held in memory. Or they are supposed to be.

In practice, unexpected reloads occur for a variety of reasons: worker rotation, memory eviction under pressure, partial deployments, lifecycle bugs. The result is a bimodal latency distribution: the vast majority of requests are fast, a minority are slow in a pattern that does not look random.

Identifying this on aggregated metrics is difficult because the mean and even the p99 can appear acceptable if the reloads are infrequent. On a trace, the cold start appears as a tokenizer_init span of 300ms on affected requests and absent on clean ones. The pattern is immediately legible on a flamegraph. On a metrics dashboard, it dissolves into the p95 histogram and becomes invisible.

4. What Distributed Tracing changes structurally

Before the implementation, the theoretical case deserves to be made clearly.

A metric is an aggregation. It discards the information about individual requests. It cannot tell you that user X's request was slow because of the feature store while user Y's request was slow because of the GPU queue.

A log is an isolated event. It knows what happened inside one service, but not how that event relates to what happened in the other services handling the same request.

A trace is a causal and temporal view of a request as it traverses all services. It links events by their trace_id, preserves the parent-child relationship between spans, measures each stage individually, and allows navigation from the aggregate view (a flamegraph across all requests) to the individual view (one specific slow request) in two clicks.

This is the difference between knowing your pipeline is sometimes slow and knowing why this specific request took 1.8 seconds at 2:03 PM on Tuesday.

5. Instrumentation Architecture: Five Services, One Trace

The following implementation covers a complete ML pipeline instrumented end-to-end. Each service produces spans. Context propagates via HTTP headers. Tempo aggregates spans into navigable traces.

[Client]
    |
    v
[API Gateway]              <- trace_id created here
    |
    |---> [Input Validator]        <- child span 1
    |
    |---> [Feature Extractor]      <- child span 2
    |           `---> [Feature Store]   <- grandchild span 2.1
    |
    |---> [Batch Assembler]        <- child span 3
    |           `---> [Tokenizer]       <- grandchild span 3.1
    |
    |---> [Model Inference]        <- child span 4
    |           `---> [Model Server]    <- grandchild span 4.1
    |
    `---> [Post-Processor]         <- child span 5

5.1 Shared Initialization

# tracing_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import os

def setup_tracing(service_name: str) -> trace.Tracer:
    resource = Resource.create({
        "service.name":           service_name,
        "service.version":        os.getenv("SERVICE_VERSION", "0.0.0"),
        "deployment.environment": os.getenv("ENV", "production"),
        "ml.pipeline.name":       os.getenv("PIPELINE_NAME", "default"),
    })

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint=os.getenv(
                    "OTEL_EXPORTER_OTLP_ENDPOINT",
                    "http://otel-collector:4317"
                ),
                insecure=True,
            ),
            max_queue_size=2048,
            max_export_batch_size=512,
        )
    )
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)

5.2 API Gateway: Creating the Root Span

The gateway is where the trace_id is born. Every subsequent service will attach its spans to this root. The attributes set here define the dimensions available for filtering in Tempo and Grafana.

# api_gateway.py
from opentelemetry.propagate import inject
from opentelemetry.trace import StatusCode
from uuid import uuid4
import time

tracer = setup_tracing("ml.api_gateway")

@app.post("/predict")
async def predict(request: PredictRequest):
    with tracer.start_as_current_span("ml.pipeline.request") as root_span:
        request_id = str(uuid4())

        root_span.set_attribute("ml.request.id",       request_id)
        root_span.set_attribute("ml.model.name",       request.model)
        root_span.set_attribute("ml.input.type",       request.input_type)
        root_span.set_attribute("ml.input.size_bytes", len(request.payload))
        root_span.set_attribute("ml.client.id",        request.client_id)

        headers = {"X-Request-ID": request_id}
        inject(headers)

        t_start = time.time()
        try:
            result = await orchestrate_pipeline(request, headers)
            root_span.set_attribute("ml.output.tokens",    result.token_count)
            root_span.set_attribute("ml.pipeline.success", True)
            root_span.set_attribute("ml.pipeline.latency_ms",
                round((time.time() - t_start) * 1000, 2))
            return result

        except Exception as e:
            root_span.set_status(StatusCode.ERROR, str(e))
            root_span.record_exception(e)
            root_span.set_attribute("ml.pipeline.success", False)
            raise

5.3 Feature Extractor: The Most Critical Stage to Instrument

This is where the most common production bottleneck lives. The instrumentation separates cache latency from database latency, making the two failure modes independently observable.

# feature_extractor.py
from opentelemetry.propagate import extract
from opentelemetry import trace

tracer = setup_tracing("ml.feature_extractor")

async def extract_features(payload: dict, headers: dict) -> dict:
    ctx = extract(headers)

    with tracer.start_as_current_span(
        "ml.feature_extraction", context=ctx
    ) as span:
        feature_ids = payload["feature_ids"]
        span.set_attribute("ml.features.requested_count", len(feature_ids))

        t_cache = time.time()
        cached      = await cache.mget(feature_ids)
        cache_hits  = sum(1 for v in cached if v is not None)
        cache_misses = len(feature_ids) - cache_hits

        span.set_attribute("ml.features.cache_hits",    cache_hits)
        span.set_attribute("ml.features.cache_misses",  cache_misses)
        span.set_attribute("ml.features.cache_hit_rate",
            round(cache_hits / len(feature_ids), 3))
        span.set_attribute("ml.features.cache_lookup_ms",
            round((time.time() - t_cache) * 1000, 2))

        if cache_misses > 0:
            t_db = time.time()
            missing_ids = [
                fid for fid, v in zip(feature_ids, cached) if v is None
            ]
            db_features = await feature_db.get_batch(missing_ids)
            span.set_attribute("ml.features.db_lookup_ms",
                round((time.time() - t_db) * 1000, 2))

            still_missing = set(missing_ids) - set(db_features.keys())
            if still_missing:
                span.add_event("ml.features.unavailable", {
                    "count":      len(still_missing),
                    "sample_ids": str(list(still_missing)[:3]),
                })
                span.set_attribute("ml.features.unavailable_count",
                    len(still_missing))

        features = {
            **{k: v for k, v in zip(feature_ids, cached) if v},
            **db_features
        }
        span.set_attribute("ml.features.retrieved_count", len(features))
        return features

5.4 Model Inference: Decomposing GPU Time

The key insight here is separating queue_wait_ms from forward_pass_ms. Without this decomposition, a slow inference span is undiagnosable. With it, the difference between a GPU under memory pressure and an under-provisioned serving tier is immediately visible.

# model_inference.py
tracer = setup_tracing("ml.inference")

async def run_inference(
    batch: InferenceBatch, headers: dict
) -> InferenceResult:
    ctx = extract(headers)

    with tracer.start_as_current_span(
        "ml.inference", context=ctx
    ) as span:
        span.set_attribute("ml.batch.size",        batch.size)
        span.set_attribute("ml.batch.max_seq_len", batch.max_seq_len)
        span.set_attribute("ml.model.version",     MODEL_VERSION)
        span.set_attribute("ml.device.type",       "cuda")
        span.set_attribute("ml.device.id",
            str(torch.cuda.current_device()))

        # Measure queue wait separately from compute time
        t_queued = time.time()
        async with gpu_semaphore:
            queue_wait_ms = round((time.time() - t_queued) * 1000, 2)
            span.set_attribute("ml.inference.queue_wait_ms", queue_wait_ms)

            if queue_wait_ms > 200:
                span.add_event("ml.inference.high_queue_wait", {
                    "queue_wait_ms": queue_wait_ms,
                    "batch_size":    batch.size,
                })

            t_forward = time.time()
            with torch.no_grad():
                logits = model(
                    batch.input_ids.cuda(),
                    batch.attention_mask.cuda()
                )
            span.set_attribute("ml.inference.forward_pass_ms",
                round((time.time() - t_forward) * 1000, 2))

            span.set_attribute("ml.gpu.memory_allocated_mb",
                round(torch.cuda.memory_allocated() / 1e6, 1))
            span.set_attribute("ml.gpu.memory_reserved_mb",
                round(torch.cuda.memory_reserved() / 1e6, 1))

            return decode_output(logits)

5.5 Post-Processor: Making the Invisible Visible

The validation_attempts attribute is the single most valuable custom signal in this entire pipeline. It costs nothing to compute and surfaces model degradation weeks before it becomes visible in error rates.

# post_processor.py
tracer = setup_tracing("ml.post_processor")

async def post_process(
    raw_output: dict, headers: dict
) -> ProcessedOutput:
    ctx = extract(headers)

    with tracer.start_as_current_span(
        "ml.post_processing", context=ctx
    ) as span:
        attempts     = 0
        result       = None
        MAX_ATTEMPTS = 3

        while attempts < MAX_ATTEMPTS:
            attempts += 1
            candidate = parse_output(raw_output)
            error     = validate_schema(candidate)

            if error is None:
                result = candidate
                break

            span.add_event("ml.output.validation_failed", {
                "attempt": attempts,
                "reason":  error.code,
                "field":   error.field or "unknown",
            })

            if attempts < MAX_ATTEMPTS:
                raw_output = apply_repair(raw_output, error)

        # This is the early degradation signal metrics cannot see
        span.set_attribute(
            "ml.post_processing.validation_attempts", attempts
        )
        span.set_attribute("ml.output.valid", result is not None)

        if result is None:
            span.set_status(
                StatusCode.ERROR,
                f"Output validation failed after {MAX_ATTEMPTS} attempts"
            )
            raise OutputValidationError(
                f"Failed after {MAX_ATTEMPTS} attempts"
            )

        if attempts > 1:
            span.set_attribute("ml.output.required_repair", True)

        return result

6. What the Flamegraph Reveals in Practice

With these five services instrumented, a slow 1.8-second request produces the following flamegraph in Tempo:

ml.pipeline.request             [1 820ms total]
|-- ml.input_validation         [   18ms]
|-- ml.feature_extraction       [  823ms]   <- 45% of total time
|   |-- cache_lookup            [   11ms]
|   `-- db_fallback             [  812ms]   <- actual bottleneck
|-- ml.batch_assembly           [   47ms]
|   `-- tokenizer_init          [   38ms]   <- cold start
|-- ml.inference                [  895ms]
|   |-- queue_wait_ms:          [  580ms]   <- waiting for GPU
|   `-- forward_pass_ms:        [  315ms]   <- actual compute
`-- ml.post_processing          [   37ms]
    `-- validation_attempts: 2              <- silent repair

In a single view, three distinct problems become apparent: the uncached feature store, the saturated GPU queue, and the post-processor silently repairing malformed outputs. Three problems, three teams to notify, three separate tickets. Found in ten minutes rather than two hours.

This is not a best-case scenario. This is what observability looks like when it is designed to answer the question "why is this specific request slow" rather than "what is the aggregate state of each service."

7. Querying Traces in Grafana and Tempo

Traces are only useful if they can be interrogated at scale. The following queries translate instrumented attributes into actionable alerts.

Identify requests with an abnormal GPU queue-to-compute ratio:

# TraceQL in Tempo
{ span.ml.inference.queue_wait_ms > span.ml.inference.forward_pass_ms }
| select(span.ml.batch.size, span.ml.device.id)

Detect slow feature store database fallbacks over the last 5 minutes:

histogram_quantile(0.95,
  sum by (service_name, le) (
    rate(ml_feature_extraction_db_lookup_ms_bucket[5m])
  )
)

Alert on rising revalidation rates (the early degradation signal):

sum(rate(ml_post_processing_validation_attempts_sum[10m]))
/
sum(rate(ml_post_processing_validation_attempts_count[10m]))
> 1.3
# Alert fires when average attempts exceeds 1.3
# Baseline is 1.0: all outputs valid on first attempt

Pipeline bottleneck overview by stage:

topk(5,
  histogram_quantile(0.95,
    sum by (service_name, le) (
      rate(
        duration_ms_bucket{ml_pipeline_name="recommendation"}[5m]
      )
    )
  )
)

8. Adaptive Sampling: Tracing at Scale Without the Cost

A production ML pipeline handling 5,000 requests per minute generates 5,000 traces per minute. At an average of 50 spans per trace, that is 250,000 spans per minute. Storing and indexing everything is expensive and degrades query performance.

The solution is tail sampling: deciding after the fact which traces to retain, once their outcome is known. Traces that are slow, erroneous, or exhibit anomalous attribute values are always kept. Routine traces are sampled at a low rate for baseline visibility.

# otelcol.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces:    50000
    policies:
      # Always keep slow traces: they explain user complaints
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 800 }

      # Always keep error traces
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      # Keep traces with multiple validation attempts
      - name: multi-attempt-validation
        type: numeric_attribute
        numeric_attribute:
          key:       ml.post_processing.validation_attempts
          min_value: 2

      # Keep traces with high GPU queue wait
      - name: high-gpu-queue
        type: numeric_attribute
        numeric_attribute:
          key:       ml.inference.queue_wait_ms
          min_value: 300

      # Sample 3% of the remainder for baseline coverage
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 3 }

In practice, this configuration retains 3 to 8% of traces while preserving 100% of the traces that are diagnostically useful. The baseline sample ensures that silent degradations accumulating across many normal-looking requests remain detectable through aggregation.

9. What Tracing Does Not Replace

Distributed tracing exposes temporal and cross-service bottlenecks. It does not replace every other observability tool, and it would be misleading to suggest otherwise.

Tracing shows where time is spent between instrumented function boundaries. It does not show what happens inside a function. A forward_pass_ms span that takes 800ms indicates that the forward pass is slow; it does not explain why at the CUDA kernel level. For that, a Python or C++ profiler is necessary.

Tracing does not replace GPU metrics for capacity planning and hardware saturation analysis, structured logs for debugging data and transformation errors, or LLM evaluation frameworks for assessing output quality. These four observability layers are complementary, not substitutable.

The genuine value of tracing in an ML pipeline is navigation. A latency p99 alert leads in two clicks to the corresponding trace in Tempo, which identifies the responsible service, whose trace_id correlates with its logs in Loki. That correlation is what gives power to each individual signal. Metrics tell you something is wrong. Traces tell you where to look. Logs tell you what happened when you get there.

10. Conclusion: Metrics Tell You What, Traces Tell You Why

A team monitoring its ML pipeline exclusively with metrics is like a physician who only takes temperature readings. The fever confirms something is wrong. It does not say what.

Distributed tracing instrumented across an entire ML pipeline transforms "something is slow" into "the feature store is slow on new entities because the cache TTL is too short, affecting 23% of requests between 1 PM and 3 PM on high-traffic days." The difference between those two formulations is the difference between a two-hour investigation and a ten-minute ticket.

The four bottlenecks described in this article: the uncached feature store, the GPU queue hidden by utilization percentages, the silent revalidations, the cold-starting unmanaged component, are not rare pathological cases. They are the most frequently encountered patterns on production ML pipelines. They are invisible to classic metrics by design, not by accident.

Instrumenting an ML pipeline with OpenTelemetry end-to-end takes approximately one day of engineering work. Diagnosing these bottlenecks without tracing takes, on average, longer than that per incident.

Appendix: Span Attribute Reference

Attribute	Type	Stage	Purpose
`ml.request.id`	string	Gateway	Correlate with application logs
`ml.model.name`	string	Gateway	Filter by model version
`ml.features.cache_hit_rate`	float	Feature Extractor	Detect cache degradation
`ml.features.db_lookup_ms`	float	Feature Extractor	Isolate database latency
`ml.features.unavailable_count`	int	Feature Extractor	Alert on missing features
`ml.inference.queue_wait_ms`	float	Inference	Detect GPU queue pressure
`ml.inference.forward_pass_ms`	float	Inference	Measure actual compute time
`ml.gpu.memory_allocated_mb`	float	Inference	Track memory pressure
`ml.post_processing.validation_attempts`	int	Post-Processor	Early degradation signal
`ml.output.required_repair`	bool	Post-Processor	Flag repaired outputs
`ml.pipeline.success`	bool	Gateway	End-to-end success tracking
`ml.pipeline.latency_ms`	float	Gateway	Total pipeline duration

This article is part of an ongoing series on production observability for AI workloads.
Previous articles: OTel Collector as IT/OT Middleware · Instrumenting Industrial Assets with OTel · LLM Instrumentation with OpenTelemetry

DEV Community