DEV Community: Erythix

Distributed Tracing in ML Pipelines: From Preprocessing to Inference

Erythix — Sat, 21 Mar 2026 11:51:14 +0000

How OpenTelemetry exposes the bottlenecks your metrics will never see

Samuel Desseaux · Erythix

1. The Lie of the Green Dashboard

It is 2 PM on a Tuesday. Your team receives a user report: predictions have been slow since this morning. You open Grafana. CPU at 38%, GPU at 72%, HTTP error rate at 0.2%, p99 latency at 1.4s. Nothing breaches a configured threshold. You tell the user everything looks nominal.

Two hours later, a second report. Then a third. The problem exists. Your tools cannot see it.

This scenario is not hypothetical. It is the daily reality of most teams operating ML pipelines in production without distributed tracing. Classic metrics measure the state of a service at a given moment. They do not measure the life of a request as it travels through multiple services. These are two fundamentally different levels of observation, and conflating them is a systematic source of operational blind spots.

The distinction matters more in ML pipelines than anywhere else in software engineering, because a machine learning pipeline is not a function. It is a chain of distributed transformations, each with its own dependencies, its own timing characteristics, and its own failure modes.

2. Why an ML Pipeline is structurally difficult to observe

Consider a typical production pipeline for a recommendation engine or an LLM completion service. A request arrives at the API Gateway, which validates it, enriches context via a feature store, assembles a batch, sends it to the model server, retrieves raw output, validates and formats it, then finally responds to the client. Six to eight distinct services, sometimes in different runtimes (Python, Go, Triton), sometimes on different machines, sometimes in different availability zones.

In this context, a performance degradation can originate anywhere in the chain. And its cause is rarely where it appears to be.

The technical impact is direct. Without cross-service visibility, diagnosing a regression takes hours. Engineers manually scan logs from each service, compare timestamps, and mentally reconstruct a sequence that tooling should surface automatically. This is not a knowledge problem; it is an instrumentation problem.

The business impact is consistently underestimated. A slow ML pipeline means a recommendation engine that responds after the user has already scrolled past. It means an LLM assistant that feels broken. In the most critical contexts, fraud detection, credit scoring, medical triage, it means a decision rendered too late to be useful. Latency is not an infrastructure problem. It is a value delivery problem.

3. The Four Bottlenecks that metrics will never see

Before reaching solutions, the problems deserve precise names. Field experience on production ML pipelines consistently surfaces four categories of bottlenecks that are structurally invisible to classic monitoring.

3.1 The Cascade Bottleneck in feature extraction

The feature store is the most under-monitored dependency in an ML pipeline. It handles requests efficiently when serving from cache, then falls back to the underlying database for the minority of cases where data is not warm. That minority can have a latency ten to fifty times higher than the cache path.

What the metrics show: p50 at 15ms, p99 at 800ms on the feature store service. If you are already looking at that specific service dashboard, the problem is visible. If you are looking at aggregate pipeline latency, the feature store is buried in the noise. And if you do not know the feature store is the culprit, you will not look at its dashboard until the investigation is already underway.

A distributed trace, by contrast, immediately shows that on slow requests the feature_extraction span accounts for 60% of total pipeline time, and that it is consistently the db_fallback child span driving the duration.

The business impact crystallizes around new users: those whose features are not yet warm in cache experience the worst latency precisely at the moment when engagement is most fragile and first impressions are being formed.

3.2 The GPU Queue hiding behind utilization numbers

The GPU is the most expensive resource in an ML infrastructure. Its standard monitoring metric is utilization percentage. 75% looks healthy, neither underused nor saturated.

But GPU utilization measures the percentage of time the GPU is executing kernels. It does not measure the time requests spend waiting in the queue for GPU access. A GPU running at 75% utilization can spend 60% of its chargeable compute time waiting for prior requests to release memory before beginning actual computation.

Distributed tracing decomposes the inference span into two distinct measurements: queue_wait_ms and forward_pass_ms. When the ratio of queue wait to forward pass exceeds 1, the GPU is a bottleneck regardless of what the utilization gauge reads.

The economics are stark. On a service handling 10,000 requests per hour, if each request waits 300ms in the GPU queue, that is 3,000 seconds of accumulated client-facing latency per hour. And the GPU billing meter runs equally on queue time and compute time.

3.3 Silent revalidations

This pattern is the most deceptive of the four. Your HTTP error rate is 0%. Users receive valid responses. Everything appears correct by every standard metric.

Behind the scenes, the post-processor is receiving malformed model outputs: truncated JSON, missing required fields, unexpected formats. It repairs them by replaying validation with relaxed parameters, sometimes by re-invoking the model. The end user sees a valid response with slightly elevated latency. The monitoring dashboard sees nothing unusual.

This behavior is an early indicator of model degradation. A model that starts producing malformed outputs 5% of the time will progressively worsen to 10%, then 20%. Without measuring the revalidation rate, the degradation only becomes detectable through HTTP error rate increases, which is to say: too late, after user-facing failures have already begun.

Distributed tracing makes these validation attempts visible as attributes and events on the post_processing span. Aggregated across requests, they form an early warning signal that HTTP-level metrics fundamentally cannot provide.

3.4 The Unmanaged Component cold start

The tokenizer, the image preprocessor, the feature scaler: these components are typically loaded from disk or initialized on the first request, then held in memory. Or they are supposed to be.

In practice, unexpected reloads occur for a variety of reasons: worker rotation, memory eviction under pressure, partial deployments, lifecycle bugs. The result is a bimodal latency distribution: the vast majority of requests are fast, a minority are slow in a pattern that does not look random.

Identifying this on aggregated metrics is difficult because the mean and even the p99 can appear acceptable if the reloads are infrequent. On a trace, the cold start appears as a tokenizer_init span of 300ms on affected requests and absent on clean ones. The pattern is immediately legible on a flamegraph. On a metrics dashboard, it dissolves into the p95 histogram and becomes invisible.

4. What Distributed Tracing changes structurally

Before the implementation, the theoretical case deserves to be made clearly.

A metric is an aggregation. It discards the information about individual requests. It cannot tell you that user X's request was slow because of the feature store while user Y's request was slow because of the GPU queue.

A log is an isolated event. It knows what happened inside one service, but not how that event relates to what happened in the other services handling the same request.

A trace is a causal and temporal view of a request as it traverses all services. It links events by their trace_id, preserves the parent-child relationship between spans, measures each stage individually, and allows navigation from the aggregate view (a flamegraph across all requests) to the individual view (one specific slow request) in two clicks.

This is the difference between knowing your pipeline is sometimes slow and knowing why this specific request took 1.8 seconds at 2:03 PM on Tuesday.

5. Instrumentation Architecture: Five Services, One Trace

The following implementation covers a complete ML pipeline instrumented end-to-end. Each service produces spans. Context propagates via HTTP headers. Tempo aggregates spans into navigable traces.

[Client]
    |
    v
[API Gateway]              <- trace_id created here
    |
    |---> [Input Validator]        <- child span 1
    |
    |---> [Feature Extractor]      <- child span 2
    |           `---> [Feature Store]   <- grandchild span 2.1
    |
    |---> [Batch Assembler]        <- child span 3
    |           `---> [Tokenizer]       <- grandchild span 3.1
    |
    |---> [Model Inference]        <- child span 4
    |           `---> [Model Server]    <- grandchild span 4.1
    |
    `---> [Post-Processor]         <- child span 5

5.1 Shared Initialization

# tracing_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import os

def setup_tracing(service_name: str) -> trace.Tracer:
    resource = Resource.create({
        "service.name":           service_name,
        "service.version":        os.getenv("SERVICE_VERSION", "0.0.0"),
        "deployment.environment": os.getenv("ENV", "production"),
        "ml.pipeline.name":       os.getenv("PIPELINE_NAME", "default"),
    })

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint=os.getenv(
                    "OTEL_EXPORTER_OTLP_ENDPOINT",
                    "http://otel-collector:4317"
                ),
                insecure=True,
            ),
            max_queue_size=2048,
            max_export_batch_size=512,
        )
    )
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)

5.2 API Gateway: Creating the Root Span

The gateway is where the trace_id is born. Every subsequent service will attach its spans to this root. The attributes set here define the dimensions available for filtering in Tempo and Grafana.

# api_gateway.py
from opentelemetry.propagate import inject
from opentelemetry.trace import StatusCode
from uuid import uuid4
import time

tracer = setup_tracing("ml.api_gateway")

@app.post("/predict")
async def predict(request: PredictRequest):
    with tracer.start_as_current_span("ml.pipeline.request") as root_span:
        request_id = str(uuid4())

        root_span.set_attribute("ml.request.id",       request_id)
        root_span.set_attribute("ml.model.name",       request.model)
        root_span.set_attribute("ml.input.type",       request.input_type)
        root_span.set_attribute("ml.input.size_bytes", len(request.payload))
        root_span.set_attribute("ml.client.id",        request.client_id)

        headers = {"X-Request-ID": request_id}
        inject(headers)

        t_start = time.time()
        try:
            result = await orchestrate_pipeline(request, headers)
            root_span.set_attribute("ml.output.tokens",    result.token_count)
            root_span.set_attribute("ml.pipeline.success", True)
            root_span.set_attribute("ml.pipeline.latency_ms",
                round((time.time() - t_start) * 1000, 2))
            return result

        except Exception as e:
            root_span.set_status(StatusCode.ERROR, str(e))
            root_span.record_exception(e)
            root_span.set_attribute("ml.pipeline.success", False)
            raise

5.3 Feature Extractor: The Most Critical Stage to Instrument

This is where the most common production bottleneck lives. The instrumentation separates cache latency from database latency, making the two failure modes independently observable.

# feature_extractor.py
from opentelemetry.propagate import extract
from opentelemetry import trace

tracer = setup_tracing("ml.feature_extractor")

async def extract_features(payload: dict, headers: dict) -> dict:
    ctx = extract(headers)

    with tracer.start_as_current_span(
        "ml.feature_extraction", context=ctx
    ) as span:
        feature_ids = payload["feature_ids"]
        span.set_attribute("ml.features.requested_count", len(feature_ids))

        t_cache = time.time()
        cached      = await cache.mget(feature_ids)
        cache_hits  = sum(1 for v in cached if v is not None)
        cache_misses = len(feature_ids) - cache_hits

        span.set_attribute("ml.features.cache_hits",    cache_hits)
        span.set_attribute("ml.features.cache_misses",  cache_misses)
        span.set_attribute("ml.features.cache_hit_rate",
            round(cache_hits / len(feature_ids), 3))
        span.set_attribute("ml.features.cache_lookup_ms",
            round((time.time() - t_cache) * 1000, 2))

        if cache_misses > 0:
            t_db = time.time()
            missing_ids = [
                fid for fid, v in zip(feature_ids, cached) if v is None
            ]
            db_features = await feature_db.get_batch(missing_ids)
            span.set_attribute("ml.features.db_lookup_ms",
                round((time.time() - t_db) * 1000, 2))

            still_missing = set(missing_ids) - set(db_features.keys())
            if still_missing:
                span.add_event("ml.features.unavailable", {
                    "count":      len(still_missing),
                    "sample_ids": str(list(still_missing)[:3]),
                })
                span.set_attribute("ml.features.unavailable_count",
                    len(still_missing))

        features = {
            **{k: v for k, v in zip(feature_ids, cached) if v},
            **db_features
        }
        span.set_attribute("ml.features.retrieved_count", len(features))
        return features

5.4 Model Inference: Decomposing GPU Time

The key insight here is separating queue_wait_ms from forward_pass_ms. Without this decomposition, a slow inference span is undiagnosable. With it, the difference between a GPU under memory pressure and an under-provisioned serving tier is immediately visible.

# model_inference.py
tracer = setup_tracing("ml.inference")

async def run_inference(
    batch: InferenceBatch, headers: dict
) -> InferenceResult:
    ctx = extract(headers)

    with tracer.start_as_current_span(
        "ml.inference", context=ctx
    ) as span:
        span.set_attribute("ml.batch.size",        batch.size)
        span.set_attribute("ml.batch.max_seq_len", batch.max_seq_len)
        span.set_attribute("ml.model.version",     MODEL_VERSION)
        span.set_attribute("ml.device.type",       "cuda")
        span.set_attribute("ml.device.id",
            str(torch.cuda.current_device()))

        # Measure queue wait separately from compute time
        t_queued = time.time()
        async with gpu_semaphore:
            queue_wait_ms = round((time.time() - t_queued) * 1000, 2)
            span.set_attribute("ml.inference.queue_wait_ms", queue_wait_ms)

            if queue_wait_ms > 200:
                span.add_event("ml.inference.high_queue_wait", {
                    "queue_wait_ms": queue_wait_ms,
                    "batch_size":    batch.size,
                })

            t_forward = time.time()
            with torch.no_grad():
                logits = model(
                    batch.input_ids.cuda(),
                    batch.attention_mask.cuda()
                )
            span.set_attribute("ml.inference.forward_pass_ms",
                round((time.time() - t_forward) * 1000, 2))

            span.set_attribute("ml.gpu.memory_allocated_mb",
                round(torch.cuda.memory_allocated() / 1e6, 1))
            span.set_attribute("ml.gpu.memory_reserved_mb",
                round(torch.cuda.memory_reserved() / 1e6, 1))

            return decode_output(logits)

5.5 Post-Processor: Making the Invisible Visible

The validation_attempts attribute is the single most valuable custom signal in this entire pipeline. It costs nothing to compute and surfaces model degradation weeks before it becomes visible in error rates.

# post_processor.py
tracer = setup_tracing("ml.post_processor")

async def post_process(
    raw_output: dict, headers: dict
) -> ProcessedOutput:
    ctx = extract(headers)

    with tracer.start_as_current_span(
        "ml.post_processing", context=ctx
    ) as span:
        attempts     = 0
        result       = None
        MAX_ATTEMPTS = 3

        while attempts < MAX_ATTEMPTS:
            attempts += 1
            candidate = parse_output(raw_output)
            error     = validate_schema(candidate)

            if error is None:
                result = candidate
                break

            span.add_event("ml.output.validation_failed", {
                "attempt": attempts,
                "reason":  error.code,
                "field":   error.field or "unknown",
            })

            if attempts < MAX_ATTEMPTS:
                raw_output = apply_repair(raw_output, error)

        # This is the early degradation signal metrics cannot see
        span.set_attribute(
            "ml.post_processing.validation_attempts", attempts
        )
        span.set_attribute("ml.output.valid", result is not None)

        if result is None:
            span.set_status(
                StatusCode.ERROR,
                f"Output validation failed after {MAX_ATTEMPTS} attempts"
            )
            raise OutputValidationError(
                f"Failed after {MAX_ATTEMPTS} attempts"
            )

        if attempts > 1:
            span.set_attribute("ml.output.required_repair", True)

        return result

6. What the Flamegraph Reveals in Practice

With these five services instrumented, a slow 1.8-second request produces the following flamegraph in Tempo:

ml.pipeline.request             [1 820ms total]
|-- ml.input_validation         [   18ms]
|-- ml.feature_extraction       [  823ms]   <- 45% of total time
|   |-- cache_lookup            [   11ms]
|   `-- db_fallback             [  812ms]   <- actual bottleneck
|-- ml.batch_assembly           [   47ms]
|   `-- tokenizer_init          [   38ms]   <- cold start
|-- ml.inference                [  895ms]
|   |-- queue_wait_ms:          [  580ms]   <- waiting for GPU
|   `-- forward_pass_ms:        [  315ms]   <- actual compute
`-- ml.post_processing          [   37ms]
    `-- validation_attempts: 2              <- silent repair

In a single view, three distinct problems become apparent: the uncached feature store, the saturated GPU queue, and the post-processor silently repairing malformed outputs. Three problems, three teams to notify, three separate tickets. Found in ten minutes rather than two hours.

This is not a best-case scenario. This is what observability looks like when it is designed to answer the question "why is this specific request slow" rather than "what is the aggregate state of each service."

7. Querying Traces in Grafana and Tempo

Traces are only useful if they can be interrogated at scale. The following queries translate instrumented attributes into actionable alerts.

Identify requests with an abnormal GPU queue-to-compute ratio:

# TraceQL in Tempo
{ span.ml.inference.queue_wait_ms > span.ml.inference.forward_pass_ms }
| select(span.ml.batch.size, span.ml.device.id)

Detect slow feature store database fallbacks over the last 5 minutes:

histogram_quantile(0.95,
  sum by (service_name, le) (
    rate(ml_feature_extraction_db_lookup_ms_bucket[5m])
  )
)

Alert on rising revalidation rates (the early degradation signal):

sum(rate(ml_post_processing_validation_attempts_sum[10m]))
/
sum(rate(ml_post_processing_validation_attempts_count[10m]))
> 1.3
# Alert fires when average attempts exceeds 1.3
# Baseline is 1.0: all outputs valid on first attempt

Pipeline bottleneck overview by stage:

topk(5,
  histogram_quantile(0.95,
    sum by (service_name, le) (
      rate(
        duration_ms_bucket{ml_pipeline_name="recommendation"}[5m]
      )
    )
  )
)

8. Adaptive Sampling: Tracing at Scale Without the Cost

A production ML pipeline handling 5,000 requests per minute generates 5,000 traces per minute. At an average of 50 spans per trace, that is 250,000 spans per minute. Storing and indexing everything is expensive and degrades query performance.

The solution is tail sampling: deciding after the fact which traces to retain, once their outcome is known. Traces that are slow, erroneous, or exhibit anomalous attribute values are always kept. Routine traces are sampled at a low rate for baseline visibility.

# otelcol.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces:    50000
    policies:
      # Always keep slow traces: they explain user complaints
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 800 }

      # Always keep error traces
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }

      # Keep traces with multiple validation attempts
      - name: multi-attempt-validation
        type: numeric_attribute
        numeric_attribute:
          key:       ml.post_processing.validation_attempts
          min_value: 2

      # Keep traces with high GPU queue wait
      - name: high-gpu-queue
        type: numeric_attribute
        numeric_attribute:
          key:       ml.inference.queue_wait_ms
          min_value: 300

      # Sample 3% of the remainder for baseline coverage
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 3 }

In practice, this configuration retains 3 to 8% of traces while preserving 100% of the traces that are diagnostically useful. The baseline sample ensures that silent degradations accumulating across many normal-looking requests remain detectable through aggregation.

9. What Tracing Does Not Replace

Distributed tracing exposes temporal and cross-service bottlenecks. It does not replace every other observability tool, and it would be misleading to suggest otherwise.

Tracing shows where time is spent between instrumented function boundaries. It does not show what happens inside a function. A forward_pass_ms span that takes 800ms indicates that the forward pass is slow; it does not explain why at the CUDA kernel level. For that, a Python or C++ profiler is necessary.

Tracing does not replace GPU metrics for capacity planning and hardware saturation analysis, structured logs for debugging data and transformation errors, or LLM evaluation frameworks for assessing output quality. These four observability layers are complementary, not substitutable.

The genuine value of tracing in an ML pipeline is navigation. A latency p99 alert leads in two clicks to the corresponding trace in Tempo, which identifies the responsible service, whose trace_id correlates with its logs in Loki. That correlation is what gives power to each individual signal. Metrics tell you something is wrong. Traces tell you where to look. Logs tell you what happened when you get there.

10. Conclusion: Metrics Tell You What, Traces Tell You Why

A team monitoring its ML pipeline exclusively with metrics is like a physician who only takes temperature readings. The fever confirms something is wrong. It does not say what.

Distributed tracing instrumented across an entire ML pipeline transforms "something is slow" into "the feature store is slow on new entities because the cache TTL is too short, affecting 23% of requests between 1 PM and 3 PM on high-traffic days." The difference between those two formulations is the difference between a two-hour investigation and a ten-minute ticket.

The four bottlenecks described in this article: the uncached feature store, the GPU queue hidden by utilization percentages, the silent revalidations, the cold-starting unmanaged component, are not rare pathological cases. They are the most frequently encountered patterns on production ML pipelines. They are invisible to classic metrics by design, not by accident.

Instrumenting an ML pipeline with OpenTelemetry end-to-end takes approximately one day of engineering work. Diagnosing these bottlenecks without tracing takes, on average, longer than that per incident.

Appendix: Span Attribute Reference

Attribute	Type	Stage	Purpose
`ml.request.id`	string	Gateway	Correlate with application logs
`ml.model.name`	string	Gateway	Filter by model version
`ml.features.cache_hit_rate`	float	Feature Extractor	Detect cache degradation
`ml.features.db_lookup_ms`	float	Feature Extractor	Isolate database latency
`ml.features.unavailable_count`	int	Feature Extractor	Alert on missing features
`ml.inference.queue_wait_ms`	float	Inference	Detect GPU queue pressure
`ml.inference.forward_pass_ms`	float	Inference	Measure actual compute time
`ml.gpu.memory_allocated_mb`	float	Inference	Track memory pressure
`ml.post_processing.validation_attempts`	int	Post-Processor	Early degradation signal
`ml.output.required_repair`	bool	Post-Processor	Flag repaired outputs
`ml.pipeline.success`	bool	Gateway	End-to-end success tracking
`ml.pipeline.latency_ms`	float	Gateway	Total pipeline duration

This article is part of an ongoing series on production observability for AI workloads.
Previous articles: OTel Collector as IT/OT Middleware · Instrumenting Industrial Assets with OTel · LLM Instrumentation with OpenTelemetry

Tracing a RAG Chain End-to-End: Where OpenTelemetry Stops and Where You Need to Instrument Yourself

Erythix — Mon, 16 Mar 2026 17:05:36 +0000

There are already plenty of "Getting started with OpenTelemetry" tutorials. This is not one of them.

This article starts with a candid observation: if you have OTel running in your infrastructure and you've just added a RAG pipeline to production, your traces look impressive but they're mostly lying to you by omission. You have spans as latency numbers. What you don't have is visibility into the five stages that actually determine whether your system is working correctly.

OTel wasn't designed for RAG. It was designed for distributed systems built around HTTP, databases, and message queues: all well-understood primitives with established semantic conventions. A RAG pipeline adds several new primitives that have no standard OTel semantics yet. The OpenTelemetry GenAI SIG is working on it, but slowly. In the meantime, production systems are running blind.

The goal here is to be precise about where the boundary is and how to cross it.

What a RAG chain actually traverses

A minimal RAG pipeline involves eight distinct stages, each with its own failure modes and its own observability requirements:

Query ingress: the user request arrives, gets validated, gets routed
Embedding: the query is converted to a vector representation
Retrieval: the vector DB is searched, ranked chunks are returned
Reranking: chunks are rescored by a cross-encoder, poor matches are dropped
Prompt assembly: context is injected, the prompt is constructed, tokens are counted
LLM call: the assembled prompt is sent to the model
Post-processing: the response is parsed, validated, formatted, filtered
Response: the final answer is returned to the caller

Each of these stages has distinct latency characteristics, distinct failure modes, and distinct diagnostic signals. The problem is that OTel treats them very differently.

What OTel covers natively

Auto-instrumentation handles the outer envelope well. For a typical Python service running FastAPI or Flask with OpenTelemetry auto-instrumentation, you get:

A root span for the HTTP request (stage 1)
The HTTP response span (stage 8)
Any outbound HTTP calls you make, including the API call to your LLM provider

That's it. Three spans. In a pipeline with eight meaningful stages.

Depending on your vector database client, you might get a span for the retrieval call. Weaviate has partial SDK-level instrumentation, most others don't. But even when you get that span, it gives you network latency, not semantic information. You know the query arrived and returned. You don't know how many results came back, what their similarity scores were, or whether the result set was empty.

The picture after auto-instrumentation: two solid spans at the edges, one partial span in the middle and four stages that are completely invisible.

The eight stages below show what auto-instrumentation sees and what it misses.

The five dead zones

Zone 1: Embedding

When you call an embedding model, whether via OpenAI, Cohere, or a local sentence-transformer, you get a latency number if you're lucky and nothing otherwise. What you don't capture:

Which model, which version. Model drift is real. A silent upgrade to your embedding provider changes vector geometry and breaks retrieval. If you're not recording model_name and model_version on every span, you'll spend days debugging what looks like a retrieval problem.
Vector dimensionality. A dimension mismatch between your embedding model and your index is a hard failure that generates cryptic errors. Logging the output dimension takes one attribute.
CPU vs GPU time split. For on-premise inference, this is the first signal that hardware saturation is affecting quality.
Similarity score distribution. The embedding stage itself doesn't produce this, but it sets up the retrieval stage. Tracking what "normal" looks like here is your baseline for drift detection.

Zone 2: Retrieval

The retrieval call to your vector database may produce an outbound HTTP span if you're using a REST-based client. But that span contains only timing and status code. What it doesn't contain:

Number of chunks returned. If your retrieval returns zero results, you want to know immediately, and you want to know the query that triggered it, not just the timestamp.
Similarity scores. The distribution of top-k scores tells you whether the retrieval was confident or speculative. A max score of 0.94 and a max score of 0.41 both count as "retrieval succeeded" in OTel's view. They're completely different situations operationally.
Reranking time as a separate stage. Many pipelines combine retrieval and reranking into a single function call. Separating them in your spans is worth the effort: reranking is frequently the actual latency bottleneck, and you'd never know.

Zone 3: Reranking

This stage is the most consistently invisible and the most consistently underestimated. A cross-encoder reranker running a full forward pass over each retrieved chunk pair adds significant latency, sometimes more than the LLM call itself. OTel sees none of it unless you explicitly instrument it.

What to capture:

Reranking duration as its own span
Input chunk count vs. output chunk count (how many were filtered)
Score threshold applied
Model name and batch size

Zone 4: Prompt assembly

This is where silent failures are born.

When you assemble a prompt, you make decisions: which chunks to include, how to order them, how to truncate if the context window is tight. OTel has no visibility into any of this. You can have a system that routinely truncates critical context and generates factually incomplete responses, and your traces will show a perfectly healthy green pipeline.

What to capture:

Estimated token count before sending
Whether truncation occurred (context_truncated: true/false)
Number of chunks injected
Whether conflicting chunks were injected (requires a light coherence check)

Zone 5: LLM call payload

This is the subtlest dead zone. You do have a span for the LLM call: it's the outbound HTTP request. But the span contains no semantic information about what happened inside that call.

The SDK for Anthropic, OpenAI, and most LLM providers does not emit OTel attributes for tokens, stop reason, or model-level parameters. You have to enrich the span yourself after the response arrives. Without this:

You cannot track token costs
You cannot alert on prompts that regularly hit the context limit (stop_reason: length)
You cannot detect when model behavior changes across versions

Instrumenting manually

The pattern is consistent across all dead zones: wrap the operation in a span, set semantic attributes, and use a naming convention that survives grep.

Embedding

from opentelemetry import trace

tracer = trace.get_tracer("rag.pipeline")

with tracer.start_as_current_span("rag.embedding") as span:
    result = embed_query(query_text)
    span.set_attribute("rag.query.length", len(query_text))
    span.set_attribute("rag.embedding.model", result.model)
    span.set_attribute("rag.embedding.dimension", len(result.vector))
    span.set_attribute("rag.embedding.latency_ms", result.latency_ms)

Retrieval

with tracer.start_as_current_span("rag.retrieval") as span:
    chunks = vector_db.search(query_vector, top_k=20)
    span.set_attribute("rag.retrieval.query_preview", query_text[:120])
    span.set_attribute("rag.retrieval.chunks_returned", len(chunks))
    span.set_attribute("rag.retrieval.empty_result", len(chunks) == 0)
    if chunks:
        span.set_attribute("rag.retrieval.max_score", round(chunks[0].score, 4))
        span.set_attribute("rag.retrieval.min_score", round(chunks[-1].score, 4))

Reranking

with tracer.start_as_current_span("rag.reranking") as span:
    reranked = reranker.rank(query_text, chunks, top_n=5)
    span.set_attribute("rag.reranking.model", reranker.model_name)
    span.set_attribute("rag.reranking.input_count", len(chunks))
    span.set_attribute("rag.reranking.output_count", len(reranked))
    span.set_attribute("rag.reranking.filtered_count", len(chunks) - len(reranked))
    if reranked:
        span.set_attribute("rag.reranking.top_score", round(reranked[0].score, 4))

Prompt assembly

with tracer.start_as_current_span("rag.prompt_assembly") as span:
    assembled, meta = build_prompt(query_text, reranked_chunks)
    span.set_attribute("rag.prompt.estimated_tokens", meta["token_count"])
    span.set_attribute("rag.prompt.context_truncated", meta["truncated"])
    span.set_attribute("rag.prompt.chunks_used", meta["chunks_injected"])
    span.set_attribute("rag.prompt.system_prompt_version", SYSTEM_PROMPT_VERSION)

Note the system_prompt_version attribute. Prompt changes are the most common cause of unexplained behavioral shifts. Versioning your system prompt and logging it on every span costs nothing and will save you multiple production investigations.

LLM call enrichment

Don't create a new span for the LLM call. You already have the outbound HTTP span from auto-instrumentation. Enrich it instead via a wrapper that adds attributes after the response arrives:

with tracer.start_as_current_span("llm.completion") as span:
    response = llm_client.messages.create(**request_params)
    span.set_attribute("llm.model", response.model)
    span.set_attribute("llm.usage.input_tokens", response.usage.input_tokens)
    span.set_attribute("llm.usage.output_tokens", response.usage.output_tokens)
    span.set_attribute("llm.stop_reason", response.stop_reason)
    span.set_attribute("llm.request.temperature", request_params.get("temperature", 1.0))
    span.set_attribute("llm.request.max_tokens", request_params.get("max_tokens"))

The stop_reason attribute alone justifies this instrumentation. When stop_reason is "length", it means the model ran out of context and stopped mid-response. In a RAG pipeline, this is almost always a prompt assembly bug. Without this attribute, the response looks valid until a human reads it.

Naming conventions

There is no OTel standard for RAG semantic conventions as of early 2026. The GenAI SIG has drafts in progress but nothing stable. Until there is a standard, the wrong choice is to invent arbitrary names per-service. The right choice is to define a coherent convention internally and apply it consistently.

The three-prefix approach:

Prefix	Domain	Examples
`rag.*`	Pipeline logic	`rag.retrieval.chunks_returned`, `rag.prompt.context_truncated`
`llm.*`	Model interaction	`llm.usage.input_tokens`, `llm.stop_reason`, `llm.model`
`vec.*`	Vector operations	`vec.index.name`, `vec.search.metric`, `vec.dimension`

This naming makes your traces queryable by domain in VictoriaMetrics, OpenObserve, or any backend that supports attribute filtering. A query like rag.retrieval.empty_result = true AND llm.stop_reason = "length" surfaces a specific failure pattern (empty retrieval leading to context-padded fallback response) in seconds.

Avoid prefixes that shadow existing OTel conventions. db.* is already used by database instrumentation. http.* is already HTTP. Pick names that won't collide with auto-instrumented attributes.

What you do with complete traces

Once the instrumentation is in place, four operational patterns become possible that were invisible before.

P95 latency by stage. Most teams assume the LLM call dominates pipeline latency. In practice, reranking is frequently the bottleneck, especially for models running on shared inference infrastructure. Without per-stage spans, you're optimizing the wrong thing.

Empty retrieval rate as a leading indicator. An uptick in rag.retrieval.empty_result = true before you see quality degradation in user feedback gives you a 24–48 hour warning window. It usually means your document index is stale or your embedding model has been silently upgraded. This is the most valuable leading indicator in a RAG system and it requires exactly one boolean attribute.

Context truncation rate as a prompt quality signal. If rag.prompt.context_truncated = true appears on more than 5–10% of requests, your retrieved chunks are too long for your context window configuration. This is a retrieval tuning problem, not an LLM problem, but without the attribute, it looks like random response degradation.

Stop reason distribution. A rise in llm.stop_reason = "length" correlates directly with content quality issues. Track it as a metric. Alert on it. It's a better signal than user satisfaction scores because it's available in real time.

Conclusion

OTel is a foundation, not a solution. For conventional infrastructure (HTTP services, databases, queues), auto-instrumentation covers most of what matters. For AI pipelines, the meaningful events happen inside application logic that has no standard semantic model.

The gap isn't a criticism of OTel. It's the normal boundary between generic infrastructure tooling and domain-specific observability. Every mature domain eventually develops its own semantic layer on top of the generic tracing substrate.

For RAG pipelines, that layer doesn't exist yet as a standard. Building it yourself is not optional if you're operating these systems in production. The instrumentation described here adds less than 200 lines of code to a typical pipeline and transforms your traces from a latency meter into an operational instrument.

The five dead zones (embedding, retrieval, reranking, prompt assembly, and LLM payload) are exactly where your system fails in interesting ways. Leaving them dark is a choice, and it's the wrong one.

Samuel Desseaux, founder Erythix · AI Observability & Industrial Monitoring

VictoriaMetrics Training Partner · erythix.tech

This article is part of the AI Observability series. Related: "GPU utilization tells you nothing about inference quality" · "Sovereign observability stack for HPC workloads"

SLURM in a nutshell: Architecture, Observability and Security for HPC Clusters

Erythix — Sat, 07 Mar 2026 14:55:38 +0000

SLURM powers Summit, Frontier, LUMI, and most of the TOP500. If you work with GPU clusters, AI training infrastructure, or scientific computing, understanding how it works is not optional.

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is an open-source cluster workload manager originally developed at Lawrence Livermore National Laboratory ¹. It is now the de-facto standard for HPC environments worldwide, deployed on more than 60% of TOP500 systems ².

It has three core responsibilities:

Resource allocation assigns compute nodes to jobs based on configured policies: partitions, Quality of Service (QOS) rules, and fairshare weights. It accounts for CPU cores, memory, GPU devices, and network topology simultaneously.

Job scheduling queues submitted jobs and launches them when resources become available. The default algorithm is backfill scheduling, which fills scheduling gaps with smaller jobs without delaying the larger ones already queued.

Accounting records every resource consumption event — who ran what, on which nodes, for how long, consuming how much CPU, memory, and GPU — via a dedicated daemon connected to a relational database.

It operates on a heartbeat model: nodes report their state to a central controller, which dispatches queued jobs as resources free up.

Architecture

The Four Daemons

+------------------------------------------------------------------+
|                        CONTROL PLANE                             |
|                                                                  |
|   +------------------+          +------------------+            |
|   |   slurmctld      |<-------->|   slurmdbd       |            |
|   |   TCP 6817       |          |   TCP 6819       |            |
|   |                  |          |                  |            |
|   |  Scheduler       |          |  Accounting GW   |            |
|   |  State manager   |          |  Only DB client  |            |
|   +--------+---------+          +--------+---------+            |
|            |                             |                       |
+------------|-----------------------------|-----------------------+
             |                             |
             | TCP 6818                    | SQL TCP 3306
             v                             v
+---------------------------+    +--------------------+
|   COMPUTE NODES           |    |   MariaDB          |
|                           |    |   Accounting DB    |
|   slurmd   slurmd   ...   |    +--------------------+
|   node01   node02         |
|                           |
|   cgroups v2 enforcement  |
|   Prolog / Epilog hooks   |
+---------------------------+
             ^
             |
     +-------+--------+
     |   slurmrestd   |
     |   TCP 6820     |
     |   OpenAPI/JWT  |
     +----------------+

`slurmctld` — Controller Daemon (TCP 6817)

The brain of the cluster. It maintains the global state of every node and every job in memory, periodically checkpointing to disk (the StateSaveLocation directory). On restart after a failure, it replays this state to resume operations without losing queued or running jobs.

Key responsibilities:

Runs the scheduler plugin (backfill by default, with optional gang scheduling)
Manages node state transitions (IDLE, ALLOCATED, DOWN, DRAIN, FAIL)
Dispatches jobs to slurmd on compute nodes
Enforces partition and QOS limits
Processes all client commands (sbatch, srun, scontrol)

High availability is supported via a primary/backup pair. If the primary slurmctld fails, the backup takes over within seconds, with minimal job disruption ³.

`slurmd` — Node Daemon (TCP 6818)

One instance runs on every compute node. It is the execution agent: it receives job steps dispatched by slurmctld, spawns user processes inside cgroup hierarchies, monitors resource consumption continuously, and sends periodic heartbeats back to the controller.

When a heartbeat is missed beyond the configured SlurmdTimeout, the controller marks the node as DOWN and can optionally reschedule its jobs.

slurmd also runs the site-defined Prolog script before launching each job (environment setup, filesystem mounting, health checks) and the Epilog script after completion (cleanup, unmounting, node validation).

`slurmdbd` — Database Daemon (TCP 6819)

The exclusive gateway to the accounting database. No other daemon connects to MariaDB directly. This design creates a single point of control for all historical data: job records, resource consumption, user associations, QOS definitions, and the fairshare tree.

slurmdbd can run on a dedicated server, isolated from the controller. Losing it does not stop job execution — running jobs continue — but new accounting records are buffered locally on slurmctld and flushed when connectivity is restored.

`slurmrestd` — REST API Daemon (TCP 6820)

Available since SLURM 20.11 ⁴, slurmrestd exposes the full SLURM management interface as an OpenAPI-documented REST API. It bridges REST calls to internal SLURM RPC, enabling integration with web portals, JupyterHub, workflow orchestrators (Nextflow, Snakemake, Apache Airflow), and cloud bursting systems.

Authentication is via JWT tokens. The API surface is significant and must be treated as a privileged endpoint.

Communication Flows

User (sbatch / srun / salloc)
        |
        | TCP 6817 — job submit, validated against associations + QOS
        v
  +-------------+   TCP 6819   +-------------+   SQL   +-----------+
  | slurmctld   |<------------>| slurmdbd    |-------->| MariaDB   |
  +------+------+   accounting +-------------+         +-----------+
         |
         | TCP 6818 — job dispatch (JobID, allocated nodes, resources)
         |
    +----+----+
    |         |
slurmd #1   slurmd #2  ...
    |
    +-- cgroups v2 (memory.max, cpu.max, devices allowlist)
    +-- Prolog  (runs as root before job)
    +-- job step (runs as user)
    +-- Epilog  (runs as root after job)
    +-- heartbeat -> slurmctld every SlurmdTimeout/3

slurmrestd --REST/JWT--> slurmctld (internal RPC)

All inter-daemon messages: signed + timestamped by MUNGE

Every message exchanged between slurmctld, slurmdbd, and slurmd is signed and timestamped by MUNGE (MUNGE Uid 'N' Gid Emporium). A credential contains the UID/GID of the originating process, a timestamp, and a configurable TTL. Replayed credentials are rejected ⁵.

Scheduling Deep Dive

Backfill Scheduling

The default sched/backfill plugin extends simple first-in-first-out scheduling by maintaining a time-ordered reservation list. When a large job cannot start immediately, the scheduler looks for smaller jobs that can be inserted into the scheduling gap without pushing back the start time of the large job ⁶.

This is why you sometimes see a small 2-node job start before a 100-node job that was submitted earlier: the 100-node job is waiting for enough nodes to free up, and the 2-node job fits in the current available capacity without affecting the projected start time.

Queue state:
  Job A: 100 nodes, submitted T+0, cannot start (only 20 nodes free)
  Job B: 10 nodes, submitted T+10

Backfill logic:
  - Job A projected start: T+45 (when enough nodes finish current jobs)
  - Job B can complete before T+45 if started now
  - Job B is scheduled immediately without delaying Job A

Priority Calculation

SLURM computes a weighted sum for each queued job ⁷:

Priority = w_age        * factor_age
         + w_fairshare  * factor_fairshare
         + w_jobsize    * factor_jobsize
         + w_qos        * factor_qos
         + w_partition  * factor_partition
         + w_assoc      * factor_assoc

The fairshare factor is the most important for multi-tenant clusters. It is computed using a decay algorithm: resource usage from the past contributes less weight over time (configured by PriorityDecayHalfLife). A user who ran 10,000 CPU-hours last week has a lower fairshare score than a user who has not submitted a job in two weeks, pushing the inactive user's jobs to higher priority.

The tool sprio shows the current priority breakdown for every queued job.

QOS and Associations

The association tree controls access at every level:

Cluster: mycluster
  |
  +-- Account: research_lab        (FairShare: 40)
  |       |
  |       +-- User: alice          (FairShare: 20)
  |       |     QOS: normal, gpu_priority
  |       |     MaxTRES: cpu=256,gres/gpu=8
  |       |
  |       +-- User: bob            (FairShare: 20)
  |             QOS: normal
  |             MaxTRES: cpu=128
  |
  +-- Account: ops_team            (FairShare: 60)
          |
          +-- User: carol          (FairShare: 60)
                QOS: normal, high_priority, infra
                MaxTRES: cpu=512,gres/gpu=32

A QOS defines hard limits (GrpTRES, MaxTRESPerJob, MaxWallDurationPerJob) and soft priority boosts. When a user submits a job requesting resources beyond their association or QOS limits, the job is rejected at submission time, not at scheduling time.

Job Lifecycle

  SUBMIT         QUEUE          ALLOCATE         RUN          COMPLETE
     |               |               |              |               |
 sbatch          PENDING          Nodes          RUNNING        COMPLETED
 script.sh        state          reserved         state           state
     |               |               |              |               |
     v               v               v              v               v
 slurmctld      Scheduler        slurmd         slurmd          slurmdbd
 validates      computes         runs           monitors        records
 resources      priority         Prolog         CPU/mem/GPU     all metrics
 + QOS limits   backfill         cgroups        heartbeats      to MariaDB
                analysis         configured     to controller

Submission

#!/bin/bash
#SBATCH --job-name=train_llm
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:a100:8
#SBATCH --mem=512G
#SBATCH --time=48:00:00
#SBATCH --account=research_lab
#SBATCH --qos=gpu_priority

module load cuda/12.2
srun python train.py --config config.yaml

slurmctld validates this script against:

The partition definition (nodes available, max wall time)
The user's association (account exists, user is a member)
The QOS (resource limits not exceeded)
Current cluster capacity (enough GPUs exist)

If all checks pass, the job receives a JobID and enters the PENDING state.

Execution on Nodes

When slurmctld dispatches the job, each slurmd on the allocated nodes:

Runs the site Prolog (as root)
Creates the cgroup hierarchy for the job
Sets memory.max, cpu.max, and the GPU device allowlist
Spawns slurmstepd, which drops privileges to the user and executes the job step
Monitors consumption every JobAcctGatherFrequency seconds
Runs the Epilog on completion (as root)
Reports final resource usage to slurmctld, which forwards it to slurmdbd

Job Arrays

For parameter sweeps, job arrays avoid submitting thousands of individual jobs:

#SBATCH --array=0-99%10    # 100 tasks, max 10 running simultaneously

PARAM=${SLURM_ARRAY_TASK_ID}
python experiment.py --seed $PARAM

Each task gets its own JobID (formatted as ArrayJobID_TaskID) and its own accounting record. The %10 limits concurrent tasks to avoid saturating the cluster.

Observability Stack

Architecture

Compute Nodes
+------------------------+    +------------------------+
| slurmd                 |    | DCGM Exporter          |
|                        |    | (NVIDIA GPU metrics)   |
| slurm-exporter :8080   |    | :9400                  |
|  slurm_jobs_running    |    |  DCGM_FI_DEV_GPU_UTIL  |
|  slurm_jobs_pending    |    |  DCGM_FI_DEV_MEM_COPY  |
|  slurm_nodes_alloc     |    |  DCGM_FI_DEV_NVLINK_*  |
|  slurm_cpus_idle       |    |  label: slurm_job_id   |
+----------+-------------+    +----------+-------------+
           |                             |
           | Prometheus scrape           | Prometheus scrape
           v                             v
+-----------------------------------------------+
|   VMAgent (per node or centralized)           |
|   Relabeling, filtering, remote_write         |
+-------------------+---------------------------+
                    |
                    | remote_write
                    v
+-----------------------------------------------+
|   VictoriaMetrics (vminsert / vmstorage)      |
|   Long-term storage, MetricsQL                |
+-------------------+---------------------------+
                    |
                    | datasource
                    v
+-----------------------------------------------+
|   Grafana                                     |
|   Job efficiency dashboards                   |
|   GPU heatmaps, fairshare visualization       |
|   Alerting (PagerDuty, Slack)                 |
+-----------------------------------------------+

slurm-exporter

The prometheus-slurm-exporter scrapes SLURM CLI tools (squeue, sinfo, sacct) and exposes metrics on port 8080 ⁸.

Key metrics exposed:

Metric	Description
`slurm_jobs_running`	Count of running jobs, by partition
`slurm_jobs_pending`	Count of pending jobs, by reason
`slurm_nodes_alloc`	Nodes in ALLOCATED state
`slurm_nodes_idle`	Nodes in IDLE state
`slurm_nodes_down`	Nodes in DOWN/DRAIN state
`slurm_cpus_total`	Total CPUs in cluster
`slurm_cpus_idle`	Idle CPUs
`slurm_account_cpu_count`	CPUs used per account

A known limitation: the exporter calls CLI binaries, which adds latency and load at scale (thousands of jobs). At very large scale, prefer reading directly from slurmctld's state files or using slurmrestd as a data source.

DCGM Exporter and GPU Correlation

The NVIDIA DCGM Exporter exposes per-GPU hardware metrics ⁹:

DCGM_FI_DEV_GPU_UTIL{gpu="0", UUID="...", hostname="node01"} 94
DCGM_FI_DEV_FB_USED{gpu="0", ...} 38654
DCGM_FI_DEV_POWER_USAGE{gpu="0", ...} 387
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", ...} 198432

To correlate GPU metrics with SLURM jobs, DCGM can be configured to expose the SLURM_JOB_ID environment variable as a label. This enables Grafana queries like:

# GPU efficiency for a specific job
DCGM_FI_DEV_GPU_UTIL{slurm_job_id="12345"}

This is the key insight for AI/ML workloads: raw GPU utilization tells you if GPUs are busy, but job_id correlation tells you which specific training run, user, or team is responsible.

Why VictoriaMetrics for HPC

Prometheus alone struggles with HPC-scale workloads for three reasons:

Cardinality: a 1000-node cluster with 8 GPUs each, running thousands of jobs, generates millions of unique time series
Retention: HPC accounting requires months or years of metrics for capacity planning and user reporting
Query performance: job efficiency reports aggregate over large time ranges with complex label filters

VictoriaMetrics addresses all three ¹⁰:

# vmagent config: distributed collection on compute nodes
scrape_configs:
  - job_name: slurm
    static_configs:
      - targets: ["localhost:8080"]
  - job_name: dcgm
    static_configs:
      - targets: ["localhost:9400"]

remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"

Compression ratios on HPC workloads are typically 10-15x better than Prometheus TSDB, and MetricsQL supports advanced aggregations like quantile_over_time and increase that are essential for wait time analysis.

KPIs That Actually Matter

Most HPC operators track GPU utilization and stop there. That is not enough. The metrics that reveal actual cluster health:

Metric	Formula	Why it matters
CPU efficiency	`used_cpus / alloc_cpus`	Reveals job over-allocation and poor sizing
Memory waste	`alloc_mem - max_rss`	Often 40-60% on ML clusters
Wait time P95	`start_time - submit_time`	Scheduler health indicator
Fairshare drift	`factor_fairshare` over 30d	Detects long-term resource monopolies
GPU occupancy	`DCGM_GPU_UTIL` weighted by job	Distinguishes idle allocation from compute-bound
Job failure rate	`failed / (completed + failed)`	Infrastructure reliability signal

A sacct query for job efficiency after the fact:

sacct -j 12345 \
  --format=JobID,CPUTime,CPUTimeRAW,AveCPU,MaxRSS,ReqMem,Elapsed \
  --units=G

Security

Authentication: MUNGE

MUNGE is the default authentication mechanism for all inter-daemon communication ⁵. Every message is signed with a shared secret (/etc/munge/munge.key), timestamped, and includes the originating UID/GID. A receiving daemon verifies the signature and rejects credentials outside the configured TTL window, preventing replay attacks.

Node A                              Node B
+------------------+                +------------------+
|  slurmctld       |                |  slurmd          |
|                  |--[credential]->|                  |
|  signs with      |                |  verifies with   |
|  munge.key       |                |  munge.key       |
|                  |<--[response]---|                  |
+------------------+                +------------------+

Credential contains:
  - UID / GID of sender
  - Timestamp (TTL: 300s default)
  - Realm (optional)
  - Payload (encrypted)

Key operational requirements:

munge.key must be identical on all nodes (controller + compute + login + slurmdbd server)
File permissions must be 0400, owned by the munge user
Distribution should use a secrets manager (HashiCorp Vault, Ansible Vault) rather than manual scp
Key rotation requires a coordinated restart of all SLURM daemons — the most disruptive operation on a live cluster

Key rotation procedure on a live cluster:

# 1. Generate new key on the controller
mungekey --create --keyfile /etc/munge/munge.key.new

# 2. Distribute to all nodes (use your config management tool)
ansible all -m copy \
  -a "src=/etc/munge/munge.key.new dest=/etc/munge/munge.key mode=0400 owner=munge"

# 3. Restart munge everywhere simultaneously (parallel SSH)
ansible all -m service -a "name=munge state=restarted"

# 4. Restart SLURM daemons in order
ansible compute -m service -a "name=slurmd state=restarted"
ansible controller -m service -a "name=slurmctld state=restarted"
ansible dbd -m service -a "name=slurmdbd state=restarted"

Resource Isolation: cgroups v2

Without cgroup enforcement, a job that allocates 64GB of memory can consume 512GB, triggering OOM kills across all other jobs on the node. SLURM's cgroup plugin prevents this ¹¹.

slurmd receives job dispatch
        |
        v
Creates cgroup hierarchy:
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_12345/
        |
        +-- memory.max        = 65536M   (allocated memory)
        +-- memory.swap.max   = 0        (no swap for HPC jobs)
        +-- cpu.max           = 6400 100000  (64 cores)
        +-- devices.allow     = c 195:0  (GPU 0 only)
        +-- devices.allow     = c 195:1  (GPU 1 only)

Essential cgroup.conf settings:

CgroupPlugin=autodetect
ConstrainRAMSpace=yes       # OOM kill if job exceeds memory limit
ConstrainSwapSpace=yes      # Disable swap for job processes
ConstrainCores=yes          # Pin processes to allocated CPU cores
ConstrainDevices=yes        # Restrict GPU access to allocated devices
AllowedRAMSpace=100         # No tolerance: enforce hard limit
TaskAffinity=yes            # Bind threads to cores

ConstrainRAMSpace=yes is non-negotiable in any multi-tenant environment. Without it, a misbehaving job can take down an entire node.

Authorization: RBAC and Associations

SLURM's authorization model is hierarchical. Access is validated at every layer:

Level 1 — Cluster
  Who can submit at all?

Level 2 — Account
  Which budget/project does the job charge to?
  What is the fairshare allocation?

Level 3 — User
  Individual limits within the account.

Level 4 — QOS
  Hard limits on resources, wall time, and concurrent jobs.
  Priority boosts or penalties.

Level 5 — Partition
  Which physical nodes? What maximum wall time?
  Restricted to specific groups (AllowGroups)?

Managing associations with sacctmgr:

# Create account hierarchy
sacctmgr add cluster mycluster
sacctmgr add account research_lab cluster=mycluster fairshare=40
sacctmgr add user alice account=research_lab defaultaccount=research_lab

# Define QOS
sacctmgr add qos gpu_priority \
  MaxTRESPerUser=cpu=256,gres/gpu=8 \
  MaxWallDurationPerJob=48:00:00 \
  Priority=100

# Assign QOS to user
sacctmgr modify user alice set qos+=gpu_priority

API Security: JWT and TLS

slurmrestd is the largest attack surface in a modern SLURM deployment. A compromised API token provides full cluster control: job submission, node management, user impersonation.

Hardening checklist:

# 1. Generate JWT signing key
openssl genrsa -out /etc/slurm/jwt_hs256.key 2048
chmod 0600 /etc/slurm/jwt_hs256.key
chown slurm: /etc/slurm/jwt_hs256.key

# In slurm.conf:
# AuthAltTypes=auth/jwt
# AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key

# 2. Issue short-lived tokens (1 hour max)
scontrol token username=alice lifespan=3600

# 3. Run behind nginx with rate limiting
# nginx.conf excerpt:
# limit_req_zone $binary_remote_addr zone=slurm_api:10m rate=10r/s;
# location /slurm/ {
#   limit_req zone=slurm_api burst=20 nodelay;
#   proxy_pass http://127.0.0.1:6820;
# }

# 4. Restrict port 6820 by firewall
# Only the proxy IP should reach slurmrestd directly

For inter-daemon TLS (SLURM 23.x+), add to slurm.conf:

CommunicationParameters=EnableTLS
TLSType=tls/openssl

Audit Trail

slurmdbd maintains a complete, immutable audit trail. Every job submission, modification, start, and completion is recorded with full resource accounting. This data is queryable via sacct:

# Full accounting for a user, last 30 days
sacct -u alice \
  --starttime=$(date -d '30 days ago' +%Y-%m-%d) \
  --format=JobID,JobName,Account,QOS,Partition,NCPUS,NNodes,\
           ReqMem,MaxRSS,CPUTime,Elapsed,State,ExitCode \
  --units=G

# Cluster-wide report
sreport cluster utilization \
  start=2024-01-01 end=2024-03-31 \
  -t hourper

For SIEM integration, SLURM writes structured logs to syslog. These can be forwarded to Wazuh, Elastic SIEM, or Splunk for correlation with authentication events and anomaly detection.

Key Configuration Files

File	Purpose	Critical settings
`slurm.conf`	Main config: nodes, partitions, plugins	`SelectType`, `PriorityType`, `AccountingStorageType`
`slurmdbd.conf`	Accounting daemon: DB credentials	Permissions must be `0600`
`cgroup.conf`	Resource enforcement	`ConstrainRAMSpace`, `ConstrainDevices`
`gres.conf`	GPU/FPGA topology and binding	GPU count, MIG partitions
`topology.conf`	Network topology for MPI placement	Switch hierarchy, InfiniBand fabric
`acct_gather.conf`	Per-job energy and I/O metrics	RAPL, InfiniBand, Lustre

Annotated `slurm.conf` for a GPU cluster

# Identity
ClusterName=mycluster
SlurmctldHost=controller01
SlurmctldHost=controller02  # HA backup

# Ports
SlurmctldPort=6817
SlurmdPort=6818

# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres           # Consumable resources: Track
SelectTypeParameters=CR_Core_Memory   # individual CPUs and memory
SchedulerParameters=bf_max_job_test=500,bf_resolution=60

# Priority (multifactor)
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightJobSize=100
PriorityDecayHalfLife=7-0             # 7 days half-life for fairshare
PriorityMaxAge=7-0

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=controller01
AccountingStoragePort=6819
AccountingStorageUser=slurm
AccountingStoragePass=<db_password>
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30             # Collect every 30s

# Task and process tracking
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup

# GRES (GPU)
GresTypes=gpu

# Timeouts
SlurmdTimeout=300
SlurmctldTimeout=120
MessageTimeout=10

# Logging
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldDebug=info
SlurmdDebug=info

# Nodes (example: 16 nodes, 8x A100 each)
NodeName=node[01-16] \
  CPUs=64 \
  RealMemory=512000 \
  Gres=gpu:a100:8 \
  State=UNKNOWN

# Partitions
PartitionName=gpu \
  Nodes=node[01-16] \
  MaxTime=INFINITE \
  DefaultTime=24:00:00 \
  State=UP \
  Default=YES

PartitionName=debug \
  Nodes=node[01-02] \
  MaxTime=1:00:00 \
  Priority=100 \
  State=UP

Operational Runbook: Common Tasks

Drain a node for maintenance

# Drain: no new jobs, current jobs finish
scontrol update NodeName=node05 State=DRAIN Reason="scheduled maintenance"

# Check when node will be empty
squeue -w node05

# After jobs finish, confirm drain
scontrol show node node05 | grep State

# Return to service
scontrol update NodeName=node05 State=RESUME

Hold and release a job

# Hold a pending job (prevents scheduling)
scontrol hold 12345

# Release
scontrol release 12345

# Requeue a failed running job
scontrol requeue 12345

Identify wasted resources

# Jobs where memory usage < 50% of allocation
sacct --format=JobID,ReqMem,MaxRSS,CPUTime,AveCPU \
  --state=COMPLETED \
  --starttime=2024-01-01 \
  | awk '$3 != 0 && ($3/$2) < 0.5 {print $0}'

Summary

SLURM in one diagram:

User submits job (sbatch / srun / srun --pty)
        |
        v
slurmctld
   validates resources (partitions + associations + QOS)
   queues job (PENDING)
   computes priority (fairshare + QOS + age + jobsize)
   runs backfill scheduling
   dispatches to allocated nodes (RUNNING)
   records lifecycle to slurmdbd
        |
        +-- slurmdbd -> MariaDB (full accounting, audit trail)
        |
        +-- slurmd on each node
                |
                +-- cgroups v2   (memory, CPU, GPU isolation)
                +-- Prolog       (pre-job setup, root)
                +-- slurmstepd   (user process, MPI launch)
                +-- Epilog       (post-job cleanup, root)
                +-- heartbeat    (node health to slurmctld)
                |
                +-- slurm-exporter :8080  (job + node metrics)
                +-- DCGM Exporter  :9400  (GPU metrics + job_id)
                        |
                        v
                VMAgent -> VictoriaMetrics -> Grafana

Security stack:
  MUNGE           inter-daemon auth (shared key, signed credentials)
  cgroups v2      resource isolation (memory, CPU, GPU per job)
  Associations    RBAC + fairshare (cluster > account > user > QOS)
  JWT + TLS       API security (slurmrestd behind reverse proxy)
  sacct / slurmdbd  audit trail (full accounting, queryable)

The three files to master before anything else: slurm.conf, cgroup.conf, gres.conf. Everything else builds on top of them.

References

This article is part of the HPC Observability series. Next: Building GPU efficiency dashboards with VictoriaMetrics and Grafana for AI training workloads.

Yoo, A.B., Jette, M.A., Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. Lecture Notes in Computer Science, 2862, 44-60. https://doi.org/10.1007/10968987_3 ↩
TOP500 Editors (2023). Statistics on Resource Management Software. TOP500 Project. https://www.top500.org/statistics/details/rmsoftware/1 ↩
SchedMD LLC. (2024). High Availability in SLURM. SLURM Documentation. https://slurm.schedmd.com/high_availability.html ↩
SchedMD LLC. (2024). REST API Guide. SLURM Documentation. https://slurm.schedmd.com/rest.html ↩
Grondona, M. (2024). MUNGE Authentication Service. GitHub. https://github.com/dun/munge ↩
Lifka, D. (1995). The ANL/IBM SP Scheduling System. Job Scheduling Strategies for Parallel Processing, 295-303. https://doi.org/10.1007/3-540-60153-8_31 ↩
SchedMD LLC. (2024). Multifactor Priority Plugin. SLURM Documentation. https://slurm.schedmd.com/priority_multifactor.html ↩
Penso, V. et al. (2024). prometheus-slurm-exporter. GitHub. https://github.com/vpenso/prometheus-slurm-exporter ↩
NVIDIA Corporation. (2024). DCGM Exporter. GitHub. https://github.com/NVIDIA/dcgm-exporter ↩
VictoriaMetrics Team. (2024). VictoriaMetrics Documentation. https://docs.victoriametrics.com ↩
SchedMD LLC. (2024). Cgroups Guide. SLURM Documentation. https://slurm.schedmd.com/cgroups.html ↩

Monitoring an ML Pipeline in Production: Anatomy of an Open-Source Stack

Erythix — Tue, 24 Feb 2026 21:07:17 +0000

This isn't a theoretical guide. It's a field report on the observability stack I've built and iterated across engagements and demos on the AI Observability Hub - a demonstration platform I use to validate AI monitoring architectures before deploying them at client sites.

The goal is straightforward: give an SRE, data engineer, or CTO the building blocks to monitor an ML pipeline in production with VictoriaMetrics, OpenTelemetry, and Grafana. No vendor lock-in. No proprietary platform. Open-source components, assembled with intention.

What we actually monitor (and what we forget)

Most organizations deploying ML in production settle for monitoring infrastructure: CPU, RAM, disk space. That's necessary, but it's the equivalent of watching a factory's temperature without looking at the quality of parts coming off the line.

A production ML pipeline has four observability layers:

Infrastructure : the foundation. GPU utilization (compute, VRAM, memory bandwidth), CPU, network, disk I/O. Without it, you don't even know if the machine is running. But with it alone, you don't know if the model is working.

Data pipeline : the invisible layer. Training data freshness, ingestion latency, feature completeness, statistical drift in input distributions. A model receiving degraded data produces degraded results and nothing in the infra metrics flags it.

Model : this is what data scientists care about, but what often goes unmonitored in production. Inference latency, throughput (requests/second), confidence score distribution, fallback rate, prediction vs. ground truth comparison when available.

Cost : this is what leadership cares about, and what's often discovered too late. Cost per inference, GPU cost per model, cost/business-value ratio. A model that costs €3 per inference on a use case generating €0.50 in value isn't a technical problem : it's a business problem that only observability makes visible.

The reference architecture

Here's the stack I've built and deploy in my engagements. Each component was chosen for a specific reason - not by habit or popularity.

┌─────────────────────────────────────────────────────────────────┐
│                        ML APPLICATIONS                          │
│  vLLM / TGI / Triton / Custom Flask-FastAPI                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │         OpenTelemetry SDK + Auto-instrumentation        │    │
│  │    Traces (spans)  │  Metrics (counters/histograms)     │    │
│  └────────────────────┼────────────────────────────────────┘    │
└───────────────────────┼─────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                  OPENTELEMETRY COLLECTOR                         │
│  Receivers: OTLP (gRPC/HTTP), Prometheus scrape                 │
│  Processors: batch, filter, attributes, tail_sampling           │
│  Exporters: prometheusremotewrite, otlp                         │
└──────────┬──────────────────────────────────────────────────────┘
           │                                  │
           ▼                                  ▼
┌──────────────────────┐         ┌────────────────────────────────┐
│   VICTORIAMETRICS    │         │       OPENOBSERVE / LOKI       │
│   (metrics TSDB)     │         │       (logs + traces)          │
│   ┌──────────────┐   │         │                                │
│   │  vmselect    │   │         │  Long retention for audit      │
│   │  vminsert    │   │         │  Full-text search              │
│   │  vmstorage   │   │         │  Trace-log correlation         │
│   └──────────────┘   │         └────────────────────────────────┘
└──────────┬───────────┘                      │
           │                                  │
           ▼                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                          GRAFANA                                │
│  Infra Dashboard  │  Model Dashboard  │  Cost Dashboard         │
│  Alerting (Alertmanager) → PagerDuty / Slack / Email            │
└─────────────────────────────────────────────────────────────────┘

Component by component

OpenTelemetry: the instrumentation standard

OpenTelemetry is the instrumentation choice for a non-negotiable reason: it's the only vendor-agnostic standard covering traces, metrics, and logs in a unified framework. Instrumenting with OTel guarantees the freedom to swap backends without re-instrumenting.

For an ML pipeline, instrumentation covers three levels:

The application SDK integrates directly into the inference service code. For a Python service (FastAPI, Flask), OTel auto-instrumentation automatically captures HTTP requests, database calls, and processing spans. For model-specific metrics, custom instruments are added:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# Meter configuration
meter = metrics.get_meter("ml.inference")

# Custom metrics for the ML pipeline
inference_duration = meter.create_histogram(
    name="ml.inference.duration",
    description="Inference duration in milliseconds",
    unit="ms"
)

inference_tokens = meter.create_counter(
    name="ml.inference.tokens.total",
    description="Total tokens generated"
)

confidence_score = meter.create_histogram(
    name="ml.inference.confidence",
    description="Confidence score distribution",
    unit="1"
)

gpu_cost_counter = meter.create_counter(
    name="ml.inference.cost.gpu",
    description="Cumulative estimated GPU cost in euros",
    unit="EUR"
)

# In the inference function
def predict(request):
    start = time.time()

    result = model.generate(request.prompt)

    duration_ms = (time.time() - start) * 1000
    tokens = result.token_count
    conf = result.confidence

    # Record metrics with labels
    labels = {
        "model_name": "llama-3-8b",
        "model_version": "v2.1",
        "environment": "production",
        "use_case": "maintenance_assistant"
    }

    inference_duration.record(duration_ms, labels)
    inference_tokens.add(tokens, labels)
    confidence_score.record(conf, labels)

    # GPU cost estimation (based on hourly rate / time consumed)
    gpu_hourly_rate = 2.50  # €/h for an A100
    gpu_cost = (duration_ms / 3_600_000) * gpu_hourly_rate
    gpu_cost_counter.add(gpu_cost, labels)

    return result

The Collector is the consolidation point. It receives telemetry data from all services, transforms, filters, and routes it to storage backends. It's the most underestimated component in the stack — and the most critical.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape GPU metrics from DCGM/nvidia-smi exporter
  prometheus:
    config:
      scrape_configs:
        - job_name: 'dcgm-exporter'
          scrape_interval: 15s
          static_configs:
            - targets: ['dcgm-exporter:9400']
        - job_name: 'node-exporter'
          scrape_interval: 30s
          static_configs:
            - targets: ['node-exporter:9100']

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Filter low-value metrics
  filter/drop-debug:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*debug.*"
          - ".*test.*"

  # Attribute enrichment
  attributes/env:
    actions:
      - key: deployment.environment
        value: production
        action: upsert
      - key: service.namespace
        value: ml-platform
        action: upsert

  # Trace sampling (keep 100% of errors, 10% of the rest)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: sample-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  prometheusremotewrite:
    endpoint: "http://victoriametrics:8428/api/v1/write"
    resource_to_telemetry_conversion:
      enabled: true

  otlp/traces:
    endpoint: "openobserve:5081"
    tls:
      insecure: true

  otlp/logs:
    endpoint: "openobserve:5081"
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, filter/drop-debug, attributes/env]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling, attributes/env]
      exporters: [otlp/traces]
    logs:
      receivers: [otlp]
      processors: [batch, attributes/env]
      exporters: [otlp/logs]

The key point here is tail sampling. A production ML pipeline can generate thousands of traces per minute. Storing everything is costly and unnecessary. Tail sampling keeps 100% of error traces (the ones that matter for debugging) and samples the rest - reducing storage volume without losing signal.

VictoriaMetrics: storage that handles the load

I chose VictoriaMetrics over Prometheus for a simple reason: cardinality.

A production ML pipeline generates metrics with a high number of label combinations: model × version × use case × environment × request type × user. Prometheus starts struggling beyond a few million active time series. VictoriaMetrics is designed to handle this scale with significantly lower memory and disk footprint.

In practice, in my deployments:

Single-node mode for mid-market companies with moderate volume (< 500k active series). A single binary, minimal configuration, excellent performance. This is the mode I recommend to start with.

# Launch VictoriaMetrics single-node
docker run -d \
  --name victoriametrics \
  -v /data/vm:/victoria-metrics-data \
  -p 8428:8428 \
  victoriametrics/victoria-metrics:stable \
  -retentionPeriod=6m \
  -search.maxUniqueTimeseries=5000000 \
  -dedup.minScrapeInterval=15s

Cluster mode (vmselect + vminsert + vmstorage) when volume exceeds one million series or when high availability is required. This is the mode I use on the AI Observability Hub to simulate realistic loads.

Retention parameters are an architecture decision, not a configuration detail. For operational observability (SRE), 30 to 90 days suffice. For governance and audit (EU AI Act), plan for 12 to 36 months — and that's where VictoriaMetrics' compression makes a real difference in storage cost compared to alternatives.

Grafana: three dashboards, three audiences

Grafana isn't just a visualization tool. It's the translation layer between technical data and human decisions. A dashboard that just shows curves without guiding action is a useless dashboard.

I systematically structure ML observability into three dashboards:

Dashboard 1 : Infra & GPU (audience: SRE/DevOps)

This dashboard answers: "Is the platform holding up?"

Key metrics:

# GPU compute utilization (via DCGM exporter)
DCGM_FI_DEV_GPU_UTIL{instance=~"$gpu_node"}

# GPU memory used vs. total
DCGM_FI_DEV_FB_USED{instance=~"$gpu_node"}
  / DCGM_FI_DEV_FB_TOTAL{instance=~"$gpu_node"} * 100

# GPU temperature (alert if > 85°C)
DCGM_FI_DEV_GPU_TEMP{instance=~"$gpu_node"}

# Inference throughput (requests/second)
rate(ml_inference_duration_count[5m])

# Inference p95 latency
histogram_quantile(0.95,
  rate(ml_inference_duration_bucket[5m])
)

# Queue saturation (if applicable)
ml_inference_queue_depth

Configured alerts:

GPU utilization > 95% for 10 minutes → capacity alert
GPU temperature > 85°C → thermal alert
p95 latency > SLO threshold (e.g., 2s for a conversational assistant) → performance alert
Queue depth > 100 requests for 5 minutes → saturation alert

Dashboard 2 : Model & Quality (audience: data engineers, ML engineers)

This dashboard answers: "Is the model doing its job?"

This is the dashboard missing from 90% of ML deployments I audit. The infra is running, the service responds, but nobody knows if the responses are good.

Key metrics:

# Confidence score distribution (heatmap)
ml_inference_confidence_bucket

# Rolling 24h average confidence score
avg_over_time(
  ml_inference_confidence_sum[24h]
) / avg_over_time(
  ml_inference_confidence_count[24h]
)

# Low-confidence response rate (< 0.6)
sum(rate(ml_inference_confidence_bucket{le="0.6"}[1h]))
  / sum(rate(ml_inference_confidence_count[1h])) * 100

# Model error rate (timeouts, exceptions, fallbacks)
sum(rate(ml_inference_duration_count{status="error"}[5m]))
  / sum(rate(ml_inference_duration_count[5m])) * 100

# Tokens generated per request (distribution)
histogram_quantile(0.5, rate(ml_inference_tokens_bucket[1h]))

# Drift detector: current vs. baseline distribution comparison
# (requires a periodic compute job publishing the metric)
ml_feature_drift_score{feature="input_length"}

Configured alerts:

Average confidence score < adaptive threshold for 48h → drift alert
Low-confidence response rate > 20% → quality alert
Model error rate > 5% for 15 minutes → critical alert
Drift score > 0.3 on a key feature → data shift alert

The drift alert is the most important and hardest to calibrate. The threshold isn't static — it must be calculated against a baseline established over a reference period (the first 30 days in production, for example). This is a use case where VictoriaMetrics recording rules come into their own:

# Recording rules for baseline calculation
groups:
  - name: ml_baseline
    interval: 1h
    rules:
      - record: ml:confidence:baseline_avg
        expr: >
          avg_over_time(
            ml_inference_confidence_sum[30d]
          ) / avg_over_time(
            ml_inference_confidence_count[30d]
          )

      - record: ml:confidence:current_avg
        expr: >
          avg_over_time(
            ml_inference_confidence_sum[24h]
          ) / avg_over_time(
            ml_inference_confidence_count[24h]
          )

      - record: ml:confidence:drift_ratio
        expr: >
          abs(ml:confidence:current_avg - ml:confidence:baseline_avg)
            / ml:confidence:baseline_avg

Dashboard 3 : Cost & Business (audience: leadership, finance, product owners)

This dashboard answers: "How much does it cost and how much value does it deliver?"

This is the dashboard that turns a cost center into a value center — and the one that'll keep your budget next year.

Key metrics:

# Cumulative GPU cost by model (current day)
sum by (model_name)(
  increase(ml_inference_cost_gpu_total[24h])
)

# Average cost per inference
sum(rate(ml_inference_cost_gpu_total[1h]))
  / sum(rate(ml_inference_duration_count[1h]))

# Cost by use case
sum by (use_case)(
  increase(ml_inference_cost_gpu_total[30d])
)

# Daily inference volume
sum(increase(ml_inference_duration_count[24h]))

# Projected end-of-month cost (linear extrapolation)
sum(increase(ml_inference_cost_gpu_total[24h])) * 30

# Tokens/cost ratio (efficiency)
sum(rate(ml_inference_tokens_total[1h]))
  / sum(rate(ml_inference_cost_gpu_total[1h]))

This dashboard must be readable by someone who doesn't know what a quantile is. Big numbers at the top (today's cost, projected monthly cost, inference count), trends below, details by model and use case at the bottom.

Deployment: from docker-compose to cluster

Phase 1 : Prototype (docker-compose)

To validate the architecture, a docker-compose.yml is enough. This is what I use on the AI Observability Hub for quick demos:

version: '3.8'

services:
  victoriametrics:
    image: victoriametrics/victoria-metrics:stable
    ports:
      - "8428:8428"
    volumes:
      - vm-data:/victoria-metrics-data
    command:
      - "-retentionPeriod=90d"
      - "-search.maxUniqueTimeseries=3000000"

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol/config.yaml
    depends_on:
      - victoriametrics
      - openobserve

  openobserve:
    image: public.ecr.aws/zinclabs/openobserve:latest
    ports:
      - "5080:5080"   # UI
      - "5081:5081"   # Ingestion
    environment:
      - ZO_ROOT_USER_EMAIL=admin@erythix.com
      - ZO_ROOT_USER_PASSWORD=changeme
    volumes:
      - oo-data:/data

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
    depends_on:
      - victoriametrics

  # ML load simulator (for demos)
  ml-simulator:
    build: ./ml-simulator
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
      - MODEL_NAME=llama-3-8b
    depends_on:
      - otel-collector

volumes:
  vm-data:
  oo-data:
  grafana-data:

Phase 2 : Production (Kubernetes)

For production, each component is deployed via Helm charts or Kubernetes manifests with the following considerations:

VictoriaMetrics: official Helm chart (victoria-metrics-k8s-stack) including vmoperator, recording rules, and Grafana integration. Cluster mode for HA, with PVCs on performant storage (local SSD or AWS EBS gp3).

OTel Collector: deployed as a DaemonSet (one per node, for system and GPU metric collection) + a centralized Deployment (for aggregation, tail sampling, and routing). The DaemonSet collects DCGM metrics and local logs. The central Deployment handles processing and export.

Grafana: deployed with automatic datasource and dashboard provisioning via ConfigMaps. Dashboards are versioned in Git and deployed via CI/CD — no manual configuration that drifts over time.

Pitfalls I learned to avoid

After multiple iterations on the AI Observability Hub and real-world deployments, here are the most expensive mistakes.

Cardinality explosion

Trap number one. A user_id label on inference metrics seems useful - until 10,000 users generate 10,000 time series per metric. Multiply by 20 metrics and 3 model versions, and you hit 600,000 series for a single service.

The rule: high-cardinality labels (user ID, request ID, session ID) belong in traces and logs, not in metrics. Metrics use bounded-cardinality labels: model_name, model_version, environment, use_case, status.

Cargo cult monitoring

Copying someone else's dashboards without understanding what they measure. I've seen teams with 47 panels on a dashboard, 43 of which nobody looked at. A useful dashboard has between 6 and 12 panels, organized by business question, not by metric type.

No baseline

Monitoring without a baseline is like having a thermometer without knowing what temperature is normal. The first 30 days of a model's production run should establish baselines for every key metric. Drift alerts are calculated against these baselines, not against arbitrary thresholds.

Only monitoring the happy path

Instrumenting only the nominal path and discovering during an incident that the error path isn't traced. Every fallback, every timeout, every exception should produce metrics and spans with an explicit error status. Errors are where observability creates the most value.

The cost of monitoring itself

I've seen observability stacks that cost more than the infrastructure they monitored. VictoriaMetrics helps significantly here (aggressive compression, low memory footprint), but sizing must be planned from the start. Rule of thumb: monitoring cost shouldn't exceed 5-10% of the monitored infrastructure cost.

The result: what the stack makes possible

When all four observability layers are in place (infra, pipeline, model, cost), three things become possible that weren't before:

Drift detection in days, not months. On a recent engagement, the stack detected a confidence score degradation in a predictive maintenance model 72 hours after a sensor change on the factory floor. Without model monitoring, the team would have continued following degraded recommendations for weeks.

Data-driven cost/performance optimization. The cost dashboard helped a client discover that a marginal use case (5% of volume) consumed 35% of the GPU budget due to an oversized model. Replaced with a lighter model, same perceived quality, 30% reduction in the overall bill.

Governance as a byproduct of observability. The audit trails required for the EU AI Act aren't an additional effort they're the traces and logs the stack already collects. It's a matter of structuring them for audit, not creating them from scratch.

Where to start

If you have nothing today, here's the sequence I recommend:

Week 1 : Instrument. Add the OpenTelemetry SDK to your inference service. Five metrics are enough to start: inference duration, request count, error rate, tokens generated, confidence score. Deploy the Collector in minimal mode.

Week 2 : Store and visualize. Deploy VictoriaMetrics in single-node mode and Grafana. Create the infra dashboard first (it's the fastest), then the model dashboard. It doesn't need to look pretty — it needs to be functional.

Week 3 : Alert. Configure three alerts only: p95 latency above SLO, error rate above 5%, confidence score dropping. Three well-calibrated alerts are worth more than twenty that generate fatigue.

Month 2 : Refine. Add cost metrics. Establish your baselines. Configure recording rules for drift calculation. Create the cost/business dashboard. At this point, you have production-grade ML observability.

Month 3+ : Extend. Add traces for advanced debugging. Integrate structured logs. Connect AI alerts to your SIEM. Explore input feature monitoring for data drift detection.

Each step delivers immediate value. No need to wait for everything to be in place to benefit from observability.

Samuel Desseaux is the founder of Erythix and Aureonis, a fractional CTO and trainer specializing in observability IT/AI, AI security, and IT/OT convergence. Official VictoriaMetricsPartner for France,Benelux and Arize partner.

The AI Observability Hub is Erythix's demonstration platform for AI workload observability. Contact https://www.linkedin.com/in/sdesseaux/ for a demo or a stack diagnostic.

Observability as the Control Plane for AI: Operations, Security, Governance

Erythix — Mon, 23 Feb 2026 21:10:48 +0000

Language models are in production. AI agents are making decisions, calling APIs, accessing databases and in most European mid-market industrial companies, nobody knows exactly what they're doing.

We know how to monitor a web server ,how to trace a SQL query but when an LLM decides to rephrase a maintenance instruction, summarize a quality report or trigger an action through an external tool, traditional observability stacks are blind. Not because they lack metrics but because they're asking the wrong questions.

Observability for AI is no longer "is the system running?" It's "why did the system make that decision, and should it have?"

This article proposes a three-layer framework (operational, security, governance ) to turn observability into a true control plane for AI systems. Not in five years. Now. With open-source building blocks. For mid-market companies that don't have hyperscaler budgets.

The problem: non-deterministic systems in critical environments

A traditional system is predictable: same input, same output. An LLM, by design, is not. It produces probabilistic responses, follows variable execution paths and when connected to tools (function calling, RAG, agents), it can trigger real-world actions.

Three characteristics make traditional monitoring approaches insufficient.

Non-determinism first. Two identical queries can produce different results. Static alert thresholds lose their meaning when normal behavior isn't constant.

Exploding cardinality next. Every AI interaction generates new telemetry dimensions: prompts, embeddings, tool calls, intermediate reasoning steps, dynamic identities. The volume of data to ingest and correlate changes by an order of magnitude.

Cross-system execution paths last. An AI workflow can traverse APIs, data platforms, identity systems, security controls, and multiple infrastructure layers. Root cause analysis requires stitching together signals across domains that historically operated in silos.

In an industrial context ( predictive maintenance, process control, asset management ) this opacity isn't a technical inconvenience. It's an operational risk.

The framework: three layers of control

AI observability is not a separate discipline. It's an extension of existing monitoring that adds three specific control functions. Each answers a different question, serves a different audience, and is implemented with different tools.

Layer 1 : Operational control

Question: Is the AI system performing as expected, within acceptable performance boundaries?

Who it serves: SRE, DevOps, and data engineering teams.

Operational control extends existing SRE practices to AI workloads. It means monitoring not just infrastructure (GPU, memory, network latency) but also model behavior in production: inference time, error rates, response quality, cost per request.

The challenge specific to AI systems is drift detection. A model that worked correctly three months ago can silently degrade because input data has shifted, business context has changed, or a RAG component was updated without anyone checking the impact on outputs.

In practice, this means:

Defining AI-specific SLOs (p95 inference latency, hallucination rate, cost per token) alongside standard application SLOs
Instrumenting the ML pipeline end-to-end with OpenTelemetry — from data preprocessing through to the user-facing response
Collecting and storing model performance metrics in a high-performance time-series database (VictoriaMetrics natively handles the high cardinality these workloads generate)
Configuring dynamic alerts that adapt to probabilistic behavior, rather than fixed thresholds that generate noise

Concrete use case — Predictive maintenance in aerospace

An aerospace subcontractor uses an ML model to anticipate equipment failures on an assembly line. The model runs on an internal HPC cluster, consumes real-time sensor data, and feeds a maintenance planning dashboard.

The observability stack monitors three levels: infrastructure (GPU utilization, memory throughput, cluster network latency), pipeline (preprocessing time, training data freshness, input feature drift), and the model itself (confidence score distribution, false positive rate over the last 30 days, comparison with field feedback).

When the average confidence score drops below an adaptive threshold for 48 hours, an alert is routed not to the infrastructure team, but to the data team — because the problem is most likely not a downed server, but a shift in sensor data patterns.

Without this layer, the maintenance team continues following recommendations from a degraded model for weeks. With it, drift is detected in days, not months.

Layer 2 : Security control

Question: Is the AI system doing something it shouldn't be doing?

Who it serves: CISOs, SOC teams, cybersecurity leads.

This is the layer that most fundamentally changes the role of observability. Production LLMs are not just tools — they are active attack surfaces. Prompt injection, data exfiltration through model outputs, business logic abuse through reasoning manipulation, unauthorized use of connected tools: the vectors are specific, and traditional security tools don't cover them.

The OWASP Top 10 for LLM Applications identifies ten vulnerability categories, the most critical for industrial contexts being prompt injection (direct and indirect), insecure output handling, and excessive agency granted to agents.

The fundamental problem is visibility. In many organizations, LLM interactions simply aren't logged. When an incident occurs, response teams don't have the data to understand what happened. Prompts aren't recorded. Tool calls aren't traced. Model decisions aren't auditable.

In practice, this means:

Systematically logging every LLM interaction: input prompt, system context, generated output, tools called, with timestamps and user attribution
Establishing behavioral baselines and detecting anomalies: unusual request spikes, repetitive prompt patterns (possible signs of model extraction), access to tools or data outside normal scope
Treating tools exposed to LLMs as privileged interfaces — with access controls, auditing, and enforcement independent of the model's output
Integrating AI alerts into the existing SIEM rather than creating a parallel security silo

Concrete use case — Prompt injection detection on an internal RAG assistant

An energy distributor deploys an internal conversational assistant connected via RAG to its technical documentation (intervention procedures, safety data sheets, standards). The assistant is used by field technicians through a mobile application.

The security layer observes every interaction and applies three levels of detection. First level: syntactic analysis of incoming prompts to detect known injection patterns (instructions like "ignore your previous instructions," unusual encodings, attempts to exfiltrate the system prompt). Second level: output monitoring to identify responses containing system prompt elements, out-of-scope data, or instructions the model shouldn't generate. Third level: tool call monitoring — if the assistant attempts to access documents outside its authorized scope or if the access pattern changes abruptly, an alert is raised.

One morning, the system detects a user submitting a series of short, structured prompts that resemble a systematic attempt to map the documents accessible via the RAG pipeline. The volume and pattern don't match normal technician usage. The alert reaches the SOC, which isolates the session and analyzes the vector.

Without this layer, the progressive exfiltration of the technical documentation base goes unnoticed.

Layer 3 : Governance control

Question: Is the AI system complying with the rules the organization is subject to?

Who it serves: executive leadership, compliance, legal, DPOs.

The EU AI Act enters full enforcement in 2026. For high-risk AI systems — and many industrial use cases fall into this category — the obligations are concrete: decision traceability, technical documentation, risk assessment, human oversight, robustness, and cybersecurity. Fines can reach €35 million or 7% of global annual turnover.

AI governance is not a PowerPoint topic. It's a data topic. And that data is what observability produces.

Observability provides the evidence needed for compliance: who used the system, when, with what inputs, what decisions were made, what actions were triggered, and whether safeguards were active at the time of the interaction. Without these traces, a mid-market company cannot demonstrate compliance — it can only assert it.

In practice, this means:

Building complete, timestamped audit trails of every AI interaction, retained according to regulatory timeframes
Automatically documenting data lineage: where did the training data, RAG data, and context data come from? Were they filtered, anonymized, validated?
Producing compliance dashboards that translate technical data into indicators understandable by a board of directors or an auditor
Implementing traceable human oversight mechanisms — not a cosmetic "approve" button, but proof that human decision-making actually occurred

Concrete use case — EU AI Act audit for a quality control system

An automotive component manufacturer uses a computer vision system coupled with an LLM for end-of-line quality control. The system classifies parts (conforming / non-conforming / requires inspection) and generates anomaly reports in natural language.

This system falls under the "high-risk" category as defined by the EU AI Act (safety components for products covered by Union harmonization legislation). The company must demonstrate traceability for every classification decision.

The governance layer builds on data produced by the two preceding layers and structures it for audit. Every classification decision is recorded with: the source image, the vision model result, the confidence score, the LLM-generated report, the timestamp, and the identifier of the model version used. When the confidence score falls below a defined threshold, the system forces human verification and records the operator's decision.

A governance dashboard aggregates this data into monthly indicators: automated vs. supervised classification rates, confidence score distribution, number of human-machine disagreements, average processing time. These indicators feed directly into the EU AI Act compliance file.

During an audit, the company doesn't present a static document describing what the system is supposed to do. It shows the actual data of what the system did, decision by decision, with proof that safeguards were functioning.

The architecture: building on what exists with open-source components

One of the most common mistakes is treating AI observability as a greenfield project requiring a dedicated stack. In most mid-market companies, the foundations already exist: metrics collection, log aggregation, trace correlation. The work is to extend that stack, not replace it.

Here is a reference architecture built on open-source, sovereign components:

Collection and instrumentation — OpenTelemetry as the single instrumentation standard. The benefit is twofold: vendor-agnostic (no lock-in) and extensible to AI-specific signals (prompt traces, inference metrics, tool call spans). OpenTelemetry SDKs integrate with existing ML frameworks (LangChain, LlamaIndex, vLLM) through dedicated instrumentations.

Metric storage and querying — VictoriaMetrics for time series. The high cardinality of AI metrics (one dimension per model × per version × per request type × per user) overwhelms traditional monitoring solutions. VictoriaMetrics is designed to handle this scale with a controlled resource footprint — a critical point for mid-market companies that can't provision Thanos clusters.

Logs and traces — OpenObserve or Grafana Loki for log aggregation, with long retention for audit requirements. OpenTelemetry traces enable following a user request from the initial prompt to the final action, through every reasoning step and every tool call.

Visualization and alerting — Grafana as the unified presentation layer. Three types of dashboards: operational (SRE/DevOps), security (SOC), governance (leadership/compliance). The same underlying data, views adapted to each audience.

Security integration — AI alerts feed the existing SIEM through standardized exports. No silos. The goal is for the SOC analyst to see AI events in the same stream as network and application events.

What makes this architecture viable for a mid-market company is that it relies on components many already use. Extending to AI observability doesn't require a dedicated budget of several hundred thousand euros. It requires competence, architecture, and an understanding of what needs to be measured.

What really changes: from passive to active observability

Traditional monitoring is fundamentally passive. It collects data, displays dashboards, and sends alerts when a threshold is breached. The human then decides what to do.

For AI systems in production, this approach is no longer sufficient. When an AI agent makes a bad decision, the time between detection and action must be measured in seconds, not minutes. Observability becomes active: it informs automated actions — throttling, rollback, isolation, escalation — when system behavior deviates from expected patterns.

This shift to active observability is the real paradigm change. It's no longer about knowing what happened after the fact. It's about intervening while it's happening.

For European industrial mid-market companies, this is also a strategic opportunity. Hyperscalers are building these capabilities into their proprietary platforms. Building the equivalent on open-source, sovereign components means preserving your decision-making capacity and technical mastery — without depending on a vendor that can change its terms, its pricing, or its data policies overnight.

Where to start?

If you're deploying or planning to deploy AI systems in production, three concrete actions to take now:

Instrument before you deploy. Integrate OpenTelemetry into your ML pipelines during development, not after production deployment. The cost of retrofitting instrumentation is always higher, and the first weeks in production are precisely when you need visibility the most.

Define your AI SLOs. Just as you have availability and latency SLOs for your applications, define measurable objectives for your AI systems: response quality, cost per inference, drift rate, audit coverage. What isn't measured won't be governed.

Treat AI security as an observability problem. Don't create an AI security silo next to your existing SOC. Extend your monitoring to cover the new threat vectors. The data is the same — the questions are different.

Observability isn't just another tool in the stack. It's the control plane that makes AI governable. Without it, you're flying blind. With it, you know what your systems are doing, why they're doing it, and you can prove they're doing it correctly.

Samuel Desseaux is the founder of Erythix and Aureonis, a CTO-Advocate specializing in IT/OT observability, AI security and observability. Official VictoriaMetrics Training Partner for Europe and Arize partner.

DEV Community: Erythix

Distributed Tracing in ML Pipelines: From Preprocessing to Inference

How OpenTelemetry exposes the bottlenecks your metrics will never see

1. The Lie of the Green Dashboard

2. Why an ML Pipeline is structurally difficult to observe

3. The Four Bottlenecks that metrics will never see

3.1 The Cascade Bottleneck in feature extraction

3.2 The GPU Queue hiding behind utilization numbers

3.3 Silent revalidations

3.4 The Unmanaged Component cold start

4. What Distributed Tracing changes structurally

5. Instrumentation Architecture: Five Services, One Trace

5.1 Shared Initialization

5.2 API Gateway: Creating the Root Span

5.3 Feature Extractor: The Most Critical Stage to Instrument

5.4 Model Inference: Decomposing GPU Time

5.5 Post-Processor: Making the Invisible Visible

6. What the Flamegraph Reveals in Practice

7. Querying Traces in Grafana and Tempo

8. Adaptive Sampling: Tracing at Scale Without the Cost

9. What Tracing Does Not Replace

10. Conclusion: Metrics Tell You What, Traces Tell You Why

Appendix: Span Attribute Reference

Tracing a RAG Chain End-to-End: Where OpenTelemetry Stops and Where You Need to Instrument Yourself

What a RAG chain actually traverses

What OTel covers natively

The five dead zones

Zone 1: Embedding

Zone 2: Retrieval

Zone 3: Reranking

Zone 4: Prompt assembly

Zone 5: LLM call payload

Instrumenting manually

Embedding

Retrieval

Reranking

Prompt assembly

LLM call enrichment

Naming conventions

What you do with complete traces

Conclusion

SLURM in a nutshell: Architecture, Observability and Security for HPC Clusters

What is SLURM?

Architecture

The Four Daemons

slurmctld — Controller Daemon (TCP 6817)

slurmd — Node Daemon (TCP 6818)

slurmdbd — Database Daemon (TCP 6819)

slurmrestd — REST API Daemon (TCP 6820)

Communication Flows

Scheduling Deep Dive

Backfill Scheduling

Priority Calculation

QOS and Associations

Job Lifecycle

Submission

Execution on Nodes

Job Arrays

Observability Stack

Architecture

slurm-exporter

DCGM Exporter and GPU Correlation

Why VictoriaMetrics for HPC

KPIs That Actually Matter

Security

Authentication: MUNGE

Resource Isolation: cgroups v2

Authorization: RBAC and Associations

API Security: JWT and TLS

Audit Trail

Key Configuration Files

Annotated slurm.conf for a GPU cluster

Operational Runbook: Common Tasks

Drain a node for maintenance

Hold and release a job

Identify wasted resources

Summary

References

Monitoring an ML Pipeline in Production: Anatomy of an Open-Source Stack

What we actually monitor (and what we forget)

`slurmctld` — Controller Daemon (TCP 6817)

`slurmd` — Node Daemon (TCP 6818)

`slurmdbd` — Database Daemon (TCP 6819)

`slurmrestd` — REST API Daemon (TCP 6820)

Annotated `slurm.conf` for a GPU cluster