<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Erythix</title>
    <description>The latest articles on DEV Community by Erythix (@erythix_6d20050c4f1039b32).</description>
    <link>https://dev.to/erythix_6d20050c4f1039b32</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3787623%2Fa06cf192-7e4d-4ffa-bfa9-b645f1a92ddf.png</url>
      <title>DEV Community: Erythix</title>
      <link>https://dev.to/erythix_6d20050c4f1039b32</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/erythix_6d20050c4f1039b32"/>
    <language>en</language>
    <item>
      <title>Distributed Tracing in ML Pipelines: From Preprocessing to Inference</title>
      <dc:creator>Erythix</dc:creator>
      <pubDate>Sat, 21 Mar 2026 11:51:14 +0000</pubDate>
      <link>https://dev.to/erythix_6d20050c4f1039b32/distributed-tracing-in-ml-pipelines-from-preprocessing-to-inference-1a76</link>
      <guid>https://dev.to/erythix_6d20050c4f1039b32/distributed-tracing-in-ml-pipelines-from-preprocessing-to-inference-1a76</guid>
      <description>&lt;h2&gt;
  
  
  How OpenTelemetry exposes the bottlenecks your metrics will never see
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Samuel Desseaux · Erythix&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Lie of the Green Dashboard
&lt;/h2&gt;

&lt;p&gt;It is 2 PM on a Tuesday. Your team receives a user report: predictions have been slow since this morning. You open Grafana. CPU at 38%, GPU at 72%, HTTP error rate at 0.2%, p99 latency at 1.4s. Nothing breaches a configured threshold. You tell the user everything looks nominal.&lt;/p&gt;

&lt;p&gt;Two hours later, a second report. Then a third. The problem exists. Your tools cannot see it.&lt;/p&gt;

&lt;p&gt;This scenario is not hypothetical. It is the daily reality of most teams operating ML pipelines in production without distributed tracing. Classic metrics measure the state of a service at a given moment. They do not measure the life of a request as it travels through multiple services. These are two fundamentally different levels of observation, and conflating them is a systematic source of operational blind spots.&lt;/p&gt;

&lt;p&gt;The distinction matters more in ML pipelines than anywhere else in software engineering, because a machine learning pipeline is not a function. It is a chain of distributed transformations, each with its own dependencies, its own timing characteristics, and its own failure modes.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Why an ML Pipeline is structurally difficult to observe
&lt;/h2&gt;

&lt;p&gt;Consider a typical production pipeline for a recommendation engine or an LLM completion service. A request arrives at the API Gateway, which validates it, enriches context via a feature store, assembles a batch, sends it to the model server, retrieves raw output, validates and formats it, then finally responds to the client. Six to eight distinct services, sometimes in different runtimes (Python, Go, Triton), sometimes on different machines, sometimes in different availability zones.&lt;/p&gt;

&lt;p&gt;In this context, a performance degradation can originate anywhere in the chain. And its cause is rarely where it appears to be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The technical impact&lt;/strong&gt; is direct. Without cross-service visibility, diagnosing a regression takes hours. Engineers manually scan logs from each service, compare timestamps, and mentally reconstruct a sequence that tooling should surface automatically. This is not a knowledge problem; it is an instrumentation problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The business impact&lt;/strong&gt; is consistently underestimated. A slow ML pipeline means a recommendation engine that responds after the user has already scrolled past. It means an LLM assistant that feels broken. In the most critical contexts, fraud detection, credit scoring, medical triage, it means a decision rendered too late to be useful. Latency is not an infrastructure problem. It is a value delivery problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Four Bottlenecks that metrics will never see
&lt;/h2&gt;

&lt;p&gt;Before reaching solutions, the problems deserve precise names. Field experience on production ML pipelines consistently surfaces four categories of bottlenecks that are structurally invisible to classic monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 The Cascade Bottleneck in feature extraction
&lt;/h3&gt;

&lt;p&gt;The feature store is the most under-monitored dependency in an ML pipeline. It handles requests efficiently when serving from cache, then falls back to the underlying database for the minority of cases where data is not warm. That minority can have a latency ten to fifty times higher than the cache path.&lt;/p&gt;

&lt;p&gt;What the metrics show: p50 at 15ms, p99 at 800ms on the feature store service. If you are already looking at that specific service dashboard, the problem is visible. If you are looking at aggregate pipeline latency, the feature store is buried in the noise. And if you do not know the feature store is the culprit, you will not look at its dashboard until the investigation is already underway.&lt;/p&gt;

&lt;p&gt;A distributed trace, by contrast, immediately shows that on slow requests the &lt;code&gt;feature_extraction&lt;/code&gt; span accounts for 60% of total pipeline time, and that it is consistently the &lt;code&gt;db_fallback&lt;/code&gt; child span driving the duration.&lt;/p&gt;

&lt;p&gt;The business impact crystallizes around new users: those whose features are not yet warm in cache experience the worst latency precisely at the moment when engagement is most fragile and first impressions are being formed.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 The GPU Queue hiding behind utilization numbers
&lt;/h3&gt;

&lt;p&gt;The GPU is the most expensive resource in an ML infrastructure. Its standard monitoring metric is utilization percentage. 75% looks healthy, neither underused nor saturated.&lt;/p&gt;

&lt;p&gt;But GPU utilization measures the percentage of time the GPU is executing kernels. It does not measure the time requests spend waiting in the queue for GPU access. A GPU running at 75% utilization can spend 60% of its chargeable compute time waiting for prior requests to release memory before beginning actual computation.&lt;/p&gt;

&lt;p&gt;Distributed tracing decomposes the &lt;code&gt;inference&lt;/code&gt; span into two distinct measurements: &lt;code&gt;queue_wait_ms&lt;/code&gt; and &lt;code&gt;forward_pass_ms&lt;/code&gt;. When the ratio of queue wait to forward pass exceeds 1, the GPU is a bottleneck regardless of what the utilization gauge reads.&lt;/p&gt;

&lt;p&gt;The economics are stark. On a service handling 10,000 requests per hour, if each request waits 300ms in the GPU queue, that is 3,000 seconds of accumulated client-facing latency per hour. And the GPU billing meter runs equally on queue time and compute time.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Silent revalidations
&lt;/h3&gt;

&lt;p&gt;This pattern is the most deceptive of the four. Your HTTP error rate is 0%. Users receive valid responses. Everything appears correct by every standard metric.&lt;/p&gt;

&lt;p&gt;Behind the scenes, the post-processor is receiving malformed model outputs: truncated JSON, missing required fields, unexpected formats. It repairs them by replaying validation with relaxed parameters, sometimes by re-invoking the model. The end user sees a valid response with slightly elevated latency. The monitoring dashboard sees nothing unusual.&lt;/p&gt;

&lt;p&gt;This behavior is an early indicator of model degradation. A model that starts producing malformed outputs 5% of the time will progressively worsen to 10%, then 20%. Without measuring the revalidation rate, the degradation only becomes detectable through HTTP error rate increases, which is to say: too late, after user-facing failures have already begun.&lt;/p&gt;

&lt;p&gt;Distributed tracing makes these validation attempts visible as attributes and events on the &lt;code&gt;post_processing&lt;/code&gt; span. Aggregated across requests, they form an early warning signal that HTTP-level metrics fundamentally cannot provide.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 The Unmanaged Component cold start
&lt;/h3&gt;

&lt;p&gt;The tokenizer, the image preprocessor, the feature scaler: these components are typically loaded from disk or initialized on the first request, then held in memory. Or they are supposed to be.&lt;/p&gt;

&lt;p&gt;In practice, unexpected reloads occur for a variety of reasons: worker rotation, memory eviction under pressure, partial deployments, lifecycle bugs. The result is a bimodal latency distribution: the vast majority of requests are fast, a minority are slow in a pattern that does not look random.&lt;/p&gt;

&lt;p&gt;Identifying this on aggregated metrics is difficult because the mean and even the p99 can appear acceptable if the reloads are infrequent. On a trace, the cold start appears as a &lt;code&gt;tokenizer_init&lt;/code&gt; span of 300ms on affected requests and absent on clean ones. The pattern is immediately legible on a flamegraph. On a metrics dashboard, it dissolves into the p95 histogram and becomes invisible.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What Distributed Tracing changes structurally
&lt;/h2&gt;

&lt;p&gt;Before the implementation, the theoretical case deserves to be made clearly.&lt;/p&gt;

&lt;p&gt;A metric is an aggregation. It discards the information about individual requests. It cannot tell you that user X's request was slow because of the feature store while user Y's request was slow because of the GPU queue.&lt;/p&gt;

&lt;p&gt;A log is an isolated event. It knows what happened inside one service, but not how that event relates to what happened in the other services handling the same request.&lt;/p&gt;

&lt;p&gt;A trace is a causal and temporal view of a request as it traverses all services. It links events by their &lt;code&gt;trace_id&lt;/code&gt;, preserves the parent-child relationship between spans, measures each stage individually, and allows navigation from the aggregate view (a flamegraph across all requests) to the individual view (one specific slow request) in two clicks.&lt;/p&gt;

&lt;p&gt;This is the difference between knowing your pipeline is sometimes slow and knowing why this specific request took 1.8 seconds at 2:03 PM on Tuesday.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Instrumentation Architecture: Five Services, One Trace
&lt;/h2&gt;

&lt;p&gt;The following implementation covers a complete ML pipeline instrumented end-to-end. Each service produces spans. Context propagates via HTTP headers. Tempo aggregates spans into navigable traces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Client]
    |
    v
[API Gateway]              &amp;lt;- trace_id created here
    |
    |---&amp;gt; [Input Validator]        &amp;lt;- child span 1
    |
    |---&amp;gt; [Feature Extractor]      &amp;lt;- child span 2
    |           `---&amp;gt; [Feature Store]   &amp;lt;- grandchild span 2.1
    |
    |---&amp;gt; [Batch Assembler]        &amp;lt;- child span 3
    |           `---&amp;gt; [Tokenizer]       &amp;lt;- grandchild span 3.1
    |
    |---&amp;gt; [Model Inference]        &amp;lt;- child span 4
    |           `---&amp;gt; [Model Server]    &amp;lt;- grandchild span 4.1
    |
    `---&amp;gt; [Post-Processor]         &amp;lt;- child span 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.1 Shared Initialization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tracing_setup.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.resources&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_tracing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tracer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service.version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVICE_VERSION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployment.environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ENV&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.pipeline.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PIPELINE_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://otel-collector:4317&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;insecure&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;max_queue_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_export_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.2 API Gateway: Creating the Root Span
&lt;/h3&gt;

&lt;p&gt;The gateway is where the &lt;code&gt;trace_id&lt;/code&gt; is born. Every subsequent service will attach its spans to this root. The attributes set here define the dimensions available for filtering in Tempo and Grafana.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# api_gateway.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.propagate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;inject&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StatusCode&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid4&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_tracing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.api_gateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/predict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PredictRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.pipeline.request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;request_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.request.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.model.name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.input.type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.input.size_bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.client.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Request-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;inject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;t_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;orchestrate_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.output.tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.pipeline.success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.pipeline.latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record_exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.pipeline.success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.3 Feature Extractor: The Most Critical Stage to Instrument
&lt;/h3&gt;

&lt;p&gt;This is where the most common production bottleneck lives. The instrumentation separates cache latency from database latency, making the two failure modes independently observable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# feature_extractor.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.propagate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_tracing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.feature_extractor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.feature_extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;feature_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.requested_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_ids&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="n"&gt;t_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;cached&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cache_hits&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cache_misses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cache_hits&lt;/span&gt;

        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.cache_hits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;cache_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.cache_misses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;cache_misses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.cache_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_hits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_ids&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.cache_lookup_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_cache&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cache_misses&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;t_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;missing_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="n"&gt;fid&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;db_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;feature_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;missing_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.db_lookup_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="n"&gt;still_missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;missing_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_features&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;still_missing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;still_missing&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;still_missing&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
                &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.unavailable_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;still_missing&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;db_features&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.features.retrieved_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.4 Model Inference: Decomposing GPU Time
&lt;/h3&gt;

&lt;p&gt;The key insight here is separating &lt;code&gt;queue_wait_ms&lt;/code&gt; from &lt;code&gt;forward_pass_ms&lt;/code&gt;. Without this decomposition, a slow inference span is undiagnosable. With it, the difference between a GPU under memory pressure and an under-provisioned serving tier is immediately visible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# model_inference.py
&lt;/span&gt;&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_tracing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InferenceBatch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;InferenceResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.batch.size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.batch.max_seq_len&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_seq_len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.model.version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="n"&gt;MODEL_VERSION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.device.type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.device.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_device&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

        &lt;span class="c1"&gt;# Measure queue wait separately from compute time
&lt;/span&gt;        &lt;span class="n"&gt;t_queued&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;gpu_semaphore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;queue_wait_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_queued&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.queue_wait_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queue_wait_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;queue_wait_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.high_queue_wait&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queue_wait_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;queue_wait_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="n"&gt;t_forward&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attention_mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.forward_pass_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_forward&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.gpu.memory_allocated_mb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;memory_allocated&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.gpu.memory_reserved_mb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;memory_reserved&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;decode_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.5 Post-Processor: Making the Invisible Visible
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;validation_attempts&lt;/code&gt; attribute is the single most valuable custom signal in this entire pipeline. It costs nothing to compute and surfaces model degradation weeks before it becomes visible in error rates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# post_processor.py
&lt;/span&gt;&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_tracing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.post_processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;post_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ProcessedOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.post_processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;attempts&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;MAX_ATTEMPTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;MAX_ATTEMPTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;attempts&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;error&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;

            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.output.validation_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attempt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;MAX_ATTEMPTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;raw_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_repair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# This is the early degradation signal metrics cannot see
&lt;/span&gt;        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.post_processing.validation_attempts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.output.valid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output validation failed after &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;MAX_ATTEMPTS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; attempts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;OutputValidationError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed after &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;MAX_ATTEMPTS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; attempts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.output.required_repair&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6. What the Flamegraph Reveals in Practice
&lt;/h2&gt;

&lt;p&gt;With these five services instrumented, a slow 1.8-second request produces the following flamegraph in Tempo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml.pipeline.request             [1 820ms total]
|-- ml.input_validation         [   18ms]
|-- ml.feature_extraction       [  823ms]   &amp;lt;- 45% of total time
|   |-- cache_lookup            [   11ms]
|   `-- db_fallback             [  812ms]   &amp;lt;- actual bottleneck
|-- ml.batch_assembly           [   47ms]
|   `-- tokenizer_init          [   38ms]   &amp;lt;- cold start
|-- ml.inference                [  895ms]
|   |-- queue_wait_ms:          [  580ms]   &amp;lt;- waiting for GPU
|   `-- forward_pass_ms:        [  315ms]   &amp;lt;- actual compute
`-- ml.post_processing          [   37ms]
    `-- validation_attempts: 2              &amp;lt;- silent repair
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a single view, three distinct problems become apparent: the uncached feature store, the saturated GPU queue, and the post-processor silently repairing malformed outputs. Three problems, three teams to notify, three separate tickets. Found in ten minutes rather than two hours.&lt;/p&gt;

&lt;p&gt;This is not a best-case scenario. This is what observability looks like when it is designed to answer the question "why is this specific request slow" rather than "what is the aggregate state of each service."&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Querying Traces in Grafana and Tempo
&lt;/h2&gt;

&lt;p&gt;Traces are only useful if they can be interrogated at scale. The following queries translate instrumented attributes into actionable alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identify requests with an abnormal GPU queue-to-compute ratio:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;TraceQL&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Tempo&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue_wait_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;forward_pass_ms&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Detect slow feature store database fallbacks over the last 5 minutes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.95,
  sum by (service_name, le) (
    rate(ml_feature_extraction_db_lookup_ms_bucket[5m])
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alert on rising revalidation rates (the early degradation signal):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum(rate(ml_post_processing_validation_attempts_sum[10m]))
/
sum(rate(ml_post_processing_validation_attempts_count[10m]))
&amp;gt; 1.3
# Alert fires when average attempts exceeds 1.3
# Baseline is 1.0: all outputs valid on first attempt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pipeline bottleneck overview by stage:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topk(5,
  histogram_quantile(0.95,
    sum by (service_name, le) (
      rate(
        duration_ms_bucket{ml_pipeline_name="recommendation"}[5m]
      )
    )
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  8. Adaptive Sampling: Tracing at Scale Without the Cost
&lt;/h2&gt;

&lt;p&gt;A production ML pipeline handling 5,000 requests per minute generates 5,000 traces per minute. At an average of 50 spans per trace, that is 250,000 spans per minute. Storing and indexing everything is expensive and degrades query performance.&lt;/p&gt;

&lt;p&gt;The solution is tail sampling: deciding after the fact which traces to retain, once their outcome is known. Traces that are slow, erroneous, or exhibit anomalous attribute values are always kept. Routine traces are sampled at a low rate for baseline visibility.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# otelcol.yaml&lt;/span&gt;
&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;decision_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;num_traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="m"&gt;50000&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Always keep slow traces: they explain user complaints&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slow-requests&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency&lt;/span&gt;
        &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;threshold_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;800&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

      &lt;span class="c1"&gt;# Always keep error traces&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;status_code&lt;/span&gt;
        &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

      &lt;span class="c1"&gt;# Keep traces with multiple validation attempts&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;multi-attempt-validation&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;numeric_attribute&lt;/span&gt;
        &lt;span class="na"&gt;numeric_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;       &lt;span class="s"&gt;ml.post_processing.validation_attempts&lt;/span&gt;
          &lt;span class="na"&gt;min_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

      &lt;span class="c1"&gt;# Keep traces with high GPU queue wait&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high-gpu-queue&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;numeric_attribute&lt;/span&gt;
        &lt;span class="na"&gt;numeric_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;       &lt;span class="s"&gt;ml.inference.queue_wait_ms&lt;/span&gt;
          &lt;span class="na"&gt;min_value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;

      &lt;span class="c1"&gt;# Sample 3% of the remainder for baseline coverage&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;baseline&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;probabilistic&lt;/span&gt;
        &lt;span class="na"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;3&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, this configuration retains 3 to 8% of traces while preserving 100% of the traces that are diagnostically useful. The baseline sample ensures that silent degradations accumulating across many normal-looking requests remain detectable through aggregation.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What Tracing Does Not Replace
&lt;/h2&gt;

&lt;p&gt;Distributed tracing exposes temporal and cross-service bottlenecks. It does not replace every other observability tool, and it would be misleading to suggest otherwise.&lt;/p&gt;

&lt;p&gt;Tracing shows where time is spent between instrumented function boundaries. It does not show what happens inside a function. A &lt;code&gt;forward_pass_ms&lt;/code&gt; span that takes 800ms indicates that the forward pass is slow; it does not explain why at the CUDA kernel level. For that, a Python or C++ profiler is necessary.&lt;/p&gt;

&lt;p&gt;Tracing does not replace GPU metrics for capacity planning and hardware saturation analysis, structured logs for debugging data and transformation errors, or LLM evaluation frameworks for assessing output quality. These four observability layers are complementary, not substitutable.&lt;/p&gt;

&lt;p&gt;The genuine value of tracing in an ML pipeline is navigation. A latency p99 alert leads in two clicks to the corresponding trace in Tempo, which identifies the responsible service, whose &lt;code&gt;trace_id&lt;/code&gt; correlates with its logs in Loki. That correlation is what gives power to each individual signal. Metrics tell you something is wrong. Traces tell you where to look. Logs tell you what happened when you get there.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Conclusion: Metrics Tell You What, Traces Tell You Why
&lt;/h2&gt;

&lt;p&gt;A team monitoring its ML pipeline exclusively with metrics is like a physician who only takes temperature readings. The fever confirms something is wrong. It does not say what.&lt;/p&gt;

&lt;p&gt;Distributed tracing instrumented across an entire ML pipeline transforms "something is slow" into "the feature store is slow on new entities because the cache TTL is too short, affecting 23% of requests between 1 PM and 3 PM on high-traffic days." The difference between those two formulations is the difference between a two-hour investigation and a ten-minute ticket.&lt;/p&gt;

&lt;p&gt;The four bottlenecks described in this article: the uncached feature store, the GPU queue hidden by utilization percentages, the silent revalidations, the cold-starting unmanaged component, are not rare pathological cases. They are the most frequently encountered patterns on production ML pipelines. They are invisible to classic metrics by design, not by accident.&lt;/p&gt;

&lt;p&gt;Instrumenting an ML pipeline with OpenTelemetry end-to-end takes approximately one day of engineering work. Diagnosing these bottlenecks without tracing takes, on average, longer than that per incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix: Span Attribute Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.request.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Correlate with application logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.model.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Filter by model version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.features.cache_hit_rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;Feature Extractor&lt;/td&gt;
&lt;td&gt;Detect cache degradation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.features.db_lookup_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;Feature Extractor&lt;/td&gt;
&lt;td&gt;Isolate database latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.features.unavailable_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;int&lt;/td&gt;
&lt;td&gt;Feature Extractor&lt;/td&gt;
&lt;td&gt;Alert on missing features&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.inference.queue_wait_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;Inference&lt;/td&gt;
&lt;td&gt;Detect GPU queue pressure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.inference.forward_pass_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;Inference&lt;/td&gt;
&lt;td&gt;Measure actual compute time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.gpu.memory_allocated_mb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;Inference&lt;/td&gt;
&lt;td&gt;Track memory pressure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.post_processing.validation_attempts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;int&lt;/td&gt;
&lt;td&gt;Post-Processor&lt;/td&gt;
&lt;td&gt;Early degradation signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.output.required_repair&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bool&lt;/td&gt;
&lt;td&gt;Post-Processor&lt;/td&gt;
&lt;td&gt;Flag repaired outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.pipeline.success&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bool&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;End-to-end success tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ml.pipeline.latency_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;float&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Total pipeline duration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;This article is part of an ongoing series on production observability for AI workloads.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Previous articles: OTel Collector as IT/OT Middleware · Instrumenting Industrial Assets with OTel · LLM Instrumentation with OpenTelemetry&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>machinelearning</category>
      <category>monitoring</category>
      <category>performance</category>
    </item>
    <item>
      <title>Tracing a RAG Chain End-to-End: Where OpenTelemetry Stops and Where You Need to Instrument Yourself</title>
      <dc:creator>Erythix</dc:creator>
      <pubDate>Mon, 16 Mar 2026 17:05:36 +0000</pubDate>
      <link>https://dev.to/erythix_6d20050c4f1039b32/tracing-a-rag-chain-end-to-end-where-opentelemetry-stops-and-where-you-need-to-instrument-yourself-2840</link>
      <guid>https://dev.to/erythix_6d20050c4f1039b32/tracing-a-rag-chain-end-to-end-where-opentelemetry-stops-and-where-you-need-to-instrument-yourself-2840</guid>
      <description>&lt;p&gt;There are already plenty of "Getting started with OpenTelemetry" tutorials. This is not one of them.&lt;/p&gt;

&lt;p&gt;This article starts with a candid observation: if you have OTel running in your infrastructure and you've just added a RAG pipeline to production, your traces look impressive but they're mostly lying to you by omission. You have spans as latency numbers. What you don't have is visibility into the five stages that actually determine whether your system is working correctly.&lt;/p&gt;

&lt;p&gt;OTel wasn't designed for RAG. It was designed for distributed systems built around HTTP, databases, and message queues: all well-understood primitives with established semantic conventions. A RAG pipeline adds several new primitives that have no standard OTel semantics yet. The OpenTelemetry GenAI SIG is working on it, but slowly. In the meantime, production systems are running blind.&lt;/p&gt;

&lt;p&gt;The goal here is to be precise about where the boundary is and how to cross it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a RAG chain actually traverses
&lt;/h2&gt;

&lt;p&gt;A minimal RAG pipeline involves eight distinct stages, each with its own failure modes and its own observability requirements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query ingress&lt;/strong&gt;: the user request arrives, gets validated, gets routed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt;: the query is converted to a vector representation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval&lt;/strong&gt;: the vector DB is searched, ranked chunks are returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranking&lt;/strong&gt;: chunks are rescored by a cross-encoder, poor matches are dropped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt assembly&lt;/strong&gt;: context is injected, the prompt is constructed, tokens are counted&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM call&lt;/strong&gt;: the assembled prompt is sent to the model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-processing&lt;/strong&gt;: the response is parsed, validated, formatted, filtered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response&lt;/strong&gt;: the final answer is returned to the caller&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these stages has distinct latency characteristics, distinct failure modes, and distinct diagnostic signals. The problem is that OTel treats them very differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  What OTel covers natively
&lt;/h2&gt;

&lt;p&gt;Auto-instrumentation handles the outer envelope well. For a typical Python service running FastAPI or Flask with OpenTelemetry auto-instrumentation, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A root span for the HTTP request (stage 1)&lt;/li&gt;
&lt;li&gt;The HTTP response span (stage 8)&lt;/li&gt;
&lt;li&gt;Any outbound HTTP calls you make, including the API call to your LLM provider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. Three spans. In a pipeline with eight meaningful stages.&lt;/p&gt;

&lt;p&gt;Depending on your vector database client, you might get a span for the retrieval call. Weaviate has partial SDK-level instrumentation, most others don't. But even when you get that span, it gives you network latency, not semantic information. You know the query arrived and returned. You don't know how many results came back, what their similarity scores were, or whether the result set was empty.&lt;/p&gt;

&lt;p&gt;The picture after auto-instrumentation: two solid spans at the edges, one partial span in the middle and four stages that are completely invisible.&lt;/p&gt;

&lt;p&gt;The eight stages below show what auto-instrumentation sees and what it misses. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi6fo9outd95bbyai0us.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi6fo9outd95bbyai0us.png" alt=" " width="800" height="544"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The five dead zones
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Zone 1: Embedding
&lt;/h3&gt;

&lt;p&gt;When you call an embedding model, whether via OpenAI, Cohere, or a local sentence-transformer, you get a latency number if you're lucky and nothing otherwise. What you don't capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Which model, which version.&lt;/strong&gt; Model drift is real. A silent upgrade to your embedding provider changes vector geometry and breaks retrieval. If you're not recording &lt;code&gt;model_name&lt;/code&gt; and &lt;code&gt;model_version&lt;/code&gt; on every span, you'll spend days debugging what looks like a retrieval problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector dimensionality.&lt;/strong&gt; A dimension mismatch between your embedding model and your index is a hard failure that generates cryptic errors. Logging the output dimension takes one attribute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU vs GPU time split.&lt;/strong&gt; For on-premise inference, this is the first signal that hardware saturation is affecting quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity score distribution.&lt;/strong&gt; The embedding stage itself doesn't produce this, but it sets up the retrieval stage. Tracking what "normal" looks like here is your baseline for drift detection.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zone 2: Retrieval
&lt;/h3&gt;

&lt;p&gt;The retrieval call to your vector database may produce an outbound HTTP span if you're using a REST-based client. But that span contains only timing and status code. What it doesn't contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Number of chunks returned.&lt;/strong&gt; If your retrieval returns zero results, you want to know immediately, and you want to know the query that triggered it, not just the timestamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity scores.&lt;/strong&gt; The distribution of top-k scores tells you whether the retrieval was confident or speculative. A max score of 0.94 and a max score of 0.41 both count as "retrieval succeeded" in OTel's view. They're completely different situations operationally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranking time as a separate stage.&lt;/strong&gt; Many pipelines combine retrieval and reranking into a single function call. Separating them in your spans is worth the effort: reranking is frequently the actual latency bottleneck, and you'd never know.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zone 3: Reranking
&lt;/h3&gt;

&lt;p&gt;This stage is the most consistently invisible and the most consistently underestimated. A cross-encoder reranker running a full forward pass over each retrieved chunk pair adds significant latency, sometimes more than the LLM call itself. OTel sees none of it unless you explicitly instrument it.&lt;/p&gt;

&lt;p&gt;What to capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reranking duration as its own span&lt;/li&gt;
&lt;li&gt;Input chunk count vs. output chunk count (how many were filtered)&lt;/li&gt;
&lt;li&gt;Score threshold applied&lt;/li&gt;
&lt;li&gt;Model name and batch size&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zone 4: Prompt assembly
&lt;/h3&gt;

&lt;p&gt;This is where silent failures are born.&lt;/p&gt;

&lt;p&gt;When you assemble a prompt, you make decisions: which chunks to include, how to order them, how to truncate if the context window is tight. OTel has no visibility into any of this. You can have a system that routinely truncates critical context and generates factually incomplete responses, and your traces will show a perfectly healthy green pipeline.&lt;/p&gt;

&lt;p&gt;What to capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estimated token count before sending&lt;/li&gt;
&lt;li&gt;Whether truncation occurred (&lt;code&gt;context_truncated: true/false&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Number of chunks injected&lt;/li&gt;
&lt;li&gt;Whether conflicting chunks were injected (requires a light coherence check)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zone 5: LLM call payload
&lt;/h3&gt;

&lt;p&gt;This is the subtlest dead zone. You do have a span for the LLM call: it's the outbound HTTP request. But the span contains no semantic information about what happened inside that call.&lt;/p&gt;

&lt;p&gt;The SDK for Anthropic, OpenAI, and most LLM providers does not emit OTel attributes for tokens, stop reason, or model-level parameters. You have to enrich the span yourself after the response arrives. Without this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You cannot track token costs&lt;/li&gt;
&lt;li&gt;You cannot alert on prompts that regularly hit the context limit (&lt;code&gt;stop_reason: length&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;You cannot detect when model behavior changes across versions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Instrumenting manually
&lt;/h2&gt;

&lt;p&gt;The pattern is consistent across all dead zones: wrap the operation in a span, set semantic attributes, and use a naming convention that survives grep.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.query.length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.embedding.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.embedding.dimension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.embedding.latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Retrieval
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.retrieval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.retrieval.query_preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.retrieval.chunks_returned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.retrieval.empty_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.retrieval.max_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.retrieval.min_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reranking
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.reranking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;reranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.reranking.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.reranking.input_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.reranking.output_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reranked&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.reranking.filtered_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reranked&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;reranked&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.reranking.top_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reranked&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prompt assembly
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.prompt_assembly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;assembled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reranked_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.prompt.estimated_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.prompt.context_truncated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;truncated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.prompt.chunks_used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunks_injected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag.prompt.system_prompt_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT_VERSION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;system_prompt_version&lt;/code&gt; attribute. Prompt changes are the most common cause of unexplained behavioral shifts. Versioning your system prompt and logging it on every span costs nothing and will save you multiple production investigations.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM call enrichment
&lt;/h3&gt;

&lt;p&gt;Don't create a new span for the LLM call. You already have the outbound HTTP span from auto-instrumentation. Enrich it instead via a wrapper that adds attributes after the response arrives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.completion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;request_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.usage.input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.usage.output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.stop_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.request.temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm.request.max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;stop_reason&lt;/code&gt; attribute alone justifies this instrumentation. When &lt;code&gt;stop_reason&lt;/code&gt; is &lt;code&gt;"length"&lt;/code&gt;, it means the model ran out of context and stopped mid-response. In a RAG pipeline, this is almost always a prompt assembly bug. Without this attribute, the response looks valid until a human reads it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Naming conventions
&lt;/h2&gt;

&lt;p&gt;There is no OTel standard for RAG semantic conventions as of early 2026. The GenAI SIG has drafts in progress but nothing stable. Until there is a standard, the wrong choice is to invent arbitrary names per-service. The right choice is to define a coherent convention internally and apply it consistently.&lt;/p&gt;

&lt;p&gt;The three-prefix approach:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prefix&lt;/th&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rag.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pipeline logic&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rag.retrieval.chunks_returned&lt;/code&gt;, &lt;code&gt;rag.prompt.context_truncated&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model interaction&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;llm.usage.input_tokens&lt;/code&gt;, &lt;code&gt;llm.stop_reason&lt;/code&gt;, &lt;code&gt;llm.model&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vec.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Vector operations&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;vec.index.name&lt;/code&gt;, &lt;code&gt;vec.search.metric&lt;/code&gt;, &lt;code&gt;vec.dimension&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This naming makes your traces queryable by domain in VictoriaMetrics, OpenObserve, or any backend that supports attribute filtering. A query like &lt;code&gt;rag.retrieval.empty_result = true AND llm.stop_reason = "length"&lt;/code&gt; surfaces a specific failure pattern (empty retrieval leading to context-padded fallback response) in seconds.&lt;/p&gt;

&lt;p&gt;Avoid prefixes that shadow existing OTel conventions. &lt;code&gt;db.*&lt;/code&gt; is already used by database instrumentation. &lt;code&gt;http.*&lt;/code&gt; is already HTTP. Pick names that won't collide with auto-instrumented attributes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you do with complete traces
&lt;/h2&gt;

&lt;p&gt;Once the instrumentation is in place, four operational patterns become possible that were invisible before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P95 latency by stage.&lt;/strong&gt; Most teams assume the LLM call dominates pipeline latency. In practice, reranking is frequently the bottleneck, especially for models running on shared inference infrastructure. Without per-stage spans, you're optimizing the wrong thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty retrieval rate as a leading indicator.&lt;/strong&gt; An uptick in &lt;code&gt;rag.retrieval.empty_result = true&lt;/code&gt; before you see quality degradation in user feedback gives you a 24–48 hour warning window. It usually means your document index is stale or your embedding model has been silently upgraded. This is the most valuable leading indicator in a RAG system and it requires exactly one boolean attribute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context truncation rate as a prompt quality signal.&lt;/strong&gt; If &lt;code&gt;rag.prompt.context_truncated = true&lt;/code&gt; appears on more than 5–10% of requests, your retrieved chunks are too long for your context window configuration. This is a retrieval tuning problem, not an LLM problem, but without the attribute, it looks like random response degradation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop reason distribution.&lt;/strong&gt; A rise in &lt;code&gt;llm.stop_reason = "length"&lt;/code&gt; correlates directly with content quality issues. Track it as a metric. Alert on it. It's a better signal than user satisfaction scores because it's available in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OTel is a foundation, not a solution. For conventional infrastructure (HTTP services, databases, queues), auto-instrumentation covers most of what matters. For AI pipelines, the meaningful events happen inside application logic that has no standard semantic model.&lt;/p&gt;

&lt;p&gt;The gap isn't a criticism of OTel. It's the normal boundary between generic infrastructure tooling and domain-specific observability. Every mature domain eventually develops its own semantic layer on top of the generic tracing substrate.&lt;/p&gt;

&lt;p&gt;For RAG pipelines, that layer doesn't exist yet as a standard. Building it yourself is not optional if you're operating these systems in production. The instrumentation described here adds less than 200 lines of code to a typical pipeline and transforms your traces from a latency meter into an operational instrument.&lt;/p&gt;

&lt;p&gt;The five dead zones (embedding, retrieval, reranking, prompt assembly, and LLM payload) are exactly where your system fails in interesting ways. Leaving them dark is a choice, and it's the wrong one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Samuel Desseaux, founder Erythix · AI Observability &amp;amp; Industrial Monitoring&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;VictoriaMetrics Training Partner · &lt;a href="https://erythix.tech" rel="noopener noreferrer"&gt;erythix.tech&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is part of the AI Observability series. Related: "GPU utilization tells you nothing about inference quality" · "Sovereign observability stack for HPC workloads"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>monitoring</category>
      <category>rag</category>
    </item>
    <item>
      <title>SLURM in a nutshell: Architecture, Observability and Security for HPC Clusters</title>
      <dc:creator>Erythix</dc:creator>
      <pubDate>Sat, 07 Mar 2026 14:55:38 +0000</pubDate>
      <link>https://dev.to/erythix_6d20050c4f1039b32/slurm-in-a-nutshell-architecture-observability-and-security-for-hpc-clusters-5gna</link>
      <guid>https://dev.to/erythix_6d20050c4f1039b32/slurm-in-a-nutshell-architecture-observability-and-security-for-hpc-clusters-5gna</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;SLURM powers Summit, Frontier, LUMI, and most of the TOP500. If you work with GPU clusters, AI training infrastructure, or scientific computing, understanding how it works is not optional.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What is SLURM?
&lt;/h2&gt;

&lt;p&gt;SLURM (Simple Linux Utility for Resource Management) is an open-source cluster workload manager originally developed at Lawrence Livermore National Laboratory &lt;sup id="fnref1"&gt;1&lt;/sup&gt;. It is now the de-facto standard for HPC environments worldwide, deployed on more than 60% of TOP500 systems &lt;sup id="fnref2"&gt;2&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;It has three core responsibilities:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource allocation&lt;/strong&gt; assigns compute nodes to jobs based on configured policies: partitions, Quality of Service (QOS) rules, and fairshare weights. It accounts for CPU cores, memory, GPU devices, and network topology simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Job scheduling&lt;/strong&gt; queues submitted jobs and launches them when resources become available. The default algorithm is backfill scheduling, which fills scheduling gaps with smaller jobs without delaying the larger ones already queued.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accounting&lt;/strong&gt; records every resource consumption event — who ran what, on which nodes, for how long, consuming how much CPU, memory, and GPU — via a dedicated daemon connected to a relational database.&lt;/p&gt;

&lt;p&gt;It operates on a heartbeat model: nodes report their state to a central controller, which dispatches queued jobs as resources free up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Four Daemons
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+------------------------------------------------------------------+
|                        CONTROL PLANE                             |
|                                                                  |
|   +------------------+          +------------------+            |
|   |   slurmctld      |&amp;lt;--------&amp;gt;|   slurmdbd       |            |
|   |   TCP 6817       |          |   TCP 6819       |            |
|   |                  |          |                  |            |
|   |  Scheduler       |          |  Accounting GW   |            |
|   |  State manager   |          |  Only DB client  |            |
|   +--------+---------+          +--------+---------+            |
|            |                             |                       |
+------------|-----------------------------|-----------------------+
             |                             |
             | TCP 6818                    | SQL TCP 3306
             v                             v
+---------------------------+    +--------------------+
|   COMPUTE NODES           |    |   MariaDB          |
|                           |    |   Accounting DB    |
|   slurmd   slurmd   ...   |    +--------------------+
|   node01   node02         |
|                           |
|   cgroups v2 enforcement  |
|   Prolog / Epilog hooks   |
+---------------------------+
             ^
             |
     +-------+--------+
     |   slurmrestd   |
     |   TCP 6820     |
     |   OpenAPI/JWT  |
     +----------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;slurmctld&lt;/code&gt; — Controller Daemon (TCP 6817)
&lt;/h4&gt;

&lt;p&gt;The brain of the cluster. It maintains the global state of every node and every job in memory, periodically checkpointing to disk (the &lt;code&gt;StateSaveLocation&lt;/code&gt; directory). On restart after a failure, it replays this state to resume operations without losing queued or running jobs.&lt;/p&gt;

&lt;p&gt;Key responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs the scheduler plugin (backfill by default, with optional gang scheduling)&lt;/li&gt;
&lt;li&gt;Manages node state transitions (IDLE, ALLOCATED, DOWN, DRAIN, FAIL)&lt;/li&gt;
&lt;li&gt;Dispatches jobs to &lt;code&gt;slurmd&lt;/code&gt; on compute nodes&lt;/li&gt;
&lt;li&gt;Enforces partition and QOS limits&lt;/li&gt;
&lt;li&gt;Processes all client commands (&lt;code&gt;sbatch&lt;/code&gt;, &lt;code&gt;srun&lt;/code&gt;, &lt;code&gt;scontrol&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High availability is supported via a primary/backup pair. If the primary &lt;code&gt;slurmctld&lt;/code&gt; fails, the backup takes over within seconds, with minimal job disruption &lt;sup id="fnref3"&gt;3&lt;/sup&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;slurmd&lt;/code&gt; — Node Daemon (TCP 6818)
&lt;/h4&gt;

&lt;p&gt;One instance runs on every compute node. It is the execution agent: it receives job steps dispatched by &lt;code&gt;slurmctld&lt;/code&gt;, spawns user processes inside cgroup hierarchies, monitors resource consumption continuously, and sends periodic heartbeats back to the controller.&lt;/p&gt;

&lt;p&gt;When a heartbeat is missed beyond the configured &lt;code&gt;SlurmdTimeout&lt;/code&gt;, the controller marks the node as DOWN and can optionally reschedule its jobs.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;slurmd&lt;/code&gt; also runs the site-defined &lt;strong&gt;Prolog&lt;/strong&gt; script before launching each job (environment setup, filesystem mounting, health checks) and the &lt;strong&gt;Epilog&lt;/strong&gt; script after completion (cleanup, unmounting, node validation).&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;slurmdbd&lt;/code&gt; — Database Daemon (TCP 6819)
&lt;/h4&gt;

&lt;p&gt;The exclusive gateway to the accounting database. No other daemon connects to MariaDB directly. This design creates a single point of control for all historical data: job records, resource consumption, user associations, QOS definitions, and the fairshare tree.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;slurmdbd&lt;/code&gt; can run on a dedicated server, isolated from the controller. Losing it does not stop job execution — running jobs continue — but new accounting records are buffered locally on &lt;code&gt;slurmctld&lt;/code&gt; and flushed when connectivity is restored.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;slurmrestd&lt;/code&gt; — REST API Daemon (TCP 6820)
&lt;/h4&gt;

&lt;p&gt;Available since SLURM 20.11 &lt;sup id="fnref4"&gt;4&lt;/sup&gt;, &lt;code&gt;slurmrestd&lt;/code&gt; exposes the full SLURM management interface as an OpenAPI-documented REST API. It bridges REST calls to internal SLURM RPC, enabling integration with web portals, JupyterHub, workflow orchestrators (Nextflow, Snakemake, Apache Airflow), and cloud bursting systems.&lt;/p&gt;

&lt;p&gt;Authentication is via JWT tokens. The API surface is significant and must be treated as a privileged endpoint.&lt;/p&gt;




&lt;h3&gt;
  
  
  Communication Flows
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User (sbatch / srun / salloc)
        |
        | TCP 6817 — job submit, validated against associations + QOS
        v
  +-------------+   TCP 6819   +-------------+   SQL   +-----------+
  | slurmctld   |&amp;lt;------------&amp;gt;| slurmdbd    |--------&amp;gt;| MariaDB   |
  +------+------+   accounting +-------------+         +-----------+
         |
         | TCP 6818 — job dispatch (JobID, allocated nodes, resources)
         |
    +----+----+
    |         |
slurmd #1   slurmd #2  ...
    |
    +-- cgroups v2 (memory.max, cpu.max, devices allowlist)
    +-- Prolog  (runs as root before job)
    +-- job step (runs as user)
    +-- Epilog  (runs as root after job)
    +-- heartbeat -&amp;gt; slurmctld every SlurmdTimeout/3

slurmrestd --REST/JWT--&amp;gt; slurmctld (internal RPC)

All inter-daemon messages: signed + timestamped by MUNGE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every message exchanged between &lt;code&gt;slurmctld&lt;/code&gt;, &lt;code&gt;slurmdbd&lt;/code&gt;, and &lt;code&gt;slurmd&lt;/code&gt; is signed and timestamped by &lt;strong&gt;MUNGE&lt;/strong&gt; (MUNGE Uid 'N' Gid Emporium). A credential contains the UID/GID of the originating process, a timestamp, and a configurable TTL. Replayed credentials are rejected &lt;sup id="fnref5"&gt;5&lt;/sup&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scheduling Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Backfill Scheduling
&lt;/h3&gt;

&lt;p&gt;The default &lt;code&gt;sched/backfill&lt;/code&gt; plugin extends simple first-in-first-out scheduling by maintaining a time-ordered reservation list. When a large job cannot start immediately, the scheduler looks for smaller jobs that can be inserted into the scheduling gap without pushing back the start time of the large job &lt;sup id="fnref6"&gt;6&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;This is why you sometimes see a small 2-node job start before a 100-node job that was submitted earlier: the 100-node job is waiting for enough nodes to free up, and the 2-node job fits in the current available capacity without affecting the projected start time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Queue state:
  Job A: 100 nodes, submitted T+0, cannot start (only 20 nodes free)
  Job B: 10 nodes, submitted T+10

Backfill logic:
  - Job A projected start: T+45 (when enough nodes finish current jobs)
  - Job B can complete before T+45 if started now
  - Job B is scheduled immediately without delaying Job A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Priority Calculation
&lt;/h3&gt;

&lt;p&gt;SLURM computes a weighted sum for each queued job &lt;sup id="fnref7"&gt;7&lt;/sup&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Priority = w_age        * factor_age
         + w_fairshare  * factor_fairshare
         + w_jobsize    * factor_jobsize
         + w_qos        * factor_qos
         + w_partition  * factor_partition
         + w_assoc      * factor_assoc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fairshare factor is the most important for multi-tenant clusters. It is computed using a decay algorithm: resource usage from the past contributes less weight over time (configured by &lt;code&gt;PriorityDecayHalfLife&lt;/code&gt;). A user who ran 10,000 CPU-hours last week has a lower fairshare score than a user who has not submitted a job in two weeks, pushing the inactive user's jobs to higher priority.&lt;/p&gt;

&lt;p&gt;The tool &lt;code&gt;sprio&lt;/code&gt; shows the current priority breakdown for every queued job.&lt;/p&gt;

&lt;h3&gt;
  
  
  QOS and Associations
&lt;/h3&gt;

&lt;p&gt;The association tree controls access at every level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cluster: mycluster
  |
  +-- Account: research_lab        (FairShare: 40)
  |       |
  |       +-- User: alice          (FairShare: 20)
  |       |     QOS: normal, gpu_priority
  |       |     MaxTRES: cpu=256,gres/gpu=8
  |       |
  |       +-- User: bob            (FairShare: 20)
  |             QOS: normal
  |             MaxTRES: cpu=128
  |
  +-- Account: ops_team            (FairShare: 60)
          |
          +-- User: carol          (FairShare: 60)
                QOS: normal, high_priority, infra
                MaxTRES: cpu=512,gres/gpu=32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A QOS defines hard limits (GrpTRES, MaxTRESPerJob, MaxWallDurationPerJob) and soft priority boosts. When a user submits a job requesting resources beyond their association or QOS limits, the job is rejected at submission time, not at scheduling time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Job Lifecycle
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  SUBMIT         QUEUE          ALLOCATE         RUN          COMPLETE
     |               |               |              |               |
 sbatch          PENDING          Nodes          RUNNING        COMPLETED
 script.sh        state          reserved         state           state
     |               |               |              |               |
     v               v               v              v               v
 slurmctld      Scheduler        slurmd         slurmd          slurmdbd
 validates      computes         runs           monitors        records
 resources      priority         Prolog         CPU/mem/GPU     all metrics
 + QOS limits   backfill         cgroups        heartbeats      to MariaDB
                analysis         configured     to controller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Submission
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --job-name=train_llm&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --partition=gpu&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --nodes=4&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --ntasks-per-node=8&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --gres=gpu:a100:8&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --mem=512G&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --time=48:00:00&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --account=research_lab&lt;/span&gt;
&lt;span class="c"&gt;#SBATCH --qos=gpu_priority&lt;/span&gt;

module load cuda/12.2
srun python train.py &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;slurmctld&lt;/code&gt; validates this script against:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The partition definition (nodes available, max wall time)&lt;/li&gt;
&lt;li&gt;The user's association (account exists, user is a member)&lt;/li&gt;
&lt;li&gt;The QOS (resource limits not exceeded)&lt;/li&gt;
&lt;li&gt;Current cluster capacity (enough GPUs exist)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If all checks pass, the job receives a &lt;code&gt;JobID&lt;/code&gt; and enters the PENDING state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Execution on Nodes
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;slurmctld&lt;/code&gt; dispatches the job, each &lt;code&gt;slurmd&lt;/code&gt; on the allocated nodes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Runs the site Prolog (as root)&lt;/li&gt;
&lt;li&gt;Creates the cgroup hierarchy for the job&lt;/li&gt;
&lt;li&gt;Sets &lt;code&gt;memory.max&lt;/code&gt;, &lt;code&gt;cpu.max&lt;/code&gt;, and the GPU device allowlist&lt;/li&gt;
&lt;li&gt;Spawns &lt;code&gt;slurmstepd&lt;/code&gt;, which drops privileges to the user and executes the job step&lt;/li&gt;
&lt;li&gt;Monitors consumption every &lt;code&gt;JobAcctGatherFrequency&lt;/code&gt; seconds&lt;/li&gt;
&lt;li&gt;Runs the Epilog on completion (as root)&lt;/li&gt;
&lt;li&gt;Reports final resource usage to &lt;code&gt;slurmctld&lt;/code&gt;, which forwards it to &lt;code&gt;slurmdbd&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Job Arrays
&lt;/h3&gt;

&lt;p&gt;For parameter sweeps, job arrays avoid submitting thousands of individual jobs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#SBATCH --array=0-99%10    # 100 tasks, max 10 running simultaneously&lt;/span&gt;

&lt;span class="nv"&gt;PARAM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SLURM_ARRAY_TASK_ID&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
python experiment.py &lt;span class="nt"&gt;--seed&lt;/span&gt; &lt;span class="nv"&gt;$PARAM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each task gets its own &lt;code&gt;JobID&lt;/code&gt; (formatted as &lt;code&gt;ArrayJobID_TaskID&lt;/code&gt;) and its own accounting record. The &lt;code&gt;%10&lt;/code&gt; limits concurrent tasks to avoid saturating the cluster.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compute Nodes
+------------------------+    +------------------------+
| slurmd                 |    | DCGM Exporter          |
|                        |    | (NVIDIA GPU metrics)   |
| slurm-exporter :8080   |    | :9400                  |
|  slurm_jobs_running    |    |  DCGM_FI_DEV_GPU_UTIL  |
|  slurm_jobs_pending    |    |  DCGM_FI_DEV_MEM_COPY  |
|  slurm_nodes_alloc     |    |  DCGM_FI_DEV_NVLINK_*  |
|  slurm_cpus_idle       |    |  label: slurm_job_id   |
+----------+-------------+    +----------+-------------+
           |                             |
           | Prometheus scrape           | Prometheus scrape
           v                             v
+-----------------------------------------------+
|   VMAgent (per node or centralized)           |
|   Relabeling, filtering, remote_write         |
+-------------------+---------------------------+
                    |
                    | remote_write
                    v
+-----------------------------------------------+
|   VictoriaMetrics (vminsert / vmstorage)      |
|   Long-term storage, MetricsQL                |
+-------------------+---------------------------+
                    |
                    | datasource
                    v
+-----------------------------------------------+
|   Grafana                                     |
|   Job efficiency dashboards                   |
|   GPU heatmaps, fairshare visualization       |
|   Alerting (PagerDuty, Slack)                 |
+-----------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  slurm-exporter
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/vpenso/prometheus-slurm-exporter" rel="noopener noreferrer"&gt;&lt;code&gt;prometheus-slurm-exporter&lt;/code&gt;&lt;/a&gt; scrapes SLURM CLI tools (&lt;code&gt;squeue&lt;/code&gt;, &lt;code&gt;sinfo&lt;/code&gt;, &lt;code&gt;sacct&lt;/code&gt;) and exposes metrics on port 8080 &lt;sup id="fnref8"&gt;8&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Key metrics exposed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_jobs_running&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Count of running jobs, by partition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_jobs_pending&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Count of pending jobs, by reason&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_nodes_alloc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Nodes in ALLOCATED state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_nodes_idle&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Nodes in IDLE state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_nodes_down&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Nodes in DOWN/DRAIN state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_cpus_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Total CPUs in cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_cpus_idle&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Idle CPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm_account_cpu_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPUs used per account&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A known limitation: the exporter calls CLI binaries, which adds latency and load at scale (thousands of jobs). At very large scale, prefer reading directly from &lt;code&gt;slurmctld&lt;/code&gt;'s state files or using &lt;code&gt;slurmrestd&lt;/code&gt; as a data source.&lt;/p&gt;

&lt;h3&gt;
  
  
  DCGM Exporter and GPU Correlation
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;NVIDIA DCGM Exporter&lt;/a&gt; exposes per-GPU hardware metrics &lt;sup id="fnref9"&gt;9&lt;/sup&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DCGM_FI_DEV_GPU_UTIL{gpu="0", UUID="...", hostname="node01"} 94
DCGM_FI_DEV_FB_USED{gpu="0", ...} 38654
DCGM_FI_DEV_POWER_USAGE{gpu="0", ...} 387
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", ...} 198432
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To correlate GPU metrics with SLURM jobs, DCGM can be configured to expose the &lt;code&gt;SLURM_JOB_ID&lt;/code&gt; environment variable as a label. This enables Grafana queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# GPU efficiency for a specific job
DCGM_FI_DEV_GPU_UTIL{slurm_job_id="12345"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the key insight for AI/ML workloads: raw GPU utilization tells you if GPUs are busy, but &lt;code&gt;job_id&lt;/code&gt; correlation tells you which specific training run, user, or team is responsible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why VictoriaMetrics for HPC
&lt;/h3&gt;

&lt;p&gt;Prometheus alone struggles with HPC-scale workloads for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cardinality&lt;/strong&gt;: a 1000-node cluster with 8 GPUs each, running thousands of jobs, generates millions of unique time series&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retention&lt;/strong&gt;: HPC accounting requires months or years of metrics for capacity planning and user reporting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query performance&lt;/strong&gt;: job efficiency reports aggregate over large time ranges with complex label filters&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;VictoriaMetrics addresses all three &lt;sup id="fnref10"&gt;10&lt;/sup&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vmagent config: distributed collection on compute nodes&lt;/span&gt;
&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slurm&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:8080"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:9400"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;remote_write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://victoriametrics:8428/api/v1/write"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compression ratios on HPC workloads are typically 10-15x better than Prometheus TSDB, and MetricsQL supports advanced aggregations like &lt;code&gt;quantile_over_time&lt;/code&gt; and &lt;code&gt;increase&lt;/code&gt; that are essential for wait time analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  KPIs That Actually Matter
&lt;/h3&gt;

&lt;p&gt;Most HPC operators track GPU utilization and stop there. That is not enough. The metrics that reveal actual cluster health:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Formula&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU efficiency&lt;/td&gt;
&lt;td&gt;&lt;code&gt;used_cpus / alloc_cpus&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reveals job over-allocation and poor sizing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory waste&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alloc_mem - max_rss&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Often 40-60% on ML clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wait time P95&lt;/td&gt;
&lt;td&gt;&lt;code&gt;start_time - submit_time&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scheduler health indicator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fairshare drift&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;factor_fairshare&lt;/code&gt; over 30d&lt;/td&gt;
&lt;td&gt;Detects long-term resource monopolies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU occupancy&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;DCGM_GPU_UTIL&lt;/code&gt; weighted by job&lt;/td&gt;
&lt;td&gt;Distinguishes idle allocation from compute-bound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job failure rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;failed / (completed + failed)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure reliability signal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A &lt;code&gt;sacct&lt;/code&gt; query for job efficiency after the fact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sacct &lt;span class="nt"&gt;-j&lt;/span&gt; 12345 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;JobID,CPUTime,CPUTimeRAW,AveCPU,MaxRSS,ReqMem,Elapsed &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--units&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;G
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Authentication: MUNGE
&lt;/h3&gt;

&lt;p&gt;MUNGE is the default authentication mechanism for all inter-daemon communication &lt;sup id="fnref5"&gt;5&lt;/sup&gt;. Every message is signed with a shared secret (&lt;code&gt;/etc/munge/munge.key&lt;/code&gt;), timestamped, and includes the originating UID/GID. A receiving daemon verifies the signature and rejects credentials outside the configured TTL window, preventing replay attacks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Node A                              Node B
+------------------+                +------------------+
|  slurmctld       |                |  slurmd          |
|                  |--[credential]-&amp;gt;|                  |
|  signs with      |                |  verifies with   |
|  munge.key       |                |  munge.key       |
|                  |&amp;lt;--[response]---|                  |
+------------------+                +------------------+

Credential contains:
  - UID / GID of sender
  - Timestamp (TTL: 300s default)
  - Realm (optional)
  - Payload (encrypted)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key operational requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;munge.key&lt;/code&gt; must be &lt;strong&gt;identical&lt;/strong&gt; on all nodes (controller + compute + login + slurmdbd server)&lt;/li&gt;
&lt;li&gt;File permissions must be &lt;code&gt;0400&lt;/code&gt;, owned by the &lt;code&gt;munge&lt;/code&gt; user&lt;/li&gt;
&lt;li&gt;Distribution should use a secrets manager (HashiCorp Vault, Ansible Vault) rather than manual &lt;code&gt;scp&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Key rotation requires a coordinated restart of all SLURM daemons — the most disruptive operation on a live cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key rotation procedure on a live cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Generate new key on the controller&lt;/span&gt;
mungekey &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--keyfile&lt;/span&gt; /etc/munge/munge.key.new

&lt;span class="c"&gt;# 2. Distribute to all nodes (use your config management tool)&lt;/span&gt;
ansible all &lt;span class="nt"&gt;-m&lt;/span&gt; copy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"src=/etc/munge/munge.key.new dest=/etc/munge/munge.key mode=0400 owner=munge"&lt;/span&gt;

&lt;span class="c"&gt;# 3. Restart munge everywhere simultaneously (parallel SSH)&lt;/span&gt;
ansible all &lt;span class="nt"&gt;-m&lt;/span&gt; service &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"name=munge state=restarted"&lt;/span&gt;

&lt;span class="c"&gt;# 4. Restart SLURM daemons in order&lt;/span&gt;
ansible compute &lt;span class="nt"&gt;-m&lt;/span&gt; service &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"name=slurmd state=restarted"&lt;/span&gt;
ansible controller &lt;span class="nt"&gt;-m&lt;/span&gt; service &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"name=slurmctld state=restarted"&lt;/span&gt;
ansible dbd &lt;span class="nt"&gt;-m&lt;/span&gt; service &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"name=slurmdbd state=restarted"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Resource Isolation: cgroups v2
&lt;/h3&gt;

&lt;p&gt;Without cgroup enforcement, a job that allocates 64GB of memory can consume 512GB, triggering OOM kills across all other jobs on the node. SLURM's cgroup plugin prevents this &lt;sup id="fnref11"&gt;11&lt;/sup&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slurmd receives job dispatch
        |
        v
Creates cgroup hierarchy:
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_12345/
        |
        +-- memory.max        = 65536M   (allocated memory)
        +-- memory.swap.max   = 0        (no swap for HPC jobs)
        +-- cpu.max           = 6400 100000  (64 cores)
        +-- devices.allow     = c 195:0  (GPU 0 only)
        +-- devices.allow     = c 195:1  (GPU 1 only)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Essential &lt;code&gt;cgroup.conf&lt;/code&gt; settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;CgroupPlugin&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;autodetect&lt;/span&gt;
&lt;span class="py"&gt;ConstrainRAMSpace&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;yes       # OOM kill if job exceeds memory limit&lt;/span&gt;
&lt;span class="py"&gt;ConstrainSwapSpace&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;yes      # Disable swap for job processes&lt;/span&gt;
&lt;span class="py"&gt;ConstrainCores&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;yes          # Pin processes to allocated CPU cores&lt;/span&gt;
&lt;span class="py"&gt;ConstrainDevices&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;yes        # Restrict GPU access to allocated devices&lt;/span&gt;
&lt;span class="py"&gt;AllowedRAMSpace&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;100         # No tolerance: enforce hard limit&lt;/span&gt;
&lt;span class="py"&gt;TaskAffinity&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;yes            # Bind threads to cores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ConstrainRAMSpace=yes&lt;/code&gt; is non-negotiable in any multi-tenant environment. Without it, a misbehaving job can take down an entire node.&lt;/p&gt;

&lt;h3&gt;
  
  
  Authorization: RBAC and Associations
&lt;/h3&gt;

&lt;p&gt;SLURM's authorization model is hierarchical. Access is validated at every layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 — Cluster
  Who can submit at all?

Level 2 — Account
  Which budget/project does the job charge to?
  What is the fairshare allocation?

Level 3 — User
  Individual limits within the account.

Level 4 — QOS
  Hard limits on resources, wall time, and concurrent jobs.
  Priority boosts or penalties.

Level 5 — Partition
  Which physical nodes? What maximum wall time?
  Restricted to specific groups (AllowGroups)?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Managing associations with &lt;code&gt;sacctmgr&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create account hierarchy&lt;/span&gt;
sacctmgr add cluster mycluster
sacctmgr add account research_lab &lt;span class="nv"&gt;cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;mycluster &lt;span class="nv"&gt;fairshare&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;40
sacctmgr add user alice &lt;span class="nv"&gt;account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;research_lab &lt;span class="nv"&gt;defaultaccount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;research_lab

&lt;span class="c"&gt;# Define QOS&lt;/span&gt;
sacctmgr add qos gpu_priority &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;MaxTRESPerUser&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;256,gres/gpu&lt;span class="o"&gt;=&lt;/span&gt;8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;MaxWallDurationPerJob&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;48:00:00 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;Priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100

&lt;span class="c"&gt;# Assign QOS to user&lt;/span&gt;
sacctmgr modify user alice &lt;span class="nb"&gt;set &lt;/span&gt;qos+&lt;span class="o"&gt;=&lt;/span&gt;gpu_priority
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  API Security: JWT and TLS
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;slurmrestd&lt;/code&gt; is the largest attack surface in a modern SLURM deployment. A compromised API token provides full cluster control: job submission, node management, user impersonation.&lt;/p&gt;

&lt;p&gt;Hardening checklist:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Generate JWT signing key&lt;/span&gt;
openssl genrsa &lt;span class="nt"&gt;-out&lt;/span&gt; /etc/slurm/jwt_hs256.key 2048
&lt;span class="nb"&gt;chmod &lt;/span&gt;0600 /etc/slurm/jwt_hs256.key
&lt;span class="nb"&gt;chown &lt;/span&gt;slurm: /etc/slurm/jwt_hs256.key

&lt;span class="c"&gt;# In slurm.conf:&lt;/span&gt;
&lt;span class="c"&gt;# AuthAltTypes=auth/jwt&lt;/span&gt;
&lt;span class="c"&gt;# AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key&lt;/span&gt;

&lt;span class="c"&gt;# 2. Issue short-lived tokens (1 hour max)&lt;/span&gt;
scontrol token &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;alice &lt;span class="nv"&gt;lifespan&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3600

&lt;span class="c"&gt;# 3. Run behind nginx with rate limiting&lt;/span&gt;
&lt;span class="c"&gt;# nginx.conf excerpt:&lt;/span&gt;
&lt;span class="c"&gt;# limit_req_zone $binary_remote_addr zone=slurm_api:10m rate=10r/s;&lt;/span&gt;
&lt;span class="c"&gt;# location /slurm/ {&lt;/span&gt;
&lt;span class="c"&gt;#   limit_req zone=slurm_api burst=20 nodelay;&lt;/span&gt;
&lt;span class="c"&gt;#   proxy_pass http://127.0.0.1:6820;&lt;/span&gt;
&lt;span class="c"&gt;# }&lt;/span&gt;

&lt;span class="c"&gt;# 4. Restrict port 6820 by firewall&lt;/span&gt;
&lt;span class="c"&gt;# Only the proxy IP should reach slurmrestd directly&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For inter-daemon TLS (SLURM 23.x+), add to &lt;code&gt;slurm.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;CommunicationParameters&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;EnableTLS&lt;/span&gt;
&lt;span class="py"&gt;TLSType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;tls/openssl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Audit Trail
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;slurmdbd&lt;/code&gt; maintains a complete, immutable audit trail. Every job submission, modification, start, and completion is recorded with full resource accounting. This data is queryable via &lt;code&gt;sacct&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Full accounting for a user, last 30 days&lt;/span&gt;
sacct &lt;span class="nt"&gt;-u&lt;/span&gt; alice &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--starttime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%Y-%m-%d&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;JobID,JobName,Account,QOS,Partition,NCPUS,NNodes,&lt;span class="se"&gt;\&lt;/span&gt;
           ReqMem,MaxRSS,CPUTime,Elapsed,State,ExitCode &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--units&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;G

&lt;span class="c"&gt;# Cluster-wide report&lt;/span&gt;
sreport cluster utilization &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2024-01-01 &lt;span class="nv"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2024-03-31 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; hourper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For SIEM integration, SLURM writes structured logs to syslog. These can be forwarded to Wazuh, Elastic SIEM, or Splunk for correlation with authentication events and anomaly detection.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Configuration Files
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Critical settings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurm.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Main config: nodes, partitions, plugins&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SelectType&lt;/code&gt;, &lt;code&gt;PriorityType&lt;/code&gt;, &lt;code&gt;AccountingStorageType&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;slurmdbd.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accounting daemon: DB credentials&lt;/td&gt;
&lt;td&gt;Permissions must be &lt;code&gt;0600&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cgroup.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Resource enforcement&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ConstrainRAMSpace&lt;/code&gt;, &lt;code&gt;ConstrainDevices&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gres.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU/FPGA topology and binding&lt;/td&gt;
&lt;td&gt;GPU count, MIG partitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;topology.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Network topology for MPI placement&lt;/td&gt;
&lt;td&gt;Switch hierarchy, InfiniBand fabric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;acct_gather.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-job energy and I/O metrics&lt;/td&gt;
&lt;td&gt;RAPL, InfiniBand, Lustre&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Annotated &lt;code&gt;slurm.conf&lt;/code&gt; for a GPU cluster
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# Identity
&lt;/span&gt;&lt;span class="py"&gt;ClusterName&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;mycluster&lt;/span&gt;
&lt;span class="py"&gt;SlurmctldHost&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;controller01&lt;/span&gt;
&lt;span class="py"&gt;SlurmctldHost&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;controller02  # HA backup&lt;/span&gt;

&lt;span class="c"&gt;# Ports
&lt;/span&gt;&lt;span class="py"&gt;SlurmctldPort&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;6817&lt;/span&gt;
&lt;span class="py"&gt;SlurmdPort&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;6818&lt;/span&gt;

&lt;span class="c"&gt;# Scheduler
&lt;/span&gt;&lt;span class="py"&gt;SchedulerType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;sched/backfill&lt;/span&gt;
&lt;span class="py"&gt;SelectType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;select/cons_tres           # Consumable resources: Track&lt;/span&gt;
&lt;span class="py"&gt;SelectTypeParameters&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;CR_Core_Memory   # individual CPUs and memory&lt;/span&gt;
&lt;span class="py"&gt;SchedulerParameters&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;bf_max_job_test=500,bf_resolution=60&lt;/span&gt;

&lt;span class="c"&gt;# Priority (multifactor)
&lt;/span&gt;&lt;span class="py"&gt;PriorityType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;priority/multifactor&lt;/span&gt;
&lt;span class="py"&gt;PriorityWeightFairshare&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;100000&lt;/span&gt;
&lt;span class="py"&gt;PriorityWeightAge&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1000&lt;/span&gt;
&lt;span class="py"&gt;PriorityWeightJobSize&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;100&lt;/span&gt;
&lt;span class="py"&gt;PriorityDecayHalfLife&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;7-0             # 7 days half-life for fairshare&lt;/span&gt;
&lt;span class="py"&gt;PriorityMaxAge&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;7-0&lt;/span&gt;

&lt;span class="c"&gt;# Accounting
&lt;/span&gt;&lt;span class="py"&gt;AccountingStorageType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;accounting_storage/slurmdbd&lt;/span&gt;
&lt;span class="py"&gt;AccountingStorageHost&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;controller01&lt;/span&gt;
&lt;span class="py"&gt;AccountingStoragePort&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;6819&lt;/span&gt;
&lt;span class="py"&gt;AccountingStorageUser&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;slurm&lt;/span&gt;
&lt;span class="py"&gt;AccountingStoragePass&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;db_password&amp;gt;&lt;/span&gt;
&lt;span class="py"&gt;JobAcctGatherType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;jobacct_gather/cgroup&lt;/span&gt;
&lt;span class="py"&gt;JobAcctGatherFrequency&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;30             # Collect every 30s&lt;/span&gt;

&lt;span class="c"&gt;# Task and process tracking
&lt;/span&gt;&lt;span class="py"&gt;TaskPlugin&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;task/cgroup,task/affinity&lt;/span&gt;
&lt;span class="py"&gt;ProctrackType&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;proctrack/cgroup&lt;/span&gt;

&lt;span class="c"&gt;# GRES (GPU)
&lt;/span&gt;&lt;span class="py"&gt;GresTypes&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gpu&lt;/span&gt;

&lt;span class="c"&gt;# Timeouts
&lt;/span&gt;&lt;span class="py"&gt;SlurmdTimeout&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;300&lt;/span&gt;
&lt;span class="py"&gt;SlurmctldTimeout&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;120&lt;/span&gt;
&lt;span class="py"&gt;MessageTimeout&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;

&lt;span class="c"&gt;# Logging
&lt;/span&gt;&lt;span class="py"&gt;SlurmctldLogFile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/log/slurm/slurmctld.log&lt;/span&gt;
&lt;span class="py"&gt;SlurmdLogFile&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/log/slurm/slurmd.log&lt;/span&gt;
&lt;span class="py"&gt;SlurmctldDebug&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;
&lt;span class="py"&gt;SlurmdDebug&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;

&lt;span class="c"&gt;# Nodes (example: 16 nodes, 8x A100 each)
&lt;/span&gt;&lt;span class="py"&gt;NodeName&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;node[01-16] &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;CPUs=64 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;RealMemory=512000 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;Gres=gpu:a100:8 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;State=UNKNOWN&lt;/span&gt;

&lt;span class="c"&gt;# Partitions
&lt;/span&gt;&lt;span class="py"&gt;PartitionName&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gpu &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;Nodes=node[01-16] &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;MaxTime=INFINITE &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;DefaultTime=24:00:00 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;State=UP &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;Default=YES&lt;/span&gt;

&lt;span class="py"&gt;PartitionName&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;debug &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;Nodes=node[01-02] &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;MaxTime=1:00:00 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;Priority=100 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;State=UP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Operational Runbook: Common Tasks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Drain a node for maintenance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Drain: no new jobs, current jobs finish&lt;/span&gt;
scontrol update &lt;span class="nv"&gt;NodeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;node05 &lt;span class="nv"&gt;State&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;DRAIN &lt;span class="nv"&gt;Reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"scheduled maintenance"&lt;/span&gt;

&lt;span class="c"&gt;# Check when node will be empty&lt;/span&gt;
squeue &lt;span class="nt"&gt;-w&lt;/span&gt; node05

&lt;span class="c"&gt;# After jobs finish, confirm drain&lt;/span&gt;
scontrol show node node05 | &lt;span class="nb"&gt;grep &lt;/span&gt;State

&lt;span class="c"&gt;# Return to service&lt;/span&gt;
scontrol update &lt;span class="nv"&gt;NodeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;node05 &lt;span class="nv"&gt;State&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;RESUME
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hold and release a job
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Hold a pending job (prevents scheduling)&lt;/span&gt;
scontrol hold 12345

&lt;span class="c"&gt;# Release&lt;/span&gt;
scontrol release 12345

&lt;span class="c"&gt;# Requeue a failed running job&lt;/span&gt;
scontrol requeue 12345
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Identify wasted resources
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Jobs where memory usage &amp;lt; 50% of allocation&lt;/span&gt;
sacct &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;JobID,ReqMem,MaxRSS,CPUTime,AveCPU &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;COMPLETED &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--starttime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2024-01-01 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'$3 != 0 &amp;amp;&amp;amp; ($3/$2) &amp;lt; 0.5 {print $0}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLURM in one diagram:

User submits job (sbatch / srun / srun --pty)
        |
        v
slurmctld
   validates resources (partitions + associations + QOS)
   queues job (PENDING)
   computes priority (fairshare + QOS + age + jobsize)
   runs backfill scheduling
   dispatches to allocated nodes (RUNNING)
   records lifecycle to slurmdbd
        |
        +-- slurmdbd -&amp;gt; MariaDB (full accounting, audit trail)
        |
        +-- slurmd on each node
                |
                +-- cgroups v2   (memory, CPU, GPU isolation)
                +-- Prolog       (pre-job setup, root)
                +-- slurmstepd   (user process, MPI launch)
                +-- Epilog       (post-job cleanup, root)
                +-- heartbeat    (node health to slurmctld)
                |
                +-- slurm-exporter :8080  (job + node metrics)
                +-- DCGM Exporter  :9400  (GPU metrics + job_id)
                        |
                        v
                VMAgent -&amp;gt; VictoriaMetrics -&amp;gt; Grafana

Security stack:
  MUNGE           inter-daemon auth (shared key, signed credentials)
  cgroups v2      resource isolation (memory, CPU, GPU per job)
  Associations    RBAC + fairshare (cluster &amp;gt; account &amp;gt; user &amp;gt; QOS)
  JWT + TLS       API security (slurmrestd behind reverse proxy)
  sacct / slurmdbd  audit trail (full accounting, queryable)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three files to master before anything else: &lt;code&gt;slurm.conf&lt;/code&gt;, &lt;code&gt;cgroup.conf&lt;/code&gt;, &lt;code&gt;gres.conf&lt;/code&gt;. Everything else builds on top of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;




&lt;p&gt;&lt;em&gt;This article is part of the HPC Observability series. Next: Building GPU efficiency dashboards with VictoriaMetrics and Grafana for AI training workloads.&lt;/em&gt;&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Yoo, A.B., Jette, M.A., Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. &lt;em&gt;Lecture Notes in Computer Science&lt;/em&gt;, 2862, 44-60. &lt;a href="https://doi.org/10.1007/10968987_3" rel="noopener noreferrer"&gt;https://doi.org/10.1007/10968987_3&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;TOP500 Editors (2023). Statistics on Resource Management Software. &lt;em&gt;TOP500 Project&lt;/em&gt;. &lt;a href="https://www.top500.org/statistics/details/rmsoftware/1" rel="noopener noreferrer"&gt;https://www.top500.org/statistics/details/rmsoftware/1&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;SchedMD LLC. (2024). High Availability in SLURM. &lt;em&gt;SLURM Documentation&lt;/em&gt;. &lt;a href="https://slurm.schedmd.com/high_availability.html" rel="noopener noreferrer"&gt;https://slurm.schedmd.com/high_availability.html&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;SchedMD LLC. (2024). REST API Guide. &lt;em&gt;SLURM Documentation&lt;/em&gt;. &lt;a href="https://slurm.schedmd.com/rest.html" rel="noopener noreferrer"&gt;https://slurm.schedmd.com/rest.html&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn5"&gt;
&lt;p&gt;Grondona, M. (2024). MUNGE Authentication Service. &lt;em&gt;GitHub&lt;/em&gt;. &lt;a href="https://github.com/dun/munge" rel="noopener noreferrer"&gt;https://github.com/dun/munge&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn6"&gt;
&lt;p&gt;Lifka, D. (1995). The ANL/IBM SP Scheduling System. &lt;em&gt;Job Scheduling Strategies for Parallel Processing&lt;/em&gt;, 295-303. &lt;a href="https://doi.org/10.1007/3-540-60153-8_31" rel="noopener noreferrer"&gt;https://doi.org/10.1007/3-540-60153-8_31&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn7"&gt;
&lt;p&gt;SchedMD LLC. (2024). Multifactor Priority Plugin. &lt;em&gt;SLURM Documentation&lt;/em&gt;. &lt;a href="https://slurm.schedmd.com/priority_multifactor.html" rel="noopener noreferrer"&gt;https://slurm.schedmd.com/priority_multifactor.html&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn8"&gt;
&lt;p&gt;Penso, V. et al. (2024). prometheus-slurm-exporter. &lt;em&gt;GitHub&lt;/em&gt;. &lt;a href="https://github.com/vpenso/prometheus-slurm-exporter" rel="noopener noreferrer"&gt;https://github.com/vpenso/prometheus-slurm-exporter&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn9"&gt;
&lt;p&gt;NVIDIA Corporation. (2024). DCGM Exporter. &lt;em&gt;GitHub&lt;/em&gt;. &lt;a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer"&gt;https://github.com/NVIDIA/dcgm-exporter&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn10"&gt;
&lt;p&gt;VictoriaMetrics Team. (2024). VictoriaMetrics Documentation. &lt;a href="https://docs.victoriametrics.com" rel="noopener noreferrer"&gt;https://docs.victoriametrics.com&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn11"&gt;
&lt;p&gt;SchedMD LLC. (2024). Cgroups Guide. &lt;em&gt;SLURM Documentation&lt;/em&gt;. &lt;a href="https://slurm.schedmd.com/cgroups.html" rel="noopener noreferrer"&gt;https://slurm.schedmd.com/cgroups.html&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>hpc</category>
      <category>linux</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Monitoring an ML Pipeline in Production: Anatomy of an Open-Source Stack</title>
      <dc:creator>Erythix</dc:creator>
      <pubDate>Tue, 24 Feb 2026 21:07:17 +0000</pubDate>
      <link>https://dev.to/erythix_6d20050c4f1039b32/monitoring-an-ml-pipeline-in-production-anatomy-of-an-open-source-stack-55ik</link>
      <guid>https://dev.to/erythix_6d20050c4f1039b32/monitoring-an-ml-pipeline-in-production-anatomy-of-an-open-source-stack-55ik</guid>
      <description>&lt;p&gt;This isn't a theoretical guide. It's a field report on the observability stack I've built and iterated across engagements and demos on the AI Observability Hub - a demonstration platform I use to validate AI monitoring architectures before deploying them at client sites.&lt;/p&gt;

&lt;p&gt;The goal is straightforward: give an SRE, data engineer, or CTO the building blocks to monitor an ML pipeline in production with VictoriaMetrics, OpenTelemetry, and Grafana. No vendor lock-in. No proprietary platform. Open-source components, assembled with intention.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we actually monitor (and what we forget)
&lt;/h2&gt;

&lt;p&gt;Most organizations deploying ML in production settle for monitoring infrastructure: CPU, RAM, disk space. That's necessary, but it's the equivalent of watching a factory's temperature without looking at the quality of parts coming off the line.&lt;/p&gt;

&lt;p&gt;A production ML pipeline has &lt;strong&gt;four observability layers&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt; : the foundation. GPU utilization (compute, VRAM, memory bandwidth), CPU, network, disk I/O. Without it, you don't even know if the machine is running. But with it alone, you don't know if the model is working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data pipeline&lt;/strong&gt; : the invisible layer. Training data freshness, ingestion latency, feature completeness, statistical drift in input distributions. A model receiving degraded data produces degraded results  and nothing in the infra metrics flags it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model&lt;/strong&gt; : this is what data scientists care about, but what often goes unmonitored in production. Inference latency, throughput (requests/second), confidence score distribution, fallback rate, prediction vs. ground truth comparison when available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt; : this is what leadership cares about, and what's often discovered too late. Cost per inference, GPU cost per model, cost/business-value ratio. A model that costs €3 per inference on a use case generating €0.50 in value isn't a technical problem : it's a business problem that only observability makes visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The reference architecture
&lt;/h2&gt;

&lt;p&gt;Here's the stack I've built and deploy in my engagements. Each component was chosen for a specific reason - not by habit or popularity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                        ML APPLICATIONS                          │
│  vLLM / TGI / Triton / Custom Flask-FastAPI                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │         OpenTelemetry SDK + Auto-instrumentation        │    │
│  │    Traces (spans)  │  Metrics (counters/histograms)     │    │
│  └────────────────────┼────────────────────────────────────┘    │
└───────────────────────┼─────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                  OPENTELEMETRY COLLECTOR                         │
│  Receivers: OTLP (gRPC/HTTP), Prometheus scrape                 │
│  Processors: batch, filter, attributes, tail_sampling           │
│  Exporters: prometheusremotewrite, otlp                         │
└──────────┬──────────────────────────────────────────────────────┘
           │                                  │
           ▼                                  ▼
┌──────────────────────┐         ┌────────────────────────────────┐
│   VICTORIAMETRICS    │         │       OPENOBSERVE / LOKI       │
│   (metrics TSDB)     │         │       (logs + traces)          │
│   ┌──────────────┐   │         │                                │
│   │  vmselect    │   │         │  Long retention for audit      │
│   │  vminsert    │   │         │  Full-text search              │
│   │  vmstorage   │   │         │  Trace-log correlation         │
│   └──────────────┘   │         └────────────────────────────────┘
└──────────┬───────────┘                      │
           │                                  │
           ▼                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                          GRAFANA                                │
│  Infra Dashboard  │  Model Dashboard  │  Cost Dashboard         │
│  Alerting (Alertmanager) → PagerDuty / Slack / Email            │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Component by component
&lt;/h2&gt;

&lt;h3&gt;
  
  
  OpenTelemetry: the instrumentation standard
&lt;/h3&gt;

&lt;p&gt;OpenTelemetry is the instrumentation choice for a non-negotiable reason: it's the only vendor-agnostic standard covering traces, metrics, and logs in a unified framework. Instrumenting with OTel guarantees the freedom to swap backends without re-instrumenting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For an ML pipeline, instrumentation covers three levels:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;application SDK&lt;/strong&gt; integrates directly into the inference service code. For a Python service (FastAPI, Flask), OTel auto-instrumentation automatically captures HTTP requests, database calls, and processing spans. For model-specific metrics, custom instruments are added:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MeterProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.metric_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPMetricExporter&lt;/span&gt;

&lt;span class="c1"&gt;# Meter configuration
&lt;/span&gt;&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_meter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Custom metrics for the ML pipeline
&lt;/span&gt;&lt;span class="n"&gt;inference_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.duration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Inference duration in milliseconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;inference_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.tokens.total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total tokens generated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Confidence score distribution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gpu_cost_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml.inference.cost.gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cumulative estimated GPU cost in euros&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EUR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# In the inference function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;duration_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_count&lt;/span&gt;
    &lt;span class="n"&gt;conf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;

    &lt;span class="c1"&gt;# Record metrics with labels
&lt;/span&gt;    &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3-8b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_case&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maintenance_assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;inference_duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;inference_tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;confidence_score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# GPU cost estimation (based on hourly rate / time consumed)
&lt;/span&gt;    &lt;span class="n"&gt;gpu_hourly_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;2.50&lt;/span&gt;  &lt;span class="c1"&gt;# €/h for an A100
&lt;/span&gt;    &lt;span class="n"&gt;gpu_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3_600_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gpu_hourly_rate&lt;/span&gt;
    &lt;span class="n"&gt;gpu_cost_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;Collector&lt;/strong&gt; is the consolidation point. It receives telemetry data from all services, transforms, filters, and routes it to storage backends. It's the most underestimated component in the stack — and the most critical.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# otel-collector-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

  &lt;span class="c1"&gt;# Scrape GPU metrics from DCGM/nvidia-smi exporter&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dcgm-exporter'&lt;/span&gt;
          &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15s&lt;/span&gt;
          &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dcgm-exporter:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;node-exporter'&lt;/span&gt;
          &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
          &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;node-exporter:9100'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
    &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;

  &lt;span class="c1"&gt;# Filter low-value metrics&lt;/span&gt;
  &lt;span class="na"&gt;filter/drop-debug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;match_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;regexp&lt;/span&gt;
        &lt;span class="na"&gt;metric_names&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.*debug.*"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.*test.*"&lt;/span&gt;

  &lt;span class="c1"&gt;# Attribute enrichment&lt;/span&gt;
  &lt;span class="na"&gt;attributes/env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment.environment&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;upsert&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service.namespace&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml-platform&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;upsert&lt;/span&gt;

  &lt;span class="c1"&gt;# Trace sampling (keep 100% of errors, 10% of the rest)&lt;/span&gt;
  &lt;span class="na"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;decision_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;errors-always&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;status_code&lt;/span&gt;
        &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sample-rest&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;probabilistic&lt;/span&gt;
        &lt;span class="na"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheusremotewrite&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://victoriametrics:8428/api/v1/write"&lt;/span&gt;
    &lt;span class="na"&gt;resource_to_telemetry_conversion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="na"&gt;otlp/traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openobserve:5081"&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="na"&gt;otlp/logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openobserve:5081"&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;filter/drop-debug&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attributes/env&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheusremotewrite&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attributes/env&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/traces&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attributes/env&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/logs&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key point here is &lt;strong&gt;tail sampling&lt;/strong&gt;. A production ML pipeline can generate thousands of traces per minute. Storing everything is costly and unnecessary. Tail sampling keeps 100% of error traces (the ones that matter for debugging) and samples the rest - reducing storage volume without losing signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  VictoriaMetrics: storage that handles the load
&lt;/h3&gt;

&lt;p&gt;I chose VictoriaMetrics over Prometheus for a simple reason: cardinality.&lt;/p&gt;

&lt;p&gt;A production ML pipeline generates metrics with a high number of label combinations: model × version × use case × environment × request type × user. Prometheus starts struggling beyond a few million active time series. VictoriaMetrics is designed to handle this scale with significantly lower memory and disk footprint.&lt;/p&gt;

&lt;p&gt;In practice, in my deployments:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-node mode&lt;/strong&gt; for mid-market companies with moderate volume (&amp;lt; 500k active series). A single binary, minimal configuration, excellent performance. This is the mode I recommend to start with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Launch VictoriaMetrics single-node&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; victoriametrics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; /data/vm:/victoria-metrics-data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8428:8428 &lt;span class="se"&gt;\&lt;/span&gt;
  victoriametrics/victoria-metrics:stable &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-retentionPeriod&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;6m &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-search&lt;/span&gt;.maxUniqueTimeseries&lt;span class="o"&gt;=&lt;/span&gt;5000000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-dedup&lt;/span&gt;.minScrapeInterval&lt;span class="o"&gt;=&lt;/span&gt;15s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cluster mode&lt;/strong&gt; (vmselect + vminsert + vmstorage) when volume exceeds one million series or when high availability is required. This is the mode I use on the AI Observability Hub to simulate realistic loads.&lt;/p&gt;

&lt;p&gt;Retention parameters are an architecture decision, not a configuration detail. For operational observability (SRE), 30 to 90 days suffice. For governance and audit (EU AI Act), plan for 12 to 36 months — and that's where VictoriaMetrics' compression makes a real difference in storage cost compared to alternatives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana: three dashboards, three audiences
&lt;/h3&gt;

&lt;p&gt;Grafana isn't just a visualization tool. It's the translation layer between technical data and human decisions. A dashboard that just shows curves without guiding action is a useless dashboard.&lt;/p&gt;

&lt;p&gt;I systematically structure ML observability into &lt;strong&gt;three dashboards&lt;/strong&gt;:&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Dashboard 1 : Infra &amp;amp; GPU (audience: SRE/DevOps)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This dashboard answers: "Is the platform holding up?"&lt;/p&gt;

&lt;p&gt;Key metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# GPU compute utilization (via DCGM exporter)
DCGM_FI_DEV_GPU_UTIL{instance=~"$gpu_node"}

# GPU memory used vs. total
DCGM_FI_DEV_FB_USED{instance=~"$gpu_node"}
  / DCGM_FI_DEV_FB_TOTAL{instance=~"$gpu_node"} * 100

# GPU temperature (alert if &amp;gt; 85°C)
DCGM_FI_DEV_GPU_TEMP{instance=~"$gpu_node"}

# Inference throughput (requests/second)
rate(ml_inference_duration_count[5m])

# Inference p95 latency
histogram_quantile(0.95,
  rate(ml_inference_duration_bucket[5m])
)

# Queue saturation (if applicable)
ml_inference_queue_depth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configured alerts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU utilization &amp;gt; 95% for 10 minutes → capacity alert&lt;/li&gt;
&lt;li&gt;GPU temperature &amp;gt; 85°C → thermal alert&lt;/li&gt;
&lt;li&gt;p95 latency &amp;gt; SLO threshold (e.g., 2s for a conversational assistant) → performance alert&lt;/li&gt;
&lt;li&gt;Queue depth &amp;gt; 100 requests for 5 minutes → saturation alert&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Dashboard 2 : Model &amp;amp; Quality (audience: data engineers, ML engineers)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This dashboard answers: "Is the model doing its job?"&lt;/p&gt;

&lt;p&gt;This is the dashboard missing from 90% of ML deployments I audit. The infra is running, the service responds, but nobody knows if the responses are good.&lt;/p&gt;

&lt;p&gt;Key metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Confidence score distribution (heatmap)
ml_inference_confidence_bucket

# Rolling 24h average confidence score
avg_over_time(
  ml_inference_confidence_sum[24h]
) / avg_over_time(
  ml_inference_confidence_count[24h]
)

# Low-confidence response rate (&amp;lt; 0.6)
sum(rate(ml_inference_confidence_bucket{le="0.6"}[1h]))
  / sum(rate(ml_inference_confidence_count[1h])) * 100

# Model error rate (timeouts, exceptions, fallbacks)
sum(rate(ml_inference_duration_count{status="error"}[5m]))
  / sum(rate(ml_inference_duration_count[5m])) * 100

# Tokens generated per request (distribution)
histogram_quantile(0.5, rate(ml_inference_tokens_bucket[1h]))

# Drift detector: current vs. baseline distribution comparison
# (requires a periodic compute job publishing the metric)
ml_feature_drift_score{feature="input_length"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configured alerts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average confidence score &amp;lt; adaptive threshold for 48h → drift alert&lt;/li&gt;
&lt;li&gt;Low-confidence response rate &amp;gt; 20% → quality alert&lt;/li&gt;
&lt;li&gt;Model error rate &amp;gt; 5% for 15 minutes → critical alert&lt;/li&gt;
&lt;li&gt;Drift score &amp;gt; 0.3 on a key feature → data shift alert&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The drift alert is the most important and hardest to calibrate. The threshold isn't static — it must be calculated against a baseline established over a reference period (the first 30 days in production, for example). This is a use case where VictoriaMetrics recording rules come into their own:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Recording rules for baseline calculation&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml_baseline&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml:confidence:baseline_avg&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="s"&gt;avg_over_time(&lt;/span&gt;
            &lt;span class="s"&gt;ml_inference_confidence_sum[30d]&lt;/span&gt;
          &lt;span class="s"&gt;) / avg_over_time(&lt;/span&gt;
            &lt;span class="s"&gt;ml_inference_confidence_count[30d]&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml:confidence:current_avg&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="s"&gt;avg_over_time(&lt;/span&gt;
            &lt;span class="s"&gt;ml_inference_confidence_sum[24h]&lt;/span&gt;
          &lt;span class="s"&gt;) / avg_over_time(&lt;/span&gt;
            &lt;span class="s"&gt;ml_inference_confidence_count[24h]&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml:confidence:drift_ratio&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="s"&gt;abs(ml:confidence:current_avg - ml:confidence:baseline_avg)&lt;/span&gt;
            &lt;span class="s"&gt;/ ml:confidence:baseline_avg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;Dashboard 3 : Cost &amp;amp; Business (audience: leadership, finance, product owners)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This dashboard answers: "How much does it cost and how much value does it deliver?"&lt;/p&gt;

&lt;p&gt;This is the dashboard that turns a cost center into a value center — and the one that'll keep your budget next year.&lt;/p&gt;

&lt;p&gt;Key metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Cumulative GPU cost by model (current day)
sum by (model_name)(
  increase(ml_inference_cost_gpu_total[24h])
)

# Average cost per inference
sum(rate(ml_inference_cost_gpu_total[1h]))
  / sum(rate(ml_inference_duration_count[1h]))

# Cost by use case
sum by (use_case)(
  increase(ml_inference_cost_gpu_total[30d])
)

# Daily inference volume
sum(increase(ml_inference_duration_count[24h]))

# Projected end-of-month cost (linear extrapolation)
sum(increase(ml_inference_cost_gpu_total[24h])) * 30

# Tokens/cost ratio (efficiency)
sum(rate(ml_inference_tokens_total[1h]))
  / sum(rate(ml_inference_cost_gpu_total[1h]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dashboard must be readable by someone who doesn't know what a quantile is. Big numbers at the top (today's cost, projected monthly cost, inference count), trends below, details by model and use case at the bottom.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployment: from docker-compose to cluster
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1 : Prototype (docker-compose)
&lt;/h3&gt;

&lt;p&gt;To validate the architecture, a &lt;code&gt;docker-compose.yml&lt;/code&gt; is enough. This is what I use on the AI Observability Hub for quick demos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;victoriametrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;victoriametrics/victoria-metrics:stable&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8428:8428"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;vm-data:/victoria-metrics-data&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-retentionPeriod=90d"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-search.maxUniqueTimeseries=3000000"&lt;/span&gt;

  &lt;span class="na"&gt;otel-collector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;otel/opentelemetry-collector-contrib:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4317:4317"&lt;/span&gt;   &lt;span class="c1"&gt;# gRPC&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4318:4318"&lt;/span&gt;   &lt;span class="c1"&gt;# HTTP&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./otel-collector-config.yaml:/etc/otelcol/config.yaml&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;victoriametrics&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openobserve&lt;/span&gt;

  &lt;span class="na"&gt;openobserve&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;public.ecr.aws/zinclabs/openobserve:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5080:5080"&lt;/span&gt;   &lt;span class="c1"&gt;# UI&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5081:5081"&lt;/span&gt;   &lt;span class="c1"&gt;# Ingestion&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ZO_ROOT_USER_EMAIL=admin@erythix.com&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ZO_ROOT_USER_PASSWORD=changeme&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;oo-data:/data&lt;/span&gt;

  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana:latest&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:3000"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;grafana-data:/var/lib/grafana&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./grafana/provisioning:/etc/grafana/provisioning&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./grafana/dashboards:/var/lib/grafana/dashboards&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GF_SECURITY_ADMIN_PASSWORD=changeme&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;victoriametrics&lt;/span&gt;

  &lt;span class="c1"&gt;# ML load simulator (for demos)&lt;/span&gt;
  &lt;span class="na"&gt;ml-simulator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./ml-simulator&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;MODEL_NAME=llama-3-8b&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;vm-data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;oo-data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;grafana-data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Phase 2 : Production (Kubernetes)
&lt;/h3&gt;

&lt;p&gt;For production, each component is deployed via Helm charts or Kubernetes manifests with the following considerations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VictoriaMetrics&lt;/strong&gt;: official Helm chart (&lt;code&gt;victoria-metrics-k8s-stack&lt;/code&gt;) including vmoperator, recording rules, and Grafana integration. Cluster mode for HA, with PVCs on performant storage (local SSD or AWS EBS gp3).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OTel Collector&lt;/strong&gt;: deployed as a DaemonSet (one per node, for system and GPU metric collection) + a centralized Deployment (for aggregation, tail sampling, and routing). The DaemonSet collects DCGM metrics and local logs. The central Deployment handles processing and export.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Grafana&lt;/strong&gt;: deployed with automatic datasource and dashboard provisioning via ConfigMaps. Dashboards are versioned in Git and deployed via CI/CD — no manual configuration that drifts over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pitfalls I learned to avoid
&lt;/h2&gt;

&lt;p&gt;After multiple iterations on the AI Observability Hub and real-world deployments, here are the most expensive mistakes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cardinality explosion
&lt;/h3&gt;

&lt;p&gt;Trap number one. A &lt;code&gt;user_id&lt;/code&gt; label on inference metrics seems useful - until 10,000 users generate 10,000 time series per metric. Multiply by 20 metrics and 3 model versions, and you hit 600,000 series for a single service.&lt;/p&gt;

&lt;p&gt;The rule: high-cardinality labels (user ID, request ID, session ID) belong in &lt;strong&gt;traces and logs&lt;/strong&gt;, not in &lt;strong&gt;metrics&lt;/strong&gt;. Metrics use bounded-cardinality labels: model_name, model_version, environment, use_case, status.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cargo cult monitoring
&lt;/h3&gt;

&lt;p&gt;Copying someone else's dashboards without understanding what they measure. I've seen teams with 47 panels on a dashboard, 43 of which nobody looked at. A useful dashboard has between 6 and 12 panels, organized by business question, not by metric type.&lt;/p&gt;

&lt;h3&gt;
  
  
  No baseline
&lt;/h3&gt;

&lt;p&gt;Monitoring without a baseline is like having a thermometer without knowing what temperature is normal. The first 30 days of a model's production run should establish baselines for every key metric. Drift alerts are calculated against these baselines, not against arbitrary thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Only monitoring the happy path
&lt;/h3&gt;

&lt;p&gt;Instrumenting only the nominal path and discovering during an incident that the error path isn't traced. Every fallback, every timeout, every exception should produce metrics and spans with an explicit error status. Errors are where observability creates the most value.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cost of monitoring itself
&lt;/h3&gt;

&lt;p&gt;I've seen observability stacks that cost more than the infrastructure they monitored. VictoriaMetrics helps significantly here (aggressive compression, low memory footprint), but sizing must be planned from the start. Rule of thumb: monitoring cost shouldn't exceed 5-10% of the monitored infrastructure cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  The result: what the stack makes possible
&lt;/h2&gt;

&lt;p&gt;When all four observability layers are in place (infra, pipeline, model, cost), three things become possible that weren't before:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift detection in days, not months.&lt;/strong&gt; On a recent engagement, the stack detected a confidence score degradation in a predictive maintenance model 72 hours after a sensor change on the factory floor. Without model monitoring, the team would have continued following degraded recommendations for weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data-driven cost/performance optimization.&lt;/strong&gt; The cost dashboard helped a client discover that a marginal use case (5% of volume) consumed 35% of the GPU budget due to an oversized model. Replaced with a lighter model, same perceived quality, 30% reduction in the overall bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance as a byproduct of observability.&lt;/strong&gt; The audit trails required for the EU AI Act aren't an additional effort they're the traces and logs the stack already collects. It's a matter of structuring them for audit, not creating them from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you have nothing today, here's the sequence I recommend:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1 : Instrument.&lt;/strong&gt; Add the OpenTelemetry SDK to your inference service. Five metrics are enough to start: inference duration, request count, error rate, tokens generated, confidence score. Deploy the Collector in minimal mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 2 : Store and visualize.&lt;/strong&gt; Deploy VictoriaMetrics in single-node mode and Grafana. Create the infra dashboard first (it's the fastest), then the model dashboard. It doesn't need to look pretty — it needs to be functional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3 : Alert.&lt;/strong&gt; Configure three alerts only: p95 latency above SLO, error rate above 5%, confidence score dropping. Three well-calibrated alerts are worth more than twenty that generate fatigue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 2 : Refine.&lt;/strong&gt; Add cost metrics. Establish your baselines. Configure recording rules for drift calculation. Create the cost/business dashboard. At this point, you have production-grade ML observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 3+ : Extend.&lt;/strong&gt; Add traces for advanced debugging. Integrate structured logs. Connect AI alerts to your SIEM. Explore input feature monitoring for data drift detection.&lt;/p&gt;

&lt;p&gt;Each step delivers immediate value. No need to wait for everything to be in place to benefit from observability.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Samuel Desseaux is the founder of Erythix and Aureonis, a fractional CTO and trainer specializing in observability IT/AI, AI security, and IT/OT convergence. Official VictoriaMetricsPartner for France,Benelux and Arize partner.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The AI Observability Hub is Erythix's demonstration platform for AI workload observability. Contact &lt;a href="https://www.linkedin.com/in/sdesseaux/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/sdesseaux/&lt;/a&gt; for a demo or a stack diagnostic.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>machinelearning</category>
      <category>monitoring</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Observability as the Control Plane for AI: Operations, Security, Governance</title>
      <dc:creator>Erythix</dc:creator>
      <pubDate>Mon, 23 Feb 2026 21:10:48 +0000</pubDate>
      <link>https://dev.to/erythix_6d20050c4f1039b32/observability-as-the-control-plane-for-ai-operations-security-governance-1bk7</link>
      <guid>https://dev.to/erythix_6d20050c4f1039b32/observability-as-the-control-plane-for-ai-operations-security-governance-1bk7</guid>
      <description>&lt;p&gt;Language models are in production. AI agents are making decisions, calling APIs, accessing databases and in most European mid-market industrial companies, nobody knows exactly what they're doing.&lt;/p&gt;

&lt;p&gt;We know how to monitor a web server ,how to trace a SQL query but when an LLM decides to rephrase a maintenance instruction, summarize a quality report or trigger an action through an external tool, traditional observability stacks are blind. Not because they lack metrics but because they're asking the wrong questions.&lt;/p&gt;

&lt;p&gt;Observability for AI is no longer "is the system running?" It's &lt;strong&gt;"why did the system make that decision, and should it have?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This article proposes a three-layer framework (operational, security, governance ) to turn observability into a true control plane for AI systems. Not in five years. Now. With open-source building blocks. For mid-market companies that don't have hyperscaler budgets.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem: non-deterministic systems in critical environments
&lt;/h2&gt;

&lt;p&gt;A traditional system is predictable: same input, same output. An LLM, by design, is not. It produces probabilistic responses, follows variable execution paths and when connected to tools (function calling, RAG, agents), it can trigger real-world actions.&lt;/p&gt;

&lt;p&gt;Three characteristics make traditional monitoring approaches insufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-determinism&lt;/strong&gt; first. Two identical queries can produce different results. Static alert thresholds lose their meaning when normal behavior isn't constant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exploding cardinality&lt;/strong&gt; next. Every AI interaction generates new telemetry dimensions: prompts, embeddings, tool calls, intermediate reasoning steps, dynamic identities. The volume of data to ingest and correlate changes by an order of magnitude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-system execution paths&lt;/strong&gt; last. An AI workflow can traverse APIs, data platforms, identity systems, security controls, and multiple infrastructure layers. Root cause analysis requires stitching together signals across domains that historically operated in silos.&lt;/p&gt;

&lt;p&gt;In an industrial context ( predictive maintenance, process control, asset management ) this opacity isn't a technical inconvenience. It's an operational risk.&lt;/p&gt;




&lt;h2&gt;
  
  
  The framework: three layers of control
&lt;/h2&gt;

&lt;p&gt;AI observability is not a separate discipline. It's an extension of existing monitoring that adds three specific control functions. Each answers a different question, serves a different audience, and is implemented with different tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1 : Operational control
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Is the AI system performing as expected, within acceptable performance boundaries?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it serves:&lt;/strong&gt; SRE, DevOps, and data engineering teams.&lt;/p&gt;

&lt;p&gt;Operational control extends existing SRE practices to AI workloads. It means monitoring not just infrastructure (GPU, memory, network latency) but also model behavior in production: inference time, error rates, response quality, cost per request.&lt;/p&gt;

&lt;p&gt;The challenge specific to AI systems is &lt;strong&gt;drift detection&lt;/strong&gt;. A model that worked correctly three months ago can silently degrade because input data has shifted, business context has changed, or a RAG component was updated without anyone checking the impact on outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice, this means:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Defining AI-specific SLOs (p95 inference latency, hallucination rate, cost per token) alongside standard application SLOs&lt;/li&gt;
&lt;li&gt;Instrumenting the ML pipeline end-to-end with OpenTelemetry — from data preprocessing through to the user-facing response&lt;/li&gt;
&lt;li&gt;Collecting and storing model performance metrics in a high-performance time-series database (VictoriaMetrics natively handles the high cardinality these workloads generate)&lt;/li&gt;
&lt;li&gt;Configuring dynamic alerts that adapt to probabilistic behavior, rather than fixed thresholds that generate noise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Concrete use case — Predictive maintenance in aerospace&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An aerospace subcontractor uses an ML model to anticipate equipment failures on an assembly line. The model runs on an internal HPC cluster, consumes real-time sensor data, and feeds a maintenance planning dashboard.&lt;/p&gt;

&lt;p&gt;The observability stack monitors three levels: infrastructure (GPU utilization, memory throughput, cluster network latency), pipeline (preprocessing time, training data freshness, input feature drift), and the model itself (confidence score distribution, false positive rate over the last 30 days, comparison with field feedback).&lt;/p&gt;

&lt;p&gt;When the average confidence score drops below an adaptive threshold for 48 hours, an alert is routed not to the infrastructure team, but to the data team — because the problem is most likely not a downed server, but a shift in sensor data patterns.&lt;/p&gt;

&lt;p&gt;Without this layer, the maintenance team continues following recommendations from a degraded model for weeks. With it, drift is detected in days, not months.&lt;/p&gt;




&lt;h3&gt;
  
  
  Layer 2 : Security control
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Is the AI system doing something it shouldn't be doing?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it serves:&lt;/strong&gt; CISOs, SOC teams, cybersecurity leads.&lt;/p&gt;

&lt;p&gt;This is the layer that most fundamentally changes the role of observability. Production LLMs are not just tools — they are active attack surfaces. Prompt injection, data exfiltration through model outputs, business logic abuse through reasoning manipulation, unauthorized use of connected tools: the vectors are specific, and traditional security tools don't cover them.&lt;/p&gt;

&lt;p&gt;The OWASP Top 10 for LLM Applications identifies ten vulnerability categories, the most critical for industrial contexts being prompt injection (direct and indirect), insecure output handling, and excessive agency granted to agents.&lt;/p&gt;

&lt;p&gt;The fundamental problem is &lt;strong&gt;visibility&lt;/strong&gt;. In many organizations, LLM interactions simply aren't logged. When an incident occurs, response teams don't have the data to understand what happened. Prompts aren't recorded. Tool calls aren't traced. Model decisions aren't auditable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice, this means:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Systematically logging every LLM interaction: input prompt, system context, generated output, tools called, with timestamps and user attribution&lt;/li&gt;
&lt;li&gt;Establishing behavioral baselines and detecting anomalies: unusual request spikes, repetitive prompt patterns (possible signs of model extraction), access to tools or data outside normal scope&lt;/li&gt;
&lt;li&gt;Treating tools exposed to LLMs as privileged interfaces — with access controls, auditing, and enforcement independent of the model's output&lt;/li&gt;
&lt;li&gt;Integrating AI alerts into the existing SIEM rather than creating a parallel security silo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Concrete use case — Prompt injection detection on an internal RAG assistant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An energy distributor deploys an internal conversational assistant connected via RAG to its technical documentation (intervention procedures, safety data sheets, standards). The assistant is used by field technicians through a mobile application.&lt;/p&gt;

&lt;p&gt;The security layer observes every interaction and applies three levels of detection. First level: syntactic analysis of incoming prompts to detect known injection patterns (instructions like "ignore your previous instructions," unusual encodings, attempts to exfiltrate the system prompt). Second level: output monitoring to identify responses containing system prompt elements, out-of-scope data, or instructions the model shouldn't generate. Third level: tool call monitoring — if the assistant attempts to access documents outside its authorized scope or if the access pattern changes abruptly, an alert is raised.&lt;/p&gt;

&lt;p&gt;One morning, the system detects a user submitting a series of short, structured prompts that resemble a systematic attempt to map the documents accessible via the RAG pipeline. The volume and pattern don't match normal technician usage. The alert reaches the SOC, which isolates the session and analyzes the vector.&lt;/p&gt;

&lt;p&gt;Without this layer, the progressive exfiltration of the technical documentation base goes unnoticed.&lt;/p&gt;




&lt;h3&gt;
  
  
  Layer 3 : Governance control
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Is the AI system complying with the rules the organization is subject to?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it serves:&lt;/strong&gt; executive leadership, compliance, legal, DPOs.&lt;/p&gt;

&lt;p&gt;The EU AI Act enters full enforcement in 2026. For high-risk AI systems — and many industrial use cases fall into this category — the obligations are concrete: decision traceability, technical documentation, risk assessment, human oversight, robustness, and cybersecurity. Fines can reach €35 million or 7% of global annual turnover.&lt;/p&gt;

&lt;p&gt;AI governance is not a PowerPoint topic. It's a data topic. And that data is what observability produces.&lt;/p&gt;

&lt;p&gt;Observability provides the evidence needed for compliance: who used the system, when, with what inputs, what decisions were made, what actions were triggered, and whether safeguards were active at the time of the interaction. Without these traces, a mid-market company cannot demonstrate compliance — it can only assert it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice, this means:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Building complete, timestamped audit trails of every AI interaction, retained according to regulatory timeframes&lt;/li&gt;
&lt;li&gt;Automatically documenting data lineage: where did the training data, RAG data, and context data come from? Were they filtered, anonymized, validated?&lt;/li&gt;
&lt;li&gt;Producing compliance dashboards that translate technical data into indicators understandable by a board of directors or an auditor&lt;/li&gt;
&lt;li&gt;Implementing traceable human oversight mechanisms — not a cosmetic "approve" button, but proof that human decision-making actually occurred&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Concrete use case — EU AI Act audit for a quality control system&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An automotive component manufacturer uses a computer vision system coupled with an LLM for end-of-line quality control. The system classifies parts (conforming / non-conforming / requires inspection) and generates anomaly reports in natural language.&lt;/p&gt;

&lt;p&gt;This system falls under the "high-risk" category as defined by the EU AI Act (safety components for products covered by Union harmonization legislation). The company must demonstrate traceability for every classification decision.&lt;/p&gt;

&lt;p&gt;The governance layer builds on data produced by the two preceding layers and structures it for audit. Every classification decision is recorded with: the source image, the vision model result, the confidence score, the LLM-generated report, the timestamp, and the identifier of the model version used. When the confidence score falls below a defined threshold, the system forces human verification and records the operator's decision.&lt;/p&gt;

&lt;p&gt;A governance dashboard aggregates this data into monthly indicators: automated vs. supervised classification rates, confidence score distribution, number of human-machine disagreements, average processing time. These indicators feed directly into the EU AI Act compliance file.&lt;/p&gt;

&lt;p&gt;During an audit, the company doesn't present a static document describing what the system is supposed to do. It shows the actual data of what the system did, decision by decision, with proof that safeguards were functioning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture: building on what exists with open-source components
&lt;/h2&gt;

&lt;p&gt;One of the most common mistakes is treating AI observability as a greenfield project requiring a dedicated stack. In most mid-market companies, the foundations already exist: metrics collection, log aggregation, trace correlation. The work is to extend that stack, not replace it.&lt;/p&gt;

&lt;p&gt;Here is a reference architecture built on open-source, sovereign components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Collection and instrumentation&lt;/strong&gt; — OpenTelemetry as the single instrumentation standard. The benefit is twofold: vendor-agnostic (no lock-in) and extensible to AI-specific signals (prompt traces, inference metrics, tool call spans). OpenTelemetry SDKs integrate with existing ML frameworks (LangChain, LlamaIndex, vLLM) through dedicated instrumentations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric storage and querying&lt;/strong&gt; — VictoriaMetrics for time series. The high cardinality of AI metrics (one dimension per model × per version × per request type × per user) overwhelms traditional monitoring solutions. VictoriaMetrics is designed to handle this scale with a controlled resource footprint — a critical point for mid-market companies that can't provision Thanos clusters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logs and traces&lt;/strong&gt; — OpenObserve or Grafana Loki for log aggregation, with long retention for audit requirements. OpenTelemetry traces enable following a user request from the initial prompt to the final action, through every reasoning step and every tool call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visualization and alerting&lt;/strong&gt; — Grafana as the unified presentation layer. Three types of dashboards: operational (SRE/DevOps), security (SOC), governance (leadership/compliance). The same underlying data, views adapted to each audience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security integration&lt;/strong&gt; — AI alerts feed the existing SIEM through standardized exports. No silos. The goal is for the SOC analyst to see AI events in the same stream as network and application events.&lt;/p&gt;

&lt;p&gt;What makes this architecture viable for a mid-market company is that it relies on components many already use. Extending to AI observability doesn't require a dedicated budget of several hundred thousand euros. It requires competence, architecture, and an understanding of what needs to be measured.&lt;/p&gt;




&lt;h2&gt;
  
  
  What really changes: from passive to active observability
&lt;/h2&gt;

&lt;p&gt;Traditional monitoring is fundamentally passive. It collects data, displays dashboards, and sends alerts when a threshold is breached. The human then decides what to do.&lt;/p&gt;

&lt;p&gt;For AI systems in production, this approach is no longer sufficient. When an AI agent makes a bad decision, the time between detection and action must be measured in seconds, not minutes. Observability becomes active: it informs automated actions — throttling, rollback, isolation, escalation — when system behavior deviates from expected patterns.&lt;/p&gt;

&lt;p&gt;This shift to active observability is the real paradigm change. It's no longer about knowing what happened after the fact. It's about intervening while it's happening.&lt;/p&gt;

&lt;p&gt;For European industrial mid-market companies, this is also a strategic opportunity. Hyperscalers are building these capabilities into their proprietary platforms. Building the equivalent on open-source, sovereign components means preserving your decision-making capacity and technical mastery — without depending on a vendor that can change its terms, its pricing, or its data policies overnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to start?
&lt;/h2&gt;

&lt;p&gt;If you're deploying or planning to deploy AI systems in production, three concrete actions to take now:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrument before you deploy.&lt;/strong&gt; Integrate OpenTelemetry into your ML pipelines during development, not after production deployment. The cost of retrofitting instrumentation is always higher, and the first weeks in production are precisely when you need visibility the most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Define your AI SLOs.&lt;/strong&gt; Just as you have availability and latency SLOs for your applications, define measurable objectives for your AI systems: response quality, cost per inference, drift rate, audit coverage. What isn't measured won't be governed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat AI security as an observability problem.&lt;/strong&gt; Don't create an AI security silo next to your existing SOC. Extend your monitoring to cover the new threat vectors. The data is the same — the questions are different.&lt;/p&gt;

&lt;p&gt;Observability isn't just another tool in the stack. It's the control plane that makes AI governable. Without it, you're flying blind. With it, you know what your systems are doing, why they're doing it, and you can prove they're doing it correctly.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Samuel Desseaux is the founder of Erythix and Aureonis, a CTO-Advocate specializing in IT/OT observability, AI security and observability. Official VictoriaMetrics Training Partner for Europe and Arize partner.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
