<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: samuel desseaux</title>
    <description>The latest articles on DEV Community by samuel desseaux (@samuel_desseaux_815f9c463).</description>
    <link>https://dev.to/samuel_desseaux_815f9c463</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3943733%2F2a28145a-e50c-448f-8598-9569cbbf7ef7.jpg</url>
      <title>DEV Community: samuel desseaux</title>
      <link>https://dev.to/samuel_desseaux_815f9c463</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samuel_desseaux_815f9c463"/>
    <language>en</language>
    <item>
      <title>End-to-End Observability for vLLM and TGI: from DCGM to Tokens</title>
      <dc:creator>samuel desseaux</dc:creator>
      <pubDate>Thu, 21 May 2026 11:37:13 +0000</pubDate>
      <link>https://dev.to/samuel_desseaux_815f9c463/end-to-end-observability-for-vllm-and-tgi-from-dcgm-to-tokens-4fbj</link>
      <guid>https://dev.to/samuel_desseaux_815f9c463/end-to-end-observability-for-vllm-and-tgi-from-dcgm-to-tokens-4fbj</guid>
      <description>&lt;p&gt;Running large language model inference servers in production exposes gaps that neither stock Prometheus dashboards nor the official documentation of vLLM or TGI cover completely. This article maps the layers that matter, names the exact signals to scrape and flags the traps most teams only hit after real traffic arrives.&lt;/p&gt;

&lt;p&gt;Audience: SREs, ML platform engineers and observability engineers who operate or are about to operate vLLM or TGI on GPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why LLM serving breaks standard observability
&lt;/h2&gt;

&lt;p&gt;A model server is not a regular web service. Four properties invalidate the usual playbook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency is not scalar.&lt;/strong&gt; Time to first token (TTFT), inter-token latency (ITL) and end-to-end latency tell three different stories. Optimizing one usually degrades another. Prefill-bound workloads (long prompts, short outputs) and decode-bound workloads (chat, agents, RAG) have inverse profiles. A single p99 number is meaningless without saying which latency it refers to and what input distribution produced it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batching is dynamic and preemptive.&lt;/strong&gt; Continuous batching schedules in-flight requests into the same forward pass. Throughput rises with batch size up to a point where KV cache pressure forces evictions or swaps. Standard "queue depth" metrics still apply, but the relationship between queue depth and tail latency is non-linear and bursty. A queue that looks shallow for ninety seconds and explodes for ten is more useful to detect than a steady moderate queue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The KV cache is the real bottleneck.&lt;/strong&gt; It lives in VRAM, grows with sequence length and dominates memory pressure. When it fills, vLLM preempts or swaps requests. TGI rejects new arrivals. Neither outcome is visible from CPU or network metrics. The KV cache is the single most informative signal on the engine layer, and it has no equivalent in a stateless web service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware reaches into the application.&lt;/strong&gt; A degraded NVLink, a thermal throttle or an NCCL all-reduce stall propagates directly to the request queue. The observability stack has to reach down to the silicon or it will produce dashboards that look fine while users wait.&lt;/p&gt;

&lt;p&gt;The right answer is a layered pipeline that correlates a token rendered to a user with what happened on the silicon a few milliseconds earlier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer map
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────┐
│ Business and cost (€/token, €/tenant, €/h GPU) │
├────────────────────────────────────────────────┤
│ API and distributed tracing (OTel GenAI)       │
├────────────────────────────────────────────────┤
│ Inference engine (vLLM, TGI: Prometheus)       │
├────────────────────────────────────────────────┤
│ Container and OS (cAdvisor, kubelet, eBPF)     │
├────────────────────────────────────────────────┤
│ CUDA runtime and collectives (NCCL, cuPTI)     │
├────────────────────────────────────────────────┤
│ GPU silicon (DCGM exporter, NVLink, PCIe)      │
└────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer has its own native signals. The value of an end-to-end pile comes from the ability to cross-reference them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer by layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GPU silicon
&lt;/h3&gt;

&lt;p&gt;DCGM exporter is the right entry point. The signals worth wiring up from day one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DCGM metric&lt;/th&gt;
&lt;th&gt;What it actually says&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_DEV_GPU_UTIL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Coarse indicator. Reaches 100 % for badly vectorized kernels. Do not use alone.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_PROF_SM_ACTIVE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fraction of cycles where at least one warp is active on an SM.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_PROF_SM_OCCUPANCY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Average warps active per SM normalized to the maximum.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_PROF_PIPE_TENSOR_ACTIVE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fraction of cycles the tensor cores are working. The real utilization signal for LLM inference.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;DCGM_FI_PROF_PIPE_FP16_ACTIVE&lt;/code&gt;, &lt;code&gt;_FP32_ACTIVE&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Pipeline activity by precision. Useful to spot fallbacks.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_PROF_DRAM_ACTIVE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;HBM traffic. Identifies memory-bound workloads.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;DCGM_FI_DEV_FB_USED&lt;/code&gt;, &lt;code&gt;_FB_FREE&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;VRAM in use and free. Cross with &lt;code&gt;vllm:gpu_cache_usage_perc&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;DCGM_FI_PROF_NVLINK_RX_BYTES&lt;/code&gt;, &lt;code&gt;_TX_BYTES&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Inter-GPU traffic. Essential under tensor parallelism.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;DCGM_FI_PROF_PCIE_RX_BYTES&lt;/code&gt;, &lt;code&gt;_TX_BYTES&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;GPU to host traffic. Surfaces pressure during model loading and CPU paging.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;DCGM_FI_DEV_POWER_USAGE&lt;/code&gt;, &lt;code&gt;_GPU_TEMP&lt;/code&gt;, &lt;code&gt;_MEMORY_TEMP&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Power and thermal. Throttling shows up here before it shows up in user latency.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;DCGM_FI_DEV_SM_CLOCK&lt;/code&gt;, &lt;code&gt;_MEM_CLOCK&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Effective clocks. A persistent drop is the first sign of thermal throttling.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;DCGM exporter ships as a Helm chart and runs as a DaemonSet on GPU nodes. Default scrape interval is one second, fine for steady-state dashboards but coarse enough to miss sub-second incidents like an eviction storm. Two profiles in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;steady&lt;/code&gt;: 5 seconds, full field set.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;incident&lt;/code&gt;: 250 ms, reduced field set, enabled on alert.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few hardware notes that change what you should monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MIG&lt;/strong&gt; (Multi-Instance GPU). When MIG slices are active, DCGM exposes per-slice metrics under the same field IDs with a different device label. Pin labels in your relabel config or you will see metrics merge or vanish across reschedules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVSwitch&lt;/strong&gt; (DGX, HGX). Add the NVSwitch exporter alongside DCGM. NVLink saturation at the switch is invisible from the per-GPU NVLink counters alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;InfiniBand&lt;/strong&gt;. Use the Mellanox &lt;code&gt;ibutils&lt;/code&gt; exporter or &lt;code&gt;ucx&lt;/code&gt; counters. RDMA traffic for distributed inference does not appear in the GPU metrics path.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CUDA runtime and collectives
&lt;/h3&gt;

&lt;p&gt;Tensor parallelism and pipeline parallelism rely on NCCL. When one GPU waits for its peers, application latency shows anomalies with no CPU or network cause visible.&lt;/p&gt;

&lt;p&gt;Sources worth wiring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;NCCL_DEBUG=WARN&lt;/code&gt; in production with parseable output, ingested as structured logs. &lt;code&gt;INFO&lt;/code&gt; is too verbose and has a non-trivial overhead.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nvidia-nccl-exporter&lt;/code&gt; where the version supports your CUDA stack.&lt;/li&gt;
&lt;li&gt;cuPTI for kernel-level and collective-level tracing. Enable on demand only, the overhead is measurable and biases what you are trying to observe.&lt;/li&gt;
&lt;li&gt;On InfiniBand fabric, export UCX counters and SHARP statistics. NCCL alone does not surface fabric congestion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Collective patterns to remember when reading dashboards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All-reduce&lt;/strong&gt; dominates tensor-parallel matmul splits. Saturated NVLink with idle SMs means you are bandwidth-bound on the collective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All-gather&lt;/strong&gt; appears in some attention implementations and in pipeline-parallel weight gathering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send/recv&lt;/strong&gt; dominates pipeline parallelism. Imbalance between stages shows up as one GPU with low SM activity and a long send wait.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These traces are not meant to be on all the time. Continuous lightweight counters with on-demand deep tracing is the pattern that scales.&lt;/p&gt;

&lt;h3&gt;
  
  
  Container and OS
&lt;/h3&gt;

&lt;p&gt;Platform layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cAdvisor and kubelet for pod CPU, RAM and IO.&lt;/li&gt;
&lt;li&gt;kube-state-metrics for Pod state, OOM events and restarts.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kube_pod_info&lt;/code&gt; joined to GPU identity (&lt;code&gt;nvidia.com/gpu&lt;/code&gt; device id) to map pod to physical GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kernel layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBPF via Tetragon, bpftrace or Pixie for syscalls, unexpected network egress and model file reads.&lt;/li&gt;
&lt;li&gt;On-CPU profiling via parca or pyroscope without instrumenting the binary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;eBPF is also where the security observability lives. A minimal Tetragon policy that watches model file reads and unexpected egress on the inference pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cilium.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TracingPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-runtime-watch&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
  &lt;span class="na"&gt;kprobes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security_file_open"&lt;/span&gt;
      &lt;span class="na"&gt;syscall&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file"&lt;/span&gt;
      &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
              &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prefix"&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/models/"&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/root/.cache/huggingface/"&lt;/span&gt;
          &lt;span class="na"&gt;matchActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tcp_connect"&lt;/span&gt;
      &lt;span class="na"&gt;syscall&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sock"&lt;/span&gt;
      &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
              &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NotDAddr"&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10.0.0.0/8"&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.0/8"&lt;/span&gt;
          &lt;span class="na"&gt;matchActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a starter: it logs every model file read and every non-RFC1918 outbound connection from vLLM pods. Convert to alerts only after a quiet-period baseline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference engine
&lt;/h3&gt;

&lt;p&gt;The layer most teams neglect the longest, while being the densest in business signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vLLM&lt;/strong&gt; exposes &lt;code&gt;/metrics&lt;/code&gt; by default. The base set:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Reading&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:time_to_first_token_seconds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;histogram&lt;/td&gt;
&lt;td&gt;Server-side TTFT. Compare to gateway TTFT.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:time_per_output_token_seconds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;histogram&lt;/td&gt;
&lt;td&gt;ITL. What the user feels in streaming.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:e2e_request_latency_seconds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;histogram&lt;/td&gt;
&lt;td&gt;Server-side end-to-end latency.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:num_requests_running&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;gauge&lt;/td&gt;
&lt;td&gt;Requests in the active batch.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:num_requests_waiting&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;gauge&lt;/td&gt;
&lt;td&gt;Queue depth. First saturation indicator.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:num_requests_swapped&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;gauge&lt;/td&gt;
&lt;td&gt;Requests paged to CPU. VRAM pressure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:gpu_cache_usage_perc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;gauge&lt;/td&gt;
&lt;td&gt;KV cache occupation. At 1.0 with &lt;code&gt;swapped &amp;gt; 0&lt;/code&gt;, you are in eviction territory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:num_preemptions_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;counter&lt;/td&gt;
&lt;td&gt;Cumulative preemptions. Take the per-second rate.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:prompt_tokens_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;counter&lt;/td&gt;
&lt;td&gt;Input tokens processed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vllm:generation_tokens_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;counter&lt;/td&gt;
&lt;td&gt;Generated tokens. Cost calculation base.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Recent vLLM versions also expose prefix caching and speculative decoding metrics. The exact names depend on the version, but the families to look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vllm:gpu_prefix_cache_hits_total&lt;/code&gt;, &lt;code&gt;vllm:gpu_prefix_cache_queries_total&lt;/code&gt;. Hit rate dominates the gain from prefix caching in agent and RAG workloads.&lt;/li&gt;
&lt;li&gt;Speculative decoding counters that let you derive the acceptance rate of the draft model. If acceptance falls below the break-even point against the draft model overhead, spec decode is costing you throughput.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;TGI&lt;/strong&gt; exposes &lt;code&gt;/metrics&lt;/code&gt; with a different naming convention:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Reading&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi_batch_current_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Active batch size.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi_batch_next_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Next batch being formed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi_queue_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Queue depth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi_request_queue_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Time in queue.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi_request_inference_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Engine time.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tgi_batch_inference_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-batch latency, decomposable into forward and decode.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;tgi_request_input_length&lt;/code&gt;, &lt;code&gt;tgi_request_generated_tokens&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Token counters per request.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both engines emit histograms with standard Prometheus buckets. Quantiles are computed at query time (&lt;code&gt;histogram_quantile&lt;/code&gt; in PromQL or VMQL equivalents).&lt;/p&gt;

&lt;p&gt;A practical reading habit: never look at a single engine metric in isolation. The useful patterns are paired.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;vllm:num_requests_waiting&lt;/code&gt; rising with &lt;code&gt;vllm:gpu_cache_usage_perc&lt;/code&gt; at 1.0 and &lt;code&gt;vllm:num_preemptions_total&lt;/code&gt; rate &amp;gt; 0: you are in cache thrash. Reduce &lt;code&gt;max_num_seqs&lt;/code&gt; or raise &lt;code&gt;max_num_batched_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vllm:num_requests_waiting&lt;/code&gt; rising with healthy cache: you are compute-bound. Add capacity or reduce &lt;code&gt;max_num_batched_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tgi_queue_size&lt;/code&gt; high with &lt;code&gt;tgi_batch_current_size&lt;/code&gt; plateauing below maximum: scheduler is starving on token budget. Inspect &lt;code&gt;max_batch_total_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  API and distributed tracing
&lt;/h3&gt;

&lt;p&gt;Tracing answers "where did my request spend its time" independently of aggregate metrics.&lt;/p&gt;

&lt;p&gt;Adopt OpenTelemetry with the GenAI semantic conventions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.system&lt;/code&gt; (for example &lt;code&gt;vllm&lt;/code&gt;, &lt;code&gt;tgi&lt;/code&gt;),&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.operation.name&lt;/code&gt; (&lt;code&gt;chat&lt;/code&gt;, &lt;code&gt;completion&lt;/code&gt;),&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.request.model&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.request.max_tokens&lt;/code&gt;, &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;top_p&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;, &lt;code&gt;gen_ai.usage.output_tokens&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gen_ai.response.finish_reasons&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A useful span breakdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http.server.request
└── gen_ai.completion
    ├── tokenize
    ├── schedule
    ├── prefill
    ├── decode  (loop, span per batch step)
    ├── detokenize
    └── stream_out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Available instrumentation libraries: OpenLIT, openllmetry (Traceloop) and OpenInference (Arize). Pick one and stick to it. Mixing them produces inconsistent attribute names that break dashboard queries.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;request_id&lt;/code&gt; propagated from the ingress through to the engine is the key that makes downstream correlation possible. Declare it at the ingress (header &lt;code&gt;x-request-id&lt;/code&gt;), propagate it through OTel baggage, log it on the engine side and attach it as a trace attribute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus exemplars&lt;/strong&gt; are worth the configuration cost. They link a histogram bucket to one or more traces, so a click on a TTFT p99 spike in Grafana jumps directly to the slowest traces. vLLM does not expose exemplars natively today, but the OTel collector can attach trace IDs to scraped histograms via the &lt;code&gt;spanmetrics&lt;/code&gt; connector. Sample collector snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;connectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;spanmetrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;histogram&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;explicit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;buckets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;10ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;50ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;100ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;250ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;500ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;1s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;2s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;5s&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;dimensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gen_ai.system&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant&lt;/span&gt;
    &lt;span class="na"&gt;exemplars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you metric-to-trace navigation without changing the engine code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logs
&lt;/h3&gt;

&lt;p&gt;Structured, JSON. VictoriaLogs handles the volume without forcing a complex query syntax.&lt;/p&gt;

&lt;p&gt;Minimum fields for the inference layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;request_id&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tenant&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;model&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prompt_tokens&lt;/code&gt;, &lt;code&gt;generation_tokens&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ttft_ms&lt;/code&gt;, &lt;code&gt;e2e_ms&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;finish_reason&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gpu_id&lt;/code&gt; (resolved at pod level),&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;trace_id&lt;/code&gt;, &lt;code&gt;span_id&lt;/code&gt; (for cross-reference with traces).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not log prompts and outputs by default. If you need to, allocate a separate channel with short retention and active PII filtering. The legal exposure of an unfiltered prompt log dwarfs any operational benefit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business and cost
&lt;/h3&gt;

&lt;p&gt;The only layer that talks to leadership. From the native counters you derive three indicators.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost per request, per tenant, per model.&lt;/strong&gt; The denominator changes the answer, surface all three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hourly cost of a GPU normalized by tokens produced in the same window.&lt;/strong&gt; This is the closest thing to a useful efficiency metric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful tokens over billed tokens.&lt;/strong&gt; A measure of batching efficiency: how many tokens you produce per token of GPU compute time.&lt;/p&gt;

&lt;p&gt;Cost per tenant, in PromQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (tenant) (
  rate(vllm:generation_tokens_total{tenant=~".+"}[5m])
)
* on(model) group_left
  cost_per_generation_token_eur
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;cost_per_generation_token_eur&lt;/code&gt; is a reference series pushed by a configuration job. Maintain prompt vs generation rates separately, they price differently in most providers and they have different production costs (prefill is single forward pass, decode is autoregressive).&lt;/p&gt;

&lt;p&gt;A useful refinement is to include idle cost. A GPU running at 30 % utilization still costs the full hourly rate. The "effective cost per token" should distribute the full GPU hour over the tokens actually produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(gpu_hourly_cost_eur)
/
(sum by (gpu) (rate(vllm:generation_tokens_total[1h])) * 3600)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the number that drives capacity decisions, not the marginal cost per token.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard problems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cross-layer correlation
&lt;/h3&gt;

&lt;p&gt;Linking a rendered token to a physical GPU is trivial in theory and hard in practice. The concrete plumbing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;request_id&lt;/code&gt; propagated from ingress through engine spans.&lt;/li&gt;
&lt;li&gt;Engine-side spans carry &lt;code&gt;gpu_id&lt;/code&gt; as an attribute.&lt;/li&gt;
&lt;li&gt;Metric series carry &lt;code&gt;pod&lt;/code&gt; and &lt;code&gt;gpu_uuid&lt;/code&gt; labels, joined via &lt;code&gt;kube_pod_info&lt;/code&gt; to a &lt;code&gt;pod&lt;/code&gt; to &lt;code&gt;gpu_uuid&lt;/code&gt; mapping (DCGM exposes &lt;code&gt;UUID&lt;/code&gt; and &lt;code&gt;device&lt;/code&gt; labels).&lt;/li&gt;
&lt;li&gt;Dashboards join temporally on time windows and spatially on &lt;code&gt;gpu_uuid&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;DCGM samples per GPU, not per request. Fine-grained correlation is always done by time window, never by exact identifier. The illusion of per-request hardware metrics is exactly that, an illusion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cardinality
&lt;/h3&gt;

&lt;p&gt;Labeling by &lt;code&gt;tenant&lt;/code&gt; and &lt;code&gt;model&lt;/code&gt; is healthy. Labeling by &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;session_id&lt;/code&gt; or &lt;code&gt;request_id&lt;/code&gt; on metrics is forbidden. Those dimensions belong to traces and logs.&lt;/p&gt;

&lt;p&gt;VictoriaMetrics absorbs moderate cardinality well, especially with &lt;code&gt;vmagent&lt;/code&gt; stream aggregation pre-rolling histograms. But multi-tenant inference explodes fast. Run the math at design time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tenants × models × quantiles × histogram_buckets × instances
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten tenants, five models, six quantiles, ten buckets, fifty instances gives 150 000 series for one histogram metric alone. Add three histograms (TTFT, ITL, e2e) and you are at half a million series before counters and gauges. Plan accordingly or use stream aggregation to drop unused dimensions before storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sampling
&lt;/h3&gt;

&lt;p&gt;Three rhythms coexist: DCGM at 1 s, vLLM at 10 s, traces sometimes at 1 in 100. For brief incidents (preemption bursts, KV eviction storms), prepare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OTel collector with tail-based sampling, rule "if error or slow then keep",&lt;/li&gt;
&lt;li&gt;DCGM in incident mode at 250 ms, switched on by an alert webhook,&lt;/li&gt;
&lt;li&gt;eBPF in continuous collection on critical syscalls (no sampling, the overhead is minimal),&lt;/li&gt;
&lt;li&gt;vLLM kept at 10 s, no faster path exists without patching.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A tail-based sampling policy that works in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;decision_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
  &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;errors&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;status_code&lt;/span&gt;
      &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slow_ttft&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency&lt;/span&gt;
      &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;threshold_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;2000&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high_value_tenant&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string_attribute&lt;/span&gt;
      &lt;span class="na"&gt;string_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant&lt;/span&gt;
        &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enterprise_a&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;enterprise_b&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;baseline&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;probabilistic&lt;/span&gt;
      &lt;span class="na"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps every error, every slow TTFT, every trace from high-value tenants and a 5 % baseline of normal traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time origin
&lt;/h3&gt;

&lt;p&gt;Server-side TTFT is not what the user feels. Streaming, proxy buffering, HTTP buffer flushes and WAN traversal all change the perceived value. Measure also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gateway-side TTFT (Envoy &lt;code&gt;upstream_rq_time&lt;/code&gt; or equivalent),&lt;/li&gt;
&lt;li&gt;client-side TTFT where possible (SDK instrumentation).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without these, you optimize a number that does not reflect the experience. The gap between engine TTFT and gateway TTFT is also a useful health signal in itself, a sudden divergence usually means a proxy buffering regression.&lt;/p&gt;

&lt;h3&gt;
  
  
  SLO design for LLM serving
&lt;/h3&gt;

&lt;p&gt;Standard SRE SLO patterns need adjustment for LLM serving. A defensible starting set:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SLO&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TTFT availability&lt;/td&gt;
&lt;td&gt;p95 TTFT below threshold over rolling window&lt;/td&gt;
&lt;td&gt;Streaming UX collapses without it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ITL stability&lt;/td&gt;
&lt;td&gt;p95 ITL below threshold&lt;/td&gt;
&lt;td&gt;Decode stalls feel worse than a long initial wait.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Completion success&lt;/td&gt;
&lt;td&gt;success rate of requests that produce at least one token&lt;/td&gt;
&lt;td&gt;Hard failure metric.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming completeness&lt;/td&gt;
&lt;td&gt;percentage of streams that emit &lt;code&gt;finish_reason=stop&lt;/code&gt; (not &lt;code&gt;length&lt;/code&gt;, not &lt;code&gt;error&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Quality proxy.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity headroom&lt;/td&gt;
&lt;td&gt;p95 queue depth below a threshold&lt;/td&gt;
&lt;td&gt;Forward-looking, drives autoscaling.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The thresholds depend on the model and workload. Chat: TTFT p95 under 1 s, ITL p95 under 80 ms. RAG: TTFT p95 under 3 s, ITL p95 under 50 ms (long outputs amplify ITL). Code completion: TTFT p95 under 500 ms, ITL p95 under 30 ms.&lt;/p&gt;

&lt;p&gt;Express them as multi-window multi-burn-rate alerts on the underlying SLI series, not as single-threshold alerts. The Google SRE workbook formulas apply unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference pile
&lt;/h2&gt;

&lt;p&gt;Components:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;th&gt;Alternative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;VictoriaMetrics cluster with vmagent&lt;/td&gt;
&lt;td&gt;Prometheus with Thanos or Mimir&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs&lt;/td&gt;
&lt;td&gt;VictoriaLogs&lt;/td&gt;
&lt;td&gt;Loki, OpenSearch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traces&lt;/td&gt;
&lt;td&gt;Tempo, Jaeger&lt;/td&gt;
&lt;td&gt;SaaS (Honeycomb, Datadog)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Application collection&lt;/td&gt;
&lt;td&gt;OTel collector (agent and gateway)&lt;/td&gt;
&lt;td&gt;Vector, Fluent Bit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU collection&lt;/td&gt;
&lt;td&gt;DCGM exporter (DaemonSet)&lt;/td&gt;
&lt;td&gt;nvidia_gpu_exporter (legacy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eBPF&lt;/td&gt;
&lt;td&gt;Tetragon, Pixie&lt;/td&gt;
&lt;td&gt;Falco&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visualization&lt;/td&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;td&gt;Perses&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  OTel collector pipeline (agent)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
          &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
          &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:8000'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm&lt;/span&gt;
          &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
          &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
  &lt;span class="na"&gt;k8sattributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;auth_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serviceAccount&lt;/span&gt;
    &lt;span class="na"&gt;extract&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;k8s.pod.name&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;k8s.namespace.name&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;k8s.node.name&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tag_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
          &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tag_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant&lt;/span&gt;
          &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod&lt;/span&gt;
  &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment.environment&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;upsert&lt;/span&gt;
  &lt;span class="na"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;decision_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;status_code&lt;/span&gt;
        &lt;span class="na"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slow&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency&lt;/span&gt;
        &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;threshold_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;2000&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;baseline&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;probabilistic&lt;/span&gt;
        &lt;span class="na"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

&lt;span class="na"&gt;connectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;spanmetrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;histogram&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;explicit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;buckets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;10ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;50ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;100ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;250ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;500ms&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;1s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;2s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;5s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;10s&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;dimensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gen_ai.system&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gen_ai.request.model&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant&lt;/span&gt;
    &lt;span class="na"&gt;exemplars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheusremotewrite&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://vmagent.observability.svc:8429/api/v1/write&lt;/span&gt;
    &lt;span class="na"&gt;external_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-eu-west&lt;/span&gt;
  &lt;span class="na"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tempo.observability.svc:4317&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;spanmetrics&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;k8sattributes&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheusremotewrite&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;k8sattributes&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;spanmetrics&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;spanmetrics&lt;/code&gt; connector turns traces into low-cardinality histograms with exemplars, giving you click-through from metrics to traces without changing engine code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Useful starter queries
&lt;/h3&gt;

&lt;p&gt;TTFT p99 by model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.99,
  sum by (model, le) (
    rate(vllm:time_to_first_token_seconds_bucket[5m])
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Preemptions per second overlaid with cache occupation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rate(vllm:num_preemptions_total[1m])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Effective tensor core utilization per GPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;avg by (gpu) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tokens per GPU-second (efficiency):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (gpu) (rate(vllm:generation_tokens_total[5m]))
/
count by (gpu) (DCGM_FI_DEV_GPU_UTIL)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Normalized TGI queue pressure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tgi_queue_size / on(instance) tgi_batch_current_size
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost per hour per tenant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (tenant) (
  rate(vllm:generation_tokens_total[1h]) * 3600
) * on(model) group_left cost_per_generation_token_eur
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Alerting that does not lie
&lt;/h2&gt;

&lt;p&gt;Alerts on inference servers should fire on user-visible degradation, not on resource thresholds. A working starter set:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTFT burn-rate (multi-window).&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLMTTFTBudgetFastBurn&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(&lt;/span&gt;
      &lt;span class="s"&gt;sum by (model) (rate(vllm:time_to_first_token_seconds_bucket{le="1.0"}[5m]))&lt;/span&gt;
      &lt;span class="s"&gt;/&lt;/span&gt;
      &lt;span class="s"&gt;sum by (model) (rate(vllm:time_to_first_token_seconds_count[5m]))&lt;/span&gt;
    &lt;span class="s"&gt;) &amp;lt; 0.95&lt;/span&gt;
    &lt;span class="s"&gt;and&lt;/span&gt;
    &lt;span class="s"&gt;(&lt;/span&gt;
      &lt;span class="s"&gt;sum by (model) (rate(vllm:time_to_first_token_seconds_bucket{le="1.0"}[1h]))&lt;/span&gt;
      &lt;span class="s"&gt;/&lt;/span&gt;
      &lt;span class="s"&gt;sum by (model) (rate(vllm:time_to_first_token_seconds_count[1h]))&lt;/span&gt;
    &lt;span class="s"&gt;) &amp;lt; 0.95&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;page&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache thrash detector.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLMCacheThrash&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;vllm:gpu_cache_usage_perc &amp;gt; 0.95&lt;/span&gt;
    &lt;span class="s"&gt;and&lt;/span&gt;
    &lt;span class="s"&gt;rate(vllm:num_preemptions_total[2m]) &amp;gt; 0.5&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tensor core idle under load.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GPUTensorIdleUnderLoad&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[10m]) &amp;lt; 0.2&lt;/span&gt;
    &lt;span class="s"&gt;and&lt;/span&gt;
    &lt;span class="s"&gt;vllm:num_requests_running &amp;gt; 4&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This last alert catches the case where the engine reports work in flight but the tensor cores are idle. The usual cause is a stalled NCCL collective or a CPU-bound bottleneck before the GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming completion regression.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLMStreamingTruncations&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(&lt;/span&gt;
      &lt;span class="s"&gt;sum by (model) (rate(vllm:request_success_total{finish_reason="length"}[10m]))&lt;/span&gt;
      &lt;span class="s"&gt;/&lt;/span&gt;
      &lt;span class="s"&gt;sum by (model) (rate(vllm:request_success_total[10m]))&lt;/span&gt;
    &lt;span class="s"&gt;) &amp;gt; 0.1&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When more than 10 % of requests stop on &lt;code&gt;length&lt;/code&gt;, either &lt;code&gt;max_tokens&lt;/code&gt; is too low for the use case or quality has regressed.&lt;/p&gt;

&lt;p&gt;Avoid alerting directly on queue depth or GPU utilization. Both vary widely under healthy load. They are diagnostic, not actionable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-patterns
&lt;/h2&gt;

&lt;p&gt;To review every quarter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treating &lt;code&gt;DCGM_FI_DEV_GPU_UTIL&lt;/code&gt; as utilization. The right read is &lt;code&gt;DCGM_FI_PROF_PIPE_TENSOR_ACTIVE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Tuning batching against mean latency. Tail latency and queue depth tell the truth.&lt;/li&gt;
&lt;li&gt;Labeling metrics by &lt;code&gt;request_id&lt;/code&gt;. That belongs to traces.&lt;/li&gt;
&lt;li&gt;Measuring latency only at the engine. Add the gateway, add the client where possible.&lt;/li&gt;
&lt;li&gt;Capturing prompts and outputs in traces without an active PII filter.&lt;/li&gt;
&lt;li&gt;Counting "tokens" without separating prompt and generation. Pricing is asymmetric, batching capacity is asymmetric.&lt;/li&gt;
&lt;li&gt;Leaving cuPTI and &lt;code&gt;NCCL_DEBUG=INFO&lt;/code&gt; on in production. Measurable overhead, biased measurements.&lt;/li&gt;
&lt;li&gt;Sampling traces uniformly. Tail-based sampling with rules for errors, slow requests and high-value tenants catches more value at lower volume.&lt;/li&gt;
&lt;li&gt;Storing everything at maximum resolution. Cardinality cost explodes before retention cost.&lt;/li&gt;
&lt;li&gt;Building alerts on resource thresholds. Alert on user-visible SLOs, treat resource metrics as diagnostic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Maturity ladder
&lt;/h2&gt;

&lt;p&gt;Where teams typically stand and where to move next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 0: nothing specific.&lt;/strong&gt; Generic node and pod metrics. No idea how the engine is doing. Move to level 1 by scraping the engine's &lt;code&gt;/metrics&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: engine metrics only.&lt;/strong&gt; vLLM or TGI metrics scraped, basic dashboard. Sufficient for an initial deployment, blind to hardware-rooted issues. Move to level 2 by adding DCGM and pod-to-GPU mapping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: engine plus GPU correlated.&lt;/strong&gt; Most pragmatic teams stop here. Resolves 70 % of incidents in practice. Move to level 3 when multi-tenant pressure starts and when latency complaints exceed throughput complaints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3: distributed tracing with GenAI semconv.&lt;/strong&gt; Per-request visibility, exemplar-driven debugging, tenant-aware SLOs. Required at scale. Move to level 4 for regulated workloads and HPC fabrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 4: kernel and fabric depth.&lt;/strong&gt; eBPF policies in alerting paths, NCCL and InfiniBand observability, audit-grade logging with retention policies, confidential computing where applicable. Required for regulated industries, sovereign deployments and large-scale training-adjacent serving.&lt;/p&gt;

&lt;p&gt;Move one level at a time. Skipping levels produces dashboards no one trusts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;p&gt;Three topics deserve their own articles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;KV cache observability: eviction, fragmentation, swap. Native metrics, stress experiments, mitigations.&lt;/li&gt;
&lt;li&gt;NCCL and tensor parallelism: observing inter-GPU flows and finding the collective that stalls the batch.&lt;/li&gt;
&lt;li&gt;Securing an inference server: attack surface, eBPF detection, sandboxing, AI Act audit trail.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The right implementation order in production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inference engine metrics (vLLM, TGI native scrape).&lt;/li&gt;
&lt;li&gt;GPU metrics (DCGM exporter).&lt;/li&gt;
&lt;li&gt;Distributed tracing with OTel GenAI semconv.&lt;/li&gt;
&lt;li&gt;Structured logs with &lt;code&gt;trace_id&lt;/code&gt; and &lt;code&gt;request_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Business and cost layer.&lt;/li&gt;
&lt;li&gt;eBPF policies for security and runtime observability.&lt;/li&gt;
&lt;li&gt;NCCL and cuPTI on demand for hard-to-reproduce issues.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Starting with layers 1 and 2 alone resolves most of the incidents observed in production. Everything above that compounds value once the base is solid.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Corrections and operational war stories welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>observability</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
