DEV Community: samuel desseaux

Observability Cost: The Collector is the lever you are not pulling

samuel desseaux — Mon, 22 Jun 2026 13:40:27 +0000

The observability bill has become a board-level worry and the numbers back it up. In the 2026 industry predictions, more than a third of teams expect to spend over a million dollars and a meaningful slice expects to cross five million. The common responses treat the symptoms. Teams shorten retention, cut sampling blindly or reopen the vendor contract. These levers exist but they all act too late in the chain. The real control point sits earlier, before storage and billing, inside the Collector.

The idea fits in one sentence. Observability cost is not driven by what you observe but by what you decide to keep and that decision is made most effectively at the source.

Let'see together in details.

Why the bill drifts

Three mechanisms combine and the first one dwarfs the other two.

Cardinality is the cost driver that surprises people most. A metric is not billed per data point, it is billed per active time series, meaning per unique combination of labels. A latency metric carrying service, pod, instance, endpoint and status_code already produces thousands of series. Add one unbounded label, a user_id or a raw request_path and you jump from thousands to millions of series with no gain in readability. Cardinality grows multiplicatively rather than additively and it is what inflates storage and query compute.

Raw volume comes next. Traces kept at 100 percent, redundant application logs, health checks logged every second on every pod: the mass piles up and the transport and ingestion cost follows.

The reflex closes the list. The "keep everything, sort it out later" habit feels safe at instrumentation time and turns expensive six months on. Whatever is never queried is rarely deleted, so the telemetry debt grows quietly.

The Collector as a control point

A Collector pipeline, whether the OpenTelemetry Collector or vmagent on the metrics side, always follows the same shape: receivers on the way in, processors in the middle, exporters on the way out. The cost battle is won in the processors because that is the last place you control the data before it becomes a line on an invoice.

Working at the source rather than at the backend changes the nature of the problem. Filtering at the backend means you pay to ingest data that you then throw away. The same filtering at the Collector drops the data before transport, ingestion and storage. You stop paying for what you do not keep.

Four families of processors carry most of the gain:

Filtering, to drop signals with no operational value, typically health checks and debug logs in production.
Attribute transformation, to remove or recode high-cardinality labels before they create series.
Aggregation, to move from per-pod to per-service granularity on metrics you never query at the pod level.
Trace sampling, to keep only what serves diagnosis.

Cardinality, where the biggest win lives

Before you sample anything, deal with cardinality, because it is the lever with the best effort-to-gain ratio.

The base rule is simple. A label whose values are unbounded has no place on a metric. A request id, a user id or a non-normalized URL path belongs in a trace or a log, not in a time series. When you need to tie a metric back to a specific trace, use exemplars, which are built for exactly that.

Next comes aggregating away the dimensions you do not use. For each label, ask whether you actually queried it in the last quarter. Labels that never serve as a filter or a grouping in your queries are direct candidates for removal at the relabeling step. On the Prometheus and VictoriaMetrics side, vmagent stream aggregation pre-aggregates those series on the fly before write, which cuts the active series count without touching application code.

Sampling, the right trade-offs

Sampling is often framed badly. Random sampling throws away as many incidents as it does routine traffic, which is the opposite of what you need.

For traces, tail sampling changes the picture. The keep-or-drop decision is made once the trace is complete, so you can keep 100 percent of error traces, 100 percent of traces above a latency threshold and a small percentage of nominal traces. You retain everything useful for diagnosis and discard the background noise with no loss of signal.

For metrics, you do not sample, you aggregate and downsample over time. Fine granularity on the recent hours and coarser granularity beyond covers almost every use case.

For logs, filtering and routing come first. The noise goes to the bin and the rest is sorted by severity and destination.

Data contracts, to make it stick

Everything above stays fragile if it depends on each team's goodwill. The data contract is the piece that turns one-off cuts into durable policy.

A data contract defines, per signal, what is allowed: which labels, which cardinality budget, which retention and which destination. It is written in the Collector configuration and checked in continuous integration, the same way as any other infrastructure artifact. The OpenTelemetry semantic conventions give the shared vocabulary that keeps those contracts readable across teams. From there, a new high-cardinality metric no longer reaches production by accident, it is stopped at review.

An illustrative before and after

The figures below describe a representative scenario, not a specific engagement. They show the order of magnitude of the savings rather than promise a result.

Take a mid-size Kubernetes platform, roughly 120 services and 800 pods, instrumented across all three signals.

Signal	Before	Action at the Collector	After
Metrics	6,000,000 active series	Drop 2 unbounded labels, aggregate per service	1,800,000 series (down 70 percent)
Traces	100 percent of spans, around 800 GB per day	Tail sampling, errors and slow traces at 100 percent, nominal at 8 percent	around 140 GB per day (down 82 percent)
Logs	1.2 TB per day	Filter noise, route compliance data to cold storage	400 GB hot plus 300 GB cold archived

On that basis, a monthly observability bill in the region of 40,000 euros typically lands in a 14,000 to 18,000 euro range, with the final figure depending on the vendor pricing model. The absolute number is not the point, the proportion is. Most of the gain comes from cardinality and tail sampling, two Collector settings, with no change to application code.

The regulated constraint

In a regulated environment, cost control cannot come at the price of compliance. Some data, audit logs and traces with evidentiary value, must be kept whatever happens, sometimes for long durations imposed by frameworks such as DORA or NIS2.

The answer is not to keep everything in hot storage, it is to route. Operational signals live in fast expensive storage with short retention. Compliance signals go to cold storage, slow and cheap but durable and exportable if an audit calls for it. The Collector is the natural place to apply that split through routing rules. You cut the hot bill without touching your obligations.

Wrapping up

The Collector is not plumbing, it is where your cost policy is enforced. Treat telemetry as a budget. Decide upfront what is worth keeping, write that decision into data contracts and apply it at the source. Cardinality first, sampling next, routing for the rest. That is the difference between observability that drifts and observability that holds over time.

A false positive is not noise but a gap in your model of normal.

samuel desseaux — Sun, 21 Jun 2026 15:47:25 +0000

You have just switched Tetragon to enforce mode on a production cluster. A few minutes later your phone lights up. Four alerts, on a single payment service.

A gcc compiler ran inside the container. The service mesh sidecar spawned processes and opened connections. A nightly migration job started a shell script. And an on-call engineer opened a kubectl exec to understand an incident.

Four alerts, all legitimate in some sense. The natural reflex, allowlist everything for some peace, is the worst thing to do for three of them.

We are taught to detect, not to triage

The available Tetragon content teaches you to cast a wide net. Install, write a first TracingPolicy, capture every exec, replay a container escape demo, switch to enforce. It is useful and it is already very well covered.

The problem is that it stops there. Nobody teaches triage. False positives are treated as folklore you swap with peers, not as an engineering problem. You endure them, you comment on them, you do not model them. And a noisy detection always ends the same way, ignored then disabled. A disabled rule detects nothing.

The flip

Here is the idea that changes everything. A detection rule is a hypothesis about what is normal. A false positive is therefore not a flaw in the tool, it is proof that your hypothesis of normal is wrong or incomplete.

Noise is not noise. It is information about your model. The right question is not how to filter this alert, it is what the alert tells you about what you believed was normal.

One symptom, four opposite treatments

Back to the payment service. The four alerts look like the same symptom, an unexpected exec. They have nothing in common.

The gcc should not be there. It is an image defect, a missing multi-stage build. You fix it at the source, you never allowlist it. Allowlisting a compiler in prod hands a tool to an attacker.

The mesh sidecar is doing exactly its job. Stable and permanent behavior, to be described as normal, by workload class and not by pod.

The migration job is legitimate but transient. You allow it within its lifecycle window, not in steady state.

The kubectl exec is legitimate but human and unpredictable. You never allowlist an interactive shell, that leaves the door wide open. You attribute the action to a person and you audit it.

Same symptom, four opposite treatments. Triage is not a detail, it is the job.

Stop guessing

The good news, you do not have to imagine the list of normal behaviors. You can observe it.

Tetragon runs in observation by default, a policy only blocks if you add an action to it. There is no magic switch between seeing and blocking. So you let it run in observation over a representative window, two weeks to catch the weekly cycles and the batch jobs, then you generate the allowlist from the behavior actually observed, aggregated by workload class.

Then you climb the rungs one at a time. Observe, alert only, permissive, narrow block. One workload class at a time, never all at once. It is the same logic as seccomp or AppArmor profile generation, record then permissive then enforce, carried over to runtime security.

What baselining does not tell you

Two traps, which I detail in the full module.

Observed is not approved. Automatic generation can perfectly well capture a behavior you absolutely do not want to make permanent, the gcc for example. A human validates before freezing it.

And baselining an already compromised environment bakes the attacker into normal. The observation window assumes a healthy environment, a hypothesis to verify, not to presume.

In the end the allowlist becomes an audit artifact, dated and justified. Because an auditor does not only ask what you detect, they ask why you allow what you allow.

What comes next

This article is a taste. It belongs to a series on the blind spots of Tetragon training, the topics the beginner content does not cover. The ground of install, first policies, escape demo, observe then enforce is saturated. The real questions start after.

Each angle has its talk then its lab, on the same rhythm. This one, signal engineering, treats false positives as a measurable discipline, with a taxonomy, baselining and signal quality indicators.

Which false positive has already cost you a night and how did you handle it?

End-to-End Observability for vLLM and TGI: from DCGM to Tokens

samuel desseaux — Thu, 21 May 2026 11:37:13 +0000

Running large language model inference servers in production exposes gaps that neither stock Prometheus dashboards nor the official documentation of vLLM or TGI cover completely. This article maps the layers that matter, names the exact signals to scrape and flags the traps most teams only hit after real traffic arrives.

Audience: SREs, ML platform engineers and observability engineers who operate or are about to operate vLLM or TGI on GPUs.

Why LLM serving breaks standard observability

A model server is not a regular web service. Four properties invalidate the usual playbook.

Latency is not scalar. Time to first token (TTFT), inter-token latency (ITL) and end-to-end latency tell three different stories. Optimizing one usually degrades another. Prefill-bound workloads (long prompts, short outputs) and decode-bound workloads (chat, agents, RAG) have inverse profiles. A single p99 number is meaningless without saying which latency it refers to and what input distribution produced it.

Batching is dynamic and preemptive. Continuous batching schedules in-flight requests into the same forward pass. Throughput rises with batch size up to a point where KV cache pressure forces evictions or swaps. Standard "queue depth" metrics still apply, but the relationship between queue depth and tail latency is non-linear and bursty. A queue that looks shallow for ninety seconds and explodes for ten is more useful to detect than a steady moderate queue.

The KV cache is the real bottleneck. It lives in VRAM, grows with sequence length and dominates memory pressure. When it fills, vLLM preempts or swaps requests. TGI rejects new arrivals. Neither outcome is visible from CPU or network metrics. The KV cache is the single most informative signal on the engine layer, and it has no equivalent in a stateless web service.

Hardware reaches into the application. A degraded NVLink, a thermal throttle or an NCCL all-reduce stall propagates directly to the request queue. The observability stack has to reach down to the silicon or it will produce dashboards that look fine while users wait.

The right answer is a layered pipeline that correlates a token rendered to a user with what happened on the silicon a few milliseconds earlier.

Layer map

┌────────────────────────────────────────────────┐
│ Business and cost (€/token, €/tenant, €/h GPU) │
├────────────────────────────────────────────────┤
│ API and distributed tracing (OTel GenAI)       │
├────────────────────────────────────────────────┤
│ Inference engine (vLLM, TGI: Prometheus)       │
├────────────────────────────────────────────────┤
│ Container and OS (cAdvisor, kubelet, eBPF)     │
├────────────────────────────────────────────────┤
│ CUDA runtime and collectives (NCCL, cuPTI)     │
├────────────────────────────────────────────────┤
│ GPU silicon (DCGM exporter, NVLink, PCIe)      │
└────────────────────────────────────────────────┘

Each layer has its own native signals. The value of an end-to-end pile comes from the ability to cross-reference them.

Layer by layer

GPU silicon

DCGM exporter is the right entry point. The signals worth wiring up from day one:

DCGM metric	What it actually says
`DCGM_FI_DEV_GPU_UTIL`	Coarse indicator. Reaches 100 % for badly vectorized kernels. Do not use alone.
`DCGM_FI_PROF_SM_ACTIVE`	Fraction of cycles where at least one warp is active on an SM.
`DCGM_FI_PROF_SM_OCCUPANCY`	Average warps active per SM normalized to the maximum.
`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	Fraction of cycles the tensor cores are working. The real utilization signal for LLM inference.
`DCGM_FI_PROF_PIPE_FP16_ACTIVE`, `_FP32_ACTIVE`	Pipeline activity by precision. Useful to spot fallbacks.
`DCGM_FI_PROF_DRAM_ACTIVE`	HBM traffic. Identifies memory-bound workloads.
`DCGM_FI_DEV_FB_USED`, `_FB_FREE`	VRAM in use and free. Cross with `vllm:gpu_cache_usage_perc`.
`DCGM_FI_PROF_NVLINK_RX_BYTES`, `_TX_BYTES`	Inter-GPU traffic. Essential under tensor parallelism.
`DCGM_FI_PROF_PCIE_RX_BYTES`, `_TX_BYTES`	GPU to host traffic. Surfaces pressure during model loading and CPU paging.
`DCGM_FI_DEV_POWER_USAGE`, `_GPU_TEMP`, `_MEMORY_TEMP`	Power and thermal. Throttling shows up here before it shows up in user latency.
`DCGM_FI_DEV_SM_CLOCK`, `_MEM_CLOCK`	Effective clocks. A persistent drop is the first sign of thermal throttling.

DCGM exporter ships as a Helm chart and runs as a DaemonSet on GPU nodes. Default scrape interval is one second, fine for steady-state dashboards but coarse enough to miss sub-second incidents like an eviction storm. Two profiles in production:

steady: 5 seconds, full field set.
incident: 250 ms, reduced field set, enabled on alert.

A few hardware notes that change what you should monitor:

MIG (Multi-Instance GPU). When MIG slices are active, DCGM exposes per-slice metrics under the same field IDs with a different device label. Pin labels in your relabel config or you will see metrics merge or vanish across reschedules.
NVSwitch (DGX, HGX). Add the NVSwitch exporter alongside DCGM. NVLink saturation at the switch is invisible from the per-GPU NVLink counters alone.
InfiniBand. Use the Mellanox ibutils exporter or ucx counters. RDMA traffic for distributed inference does not appear in the GPU metrics path.

CUDA runtime and collectives

Tensor parallelism and pipeline parallelism rely on NCCL. When one GPU waits for its peers, application latency shows anomalies with no CPU or network cause visible.

Sources worth wiring:

NCCL_DEBUG=WARN in production with parseable output, ingested as structured logs. INFO is too verbose and has a non-trivial overhead.
nvidia-nccl-exporter where the version supports your CUDA stack.
cuPTI for kernel-level and collective-level tracing. Enable on demand only, the overhead is measurable and biases what you are trying to observe.
On InfiniBand fabric, export UCX counters and SHARP statistics. NCCL alone does not surface fabric congestion.

Collective patterns to remember when reading dashboards:

All-reduce dominates tensor-parallel matmul splits. Saturated NVLink with idle SMs means you are bandwidth-bound on the collective.
All-gather appears in some attention implementations and in pipeline-parallel weight gathering.
Send/recv dominates pipeline parallelism. Imbalance between stages shows up as one GPU with low SM activity and a long send wait.

These traces are not meant to be on all the time. Continuous lightweight counters with on-demand deep tracing is the pattern that scales.

Container and OS

Platform layer:

cAdvisor and kubelet for pod CPU, RAM and IO.
kube-state-metrics for Pod state, OOM events and restarts.
kube_pod_info joined to GPU identity (nvidia.com/gpu device id) to map pod to physical GPU.

Kernel layer:

eBPF via Tetragon, bpftrace or Pixie for syscalls, unexpected network egress and model file reads.
On-CPU profiling via parca or pyroscope without instrumenting the binary.

eBPF is also where the security observability lives. A minimal Tetragon policy that watches model file reads and unexpected egress on the inference pod:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: vllm-runtime-watch
spec:
  podSelector:
    matchLabels:
      app: vllm
  kprobes:
    - call: "security_file_open"
      syscall: false
      args:
        - index: 0
          type: "file"
      selectors:
        - matchArgs:
            - index: 0
              operator: "Prefix"
              values:
                - "/models/"
                - "/root/.cache/huggingface/"
          matchActions:
            - action: Post
    - call: "tcp_connect"
      syscall: false
      args:
        - index: 0
          type: "sock"
      selectors:
        - matchArgs:
            - index: 0
              operator: "NotDAddr"
              values:
                - "10.0.0.0/8"
                - "127.0.0.0/8"
          matchActions:
            - action: Post

This is a starter: it logs every model file read and every non-RFC1918 outbound connection from vLLM pods. Convert to alerts only after a quiet-period baseline.

Inference engine

The layer most teams neglect the longest, while being the densest in business signal.

vLLM exposes /metrics by default. The base set:

Metric	Type	Reading
`vllm:time_to_first_token_seconds`	histogram	Server-side TTFT. Compare to gateway TTFT.
`vllm:time_per_output_token_seconds`	histogram	ITL. What the user feels in streaming.
`vllm:e2e_request_latency_seconds`	histogram	Server-side end-to-end latency.
`vllm:num_requests_running`	gauge	Requests in the active batch.
`vllm:num_requests_waiting`	gauge	Queue depth. First saturation indicator.
`vllm:num_requests_swapped`	gauge	Requests paged to CPU. VRAM pressure.
`vllm:gpu_cache_usage_perc`	gauge	KV cache occupation. At 1.0 with `swapped > 0`, you are in eviction territory.
`vllm:num_preemptions_total`	counter	Cumulative preemptions. Take the per-second rate.
`vllm:prompt_tokens_total`	counter	Input tokens processed.
`vllm:generation_tokens_total`	counter	Generated tokens. Cost calculation base.

Recent vLLM versions also expose prefix caching and speculative decoding metrics. The exact names depend on the version, but the families to look for:

vllm:gpu_prefix_cache_hits_total, vllm:gpu_prefix_cache_queries_total. Hit rate dominates the gain from prefix caching in agent and RAG workloads.
Speculative decoding counters that let you derive the acceptance rate of the draft model. If acceptance falls below the break-even point against the draft model overhead, spec decode is costing you throughput.

TGI exposes /metrics with a different naming convention:

Metric	Reading
`tgi_batch_current_size`	Active batch size.
`tgi_batch_next_size`	Next batch being formed.
`tgi_queue_size`	Queue depth.
`tgi_request_queue_duration`	Time in queue.
`tgi_request_inference_duration`	Engine time.
`tgi_batch_inference_duration`	Per-batch latency, decomposable into forward and decode.
`tgi_request_input_length`, `tgi_request_generated_tokens`	Token counters per request.

Both engines emit histograms with standard Prometheus buckets. Quantiles are computed at query time (histogram_quantile in PromQL or VMQL equivalents).

A practical reading habit: never look at a single engine metric in isolation. The useful patterns are paired.

vllm:num_requests_waiting rising with vllm:gpu_cache_usage_perc at 1.0 and vllm:num_preemptions_total rate > 0: you are in cache thrash. Reduce max_num_seqs or raise max_num_batched_tokens.
vllm:num_requests_waiting rising with healthy cache: you are compute-bound. Add capacity or reduce max_num_batched_tokens.
tgi_queue_size high with tgi_batch_current_size plateauing below maximum: scheduler is starving on token budget. Inspect max_batch_total_tokens.

API and distributed tracing

Tracing answers "where did my request spend its time" independently of aggregate metrics.

Adopt OpenTelemetry with the GenAI semantic conventions:

gen_ai.system (for example vllm, tgi),
gen_ai.operation.name (chat, completion),
gen_ai.request.model,
gen_ai.request.max_tokens, temperature, top_p,
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens,
gen_ai.response.finish_reasons.

A useful span breakdown:

http.server.request
└── gen_ai.completion
    ├── tokenize
    ├── schedule
    ├── prefill
    ├── decode  (loop, span per batch step)
    ├── detokenize
    └── stream_out

Available instrumentation libraries: OpenLIT, openllmetry (Traceloop) and OpenInference (Arize). Pick one and stick to it. Mixing them produces inconsistent attribute names that break dashboard queries.

The request_id propagated from the ingress through to the engine is the key that makes downstream correlation possible. Declare it at the ingress (header x-request-id), propagate it through OTel baggage, log it on the engine side and attach it as a trace attribute.

Prometheus exemplars are worth the configuration cost. They link a histogram bucket to one or more traces, so a click on a TTFT p99 spike in Grafana jumps directly to the slowest traces. vLLM does not expose exemplars natively today, but the OTel collector can attach trace IDs to scraped histograms via the spanmetrics connector. Sample collector snippet:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
    dimensions:
      - name: gen_ai.system
      - name: gen_ai.request.model
      - name: tenant
    exemplars:
      enabled: true

This gives you metric-to-trace navigation without changing the engine code.

Logs

Structured, JSON. VictoriaLogs handles the volume without forcing a complex query syntax.

Minimum fields for the inference layer:

request_id,
tenant,
model,
prompt_tokens, generation_tokens,
ttft_ms, e2e_ms,
finish_reason,
gpu_id (resolved at pod level),
trace_id, span_id (for cross-reference with traces).

Do not log prompts and outputs by default. If you need to, allocate a separate channel with short retention and active PII filtering. The legal exposure of an unfiltered prompt log dwarfs any operational benefit.

Business and cost

The only layer that talks to leadership. From the native counters you derive three indicators.

Cost per request, per tenant, per model. The denominator changes the answer, surface all three.

Hourly cost of a GPU normalized by tokens produced in the same window. This is the closest thing to a useful efficiency metric.

Useful tokens over billed tokens. A measure of batching efficiency: how many tokens you produce per token of GPU compute time.

Cost per tenant, in PromQL:

sum by (tenant) (
  rate(vllm:generation_tokens_total{tenant=~".+"}[5m])
)
* on(model) group_left
  cost_per_generation_token_eur

Where cost_per_generation_token_eur is a reference series pushed by a configuration job. Maintain prompt vs generation rates separately, they price differently in most providers and they have different production costs (prefill is single forward pass, decode is autoregressive).

A useful refinement is to include idle cost. A GPU running at 30 % utilization still costs the full hourly rate. The "effective cost per token" should distribute the full GPU hour over the tokens actually produced:

(gpu_hourly_cost_eur)
/
(sum by (gpu) (rate(vllm:generation_tokens_total[1h])) * 3600)

This is the number that drives capacity decisions, not the marginal cost per token.

The hard problems

Cross-layer correlation

Linking a rendered token to a physical GPU is trivial in theory and hard in practice. The concrete plumbing:

request_id propagated from ingress through engine spans.
Engine-side spans carry gpu_id as an attribute.
Metric series carry pod and gpu_uuid labels, joined via kube_pod_info to a pod to gpu_uuid mapping (DCGM exposes UUID and device labels).
Dashboards join temporally on time windows and spatially on gpu_uuid.

DCGM samples per GPU, not per request. Fine-grained correlation is always done by time window, never by exact identifier. The illusion of per-request hardware metrics is exactly that, an illusion.

Cardinality

Labeling by tenant and model is healthy. Labeling by user_id, session_id or request_id on metrics is forbidden. Those dimensions belong to traces and logs.

VictoriaMetrics absorbs moderate cardinality well, especially with vmagent stream aggregation pre-rolling histograms. But multi-tenant inference explodes fast. Run the math at design time:

tenants × models × quantiles × histogram_buckets × instances

Ten tenants, five models, six quantiles, ten buckets, fifty instances gives 150 000 series for one histogram metric alone. Add three histograms (TTFT, ITL, e2e) and you are at half a million series before counters and gauges. Plan accordingly or use stream aggregation to drop unused dimensions before storage.

Sampling

Three rhythms coexist: DCGM at 1 s, vLLM at 10 s, traces sometimes at 1 in 100. For brief incidents (preemption bursts, KV eviction storms), prepare:

OTel collector with tail-based sampling, rule "if error or slow then keep",
DCGM in incident mode at 250 ms, switched on by an alert webhook,
eBPF in continuous collection on critical syscalls (no sampling, the overhead is minimal),
vLLM kept at 10 s, no faster path exists without patching.

A tail-based sampling policy that works in practice:

tail_sampling:
  decision_wait: 10s
  policies:
    - name: errors
      type: status_code
      status_code: { status_codes: [ERROR] }
    - name: slow_ttft
      type: latency
      latency: { threshold_ms: 2000 }
    - name: high_value_tenant
      type: string_attribute
      string_attribute:
        key: tenant
        values: [enterprise_a, enterprise_b]
    - name: baseline
      type: probabilistic
      probabilistic: { sampling_percentage: 5 }

This keeps every error, every slow TTFT, every trace from high-value tenants and a 5 % baseline of normal traffic.

Time origin

Server-side TTFT is not what the user feels. Streaming, proxy buffering, HTTP buffer flushes and WAN traversal all change the perceived value. Measure also:

gateway-side TTFT (Envoy upstream_rq_time or equivalent),
client-side TTFT where possible (SDK instrumentation).

Without these, you optimize a number that does not reflect the experience. The gap between engine TTFT and gateway TTFT is also a useful health signal in itself, a sudden divergence usually means a proxy buffering regression.

SLO design for LLM serving

Standard SRE SLO patterns need adjustment for LLM serving. A defensible starting set:

SLO	Definition	Why
TTFT availability	p95 TTFT below threshold over rolling window	Streaming UX collapses without it.
ITL stability	p95 ITL below threshold	Decode stalls feel worse than a long initial wait.
Completion success	success rate of requests that produce at least one token	Hard failure metric.
Streaming completeness	percentage of streams that emit `finish_reason=stop` (not `length`, not `error`)	Quality proxy.
Capacity headroom	p95 queue depth below a threshold	Forward-looking, drives autoscaling.

The thresholds depend on the model and workload. Chat: TTFT p95 under 1 s, ITL p95 under 80 ms. RAG: TTFT p95 under 3 s, ITL p95 under 50 ms (long outputs amplify ITL). Code completion: TTFT p95 under 500 ms, ITL p95 under 30 ms.

Express them as multi-window multi-burn-rate alerts on the underlying SLI series, not as single-threshold alerts. The Google SRE workbook formulas apply unchanged.

Reference pile

Components:

Role	Recommended	Alternative
Metrics	VictoriaMetrics cluster with vmagent	Prometheus with Thanos or Mimir
Logs	VictoriaLogs	Loki, OpenSearch
Traces	Tempo, Jaeger	SaaS (Honeycomb, Datadog)
Application collection	OTel collector (agent and gateway)	Vector, Fluent Bit
GPU collection	DCGM exporter (DaemonSet)	nvidia_gpu_exporter (legacy)
eBPF	Tetragon, Pixie	Falco
Visualization	Grafana	Perses

OTel collector pipeline (agent)

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: vllm
          scrape_interval: 10s
          static_configs:
            - targets: ['localhost:8000']
        - job_name: dcgm
          scrape_interval: 5s
          static_configs:
            - targets: ['localhost:9400']
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.node.name
      labels:
        - tag_name: app
          key: app
          from: pod
        - tag_name: tenant
          key: tenant
          from: pod
  resource:
    attributes:
      - key: deployment.environment
        value: prod
        action: upsert
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s]
    dimensions:
      - name: gen_ai.system
      - name: gen_ai.request.model
      - name: tenant
    exemplars:
      enabled: true

exporters:
  prometheusremotewrite:
    endpoint: http://vmagent.observability.svc:8429/api/v1/write
    external_labels:
      cluster: prod-eu-west
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [prometheus, spanmetrics]
      processors: [batch, k8sattributes, resource]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [batch, k8sattributes, resource, tail_sampling]
      exporters: [otlp/tempo, spanmetrics]

The spanmetrics connector turns traces into low-cardinality histograms with exemplars, giving you click-through from metrics to traces without changing engine code.

Useful starter queries

TTFT p99 by model:

histogram_quantile(0.99,
  sum by (model, le) (
    rate(vllm:time_to_first_token_seconds_bucket[5m])
  )
)

Preemptions per second overlaid with cache occupation:

rate(vllm:num_preemptions_total[1m])

Effective tensor core utilization per GPU:

avg by (gpu) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)

Tokens per GPU-second (efficiency):

sum by (gpu) (rate(vllm:generation_tokens_total[5m]))
/
count by (gpu) (DCGM_FI_DEV_GPU_UTIL)

Normalized TGI queue pressure:

tgi_queue_size / on(instance) tgi_batch_current_size

Cost per hour per tenant:

sum by (tenant) (
  rate(vllm:generation_tokens_total[1h]) * 3600
) * on(model) group_left cost_per_generation_token_eur

Alerting that does not lie

Alerts on inference servers should fire on user-visible degradation, not on resource thresholds. A working starter set:

TTFT burn-rate (multi-window).

- alert: VLLMTTFTBudgetFastBurn
  expr: |
    (
      sum by (model) (rate(vllm:time_to_first_token_seconds_bucket{le="1.0"}[5m]))
      /
      sum by (model) (rate(vllm:time_to_first_token_seconds_count[5m]))
    ) < 0.95
    and
    (
      sum by (model) (rate(vllm:time_to_first_token_seconds_bucket{le="1.0"}[1h]))
      /
      sum by (model) (rate(vllm:time_to_first_token_seconds_count[1h]))
    ) < 0.95
  for: 2m
  labels:
    severity: page

Cache thrash detector.

- alert: VLLMCacheThrash
  expr: |
    vllm:gpu_cache_usage_perc > 0.95
    and
    rate(vllm:num_preemptions_total[2m]) > 0.5
  for: 5m
  labels:
    severity: ticket

Tensor core idle under load.

- alert: GPUTensorIdleUnderLoad
  expr: |
    avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[10m]) < 0.2
    and
    vllm:num_requests_running > 4
  for: 10m
  labels:
    severity: ticket

This last alert catches the case where the engine reports work in flight but the tensor cores are idle. The usual cause is a stalled NCCL collective or a CPU-bound bottleneck before the GPU.

Streaming completion regression.

- alert: VLLMStreamingTruncations
  expr: |
    (
      sum by (model) (rate(vllm:request_success_total{finish_reason="length"}[10m]))
      /
      sum by (model) (rate(vllm:request_success_total[10m]))
    ) > 0.1
  for: 15m
  labels:
    severity: ticket

When more than 10 % of requests stop on length, either max_tokens is too low for the use case or quality has regressed.

Avoid alerting directly on queue depth or GPU utilization. Both vary widely under healthy load. They are diagnostic, not actionable.

Anti-patterns

To review every quarter:

Treating DCGM_FI_DEV_GPU_UTIL as utilization. The right read is DCGM_FI_PROF_PIPE_TENSOR_ACTIVE.
Tuning batching against mean latency. Tail latency and queue depth tell the truth.
Labeling metrics by request_id. That belongs to traces.
Measuring latency only at the engine. Add the gateway, add the client where possible.
Capturing prompts and outputs in traces without an active PII filter.
Counting "tokens" without separating prompt and generation. Pricing is asymmetric, batching capacity is asymmetric.
Leaving cuPTI and NCCL_DEBUG=INFO on in production. Measurable overhead, biased measurements.
Sampling traces uniformly. Tail-based sampling with rules for errors, slow requests and high-value tenants catches more value at lower volume.
Storing everything at maximum resolution. Cardinality cost explodes before retention cost.
Building alerts on resource thresholds. Alert on user-visible SLOs, treat resource metrics as diagnostic.

Maturity ladder

Where teams typically stand and where to move next.

Level 0: nothing specific. Generic node and pod metrics. No idea how the engine is doing. Move to level 1 by scraping the engine's /metrics.

Level 1: engine metrics only. vLLM or TGI metrics scraped, basic dashboard. Sufficient for an initial deployment, blind to hardware-rooted issues. Move to level 2 by adding DCGM and pod-to-GPU mapping.

Level 2: engine plus GPU correlated. Most pragmatic teams stop here. Resolves 70 % of incidents in practice. Move to level 3 when multi-tenant pressure starts and when latency complaints exceed throughput complaints.

Level 3: distributed tracing with GenAI semconv. Per-request visibility, exemplar-driven debugging, tenant-aware SLOs. Required at scale. Move to level 4 for regulated workloads and HPC fabrics.

Level 4: kernel and fabric depth. eBPF policies in alerting paths, NCCL and InfiniBand observability, audit-grade logging with retention policies, confidential computing where applicable. Required for regulated industries, sovereign deployments and large-scale training-adjacent serving.

Move one level at a time. Skipping levels produces dashboards no one trusts.

Where to go next

Three topics deserve their own articles:

KV cache observability: eviction, fragmentation, swap. Native metrics, stress experiments, mitigations.
NCCL and tensor parallelism: observing inter-GPU flows and finding the collective that stalls the batch.
Securing an inference server: attack surface, eBPF detection, sandboxing, AI Act audit trail.

The right implementation order in production:

Inference engine metrics (vLLM, TGI native scrape).
GPU metrics (DCGM exporter).
Distributed tracing with OTel GenAI semconv.
Structured logs with trace_id and request_id.
Business and cost layer.
eBPF policies for security and runtime observability.
NCCL and cuPTI on demand for hard-to-reproduce issues.

Starting with layers 1 and 2 alone resolves most of the incidents observed in production. Everything above that compounds value once the base is solid.

Corrections and operational war stories welcome.