DEV Community

Cover image for Distributed Inference Observability gaps
Jeff Geiser
Jeff Geiser

Posted on

Distributed Inference Observability gaps

It seems that distributed inference observability has some gaps.

In terms of framing this, I am referring to inference deployments at the edge (or so called near edge).. pops close to end users. Let's say you are using ollama for some early testing and/or scaling but are using vllm in production.

Traditional monitoring platforms will report on GPU/CPU load, memory usage, network status, etc, etc.

However, other stuff is also happening:

GPU throttled - 100% utilization but clock speed dropped 33%
KV cache saturated causing some queue backlog
Time to first token spiked 200% from CPU contention
Another tenant's PCIe traffic impacted inference

maybe some contextual drift - some hardware stresses that degrade inference performance but it is happening in ways that is generally invisible to system metrics.

Most of the monitoring in the market is built for servers and takes a peek at intervals that may not make sense for inference

  • token generation:20-100 per second
  • cache saturation: spikes in seconds
  • thermal throttling happens instantly

but traditional monitoring might see this as smooth if only glancing at the server every 30 seconds. But, you also can't grab data every 2 seconds or you might contribute to some cpu scheduling pressure.

So, if you are going to run both ollama (for dev/test or smaller loads) and vLLM for production they have completely different failure modes but traditional monitoring would treat them the same.

We also have a blind spot with regard to time to first token (ttft) and time per output token (tpot). We might show request latency spiking, but we need to know whether ttft spiked or tpot spiked..

so, I am thinking about an open source project that would be a lightweight observability agent.. large companies will likely solve this by building a giant observability layer on top of their distributed inference solution -- but I think having a more bottoms up approach that can be deployed might make sense..

the observability agent would strive to:

  • have limited cpu impact/overhead
  • 2 second sampling with some intelligent backoff
  • built in ttft/tpot splitting
  • contextual drift detection
  • works with vLLM/Prometheus and Ollama API stats
  • embedded DB storage (duckDB?) - no external dependencies
  • runs at edge.. maybe federates..

Curious to get feedback on where people are hitting observability gaps.. this is a new area for me to spend time on so curious about all feedback.

What are you doing to monitor vLLM and/or other inference engines?

What metrics do you wish you had?

Drop the war stories here.. thanks..

(apologies for lack of formatting.. maybe I will get better over time..)

Top comments (0)