Distributed Inference Observability gaps

#ai #llm #monitoring #performance

It seems that distributed inference observability has some gaps.

In terms of framing this, I am referring to inference deployments at the edge (or so called near edge).. pops close to end users. Let's say you are using ollama for some early testing and/or scaling but are using vllm in production.

Traditional monitoring platforms will report on GPU/CPU load, memory usage, network status, etc, etc.

However, other stuff is also happening:

GPU throttled - 100% utilization but clock speed dropped 33%
KV cache saturated causing some queue backlog
Time to first token spiked 200% from CPU contention
Another tenant's PCIe traffic impacted inference

maybe some contextual drift - some hardware stresses that degrade inference performance but it is happening in ways that is generally invisible to system metrics.

Most of the monitoring in the market is built for servers and takes a peek at intervals that may not make sense for inference

token generation:20-100 per second
cache saturation: spikes in seconds
thermal throttling happens instantly

but traditional monitoring might see this as smooth if only glancing at the server every 30 seconds. But, you also can't grab data every 2 seconds or you might contribute to some cpu scheduling pressure.

So, if you are going to run both ollama (for dev/test or smaller loads) and vLLM for production they have completely different failure modes but traditional monitoring would treat them the same.

We also have a blind spot with regard to time to first token (ttft) and time per output token (tpot). We might show request latency spiking, but we need to know whether ttft spiked or tpot spiked..

so, I am thinking about an open source project that would be a lightweight observability agent.. large companies will likely solve this by building a giant observability layer on top of their distributed inference solution -- but I think having a more bottoms up approach that can be deployed might make sense..

the observability agent would strive to: