Mohammad Heydari

Posted on Jun 23

Kubernetes in LLMOps (Part 2): GPU Efficiency, Cost Engineering, and Real-World Failure Modes

#infrastructure #kubernetes #llm #performance

Introduction: Scaling Is Easy, Efficiency Is Not

By the time a team reaches Kubernetes in their LLM journey, they usually solve one class of problems: orchestration.

Services restart automatically. Deployments become safer. Scaling becomes possible.

But a new class of problems emerges subtler, more expensive, and far more difficult to fix:

GPUs are allocated but underutilized
Costs grow faster than traffic
Latency improves… until it suddenly doesn’t
Systems appear healthy but perform poorly

At this stage, the challenge is no longer making the system work.

The challenge is making it efficient, predictable, and economically viable.

The GPU Utilization Paradox

One of the most counterintuitive realities in LLM systems is this:

You can have 100% GPU allocation and still have terrible efficiency.

This happens because allocation is not utilization.

A typical inference workload behaves like this:

It processes requests in bursts
It waits for new requests
It suffers from memory fragmentation
It is constrained by batch size and token generation speed

As a result, a GPU may be “busy” from a scheduler’s perspective but idle from a compute perspective.

This is the GPU utilization paradox.

Batching: The Most Powerful (and Misused) Optimization

Batching is often introduced as a simple idea: process multiple requests together to maximize GPU throughput.

In practice, batching is one of the most delicate trade-offs in LLM systems.

Larger batches:

Increase throughput
Improve GPU efficiency
Reduce cost per request

But they also:

Increase latency for individual users
Introduce queuing delays
Complicate scheduling

The real challenge is not enabling batching—it is controlling it dynamically.

A production system must continuously balance:

Queue length
Latency targets
GPU saturation

This often leads to adaptive batching strategies, where batch size changes based on real-time conditions.

Kubernetes does not implement batching—but it enables architectures (queue + workers) where batching becomes possible.

Model Multiplexing: Running More with Less

Another advanced optimization is model multiplexing—running multiple models on a single GPU.

At first glance, this seems like an obvious way to improve utilization. But in practice, it introduces significant complexity:

Memory contention between models
Unpredictable latency due to shared resources
Difficult debugging when performance degrades

The key insight is that multiplexing is not just a technical problem—it is a scheduling problem.

You must decide:

Which models can safely share a GPU
How to isolate workloads
How to prioritize requests

Kubernetes can assist through node-level isolation and resource constraints, but the logic of multiplexing often lives at the application layer.

MIG: Hardware-Level Partitioning

For workloads that require stronger isolation, Multi-Instance GPU (MIG) provides a hardware-level solution.

Instead of sharing a GPU dynamically, MIG partitions it into smaller, independent units.

This allows you to:

Run multiple inference workloads in isolation
Reduce contention
Improve predictability

However, MIG introduces its own trade-offs:

Reduced flexibility compared to full GPUs
Fixed partition sizes
More complex scheduling requirements

In Kubernetes, MIG-enabled GPUs can be exposed as separate resources, allowing more granular scheduling.

Cost Engineering: The Missing Discipline in LLMOps

Most teams think about scaling before they think about cost.

This is a mistake.

In LLM systems, cost is not a byproduct—it is a first-class constraint.

A poorly optimized system can easily cost 5–10x more than necessary without delivering better performance.

Key cost drivers include:

GPU idle time
Over-provisioned replicas
Inefficient batching
Redundant computation (lack of caching)

Cost engineering requires visibility and control.

You need to understand:

Cost per request
Cost per token
Cost per user session

And then design your system to optimize these metrics.

Kubernetes helps by enabling:

Fine-grained scaling
Resource limits
Workload isolation

But cost efficiency ultimately depends on system design decisions.

Failure Modes You Only See in Production

Some failures only emerge at scale. They do not appear in testing or staging environments.

1. Silent Latency Degradation

The system does not crash. It does not throw errors.

It just becomes slower.

This is often caused by:

Retriever bottlenecks
Cache inefficiencies
Suboptimal batching

These issues are difficult to detect without proper observability.

2. GPU Memory Fragmentation

Over time, repeated allocations and deallocations lead to memory fragmentation.

Even if total memory is sufficient, large contiguous blocks may not be available.

This results in:

Unexpected OOM errors
Pod crashes under seemingly safe conditions

Restarting pods temporarily fixes the issue—but does not solve the root cause.

3. Thundering Herd Problem

A sudden spike in traffic (or cache miss) causes a flood of requests to hit the system simultaneously.

This leads to:

Queue explosion
Increased latency
Cascading failures

Mitigation strategies include:

Rate limiting
Request deduplication
Better caching

4. Cold Start Amplification

When scaling up, new pods need time to load models.

During this time:

Existing pods become overloaded
Latency increases
Autoscaling may overreact

This creates a feedback loop that destabilizes the system.

Debugging in a Distributed LLM System

Debugging LLM systems is fundamentally different from debugging traditional applications.

You are not just debugging code—you are debugging interactions between services.

A typical debugging workflow might involve:

Tracing a request across API, retriever, and model
Inspecting queue delays
Analyzing GPU utilization patterns
Correlating logs across multiple pods

This requires:

Structured logging
Distributed tracing
Time-synchronized metrics

Kubernetes provides the environment—but effective debugging requires discipline in instrumentation.

Designing for Predictability, Not Just Performance

A common mistake is optimizing purely for peak performance.

But in production systems, predictability is often more valuable than raw speed.

Users tolerate slightly slower responses.
They do not tolerate inconsistent behavior.

Designing for predictability means:

Avoiding extreme batching strategies
Isolating workloads when necessary
Prioritizing stable latency over maximum throughput

Kubernetes helps enforce these constraints through resource limits and isolation mechanisms.

The Evolution of an LLM System

Most LLM systems evolve through stages:

Prototype (single service, no orchestration)
Early production (basic scaling, manual fixes)
Orchestrated system (Kubernetes, microservices)
Optimized system (cost-aware, efficient, observable)

Many teams reach stage 3 and stop.

But real competitive advantage lies in stage 4.

Conclusion: The Real Work Begins After Deployment

Kubernetes solves the problem of orchestration.

But orchestration is only the beginning.

The real challenges in LLMOps are:

Efficient GPU utilization
Cost control
Failure handling at scale
System predictability

These are not problems you solve once.

They are problems you continuously manage.

And the teams that do this well are the ones that turn AI capabilities into real, sustainable products.

What’s Next (Part 3)

In Part 3, we will explore:

Real-world architecture patterns (RAG at scale, streaming inference)
Advanced scheduling strategies
Hybrid cloud and on-prem GPU setups
Lessons learned from production incidents

Because at scale, every design decision becomes an operational decision.

Top comments (1)

Max Quimby • Jun 24

The allocation-≠-utilization point deserves the emphasis you give it. The trap I'd add: nvidia-smi "GPU-Util" makes this worse, because it reports the GPU as busy if a single kernel ran anywhere in the sampling window — so you can stare at 95% util while the SMs sit mostly idle between decode steps. For inference, the number that actually tracks money is closer to MFU, or pragmatically the prefill/decode split and KV-cache occupancy.

On batching: adaptive batch size is the right instinct, but for autoregressive serving the bigger lever is usually continuous (in-flight) batching à la vLLM. Static request-level batches waste the GPU through the long tail of decode, where most requests in the batch have finished and a couple are still generating — token-level scheduling reclaims exactly that.

And strong agree on cost-per-session as a first-class metric; it's underrated because multi-turn KV-cache reuse changes the economics in a way per-request hides. One question: what latency target drives your adaptive batch ceiling? In my experience the batch policy is really an SLO knob in disguise — teams that don't tie it to an explicit p95 end up oscillating between idle and overloaded.