Designing GPU scheduling in Kubernetes requires more than assigning one pod per GPU. Learn production-grade patterns for AI and ML workloads, including job queues, batching strategies, GPU sharing, and throughput-optimized scheduling.
From Waste to Design: Where We’re Picking Up
By now, the pattern should be clear.
We started this series by uncovering how Kubernetes clusters quietly waste CPU and memory due to inflated requests. Then we saw how requests and limits distort scheduling behavior, and how autoscaling — instead of fixing the issue — often amplifies it when the inputs are wrong.
In Part 4, things escalated. GPU clusters took all of those inefficiencies and turned them into direct financial impact. Idle time became expensive. Allocation without utilization became the default. And the traditional “one pod per resource” model started to fall apart under real AI workloads.
- Part 1 Kubernetes Resource Management at Scale: Why Your Clusters Are Full, Idle, and Still Starving for Resources
- Part 2 Kubernetes Requests and Limits: The Most Misunderstood Feature in Production
- Part 3 Kubernetes Autoscaling Myths: Why HPA Alone Won’t Fix Your Resource Problems
- Part 4 Why GPU Clusters Bleed Money in Kubernetes (and How to Stop It)
So now we’re at the point where theory isn’t enough.
If you’re running GPU workloads in Kubernetes, the question is no longer why is this inefficient?
The real question is:
What does a well-designed GPU scheduling system actually look like?
The First Mental Shift: You’re Not Scheduling Pods — You’re Scheduling Work
Kubernetes is built around pods, but GPU platforms are built around work units. That difference matters.
A long-running deployment holding a GPU is almost always the wrong abstraction for machine learning workloads. Training jobs, inference batches, data processing pipelines — these are all finite pieces of work with a clear start and end.
When you treat them as services, you inherit all the inefficiencies of service-style scheduling:
- GPUs stay allocated between tasks
- Idle time accumulates silently
- Scaling becomes reactive instead of intentional
The first step toward efficiency is to model workloads as jobs, not services. This alone changes how resources flow through the system.
Queue-Based Scheduling: The Backbone of Efficient GPU Platforms
Once workloads are modeled as jobs, the next step is introducing a queue. Instead of immediately scheduling pods when they are created, jobs enter a queue and are scheduled only when resources are available and it makes sense to run them. This might feel counterintuitive at first. Engineers are used to immediate execution. But queues introduce something critical: control over contention and utilization.
A queue allows you to:
- Avoid fragmenting GPU resources
- Prioritize important workloads
- Batch compatible jobs together
- Maintain high utilization without overcommitting
Without a queue, Kubernetes will try to schedule everything immediately, often leading to inefficient placement and unnecessary scaling.
With a queue, you move from reactive scheduling to intentional scheduling.
Throughput vs Latency: The Trade-Off Most Teams Ignore
One of the biggest design decisions in GPU scheduling is choosing between throughput optimization and latency optimization.
Service-oriented thinking prioritizes latency. You want requests to start immediately and complete as fast as possible. This works for APIs and user-facing systems.
GPU workloads are different.
Most AI training and batch inference jobs are not latency-sensitive. They are throughput-sensitive. What matters is how much work gets done over time, not how quickly an individual job starts.
When you optimize for throughput:
- Jobs may wait in a queue briefly
- GPUs stay consistently busy
- Overall system efficiency increases
When you optimize for latency:
- Jobs start immediately
- GPUs may sit idle between tasks
- Utilization drops significantly
Mature platforms make this trade-off explicit. They don’t accidentally drift into a latency-first model — they choose their priorities based on workload characteristics.
GPU Packing: Breaking the “One Pod = One GPU” Model
The default Kubernetes GPU model assumes exclusive allocation. One pod requests one GPU, and that GPU is reserved entirely. This is simple, but often wasteful.
Many workloads don’t need a full GPU continuously. Some use only a fraction of memory or compute capacity. Others are bursty, alternating between active and idle phases.
This opens the door to GPU packing — running multiple workloads on the same GPU.
There are several approaches to this:
- Running multiple containers sharing a GPU
- Using frameworks that allow partial GPU allocation
- Structuring workloads to interleave compute phases
Each approach comes with trade-offs in isolation, performance predictability, and operational complexity.
The key is not to force packing everywhere, but to identify workloads that can safely share without impacting correctness or performance. Even modest improvements in packing efficiency can lead to significant cost savings.
Job Lifecycle Discipline: Where Most Savings Come From
One of the most overlooked areas in GPU platforms is job lifecycle management.
A GPU is only useful while it’s actively executing work. The moment a job finishes — or effectively stops doing useful computation — that GPU should be released. In practice, this doesn’t always happen.
Common issues include:
- Jobs that linger after completion
- Processes waiting indefinitely on external dependencies
- Cleanup steps that unnecessarily hold GPU resources
- Orchestrations that don’t terminate cleanly
These small inefficiencies accumulate quickly.
The most effective platforms enforce strict lifecycle discipline:
- Jobs have clear completion criteria
- Resources are released immediately after completion
- Idle states are minimized or eliminated
This is not glamorous work, but it often delivers the highest return on investment.
Scheduling Policies: Turning Infrastructure into a Platform
At scale, GPU scheduling is no longer just about placing workloads — it becomes about defining policies. These policies answer questions like:
- Which jobs get priority during contention?
- Can lower-priority jobs be preempted?
- How are resources shared across teams?
- What happens when demand exceeds supply?
Without explicit policies, the system defaults to first come, first served, which is rarely optimal. With policies, you can align infrastructure behavior with business priorities. For example, production inference workloads might take precedence over experimental training jobs. High-priority research might preempt lower-value batch processing. Teams might be allocated quotas to prevent resource monopolization. These decisions are not purely technical. They reflect how the organization values different types of work.
Why Kubernetes Alone Is Not Enough
Kubernetes provides the primitives for scheduling, but it does not provide a complete GPU scheduling system out of the box. This is where many teams get stuck.
They expect Kubernetes to solve higher-level scheduling problems that it was never designed to handle:
- Queue management
- Fairness across teams
- Workload prioritization
- Efficient batching
To address these gaps, teams often introduce additional layers:
- Job schedulers
- Queueing systems
- Custom controllers
- Workflow orchestration tools
The goal is not to replace Kubernetes, but to build on top of it with a system that understands the semantics of AI workloads.
The Most Important Metric: GPU Busy Time
If you had to track one metric to evaluate your GPU platform, it wouldn’t be raw utilization. It would be GPU busy time as a percentage of allocation time.
This captures the real efficiency of your system:
- How long GPUs are allocated
- How much of that time is spent doing useful work
Everything in this post — queues, packing, lifecycle management, policies — ultimately aims to improve this metric.
When GPU busy time increases, costs stabilize and throughput improves.
What a Mature GPU Platform Looks Like
In well-designed systems, things feel very different.
Workloads don’t immediately grab GPUs — they enter a queue and are scheduled intentionally. GPUs rarely sit idle because jobs are batched and packed efficiently. Resource allocation reflects priority and business value, not just timing.
Engineers understand that GPUs are shared infrastructure, not personal resources. Jobs are designed to release resources quickly. Metrics are trusted, and inefficiencies are visible.
Most importantly, the system behaves predictably. And just like we discussed in earlier parts of this series, predictability is what allows efficiency to emerge.
Closing Thoughts
Efficient GPU scheduling is not about squeezing every last percentage point of utilization. It’s about designing a system where waste is hard to hide and easy to correct.
Kubernetes gives you the foundation, but it’s not the full solution. The real work lies in how you model workloads, how you control scheduling, and how you align infrastructure with organizational priorities.
If you treat GPUs like CPU, you will overspend.
If you treat GPU scheduling as a first-class system, you will gain control.
Key Takeaways
GPU scheduling must be job-oriented, not pod-oriented, to eliminate idle allocation and improve utilization.
Queues and scheduling policies are essential, enabling intentional resource allocation and higher throughput.
Lifecycle discipline and GPU packing drive the biggest efficiency gains, not just better configuration.
So, what coming next?
Next up, in Part 6, we’ll tackle something equally important and often ignored: How to make Kubernetes cost visible — without turning it into a political battle between teams.


Top comments (0)