DEV Community: Daya Shankar

MVCC Differences Between PostgreSQL and MySQL

Daya Shankar — Mon, 29 Jun 2026 08:30:52 +0000

I've seen PostgreSQL databases slow down because VACUUM couldn't keep up with dead tuples. I've seen MySQL databases accumulate massive undo histories because long-running transactions prevented purge operations from doing their job. In both cases, the root cause was the same: MVCC was working exactly as designed.

That's what makes MVCC such an interesting database feature. Most of the time, nobody thinks about it. Reads and writes happen concurrently, applications remain responsive, and everything appears normal. Then one day storage consumption starts growing unexpectedly, query performance begins drifting, maintenance operations become a bottleneck, or a database that handled yesterday's workload comfortably starts struggling under today's traffic. Suddenly, MVCC becomes impossible to ignore.

When I'm evaluating PostgreSQL and MySQL, I don't think of MVCC as a concurrency feature anymore. I think of it as a version-management system. Both databases solve the same problem by maintaining multiple versions of data, but they make a fundamentally different decision about where those versions live. Everything that follows, VACUUM activity, undo logs, storage growth, cleanup behavior, and long-transaction impact, stems from that single architectural choice.

The Most Important Difference: Where Old Row Versions Go

At a high level, MVCC allows readers and writers to operate concurrently without constantly blocking one another. Instead of immediately overwriting data, the database maintains multiple versions of a row and determines which version each transaction should see. The concept is similar in PostgreSQL and MySQL. The implementation is not.

The most important distinction can be summarized in a single table:

Area	PostgreSQL	MySQL InnoDB
Old Row Versions	Stored in the table itself	Stored in Undo Logs
Update Behavior	Creates new tuple versions	Updates row and stores previous state in Undo
Cleanup Mechanism	VACUUM	Purge Threads
Storage Impact	Table and index bloat	Undo history growth
Visibility Tracking	Transaction IDs on tuples	Undo version chains

Whenever I'm explaining MVCC to engineers, I start here because almost every operational difference between PostgreSQL and MySQL can be traced back to this design decision.

When a row is updated in PostgreSQL, the database does not modify the existing row in place. Instead, it creates a new tuple version and leaves the previous version in the table until it can be safely removed. MySQL's InnoDB engine takes a different approach. The current row is updated, while information about its previous state is stored separately in undo records. Both systems preserve historical versions, but PostgreSQL keeps those versions close to the data itself while MySQL moves them into a dedicated version-management system.

A simple update statement looks identical in both databases:

UPDATE orders
SET status = 'shipped'
WHERE id=101;

What happens underneath is where the divergence begins. PostgreSQL creates another tuple. InnoDB extends an undo chain. The application never notices the difference, but database administrators eventually do.

What Happens When Cleanup Falls Behind

The easiest way to understand MVCC in production is not to focus on how row versions are created but on what happens when they stop getting cleaned up efficiently.

When I'm troubleshooting PostgreSQL performance, one of the first things I look at is whether VACUUM is keeping pace with update activity. Because PostgreSQL stores historical versions inside the table, every update contributes to a growing collection of dead tuples. Under normal conditions, VACUUM reclaims that space and keeps storage growth under control. When VACUUM falls behind, however, dead tuples accumulate inside tables, indexes grow larger than necessary, storage consumption increases, and query performance gradually deteriorates as the database works through increasingly bloated structures.

MySQL experiences a similar challenge, but the symptoms appear in a different location. Since historical versions live inside undo logs rather than tables, the focus shifts from table bloat to undo growth. Purge threads are responsible for removing obsolete undo records once they are no longer needed. When purge activity cannot keep up with write volume, undo history grows, version chains become longer, and the database spends more effort reconstructing historical row versions.

The interesting part is that both databases are paying the same operational cost. They're simply paying it in different places.

A useful comparison looks like this:

Production Symptom	PostgreSQL	MySQL InnoDB
Storage Growth	Table Bloat	Undo Growth
Cleanup Process	VACUUM	Purge Threads
Cleanup Delay Cause	Dead Tuples Remain	Undo Records Remain
Performance Impact	Larger Tables and Indexes	Longer Version Chains
Maintenance Focus	Vacuum Health	Purge Efficiency

This is one reason I've stopped evaluating MVCC implementations based solely on architecture diagrams. What matters in production is not where versions are stored. What matters is how effectively those versions are cleaned up once workloads start generating millions of row changes.

Why Long-Running Transactions Cause Problems in Both Databases

One of the few things PostgreSQL and MySQL administrators generally agree on is that long-running transactions are dangerous.

The implementation details differ, but the operational outcome is remarkably similar.

In PostgreSQL, an open transaction can prevent dead tuples from being removed because those older row versions may still be visible to the transaction. In MySQL, an open transaction can prevent purge threads from removing undo records because those historical versions may still be required for consistent reads.

Different storage mechanisms ultimately lead to the same result: historical versions remain alive longer than expected.

I've seen PostgreSQL environments struggle with table bloat because a reporting query remained open for hours. I've seen MySQL environments accumulate enormous undo histories for exactly the same reason. The symptoms looked different. The root cause was identical.

This is why transaction duration is one of the first metrics I investigate when MVCC-related issues appear. Engineers often focus on CPU utilization, memory pressure, or indexing strategies when performance degrades. Sometimes the real problem is a transaction that has quietly prevented cleanup operations from doing their job.

Concurrency Is Similar. Locking Behavior Isn't Always the Same.

One area where engineers occasionally get caught off guard is transaction isolation.

Both PostgreSQL and MySQL use MVCC to reduce reader-writer contention, but they do not always achieve consistency using the same mechanisms. PostgreSQL relies heavily on row-version visibility and snapshot semantics. InnoDB combines MVCC with techniques such as next-key locking to prevent phantom reads under certain isolation levels.

The result is that identical workloads can sometimes exhibit different contention patterns even when schemas and queries remain unchanged. Most applications never notice these differences. High-concurrency transactional systems often do.

When I'm evaluating database behavior under load, I pay attention not just to MVCC itself but also to how MVCC interacts with locking and isolation-level choices. Those interactions frequently explain why a workload behaves differently after moving between PostgreSQL and MySQL.

Which Approach Is Better?

After working with both databases, I've stopped thinking about MVCC implementations in terms of better or worse. They're different engineering trade-offs.

PostgreSQL keeps version history close to the data and depends on VACUUM to maintain storage health. MySQL keeps version history in undo records and depends on purge processes to prevent historical state from accumulating. Neither approach is inherently superior. Each shifts the maintenance burden to a different part of the system.

A practical way to think about it is:

If You Care Most About...	Often Favored Approach
Visibility Simplicity	PostgreSQL
Smaller Table Footprint	MySQL InnoDB
Vacuum-Based Maintenance	PostgreSQL
Undo-Based Version Management	MySQL InnoDB

This distinction becomes increasingly important as workloads scale. In managed database environments, storage growth, maintenance overhead, and transaction behavior directly influence operational costs and long-term performance. Cloud platforms such as AceCloud support both PostgreSQL and MySQL deployments, but understanding how each database manages row versions often has a greater impact on performance than simply selecting larger infrastructure.

The longer I work with PostgreSQL and MySQL, the less I think about MVCC as a concurrency feature and the more I think about it as a cleanup strategy.

Both databases allow readers and writers to coexist efficiently. Both deliver excellent concurrency under demanding workloads. The real difference lies in how historical row versions are stored and how aggressively they must be maintained. PostgreSQL stores those versions inside the table and depends on VACUUM to reclaim space. MySQL stores them in undo logs and depends on purge operations to keep history under control.

Everything else, table bloat, undo growth, maintenance overhead, long-transaction behavior, and storage efficiency, flows from that decision. When production issues appear, that's usually the difference that matters most.

Kubernetes Networking for High-Performance Computing: CNI, Overlays, and Latency

Daya Shankar — Mon, 29 Jun 2026 08:27:15 +0000

One of the most frustrating Kubernetes performance investigations I've worked on started with a simple assumption: the cluster needed more compute. Additional nodes were added, more powerful processors were introduced, and GPU capacity was expanded, yet application performance barely improved. At first, the numbers didn't make sense because CPU utilization wasn't saturated, memory wasn't constrained, and storage performance looked healthy. The infrastructure appeared to have everything the workload needed.

The bottleneck turned out to be neither compute nor storage. It was the networking layer connecting the workloads together.

What made the issue difficult to identify was that Kubernetes appeared to be functioning perfectly. Pods were healthy, services were reachable, and the cluster looked stable from an operational perspective. The problem only became visible when I started examining how frequently workloads were communicating with each other and how much time was being spent moving data between nodes.

I've seen variations of this problem repeatedly in high-performance computing environments. Teams spend significant time evaluating processors, GPUs, storage systems, and autoscaling policies because those resources are visible and easy to measure. Networking often receives attention much later, usually after performance improvements begin producing diminishing returns. By that stage, the workload is no longer limited by how quickly individual nodes process information. It is limited by how efficiently those nodes exchange information with each other.

That distinction is important because many high-performance workloads spend almost as much time communicating as they do computing. Once that happens, networking stops being background infrastructure and becomes part of the application's performance profile.

Why HPC Workloads Expose Networking Problems Faster

One thing I've learned over the years is that not every Kubernetes workload experiences networking in the same way. A typical web application may tolerate a few additional milliseconds of latency without creating a meaningful impact on user experience because most requests are relatively independent. The majority of processing time is often spent inside the application itself rather than moving data between services.

High-performance computing environments operate very differently.

When I'm evaluating HPC workloads, one of the first things I look at is communication behavior because performance is often determined as much by data movement as by compute power. Distributed simulations, scientific computing applications, MPI-based clusters, large-scale analytics platforms, and GPU-powered computing environments all share a similar characteristic: they constantly exchange information between nodes.

Common examples include:

Distributed simulation environments

Scientific computing applications

MPI clusters

Large-scale analytics workloads

GPU computing clusters

Parallel processing frameworks

What I've noticed is that these workloads spend a substantial portion of their execution time synchronizing processes, exchanging datasets, coordinating distributed operations, or moving intermediate results across the network. Whether it's an MPI simulation distributing tasks across dozens of nodes or a GPU cluster synchronizing workloads between accelerators, communication becomes a critical part of execution.

This is why networking bottlenecks tend to appear much earlier in HPC environments than in traditional Kubernetes deployments. Small inefficiencies that go unnoticed in web applications become measurable performance constraints when multiplied across thousands or millions of communications. A few microseconds here and a few milliseconds there may seem insignificant in isolation, but over time they directly affect throughput, scalability, and execution times.

Once I understand how heavily a workload depends on communication, the networking architecture becomes far easier to evaluate.

Why the CNI Matters More Than Most Teams Realize

When I'm troubleshooting networking performance, the Container Network Interface (CNI) is one of the first components I examine because every packet entering or leaving a pod is influenced by decisions made at this layer. Over the years, I've found that many teams choose a CNI during cluster deployment and rarely revisit the decision. That approach works perfectly well until networking performance becomes part of application performance.

The CNI determines how pods connect to the network, how traffic moves throughout the cluster, and how networking policies are enforced. In many environments, those details remain largely invisible. In HPC environments, however, they can directly influence latency and communication efficiency.

Some of the most common options include:

CNI	Common Characteristics
Calico	Routing-focused, scalable, policy-rich
Cilium	eBPF-based networking and security
Flannel	Simplicity and ease of deployment
Weave Net	Overlay-focused networking
Multus	Multiple network interfaces per pod

What I've learned is that every CNI makes trade-offs. Some prioritize simplicity and operational ease. Others emphasize policy enforcement, observability, advanced routing capabilities, or performance optimization. The right choice depends entirely on workload requirements.

This is why I rarely think about CNI selection as a networking decision alone. In high-performance environments, it is fundamentally a workload decision because the networking model influences how efficiently applications communicate.

Where Latency Actually Comes From

One misconception I encounter frequently is the belief that latency is a single problem with a single cause. In practice, latency is usually the cumulative result of multiple infrastructure decisions working together.

When a packet moves between two Kubernetes workloads, it may pass through several layers before reaching its destination:

Overlay encapsulation

Routing decisions

Network policy enforcement

Service proxying

Node-to-node communication

Physical network infrastructure

Individually, each layer introduces only a small amount of overhead. Collectively, those delays can become significant for workloads that exchange large amounts of data or require constant synchronization.

What I've learned is that overlay networking is often where the trade-off between operational simplicity and performance becomes most visible. Overlay networks create virtual networking layers on top of the physical infrastructure, making clusters easier to deploy and manage while abstracting many networking complexities. For general-purpose Kubernetes environments, that trade-off often makes perfect sense.

Communication-heavy workloads tend to expose the cost of that abstraction more quickly.

A simplified comparison looks like this:

Networking Model	Characteristics
Overlay Network	Greater abstraction, easier deployment
Underlay Network	Direct infrastructure connectivity, lower overhead

This does not mean overlay networking is inherently wrong. I've seen many production environments operate successfully using overlay architectures. The important point is understanding the trade-off. Overlay networking often improves flexibility and operational simplicity. Underlay approaches frequently reduce overhead and improve communication efficiency.

One of the most common mistakes I've seen is selecting a networking model before understanding workload communication patterns. When that happens, networking decisions are driven by deployment convenience rather than application performance requirements.

How I Evaluate Kubernetes Networking for HPC

When I'm designing Kubernetes networking for performance-sensitive environments, I spend far less time comparing feature lists and far more time understanding how workloads communicate.

The questions I usually ask are:

How frequently do workloads communicate?

How sensitive are they to latency?

How much east-west traffic exists between nodes?

Are workloads exchanging large datasets or small messages?

Do they require direct network access?

Is operational simplicity more important than performance?

The answers usually determine where optimization efforts should focus.

I've found that many networking debates disappear once communication patterns become clear. Some workloads benefit significantly from networking optimizations because communication directly affects execution time. Others see little measurable improvement because networking is not their primary constraint.

This is one reason networking has become a much larger consideration in modern HPC environments. Running distributed analytics platforms, simulation workloads, and GPU-powered computing clusters requires more than fast processors and powerful accelerators. It requires infrastructure where cloud networking performance scales alongside compute performance. Cloud platforms such as AceCloud support high-performance Kubernetes deployments because improving CPUs or GPUs alone rarely solves communication bottlenecks. If data cannot move efficiently between workloads, additional compute resources often deliver diminishing returns.

The most effective environments I've worked with do not treat networking as supporting infrastructure. They treat it as a performance-critical component of the workload itself.

The longer I work with Kubernetes, the less I think about networking as a connectivity problem and the more I think about it as a performance problem.

Most clusters can successfully connect pods and services. The real question is whether they can do so efficiently enough for the workloads they support. Traditional application environments may never expose networking bottlenecks because communication patterns are relatively simple. High-performance computing environments are different. They amplify every inefficiency because workloads spend so much time exchanging information across the network.

What I've learned is that networking performance is rarely determined by a single technology choice. It is the combined result of workload communication patterns, CNI architecture, overlay design, latency characteristics, and infrastructure decisions made throughout the stack. When those pieces align, applications scale more predictably and infrastructure investments produce the performance gains teams expect.

In my experience, the most successful Kubernetes networking strategies begin with understanding how workloads communicate. Once that foundation exists, decisions around CNIs, overlays, and latency optimization become far easier to make and far more likely to deliver meaningful results.

Pod Scheduling for Mixed Workloads: CPU, GPU, and Memory-Optimized Nodes

Daya Shankar — Mon, 29 Jun 2026 08:20:07 +0000

One of the most expensive Kubernetes environments I've worked with wasn't short on CPU, memory, or GPU capacity. In fact, most dashboards suggested the cluster was healthy. Applications were running, autoscaling was active, and resource utilization appeared reasonable.

The problem was that workloads were landing on the wrong infrastructure.

Lightweight APIs were running on GPU nodes. Memory-intensive applications were competing for general-purpose resources. GPU workloads occasionally waited for capacity while expensive accelerator nodes hosted services that didn't need them. Nothing was technically broken, but the cluster was operating far less efficiently than it should have.

What I've learned is that mixed-workload Kubernetes environments introduce a scheduling challenge that doesn't exist in simpler clusters. Once CPU-intensive services, memory-heavy applications, and GPU-powered workloads begin sharing the same platform, Kubernetes needs help understanding where each workload belongs.

That's when pod scheduling stops being an operational detail and becomes one of the most important infrastructure decisions in the cluster.

Why Mixed Workloads Break Simple Scheduling Assumptions

When I'm evaluating a mixed-workload cluster, the first question I ask isn't how many nodes exist. It's whether workloads are consuming the type of infrastructure they were designed for. Kubernetes scheduler is very good at finding available capacity, but it has no built-in understanding of whether a database should run on a memory-optimized node or whether a lightweight API should consume GPU-backed infrastructure. Without explicit scheduling rules, the scheduler simply places workloads where resources are available.

That approach works reasonably well in small clusters where most nodes are identical. It becomes far less effective once different node types enter the environment.

Today, it's common to see Kubernetes clusters running a combination of CPU-optimized, memory-optimized, GPU-enabled, and general-purpose nodes. The reason is straightforward: different workloads stress different resources.

A web API may consume significant CPU resources while using relatively little memory. A Redis deployment or database may care far more about memory availability than processor performance. Analytics platforms often need large amounts of both. GPU inference services depend almost entirely on accelerator resources.

A simple mapping looks like this:

Workload Type	Best Node Type
APIs & Microservices	CPU-Optimized
Web Applications	CPU-Optimized
Databases	Memory-Optimized
Redis & Caching Layers	Memory-Optimized
Analytics Jobs	CPU + Memory Optimized
GPU Inference	GPU Nodes
Training Workloads	GPU Nodes

What I've found is that mixed-workload clusters become inefficient the moment these workloads start competing for infrastructure that wasn't designed for their requirements. Kubernetes may successfully schedule the workload, but successful scheduling and efficient scheduling are not the same thing.

What Happens When Workloads Land on the Wrong Nodes

One thing I've learned over the years is that Kubernetes and platform teams often define success differently. Kubernetes considers scheduling successful when a workload lands on a node that satisfies its requirements. Platform teams care whether that workload landed on the most appropriate infrastructure.

Those are not always the same thing.

CPU Workloads Running on GPU Nodes

One of the most common scheduling mistakes I've seen is lightweight services landing on GPU infrastructure simply because capacity exists.

The application runs perfectly. Users notice nothing. Monitoring systems remain green.

The problem is that an expensive GPU node is now hosting a workload that gains no value from accelerator hardware. Every hour that workload occupies GPU-backed infrastructure, specialized resources become unavailable for workloads that actually require them.

This is particularly common in clusters where GPU nodes are not adequately isolated through scheduling controls.

Memory-Intensive Applications Running on CPU Infrastructure

I've also seen the opposite problem.

Databases, Elasticsearch clusters, and caching platforms often end up running on infrastructure optimized primarily for compute. CPU utilization remains relatively low while memory pressure becomes the dominant bottleneck.

Teams often respond by scaling horizontally or adding additional nodes. In reality, the workload may simply be running on infrastructure that doesn't match its resource profile.

The result is unnecessary infrastructure growth driven by poor placement rather than genuine demand. This is why comparing standard, compute-optimized and memory-optimized instances matters before teams design node pools for mixed workloads.

GPU Workloads Competing with General-Purpose Services

GPU workloads are usually the most sensitive to scheduling decisions because accelerator resources are finite and significantly more expensive than standard compute infrastructure.

When inference services, training workloads, and general-purpose applications share the same scheduling pool, resource contention becomes difficult to predict. I've seen GPU workloads wait for capacity while non-GPU services occupied nodes that should have been reserved for accelerator-dependent applications.

What makes these situations difficult to diagnose is that Kubernetes is often behaving exactly as expected. The scheduler is successfully placing workloads. The issue is that it wasn't given enough information to make infrastructure-aware decisions.

Over time, the consequences become visible:

Specialized nodes remain underutilized

General-purpose infrastructure becomes overloaded

Pending workloads increase

Autoscaling adds additional nodes

Infrastructure costs rise faster than workload demand

Many teams initially respond by adding capacity. In my experience, the better solution is usually improving placement.

How I Control Placement in Mixed-Workload Clusters

Once a cluster contains specialized node pools, I stop thinking about scheduling as a Kubernetes feature and start thinking about it as infrastructure governance.

The objective is simple: ensure workloads consume the resources they were designed for and prevent them from occupying resources they don't need.

When I'm building mixed-workload environments, I rely heavily on three Kubernetes scheduling controls:

Node Labels

Node Affinity

Taints and Tolerations

The first step is making the infrastructure describe itself.

For example:

node-role=cpu
node-role=memory
node-role=gpu

If Kubernetes cannot distinguish a GPU node from a memory-optimized node, I cannot reasonably expect workloads to land in the right place.

Once nodes are labeled, workloads can express placement requirements using node affinity:

affinity:
 nodeAffinity:
 requiredDuringSchedulingIgnoredDuringExecution:
 nodeSelectorTerms:
 - matchExpressions:
 - key: node-role
 operator: In
 values:
 - gpu

This ensures GPU-dependent workloads are scheduled only on GPU-capable infrastructure.

For highly specialized resources, I usually add taints and tolerations as an additional layer of protection. GPU nodes are a perfect example because they represent some of the most expensive resources in the cluster. Allowing general-purpose workloads onto those nodes simply because capacity happens to be available often creates significant waste.

I've found that labels, affinity rules, and taints work best when they're treated as infrastructure guardrails rather than scheduling features. Their purpose is not simply to influence placement. Their purpose is to prevent costly mistakes before they happen.

What Mature Kubernetes Environments Do Differently

One pattern I've consistently noticed is that mature Kubernetes environments rarely design infrastructure around nodes. They design infrastructure around workload behavior.

Before defining node pools, scheduling policies, or autoscaling rules, they first understand how applications consume resources.

That usually means evaluating:

CPU utilization patterns

Memory consumption trends

GPU utilization metrics

Scaling behavior

Performance requirements

Infrastructure costs

Only after those patterns become clear do scheduling decisions start making sense.

This is one reason infrastructure planning has become increasingly workload-aware. Running APIs, databases, analytics platforms, caching layers, and GPU-powered services inside the same Kubernetes environment requires more than simply adding nodes. It requires matching workloads with infrastructure characteristics.

Cloud platforms such as AceCloud support CPU-optimized, memory-optimized, and GPU-backed environments because modern Kubernetes deployments rarely fail due to a lack of resources. More often, they struggle because resources are being consumed inefficiently. For GPU-heavy clusters, a deeper look at multi-GPU orchestration in Kubernetes can help when scheduling moves beyond simple node selection and into queues, retries, accelerators, and workload-aware placement.

The challenge is not giving every workload access to every node. The challenge is ensuring each workload lands on infrastructure that reflects how it actually consumes resources.

The most efficient clusters I've worked with are not necessarily the ones with the most resources. They're the ones where workloads consistently consume the right resources.

The longer I work with Kubernetes, the less I think about scheduling as a process of finding available capacity and the more I think about it as a process of matching workloads to infrastructure.

CPU-intensive services, memory-heavy applications, and GPU-powered workloads all place different demands on the cluster. Kubernetes can schedule them successfully, but successful scheduling is not always efficient scheduling. Without deliberate placement policies, workloads gradually drift toward whatever capacity happens to be available, and the result is often lower utilization, higher costs, and more infrastructure than the environment actually needs.

In my experience, the most efficient mixed-workload clusters aren't the ones with the most nodes. They're the ones where every workload consistently lands on the node type it was designed for. That's where scheduling stops being a Kubernetes feature and starts becoming a competitive advantage in how infrastructure is operated.

Kubernetes for Stateful vs Stateless Workloads: Storage and Networking Implications

Daya Shankar — Mon, 29 Jun 2026 07:42:27 +0000

The difference between stateful and stateless workloads becomes obvious the first time Kubernetes restarts something important.

If a stateless API pod disappears, Kubernetes creates another one.

Traffic moves.

The application keeps running.

But if a database pod disappears, the conversation changes immediately.

Now nobody is just thinking about containers.

They are thinking about data, replication, recovery, consistency and what might break if the wrong pod comes back in the wrong way.

That is the real difference.

Not whether the workload runs in Kubernetes.

Both do.

The real question is simpler:

If this pod disappears, what needs to survive?

If the answer is “almost nothing,” the workload is probably stateless.

If the answer includes data, identity, ordering or cluster membership, the workload is stateful.

And once that happens, Kubernetes is no longer only keeping containers running. It is helping preserve relationships the application depends on.

Stateless Workloads Work Because Pods Are Disposable

Stateless applications usually do not care which pod handles a request.

Most web apps, APIs, backend services and microservices are designed this way.

The pod is just a runtime.

If Kubernetes removes it during a node failure, rolling update or scaling event, another pod can replace it.

User sessions, business data and long-term state usually live somewhere else.

That is why Kubernetes feels natural for stateless workloads.

The platform can make scheduling decisions without carrying much application history.

Scale up?

Add replicas.

Upgrade?

Replace pods gradually.

Node failure?

Start the pod somewhere else.

Characteristic	Stateless workloads	Stateful workloads
Pod identity	Disposable	Stable or meaningful
Storage dependency	Minimal	Critical
Scaling	Usually straightforward	Data-aware
Failure recovery	Replace the pod	Recover pod and data safely
Network identity	Usually irrelevant	Often required
Examples	APIs, web apps, microservices	Databases, Kafka, Elasticsearch

The moment the workload expects Kubernetes to preserve something, the model changes.

Stateful Workloads Change the Storage Conversation

Stateful workloads rarely become difficult because of containers.

They become difficult because the application cares about what survives after the container restarts.

A database pod is not only running a process.

It is attached to data that must survive rescheduling, upgrades, node failures and maintenance windows.

This is why Persistent Volumes and Persistent Volume Claims become foundational for stateful workloads.

A PVC might look simple:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: database-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

But the YAML is not the important part.

The relationship is.

Kubernetes is now managing how a pod connects to persistent storage.

That means storage performance, access modes, backup policies, snapshots and recovery procedures all become part of the application architecture.

This is also where many teams underestimate the operational side.

Creating a volume is easy.

Restoring the right version of that volume after failure is the hard part.

For multi-volume applications, teams should also think carefully about Kubernetes CSI volume snapshots, because independent snapshots can create consistency problems when data, logs and write-ahead records live on separate PVCs.

Stateful Workloads Also Change Networking

Storage is usually the first problem teams notice.

Networking is usually the second.

Stateless services generally need to reach any healthy pod.

A Kubernetes Service works well because the individual pod identity does not matter.

Stateful systems are different.

Database replicas, search clusters and message brokers may need stable names, predictable identities and ordered membership.

In those cases, the workload does not only need connectivity.

It needs identity.

This is where StatefulSets matter.

Instead of creating interchangeable pods, StatefulSets give pods stable ordinal identities such as:

mysql-0
mysql-1
mysql-2

That may look like a small naming detail.

It is not.

Stable identities help distributed systems maintain replication relationships, leader election, cluster membership and predictable communication patterns.

For some workloads, headless Services are also used so applications can discover individual pods directly instead of only reaching a load-balanced service address.

Scaling Stateful Workloads Is Not Just Adding Replicas

Stateless scaling is usually simple.

Add more replicas.

Let the Service send traffic to them.

Stateful scaling is slower and more deliberate.

A new database replica is not just another pod.

Storage may need to be provisioned.

Data may need to sync.

Cluster membership may need to update.

The new instance may not be ready to serve traffic until recovery or replication catches up.

This is why I do not treat storage, networking and scaling as separate decisions for stateful workloads.

They are connected.

Once data must survive, identity often must survive.

Once identity must survive, scaling becomes more careful.

The Operational Trade-Off Teams Discover Late

Deploying a stateful app on Kubernetes is not the hardest part anymore.

Operators, storage integrations and managed platforms have made deployment much easier.

The real challenge appears later.

Backups.

Failover.

Restore testing.

Version upgrades.

Replication lag.

Data consistency during maintenance.

Operational area	Stateless workloads	Stateful workloads
Scaling	Add replicas	Add replicas and sync data
Failover	Replace pod	Recover service and data
Upgrades	Rolling updates	Coordinated updates
Networking	Service-based access	Identity-aware access
Storage	Often external	Part of reliability design

Most incidents do not happen because Kubernetes cannot run the workload.

They happen because assumptions were never tested.

Will the backup restore cleanly?

Can the replica catch up after a node failure?

What is the actual RPO?

How long does failover take?

If those questions matter to the business, the workload is not just stateful.

It is operationally sensitive.

For that reason, stateful systems also need a clear cross-region disaster recovery strategy when downtime or data loss would directly affect customers.

What Should You Check Before Choosing an Architecture?

Do not start with Deployments, StatefulSets or storage classes.

Start with application behaviour.

Ask:

Can the workload tolerate pod replacement?
Does data need to survive rescheduling?
Does the application need stable network identity?
Can instances be treated as interchangeable?
What happens during node failure?
How difficult is recovery if a pod disappears permanently?
What RPO and RTO does the business expect?

The answers usually reveal the architecture.

If pods are disposable, a Deployment and Service may be enough.

If storage and identity must survive, the design needs Persistent Volumes, StatefulSets, backup workflows and recovery testing.

Final Thought

I do not think about stateful and stateless workloads as just application categories.

I think about them as operational commitments.

Stateless workloads ask Kubernetes to keep applications running.

Stateful workloads ask Kubernetes to keep applications running while preserving data, identity and consistency.

That extra responsibility changes storage, networking, scaling and recovery planning.

So before choosing the architecture, come back to the simple question:

If this pod disappears, what needs to survive?

The answer usually tells you how complex the platform really needs to be.

GPU Resource Requests and Limits in Kubernetes: Why Default Settings Break Production

Daya Shankar — Mon, 29 Jun 2026 07:31:50 +0000

One of the strangest GPU incidents I have seen involved a Kubernetes cluster that looked underused and overloaded at the same time.

GPU utilisation was below 40%.

The dashboards looked healthy.

But new workloads were stuck in Pending. Application teams wanted more GPU nodes. The autoscaler kept expanding the cluster.

At first, the numbers did not make sense.

If the GPUs were not fully used, why could Kubernetes not schedule more work?

The answer was not compute capacity.

It was allocation.

And this is where many GPU Kubernetes clusters quietly become expensive.

Kubernetes Does Not Treat GPUs Like CPUs

Most teams understand CPU requests and limits reasonably well.

GPUs are different.

In Kubernetes, GPUs are scheduled as extended resources. The official Kubernetes GPU scheduling documentation states that GPUs must be specified in limits, and if requests are also specified, requests and limits must be equal.

That means a pod asking for one GPU is not asking for 30% of a GPU.

It is asking for one whole GPU device.

So, if a workload requests:

resources:
  limits:
    nvidia.com/gpu: 1

Kubernetes treats one GPU as allocated to that pod.

Even if the workload only uses 25% or 40% of the GPU in practice.

This is the part that surprises many teams.

A GPU can be partially utilised and fully allocated at the same time.

Why Default GPU Requests Break Production

There is nothing technically wrong with requesting one GPU.

For model training, large inference jobs or workloads that genuinely need exclusive access, it may be exactly the right configuration.

But many production workloads do not use an entire GPU all the time.

Small inference services, development notebooks, internal tools and batch jobs often consume only a fraction of available compute.

Still, once they request one GPU, Kubernetes reserves the whole device.

That creates a strange situation.

8 GPUs are available
8 pods request one GPU each
Average GPU utilisation stays below 40%
New GPU pods remain pending

From the dashboard, the cluster looks underused.

From Kubernetes’ scheduler, the cluster is full.

Both views are true.

And that is the problem.

Why Autoscaling Can Make the Problem More Expensive

Once pods remain pending, node autoscaling usually enters the conversation.

A node autoscaler provisions or consolidates nodes so the cluster has enough capacity for workloads, as described in Kubernetes’ node autoscaling documentation.

So if a pod needs one GPU and no allocatable GPU is available, adding a GPU node can be the correct scheduler response.

Technically, the autoscaler is doing its job.

But here is the catch.

The cluster may not need more GPU compute.

It may need a better allocation strategy.

If every lightweight workload reserves a full device, autoscaling solves the scheduling problem by buying more capacity. It does not fix the utilisation problem.

That is how teams end up with larger GPU fleets, higher cloud bills and only a small improvement in actual throughput.

If the problem appears repeatedly across training, inference and batch workloads, the issue is no longer only resource requests.

It becomes a GPU orchestration problem.

At that point, teams may need workload-aware scheduling, queues, node pools and retry policies rather than simply adding more GPU nodes. A deeper look at multi-GPU orchestration in Kubernetes can help when the cluster is moving beyond simple one-pod-one-GPU scheduling.

The Metric That Misleads Teams

GPU utilisation is useful.

But by itself, it can be misleading.

A low utilisation number does not automatically mean Kubernetes can place another GPU workload on the node.

You also need to check:

GPU allocation
GPU memory usage
Pending GPU pods
Node allocatable GPU count
Autoscaler activity
Workload concurrency

Looking only at utilisation is like seeing empty seats in a theatre after every ticket has already been sold.

The room looks available.

The booking system disagrees.

Kubernetes works more like the booking system.

How Should Teams Evaluate GPU Requests?

Before assigning a full GPU to a workload, ask a few practical questions:

How much GPU memory does the workload actually use?
Does it need exclusive GPU access?
What does utilisation look like at peak load?
Can it safely share a GPU with another workload?
Is the workload continuous or intermittent?
Does it need latency isolation?

These questions usually reveal the real shape of the workload.

A training job may need a dedicated GPU.

A production inference service may need one too, especially when latency is strict.

But a development notebook, lightweight batch job or small internal inference service may not.

Without that distinction, GPU requests become guesses.

And expensive guesses tend to scale badly.

What Do Efficient GPU Clusters Do Differently?

Efficient GPU Kubernetes clusters do not start by giving every workload a full device.

They start by measuring workload behaviour.

They monitor memory, utilisation, scheduling frequency, peak demand and contention.

Then they choose the allocation model.

Workload type	Common allocation approach
Large training jobs	Dedicated GPU
Latency-sensitive inference	Dedicated GPU or isolated slice
Small inference services	Shared GPU or time-slicing
Development environments	Shared GPU
Batch processing jobs	Time-sliced or shared GPU

The right answer depends on isolation needs.

If workloads need stronger isolation, NVIDIA Multi-Instance GPU can partition supported GPUs into separate GPU instances with dedicated compute and memory resources.

If workloads mainly need opportunistic sharing, NVIDIA GPU time-slicing can allow multiple workloads to share access to the same underlying GPU over time. Teams comparing shared and exclusive access models should also understand the trade-off between GPU time-slicing and passthrough before changing production allocation rules.

But these are not magic switches.

MIG, time-slicing and dedicated allocation all have trade-offs around isolation, scheduling flexibility, memory guarantees and operational complexity.

The point is not to share every GPU.

The point is to stop treating every workload as if it needs a full one.

When Should You Use Dedicated GPUs?

Use dedicated GPUs when the workload needs predictable performance or consumes most of the device.

This is common for large training jobs, high-throughput model serving, strict latency workloads or applications with heavy GPU memory requirements.

In those cases, sharing can create more risk than savings.

Dedicated allocation may look less efficient on a dashboard, but it can be the right design when predictability matters.

When Should You Consider GPU Sharing?

Consider sharing when workloads are smaller, intermittent or tolerant of variable performance.

This often includes development environments, internal tools, lightweight inference services and batch jobs.

These workloads may not need an entire GPU reserved all day.

Sharing can improve utilisation and reduce unnecessary node growth.

But it should be tested under real load.

A setup that works for idle notebooks may not work for production inference with strict response-time targets.

The Real Question Before Adding More GPU Nodes

When GPU workloads start pending, the instinct is to add nodes.

Sometimes that is correct.

But before expanding the cluster, ask:

Are the existing GPUs actually being used, or are they just allocated?

That one question changes the troubleshooting path.

If GPUs are heavily utilised and memory is saturated, you probably need more capacity.

If GPUs are allocated but lightly used, the problem is allocation design.

More nodes may hide the issue for a while.

They will not fix it.

Final Thought

Most GPU Kubernetes problems do not begin with a shortage of GPUs.

They begin with a mismatch between how workloads consume GPU resources and how Kubernetes allocates them.

A GPU can sit at 35% utilisation and still be unavailable to every new pod.

Once that makes sense, many scheduling, autoscaling and cost problems become easier to explain.

So before increasing autoscaling limits or buying more GPU nodes, check the allocation model.

You may not have a capacity problem.

You may have a reservation problem.

Best Managed Kubernetes Hosting in India in 2026

Daya Shankar — Wed, 17 Jun 2026 10:49:47 +0000

Most teams comparing managed Kubernetes hosting start with one question:

Which provider is the best?

But that depends on what you are trying to run.

A large enterprise may need deep IAM controls, multi-region recovery and integration with hundreds of cloud services.

A startup may care more about simple pricing, local support and a free control plane.

So instead of putting every provider into one list, it makes more sense to compare them by category.

First, the hyperscalers.

Then, India-based cloud providers.

And finally, global providers with Indian regions.

Managed Kubernetes Providers in India at a Glance

Provider	Category	Best suited for
Amazon EKS	Hyperscaler	AWS-based enterprise environments
Google Kubernetes Engine	Hyperscaler	Kubernetes automation and cloud-native teams
Azure Kubernetes Service	Hyperscaler	Microsoft and hybrid environments
Oracle Kubernetes Engine	Global cloud provider	Oracle and OCI workloads
AceCloud	India-based provider	Managed GPU Kubernetes and local support
Utho	India-based provider	Startups and cost-conscious teams
OVHcloud	Global provider with India region	Open cloud and network-heavy workloads
DigitalOcean Kubernetes	Global provider with India region	Simple developer-focused deployments

Best Hyperscaler Kubernetes Platforms in India

1. Amazon EKS: Best for AWS Environments

Amazon Elastic Kubernetes Service is usually the obvious choice when applications already use EC2, IAM, RDS, CloudWatch or other AWS services.

AWS manages the Kubernetes control plane, while teams can choose managed node groups, Fargate or EKS Auto Mode for worker infrastructure.

The biggest advantage is not EKS alone.

It is everything around EKS.

Networking, security, storage, databases and observability can remain inside the same cloud ecosystem.

But that flexibility also creates complexity. Control-plane charges, compute, NAT gateways, load balancers, storage and monitoring are billed separately.

Choose EKS when: your organisation already runs on AWS and needs deep enterprise integrations.

2. Google Kubernetes Engine: Best for Automation

Google Kubernetes Engine is one of the most mature managed Kubernetes platforms.

Standard mode gives infrastructure teams more control over nodes and cluster settings.

Autopilot manages more of the infrastructure, including node provisioning and scaling.

This makes GKE useful for teams that want Kubernetes flexibility without managing every underlying component.

Its strengths include release channels, workload identity, autoscaling and integration with Google Cloud’s data and AI services.

Choose GKE when: automation and a Kubernetes-first developer experience matter most.

3. Azure Kubernetes Service: Best for Microsoft Teams

Azure Kubernetes Service fits naturally into organisations using Microsoft Entra ID, Azure DevOps, Azure Monitor or Windows-based applications.

Its biggest advantage is ecosystem alignment.

Identity, policies, monitoring and CI/CD can remain connected to the Microsoft tools the organisation already uses.

AKS is also relevant for hybrid environments through Azure Arc.

Choose AKS when: Microsoft identity, development tools and hybrid infrastructure already shape your environment.

Best India-Based Managed Kubernetes Providers

4. AceCloud: Best for GPU Kubernetes and Managed Support

AceCloud Managed Kubernetes is designed for teams that want managed clusters without the operational overhead of running the control plane themselves.

Its offering includes a high-availability control plane, autoscaling, managed upgrades, monitoring and support.

The main differentiator is GPU infrastructure.

Teams can create GPU-enabled node groups for AI training, inference, analytics and other compute-heavy workloads.

This makes it relevant for Indian AI startups and enterprises that want Kubernetes and GPU capacity from the same provider.

Choose AceCloud when: GPU workloads, managed operations and local support are important.

5. Utho: Best for Cost-Conscious Indian Teams

Utho Managed Kubernetes focuses on simple deployment and transparent infrastructure pricing.

It offers a managed control plane, autoscaling, storage integrations and cloud infrastructure hosted in India.

Its smaller ecosystem can be an advantage for teams that do not want to navigate dozens of managed services before deploying a cluster.

But teams running complex enterprise workloads should compare support, integrations and service maturity carefully.

Choose Utho when: you want an Indian provider with straightforward pricing and simpler cluster management.

Other Global Providers with India Regions

6. OVHcloud: Best for Open Cloud Infrastructure

OVHcloud Managed Kubernetes is available through its Mumbai Public Cloud region.

The platform supports autoscaling, Terraform, managed load balancing and Cilium-based networking.

OVHcloud also positions its service around open standards and predictable networking costs.

It does not offer the same service breadth as AWS, Azure or Google Cloud.

But for teams trying to reduce dependency on a large hyperscaler, that may be a reasonable trade-off.

Choose OVHcloud when: open cloud infrastructure, Mumbai hosting and predictable networking matter.

7. DigitalOcean Kubernetes: Best for Simplicity

DigitalOcean Kubernetes is built for developers and smaller engineering teams that want to deploy clusters without a steep cloud-management learning curve.

It includes a managed control plane, autoscaling and integration with DigitalOcean storage, load balancers and container registries.

The service is available in Bangalore.

It offers fewer enterprise services than the hyperscalers, but its simpler platform can make deployments easier to understand and operate.

Choose DigitalOcean when: developer experience and fast deployment matter more than enterprise service breadth.

8. Oracle Kubernetes Engine: Best for OCI Workloads

Oracle Kubernetes Engine is most relevant when applications already depend on Oracle Database, Exadata or other OCI services.

OKE supports managed nodes, virtual nodes, autoscaling and OCI networking and security integrations.

It becomes a practical choice when Kubernetes is part of a broader Oracle architecture.

Choose OKE when: Oracle services already sit at the centre of your workload.

What Should You Compare Before Choosing?

Do not compare providers only on worker-node prices.

Also check:

Control-plane fees and SLA
Indian data-location requirements
Load balancer and public IP costs
Storage, snapshots and backup pricing
Network egress charges
GPU node availability
Upgrade and patching responsibility
Support response times

A cheap worker node can still produce an expensive cluster.

Which Managed Kubernetes Provider Should You Choose?

Choose EKS, GKE or AKS when you need the scale and ecosystem of a hyperscaler.

Choose AceCloud or Utho when local support, Indian infrastructure or simpler billing matters more.

Choose OVHcloud or DigitalOcean when you want a global platform with Indian hosting but less hyperscaler complexity.

Choose OKE when Oracle services already shape your architecture.

The best provider is not the one with the longest feature list.

It is the one that removes the Kubernetes work your team does not want to manage without taking away the control it still needs.

Stable Diffusion Inference: Memory Requirements, Speed and GPU Selection

Daya Shankar — Wed, 17 Jun 2026 10:16:44 +0000

When teams plan infrastructure for Stable Diffusion, the conversation usually starts with GPU speed.

Should we use an L40S?

Would an H100 generate images faster?

How many images can it produce per minute?

Those are useful questions.

But they often skip the constraint that decides whether the workload can run properly in the first place:

GPU memory.

A model may generate one image quickly during testing and still struggle in production.

Why?

Because production adds larger resolutions, concurrent requests, ControlNet models, LoRA adapters and multiple pipeline components competing for the same VRAM.

What looks like a slow GPU can actually be a workload that no longer fits comfortably in memory.

What Actually Uses VRAM During Stable Diffusion Inference?

The model weights are only part of the memory requirement.

VRAM is also used by:

The UNet or diffusion transformer
Text encoders
The VAE
Latent representations and intermediate tensors
Attention operations
Image buffers
Batch and concurrency overhead
ControlNet models and other adapters

Resolution matters because larger images create larger latent tensors and attention workloads.

Batch size matters because the GPU must process more images at the same time.

Concurrency matters because multiple active requests may require their own intermediate data.

And ControlNet matters because its weights and activations add another model component to the pipeline. The official Diffusers ControlNet documentation explains how these additional conditioning models work alongside the base diffusion model.

How Much VRAM Does Stable Diffusion Need?

There is no single correct number.

The requirement changes with the model, resolution, precision, framework, attention backend, batch size and memory optimisations.

Still, the following ranges provide a practical starting point for planning:

Deployment scenario	Practical VRAM range
Stable Diffusion 1.x at 512×512	6-8 GB
SDXL base at 1024×1024	12-16 GB
SDXL with LoRA or a light extended workflow	16-24 GB
SDXL with ControlNet or multiple pipeline components	20-24 GB or more
Concurrent production inference	24-48 GB or more

These are planning ranges, not fixed minimums. Memory-saving techniques can reduce VRAM use, while larger batches, multiple ControlNets and concurrent requests can push requirements higher.

For example, SDXL can run on lower-memory hardware by moving pipeline components to system memory. Hugging Face documents several options in its Diffusers memory optimisation guide.

But there is a trade-off.

CPU offloading saves VRAM by moving model components between the CPU and GPU. That movement can also increase generation time.

So fitting the model into memory and running it efficiently are not always the same thing.

Does More VRAM Make Stable Diffusion Faster?

Not directly.

This is where GPU selection often becomes confusing.

VRAM capacity determines whether the model, batch and active requests fit on the GPU.

Once they fit, generation speed depends more heavily on:

GPU compute performance
Memory bandwidth
Image resolution
Number of sampling steps
Batch size
Precision such as FP16, BF16 or FP8
Inference framework and kernel optimisations

Imagine two GPUs that finish one SDXL image in a similar amount of time.

One has 24 GB of VRAM. The other has 48 GB.

The 48 GB GPU may not generate that single image twice as fast.

But it may support larger batches, more complex pipelines or more concurrent requests before running out of memory.

That is the real value of additional VRAM.

It creates headroom.

Why Does Performance Change Under Concurrent Load?

A single-image benchmark answers one question:

How quickly can this GPU complete one controlled request?

A production service asks something different:

How many requests can it complete while keeping latency within an acceptable range?

Suppose one user generates a 1024×1024 image.

The GPU may look fast and lightly loaded.

Now add ten users, different LoRA adapters and a ControlNet workflow.

The hardware has not changed.

The memory requirement has.

When the workload no longer fits comfortably, the deployment may need to reduce batch size, queue requests, offload components to the CPU or reject requests with an out-of-memory error.

This is why production performance often looks very different from a benchmark.

How Does Batch Size Affect Memory and Speed?

Batching allows the GPU to process multiple prompts together.

This can improve throughput because the GPU does more work in each execution cycle.

But a larger batch also requires more VRAM.

The official Diffusers batch inference guide describes the same trade-off: batching can improve GPU utilisation, but it increases memory use and may increase latency.

So the largest possible batch is not automatically the best batch.

A batch service may prioritise maximum images per minute.

An interactive application may use smaller batches because users care more about how quickly each request starts and finishes.

Which GPU Should You Choose for Stable Diffusion?

There is no universal “best GPU.”

The better choice depends on whether you are optimising for single-user development, cost-efficient inference, concurrency or large shared environments.

Deployment goal	Suitable GPU class	Why it fits
Development and testing	16-24 GB GPU	Enough for common single-user SDXL workflows with sensible optimisation
Professional workstation workflows	RTX 6000 Ada, 48 GB	Large VRAM pool for complex local workflows and multiple extensions
Production image inference	L4, 24 GB or L40S, 48 GB	L4 suits lighter serving, while L40S adds headroom for larger batches and concurrency
High-concurrency inference	H100, 80 GB or 94 GB	Higher compute, bandwidth and memory capacity for demanding serving environments
Memory-heavy shared environments	H200, 141 GB	Large HBM3e capacity for high concurrency and larger mixed AI workloads

The L40S provides 48 GB of GDDR6 memory, while the H200 provides 141 GB of HBM3e. Those specifications are useful, but they do not mean every Stable Diffusion deployment should move directly to an H200.

For standard SDXL inference, an H200 may be unnecessary unless the environment also needs substantial concurrency, large batches or broader memory-heavy AI workloads.

Buying more headroom than the workload can use does not improve efficiency.

What Should You Measure Before Selecting a GPU?

Do not test only one image.

Test the workload you actually expect to operate.

Measure:

Peak VRAM usage
Average and p95 generation latency
Images generated per minute
Maximum stable concurrency
Queue length during peak traffic
Failure and out-of-memory rates
Cost per completed image

Use the same model, resolution, sampling steps, adapters and concurrency level expected in production.

Otherwise, the benchmark may tell you which GPU wins a test without telling you which GPU fits the deployment.

The Infrastructure Mistake Most Teams Make

The common mistake is selecting a GPU first and defining the workload later.

Teams see that an H100 is faster than an L4 and assume it must be the better choice.

But faster hardware only creates value when the workload uses that performance.

A low-volume internal tool may run efficiently on a 24 GB GPU.

A public image platform may need 48 GB or more because many users are generating images at once.

A large batch pipeline may care less about individual request latency and more about total images per GPU hour.

Same model.

Different operating conditions.

Different GPU decision.

Memory, Speed and Cost Are Connected

Stable Diffusion infrastructure is not only a GPU performance problem.

It is a resource-balancing problem.

VRAM capacity determines what can fit and how much work can run together.

Compute performance and memory bandwidth affect how quickly that work completes.

Concurrency and response-time targets determine how much spare capacity the deployment needs.

And all of those decisions affect cost.

So before asking, “Which GPU is fastest?” ask something more useful:

What must this GPU handle at the busiest point of the day?

That answer will tell you far more than a single-image benchmark.

Batch Processing vs Real-Time Inference: When to Use Each for Image Generation

Daya Shankar — Wed, 17 Jun 2026 10:01:18 +0000

Two companies use the same image generation model.

One needs 100,000 product images for an e-commerce catalogue. The other runs a design platform where users expect an image within seconds.

Same model. Possibly the same GPUs.

Completely different infrastructure.

Why?

Because one company needs the images completed. The other has users waiting for them.

Most teams begin by comparing models, inference frameworks and GPU specifications. Those choices matter, but another question often has a bigger effect on cost and GPU utilisation:

Does the image need to exist now, or can it be generated later?

The answer usually determines whether batch processing, real-time inference or a combination of both is the right approach.

Batch Processing vs Real-Time Inference at a Glance

Factor	Batch Processing	Real-Time Inference
Primary goal	Maximum throughput	Fast response time
User waiting	No	Yes
Queueing	Expected	Kept within a latency limit
GPU utilisation	Usually easier to maximise	Often requires spare capacity
Capacity planning	Based on job volume and deadlines	Based on traffic and latency targets
Cost priority	Lower cost per completed image	Consistent user experience
Infrastructure priority	Efficiency	Availability

The difference may look operational.

In reality, it shapes the entire deployment architecture.

When Does Batch Processing Make Sense?

Batch processing treats image generation as work that must be completed, not as a service that must respond immediately.

It works well for:

Product catalogue generation
Bulk image enhancement
Marketing asset production
Media rendering pipelines
Large-scale design automation

In these cases, the business cares about total output and delivery time. It does not usually matter whether every image appears seconds after the request.

That flexibility is useful.

Requests can wait in a queue. Compatible jobs can be grouped together. GPUs can continue processing without keeping capacity available for unpredictable user traffic.

The goal is simple:

Keep the GPU busy and complete as much work as possible.

Think of it like filling a delivery truck. When the delivery is not urgent, sending a full truck is more efficient than making several half-empty trips.

Batch image generation follows the same principle.

Technologies such as NVIDIA Triton dynamic batching can combine compatible inference requests into larger batches to improve throughput.

Here, the queue is not necessarily a bottleneck.

It is part of the optimisation strategy.

Why Can Batch Processing Cost Less?

Batch workloads give teams more control over when and how GPU capacity is used.

They can group similar requests, schedule jobs during available capacity and process work continuously for longer periods.

This can increase the number of images completed per GPU hour and reduce the effective cost per image.

But batching is not automatic magic.

It works best when requests use compatible settings such as the same model, resolution or inference configuration. Highly varied requests may require separate queues or scheduling rules.

Speed still matters, but the metric changes.

A batch pipeline may take several hours to generate 100,000 images. If the output is ready before the business deadline, it has done exactly what it was designed to do.

When Does Real-Time Inference Make Sense?

Now imagine a user entering a prompt and clicking Generate Image.

They are not thinking about GPU utilisation.

They are watching the loading screen.

The infrastructure must have capacity available when the request arrives. It cannot comfortably hold every request for several minutes while waiting to build a larger batch.

Every extra second becomes part of the product experience.

This makes real-time inference suitable for:

Interactive image generation tools
AI design platforms
Live photo-editing applications
Customer-facing content creation tools
Applications with strict response-time targets

Real-time infrastructure may need spare GPU capacity during quieter periods so it can handle sudden traffic increases.

From an infrastructure perspective, that capacity may look underused.

From a product perspective, it protects the user experience.

Does Real-Time Inference Mean No Batching?

No.

This is an important distinction.

Real-time systems can still use small or dynamic batches. The difference is that requests can only wait for a limited time.

For example, an inference server may hold a request for a few milliseconds to see whether another compatible request arrives. It can then process both together without creating a noticeable delay.

But here is the trade-off.

The longer the system waits to create a batch, the more throughput it may gain. It also adds more latency.

NVIDIA’s Triton optimisation guidance treats minimum latency and maximum throughput as different tuning goals. You rarely maximise both at the same time.

The Real Trade-Off: Utilisation vs Responsiveness

Many techniques that improve batch efficiency can make interactive applications feel slower.

Larger queues can improve throughput but increase waiting time.
Higher utilisation can lower idle capacity but leave less room for traffic spikes.
Aggressive scheduling can keep GPUs busy but delay interactive requests.

What looks like optimisation in a batch environment can become a bottleneck in a real-time one.

In batch processing, waiting can improve efficiency.

In real-time inference, waiting affects the customer experience.

Which Processing Model Should You Choose?

Ask one question:

What happens if the image arrives ten minutes later?

If the answer is “nothing important,” batch processing is probably the better choice.

If the delay interrupts a workflow or frustrates a waiting user, real-time inference may be justified.

Choose Batch Processing When:

No user is actively waiting for each image
The workload contains many similar requests
Images must meet a deadline rather than appear immediately
Cost per image matters more than individual request latency
Jobs can tolerate queueing or rescheduling

Choose Real-Time Inference When:

A customer is waiting for the result
Response time affects the product experience
Requests arrive unpredictably
The application has a clear latency target
Slow generation could cause users to abandon the workflow

What If Your Workload Needs Both?

Many production applications use a hybrid architecture.

Interactive requests go to infrastructure designed for low latency. Bulk tasks move to a queue and run on capacity optimised for throughput.

For example, a design platform may generate a preview in real time. Once the user approves it, high-resolution exports, different aspect ratios and additional variations can move to a batch pipeline.

The user gets a fast preview.

The infrastructure avoids treating every output as urgent.

Why Workload Behaviour Matters More Than GPU Size

Teams often begin by asking which GPU they should use.

But the fastest GPU does not automatically create the most cost-effective architecture.

A powerful GPU running at low utilisation in an oversized real-time environment may cost more per image than a smaller GPU running continuously in a batch pipeline.

The hardware matters.

But workload behaviour determines how efficiently that hardware is used.

Before selecting an GPU instance, define whether the workload needs maximum throughput, low latency or a balance of both.

You can then compare hourly and longer-term configurations through cloud GPU pricing instead of keeping unnecessary capacity active.

So, Who Is Waiting for the Image?

Choose batch processing when completion matters more than immediate delivery.

Choose real-time inference when the user experience depends on receiving the image quickly.

Use a hybrid architecture when only part of the workflow needs an instant response.

Before comparing GPUs or benchmarking inference frameworks, ask:

Who is waiting for the image?

If nobody is waiting, let the workload queue.

If a user is watching the screen, design the infrastructure around that moment.

Cold Starts, Model Loading, and Their Impact on Latency SLAs

Daya Shankar — Mon, 02 Mar 2026 06:37:05 +0000

Cold start latency breaks SLAs because “pod is Running” isn’t “model is ready.” In Kubernetes hookup with vLLM, cold start includes image pulls, weight downloads, tensor load into GPU memory, and warm-up work (often CUDA-graph-related). These events are rare but huge, so they dominate p95/p99—especially when you scale from zero.

The on-call version of this problem

Bridge: SLAs die on tails, and cold starts are tail generators.

You deploy a new vLLM revision. HPA scales up. Pods come up fast. Traffic shifts. p50 looks fine. p99 explodes.

Nothing “crashed.” You just routed users onto instances still doing model loading and warm-up. That’s not a bug. That’s physics plus orchestration.

If you run strict SLAs on a GPU fleet, you need to treat cold start like a first-class SLI.

What “cold start” actually contains for vLLM on Kubernetes

Bridge: Break the chain into phases so you can measure and fix the slowest link.

Cold start is not one thing. It’s a pipeline:

DIAGRAM 1 — Cold start timeline (what you must budget)

Scale event
|
v
[1] Image pull ---> [2] Container start ---> [3] Model fetch ---> [4] Weight load ---> [5] Warm-up ---> Ready
| | | | |
Registry Python init Network/FS Disk->RAM->GPU Graph/caches

The phase that usually dominates: model storage path

Bridge: If weights sit on slow shared storage, everything else is noise.

vLLM calls this out bluntly: loading large models from shared/network filesystems can be slow, and it’s better to store the model on local disk when possible. It also warns that CPU memory pressure can trigger swapping and slow the OS down.

Translation:

If your weights live on a slow network filesystem, you built a cold-start machine.

If you swap while loading weights, you built a cold-start machine that also hurts neighbors.

Warm-up is real work, not “nice to have”

Bridge: If you don’t pre-warm, the first user request becomes your warm-up job.

vLLM provides tooling specifically to benchmark cold vs warm startup, including model loading and compilation/cache operations.
If vLLM ships a benchmark for startup, that’s your sign: startup cost matters.

Why L40S changes the tuning you should do

Bridge: PCIe-only GPUs expose bad data paths immediately.

On NVIDIA L40S, you’re on PCIe Gen4 x16 (64GB/s bidirectional).
Also: NVLink: no and MIG: no.

What this means for cold starts:

Host↔GPU traffic rides PCIe. Extra copies hurt.

You can’t “hide” cold starts by slicing a big GPU into tiny always-warm MIG slices.

Your operational levers are boring: caching, warm replicas, and reducing churn.

SLA math: cold starts don’truin averages, they ruin p95/p99

Bridge: You can’t hand-wave tails with “but it’s rare.”

Cold starts are low-frequency, high-impact latency events. That’s exactly what percentiles punish.

If you allow scale-to-zero, your probability of cold starts after idle becomes close to 1 for the first request. Knative documents scale-to-zero as a feature and exposes config to enable/disable it.
Knative also documents Scale Down Delay specifically to keep containers around for a configurable time to avoid cold start penalties.

Even if you don’t use Knative, the principle holds:

Every time you delete a pod, you re-pay model load.

Every time you scale to zero, you guarantee a cold start on the next burst.

Fix cold start latency by attacking three things

Bridge: You reduce cold starts by moving fewer bytes, repeating less work, and avoiding scale-to-zero surprises.

1) Cache model artifacts on the node (prefer local disk)

Bridge: Put the bytes next to the GPU node or pay for network + FS latency on every churn event.

vLLM recommends local disk when shared filesystems are slow.
So do this:

Mirror model artifacts to a controlled location (object store, internal registry, or artifact repo).

Cache on node-local SSD/NVMe where possible.

Point vLLM/HF cache directories at that local path.

Practical rule for SREs: don’t download weights from the public internet in the hot path. vLLM itself recommends downloading first (via huggingface-cli) and passing the local path to isolate issues.

2) Pre-pull images on GPU nodes

Bridge: Image pulls are pure waste during a scale event.

Use a DaemonSet that pins to GPU nodes and pulls your serving image. Keep it dumb.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vllm-prepull
namespace: kube-system
spec:
selector:
matchLabels: { app: vllm-prepull }
template:
metadata:
labels: { app: vllm-prepull }
spec:
nodeSelector:
accelerator: nvidia
tolerations:
- key: "accelerator"
operator: "Equal"
value: "nvidia"
effect: "NoSchedule"
containers:
- name: sleep
image: your-registry/vllm-serving:TAG
command: ["sh","-c","sleep 365000"]
resources:
requests: { cpu: "10m", memory: "32Mi" }
limits: { cpu: "50m", memory: "64Mi" }

3) Keep a warm floor for SLA-bound services

Bridge: If your SLA can’t tolerate cold starts, don’t scale to zero.

Set:

min replicas > 0 (HPA floor or Deployment replicas)

a “warm pool” per model

separate burst capacity if you need it

Scale-to-zero is a cost tool. It is not an SLA tool. Knative’s own docs bake in knobs to control that behavior for a reason.

Two diagrams that match how this should be deployed

Bridge: These are the shapes that keep p99 sane without paying for always-on waste.

Reference deployment YAML (vLLM on L40S with readiness gating + node-local cache)

Bridge: This is the “copy/paste then edit” block you can review in PRs.

This example does three things:

pins onto GPU nodes

caches model files on a node-local path

gates readiness until a warm-up completes

Important: This assumes you control the container entrypoint and can add a small wrapper script. That’s the cleanest way to tie readiness to “model is hot.”

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-serve
namespace: inference
spec:
replicas: 2 # warm floor for SLA
selector:
matchLabels:
app: vllm-serve
template:
metadata:
labels:
app: vllm-serve
spec:
nodeSelector:
accelerator: nvidia
gpu: l40s
tolerations:
- key: "accelerator"
operator: "Equal"
value: "nvidia"
effect: "NoSchedule"
containers:
- name: vllm
image: your-registry/vllm-serving:TAG
ports:
- containerPort: 8000
resources:
requests:
cpu: "4"
memory: "24Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
env:
- name: HF_HOME
value: /models/hf
- name: MODEL_PATH
value: /models/hf/my-model # pre-downloaded or mirrored
command: ["/bin/bash","-lc"]
args:
- |
set -euo pipefail
rm -f /tmp/ready

# Start vLLM in background
vllm serve "${MODEL_PATH}" \
--host 0.0.0.0 --port 8000 \
--dtype auto \
--max-model-len 8192 \
--tensor-parallel-size 1 \
2>&1 | tee /var/log/vllm.log &
VLLM_PID=$!

# Wait for the server socket, then trigger a warm-up request.
# Replace the warm-up call with your own internal probe if needed.
for i in {1..120}; do
(echo > /dev/tcp/127.0.0.1/8000) >/dev/null 2>&1 && break
sleep 1
done

# Minimal warm-up: hit the server once (your internal client/probe here).
# If you can’t curl the API, run a lightweight local script instead.
curl -sf http://127.0.0.1:8000/ >/dev/null || true

# Mark ready only after warm-up.
touch /tmp/ready

# Keep foreground
wait $VLLM_PID
readinessProbe:
exec:
command: ["/bin/sh","-c","test -f /tmp/ready"]
periodSeconds: 2
timeoutSeconds: 1
failureThreshold: 30
startupProbe:
exec:
command: ["/bin/sh","-c","test -f /var/log/vllm.log"]
periodSeconds: 2
timeoutSeconds: 1
failureThreshold: 300 # allow long first load
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
hostPath:
path: /var/lib/model-cache
type: DirectoryOrCreate

Notes for senior SREs

hostPath is powerful and dangerous. In a managed Kubernetes environment, you may prefer node-local ephemeral SSD mounts that the platform team controls, or a LocalPV setup with strict node affinity.

Set replicas to your SLA floor. Use HPA for burst, but don’t let it go to zero if p99 matters.

Measure it like an SRE: phase timings and startup benchmarks

Bridge: You can’t improve what you can’t attribute.

Use vLLM’s startup benchmark tooling

Bridge: Benchmark cold vs warm startup and block regressions.

vLLM ships a startup benchmark module to measure cold and warm startup times, including model loading and compilation/cache operations.

Run it against:

your container image

your model

your storage backend

your L40S node class

Then fail CI when cold start time regresses.

Log phase timestamps

Bridge: Turn “it’s slow” into numbers you can grep.

Log these timestamps per pod:

image pulled (node event)

process start

model fetch complete

weights loaded

warm-up complete

readiness true

Then build histograms:

cold_start_seconds{phase="fetch"}

cold_start_seconds{phase="load"}

cold_start_seconds{phase="warmup"}

This tells you where to spend effort.

Managed Kubernetes: what it helps, what it doesn’t

Bridge: Managed Kubernetes runs the plumbing. You still own the SLA path.

Managed Kubernetes can:

keep control plane stable

manage node lifecycle and autoscaler hygiene

standardize storage classes and node pools

It will not:

pick your cache strategy

keep models warm for your SLA

prevent you from scaling to zero and cold-starting on every burst

On AceCloud managed Kubernetes, the clean play is: dedicated GPU node pools for vLLM, pre-pull images, cache weights on node-local storage, set warm floors for SLA services, and Script warm-up into readiness. Keep your cold path measured and boring.

Checklist for PR reviews

Bridge: This is the short list that prevents “p99 spikes after deploy.”

Model artifacts are local or cached. Not pulled from the public internet at runtime.

GPU node pools are pinned (L40S), tainted, and isolated.

Image pre-pull exists on GPU nodes.

Readiness gates on “model is hot,” not “process started.”

Warm floor exists (min replicas > 0) for SLA services.

Cold vs warm startup is benchmarked in CI.

If you want, I can also add a short Prometheus section (cold-start phase histograms + alert rules) so the on-call page tells you which phase is burning your SLA.

Operational Risks of Running Large Multi-Tenant Kubernetes Clusters

Daya Shankar — Mon, 02 Mar 2026 06:30:39 +0000

Large multi-tenant Kubernetes clusters concentrate risk. Tenants share the control plane, core add-ons (CNI/CSI/Ingress/DNS), and scheduling capacity, so one bad deployment or “safe” upgrade can hit everyone.

The common failures are noisy neighbors, weak isolation, quota starvation, identity drift, and upgrade blast radius. Managed Kubernetes helps, but it won’t design tenancy for you.

What “multi-tenancy” means when you’re on call

If you don’t define the tenant boundary, you can’t defend it.

Multi-tenant Kubernetes usually means “many teams share one cluster.” The boundary is often a namespace. Sometimes it’s stronger: separate node pools, stricter network policy, workload identity, dedicated ingress, dedicated GPUs, etc.

Operationally, multi-tenancy is shared failure domains:

One API server.

One DNS stack (CoreDNS).

One CNI and conntrack table.

One CSI and storage path.

One ingress layer (or a few shared controllers).

One scheduler and one pool of allocatable capacity.

If you want a cluster to survive at scale, you need to decide which failures are allowed to be shared and which are not.

Noisy neighbors aren’t a “performance issue” they’re an outage pattern

Shared capacity turns small mistakes into cluster-level incidents.

CPU: throttling, request inflation, and scheduler lies

CPU is compressible, so people abuse it.

Two classic problems:

No limits + bursty workloads → one tenant burns cores and everyone’s latency climbs.

Overstated requests → scheduler thinks nodes are full → cluster autoscaler spins up nodes → real CPU sits idle.

If you size everything to p95 requests, you don’t just waste money. You also block bin-packing and create “Pending pods” incidents that look like infra failure.

Minimum guardrail

Enforce requests on CPU.

Be cautious with CPU limits for latency-sensitive services (throttling is real).

Use HPA with a real scaling signal. Don’t “set and forget.”

Memory: eviction storms and node death spirals

Memory is not compressible; it fails hard.

One tenant can trigger:

node memory pressure

kubelet evictions

cascading restarts

thundering herds as pods all re-pull images and rebuild caches

Minimum guardrail

Set memory requests and limits for all tenant workloads.

Alert on OOMKilled and eviction rates per namespace.

Keep headroom on nodes so eviction doesn’t become a cluster-wide reboot loop.

Disk/inode: the silent killer

Disk fills don’t page until they page everybody.

Common multi-tenant disk failures:

log storms filling /var/log or container runtime storage

inode exhaustion from small-file spam

image cache churn under high pod turnover

Minimum guardrail

Per-namespace log volume controls (don’t let one team spam logs).

Node alerts on disk/inode usage.

Runtime storage quotas where available.

Network: saturation and conntrack exhaustion

Networking failures hit all tenants because the kernel tables are shared.

When one tenant opens too many connections or you get a traffic spike:

conntrack table fills

packets drop

“random” timeouts appear across unrelated services

Minimum guardrail

Rate-limit at ingress.

Enforce egress policies.

Watch conntrack, dropped packets, and retransmits on nodes.

Isolation failures become security incidents

Namespaces are a convenience boundary, not a security boundary.

RBAC drift and privilege creep

RBAC starts clean and rots fast in shared clusters.

The failure mode is predictable:

a team needs one permission

someone grants a broad ClusterRole

later, nobody remembers it exists

now the tenant can list secrets cluster-wide or mutate critical resources

Minimum guardrail

Centralize ClusterRole creation.

Lint RBAC in CI (Script it; don’t “review in Slack”).

Ban wildcard verbs/resources for tenant roles.

Workload identity and cloud IAM misbinding

The fastest way to leak data is to bind the wrong identity to the right pod.

In multi-tenant, identity mistakes propagate:

a shared service account gets reused

a workload identity binding is too broad

pods gain access to buckets/queues they should never see

Minimum guardrail

One workload identity per service, not per namespace.

Deny “default” service account usage for real workloads.

Audit “who can assume what” regularly and pipe it to alerts.

Pod security exceptions that never die

The exception list grows until it becomes the policy.

If you allow privileged pods, hostPath mounts, or host networking for one team, you’ve opened a side door for everyone unless you lock it down.

Minimum guardrail

Use Pod Security Admission (baseline/restricted) as default.

Require an exception workflow with expiry.

Grep for privileged/hostPath usage weekly.

Network policy gaps turn “one bad app” into “everyone is down”

Flat networks are how tenant bugs become tenant outages.

Default-allow is the default failure

If everything can talk to everything, blast radius is automatic.

A single noisy service can:

hammer shared dependencies

trigger thundering herds

overload DNS

spike cross-namespace traffic

Minimum guardrail

Default-deny egress and ingress per namespace.

Explicit allowlists for shared services.

Treat policies like code (PRs, review, tests).

Shared ingress controllers amplify mistakes

One bad ingress change can break unrelated tenants.

Failure patterns:

config reload loops

bad annotations triggering expensive behaviors

certificate mis-rotation

Minimum guardrail

Separate ingress controllers by tenancy tier (shared/dev vs prod/critical).

Canary ingress changes.

Enforce annotation allowlists.

DNS is a shared single point of failure

CoreDNS is the cluster’s heartbeat; overload it and nothing resolves.

In big clusters, DNS load grows with:

pod count

churn

retries during incidents

Minimum guardrail

Scale CoreDNS for QPS and cache settings.

Alert on CoreDNS latency/errors.

During incidents: Grep logs for upstream timeouts and SERVFAIL.

Scheduling and quota pathologies at scale

“Fair scheduling” is policy you must configure, not something Kubernetes gifts you.

Quota starvation and priority inversion

One tenant can starve others without “breaking rules.”

Common patterns:

Tenant A uses up shared node pool capacity with big requests.

Tenant B is within quota but can’t schedule due to fragmentation.

Everyone blames the scheduler. It did what you told it.

Minimum guardrail

ResourceQuotas per namespace (CPU/memory/pods).

LimitRanges to prevent “no requests” and “ridiculous limits.”

Separate node pools for noisy/bursty tenants.

Preemption can save prod or murder batch

Preemption is a knife. Use it like one.

If you enable priority + preemption:

your critical services can recover capacity

your batch jobs can get killed repeatedly unless they checkpoint

Minimum guardrail

PriorityClasses: “prod-critical”, “prod”, “batch”.

For batch: checkpoint or accept loss.

Measure eviction rates after enabling.

Upgrade and change-management risk is multiplied by tenant count

In big clusters, “safe changes” are the biggest outage source.

The shared add-ons are the sharp edges:

CNI upgrades

CSI upgrades

ingress controller upgrades

API deprecations breaking controllers/operators

node patching + drains deadlocking on PDBs

Failure mode you will hit: PDB deadlock.
A drain starts, pods can’t evict due to strict budgets, the rollout stalls, capacity shrinks, and unrelated tenants get squeezed.

Minimum guardrail

Canary upgrades in a smaller cluster or a dedicated pool.

Script rollback paths.

Set realistic PDBs (protect availability, don’t freeze the cluster).

Observability and incident response get harder as the cluster gets bigger

If you can’t attribute load to a tenant, you can’t run multi-tenant.

Per-tenant attribution is mandatory

“The cluster is slow” is not an actionable alert.

You need:

dashboards by namespace (CPU/mem/network/disk)

request rates at ingress by tenant

top talkers (network) and top allocators (memory)

Cardinality will bite you. Don’t ship every label. Decide which labels you can afford, then enforce it.

Audit logs and “who did what”

Multi-tenant incidents often start as “someone applied something.”

Enable audit logs and make them searchable. When you’re debugging a weird outage, you should be able to answer:

who changed the Deployment?

who changed the NetworkPolicy?

who updated the ingress?

Managed Kubernetes changes the risk profile, not the physics

Managed Kubernetes reduces control-plane toil, not tenant blast radius.

Managed Kubernetes usually helps with:

control plane uptime/patching

some upgrade orchestration

basic integrations

It does not automatically give you:

tenant isolation

safe defaults for quotas and policies

sane RBAC boundaries

disciplined change control for shared add-ons

If you’re on a managed Kubernetes offering like AceCloud, use the managed layer for what it’s good at (platform plumbing), then enforce tenancy guardrails at the cluster policy layer (quotas, PSA defaults, network policies, and tiered node pools). That’s where multi-tenancy succeeds or fails.

Mitigation playbook (week 1 controls that actually reduce incidents)

These are the controls you can deploy fast and feel immediately.

1) Create tenancy tiers

Not every workload deserves to share the same failure domain.

Shared/dev tier: many tenants, lower guarantees

Prod/shared tier: stricter policies, more guardrails

Prod/dedicated tier: separate node pools or separate clusters for the truly critical

2) Enforce default-deny networking

Flat networks are the default blast radius.

Deploy default-deny policies per namespace. Add explicit allow rules.

3) Lock down RBAC and pod security

Security drift is guaranteed unless you block it.

central RBAC templates

Pod Security Admission defaults

expiring exceptions

4) Quotas + LimitRanges everywhere

Multi-tenant without quotas is “first team to deploy wins.”

ResourceQuota per namespace

LimitRange to prevent “no requests” and “unbounded limits”

Alerts on quota saturation and Pending pods

5) Safer change management for shared components

Your add-ons are shared infrastructure. Treat them like prod.

canary upgrades

rollback scripts

PDB sanity checks before drains

runbooks for CNI/CSI/Ingress/DNS failures

Bottom line

Large multi-tenant clusters work when you treat them like a shared operating system.

A big shared Kubernetes cluster isn’t “just more nodes.” It’s a bigger shared failure domain. The operational risks are predictable: noisy neighbors, weak isolation, quota starvation, identity drift, and upgrade blast radius.

If you want reliability, you must Configure guardrails, Script rollouts, and verify attribution per tenant whether you run it yourself or on managed Kubernetes.

Hosted control plane: when it simplifies operations and when it adds complexity

Daya Shankar — Fri, 27 Feb 2026 06:13:52 +0000

A hosted control plane moves Kubernetes control-plane components off your worker fleet either into a provider-managed boundary (EKS) or onto a separate hosting cluster as pods (HyperShift).

It simplifies ops when you want predictable upgrades, less per-cluster snowflake work, and cleaner separation between “management” and “workloads.”

It adds complexity when control-plane connectivity, IAM, and shared blast radius become your new failure modes especially with private clusters.

Define hosted control plane in concrete terms

If you can’t say where the API server and etcd live, you can’t model risk.

“Hosted control plane” is a placement decision.

EKS: hosted by AWS in an EKS-managed VPC

AWS owns the masters; you own nodes and workloads.

AWS documents that the EKS-managed control plane runs inside an AWS-managed VPC and includes Kubernetes API server nodes and an etcd cluster. API server nodes run in an Auto Scaling group across at least two AZs; etcd nodes span three AZs.

What that means operationally:

You don’t patch control-plane instances.

You don’t rebuild etcd.

You do still own access, RBAC, node lifecycle, and add-ons.

kubeadm on EC2: not hosted, you host it

You run the masters, the etcd, the upgrades, and the recovery drills.

Kubeadm HA requires you to pick a topology (stacked etcd vs external etcd) and wire up the endpoints (often via a load balancer DNS name). External etcd needs explicit endpoint configuration; stacked etcd is “managed automatically” by kubeadm’s topology.

What that means operationally:

You patch and upgrade the control plane.

You own etcd snapshots and restore tests.

You own certificates and rotation edge cases.

HyperShift (hosted control planes): control planes as pods on a hosting cluster

You consolidate many control planes onto one management cluster.

Red Hat’s hosted control planes model runs control planes as pods on a management/hosting cluster, without dedicated VMs per control plane.

HyperShift then introduces a new question: where do those control plane pods land? Docs show “shared everything” by default, and you can dedicate nodes for control plane workloads via labels/taints.

Side-by-side: what gets simpler, what gets harder

Feature lists lie. Ownership and failure modes don’t.

Model	What simplifies	What gets harder	The new “pager line”
EKS hosted control plane	Control plane HA, scaling, replacement; less etcd babysitting	Endpoint access + SG design for private clusters; version planning	“Can we reach the API endpoint from the right networks?”
kubeadm on EC2	Full control; no managed constraints	Everything: HA wiring, etcd ops, upgrades, certs	“etcd is sick” is your incident
HyperShift	Reduce per-cluster control-plane VMs; faster cluster churn; multi-tenant mgmt	Hosting cluster becomes shared blast radius; two-layer debugging	“Hosting cluster health” pages everyone

When a hosted control plane simplifies operations

Hosted control planes help when your bottleneck is “running too many control planes.”

1) You operate many clusters (multi-tenant SaaS, env sprawl)

Cluster count is the multiplier.

If you run 20+ clusters, self-managed control planes become a tax:

patch windows multiply

certificate and etcd risk multiplies

“one-off cluster drift” becomes normal

EKS removes the control plane instances from your fleet and gives you a standardized control plane architecture across AZs.

HyperShift goes further: it removes dedicated control-plane machines per cluster and runs them as pods on a hosting cluster.

2) You want predictable control-plane availability without building an etcd practice

etcd is not hard until it’s hard at 3 AM.

kubeadm HA docs are clear: external etcd adds configuration surface area (explicit endpoints); stacked etcd is simpler but still your operational problem.

If your team doesn’t want to own etcd restores as a practiced drill, a hosted control plane removes that class of work from your team’s backlog.

3) You need fast cluster create/delete (ephemeral clusters, tenant clusters)

Provisioning speed is operational leverage.

HyperShift is designed around the concept of creating control planes as pods on a management cluster, which reduces the need to “spin up” dedicated control-plane machines per hosted cluster.

That’s useful when:

you create short-lived clusters for CI

you provision tenant clusters and churn them

you want cluster lifecycle to look like Deploying an app

4) You’re private-cluster-heavy and want a supported endpoint model

Private changes the operational shape more than any “feature.”

EKS lets you run a private-only API server endpoint (no public access), where kubectl must come from within the VPC or connected networks. Access to the private endpoint is controlled by rules on the cluster security group.

That’s not “simpler” in absolute terms. It’s simpler because it’s a supported, documented pattern with fewer moving parts than self-hosting your own API endpoint VIP/LB and cert story.

When a hosted control plane adds complexity

You trade “masters on VMs” for “network + IAM + shared blast radius.”

1) Control-plane connectivity becomes a first-class dependency

The API server is now “across a boundary,” and boundaries fail.

With EKS private-only clusters:

your kubectl, CI runners, and controllers must live inside the VPC or connected networks

your security group rules become part of cluster availability

With public endpoint access, the default behavior has historically been public enabled / private disabled (and you can toggle both).
Either way, endpoint mode is now a design choice you must document, test, and audit.

What changes for on-call:

“API is down” might really be “route to endpoint is broken”

DNS, TGW/peering, SG rules, and client network become suspects

2) Identity boundaries get sharper (and easier to misconfigure)

Hosted control planes push you into “who can reach what” decisions.

Private endpoint + security group control is good. It’s also easy to get wrong:

over-broad SG rules turn “private endpoint” into “private but reachable from everything”

too-tight rules break controllers and CI/CD in weird ways

Hosted doesn’t remove IAM work. It moves it to the center of the blast radius.

3) HyperShift’s hosting cluster becomes shared infrastructure

You didn’t delete control planes. You consolidated them.

HyperShift runs control planes as pods on a hosting cluster.
Docs show that hosted control plane pods can be scheduled broadly (“shared everything”), and you can taint/label nodes to dedicate capacity.

This is the operational trade:

Pro: fewer dedicated control-plane machines per tenant cluster

Con: hosting cluster saturation, upgrades, or outages can hit multiple hosted clusters at once

If you adopt HyperShift, treat the hosting cluster like tier-0 infrastructure:

separate node pools

aggressive monitoring

strict change control

tested disaster recovery

4) Debug becomes two-layer

Symptoms show up in the guest cluster; root cause can live elsewhere.

With EKS, control plane is managed. You troubleshoot via endpoint reachability, AWS telemetry, and cluster behavior. You can’t SSH into masters, and that’s the point.

With HyperShift, you can often inspect control plane pods on the hosting cluster. That’s powerful and it means your runbooks must cover two clusters:

guest cluster symptoms

hosting cluster root cause

Private clusters: the “hosted” decision that matters most

Private mode turns networking into part of the control plane.

EKS private endpoint: supported, but policy-heavy

SG rules are now part of cluster uptime.

AWS states that for private-only API servers:

there is no public access from the internet

kubectl must come from the VPC or connected network

cluster security group rules control private endpoint access

This is clean if you already run:

TGW / VPC peering / Direct Connect

private DNS resolution patterns

locked-down egress

It’s messy if your ops tooling lives outside the network boundary and you aren’t ready to move it.

kubeadm private: you own the endpoint and its failure modes

You don’t get a managed endpoint; you build one.

kubeadm HA guides assume you Configure a load balancer in front of the control plane nodes and wire up DNS names and endpoints.

That’s flexible. It’s also more work:

API endpoint LB health checks

TLS/cert rotation

routing changes during upgrades

HyperShift private: you design exposure between hosting and guest clusters

Hosted control planes still need reachable endpoints.

Hosted control plane pods live on the hosting cluster. That’s good for consolidation. It also means you must design:

how guest nodes reach the hosted API server

how admins reach it (private networks, bastions, CI runners)

how you segment tenants

The exact networking patterns vary by environment, but the invariant is: private hosted control planes increase the importance of network design.

Terraform: what you actually manage in each model

IaC doesn’t disappear. The resource graph changes.

EKS Terraform surface area

You configure endpoint modes, SGs, node groups, and IAM.

Minimum Terraform concerns:

endpoint access mode (public/private/both)

cluster security group rules for private access

node groups and AMI strategy

IRSA and IAM boundaries

Hosted control plane simplifies the “masters” part. It does not simplify the access-control part.

kubeadm Terraform surface area

Terraform becomes your control-plane installer, not just a cluster creator.

You end up managing:

control plane EC2 instances

LB/VIP in front of API servers (common HA pattern)

etcd instances (external) or colocated etcd (stacked)

bootstrap scripts, cert distribution, upgrade workflows

This can be clean if you have mature automation. If not, it’s a lot of state to keep consistent.

HyperShift Terraform surface area

You manage the hosting cluster like a platform, then declaratively create hosted clusters.

HyperShift adds:

hosting cluster lifecycle (upgrade, capacity, resilience)

hosted cluster objects and their infra mappings

scheduling policies for control plane pods (dedicated nodes via labels/taints)

Terraform can drive parts of this, but you’ll also lean on cluster-native controllers.

Prometheus: what you need to watch so hosted doesn’t surprise you

Hosted control planes move failure modes. Your dashboards must follow.

At minimum, split monitoring into two planes:

Workload plane (guest cluster apps)
request rates, latency, errors
node saturation
queue depth / retries
Control plane plane
API server availability/latency from where your clients run
controller health signals
for HyperShift: hosting cluster resource pressure, because control planes are pods

For private clusters, add synthetic checks from the networks that matter:

from CI runner network

from admin network

from in-cluster controllers

If the API endpoint is unreachable from your automation network, you don’t have a cluster. You have a museum exhibit.

Decision checklist for SaaS and platform teams

Answer these honestly and the right model usually falls out.

How many clusters will you run in 12 months?
If the number is growing fast, hosted control plane saves toil.
Do you have an etcd practice?
If “restore drill” isn’t something you run quarterly, kubeadm HA is a risk trade.
Is private-only mandatory?
If yes, model endpoint reachability and SG rules as part of uptime.
Can you tolerate shared blast radius?
HyperShift consolidates control planes. Treat hosting cluster as tier-0.
What do you want to debug at 3 AM: VMs or networks?
kubeadm tends toward VM-level debugging.
hosted control planes tend toward network/identity debugging.

Where AceCloud fits

Hosted control plane only helps if the day-2 loop is owned and scripted.

If you’re buying hosted control plane benefits but don’t want to run the surrounding ops (endpoint policies, Terraform hygiene, Prometheus wiring, upgrade runbooks), a managed Kubernetes provider like AceCloud can own that platform loop while your team focuses on workload correctness and SLOs.

Bottom line

Hosted control plane is not “less complexity.” It’s different complexity.

Pick a hosted control plane (EKS) when you want AWS to own control plane HA, scaling, and replacement across AZs.
Pick kubeadm when you need maximum control and you’re willing to own HA topology, etcd ops, and endpoint plumbing.
Pick HyperShift when you need to run many clusters and you’re ready to operate a tier-0 hosting cluster that runs control planes as pods.

The correct choice is the one that gives every failure mode a clear owner—and keeps your pager quiet for the right reasons.

Serving LLMs on IaaS: throughput vs latency tuning with practical guardrails

Daya Shankar — Fri, 27 Feb 2026 06:11:05 +0000

Serving LLMs on IaaS is queueing plus memory pressure dressed up as ML. Every request has a prefill phase (prompt → KV cache) and a decode phase (token-by-token output).

Throughput tuning pushes batching and concurrency. Latency tuning caps them to protect TTFT and ITL. With vLLM on a single L40S (PCIe), you win by setting hard limits and enforcing admission control.

TTFT, ITL, TPS: stop mixing the metrics

If you tune the wrong metric, you’ll ship a fast benchmark and a slow product.

You need three numbers, and they mean different things:

TTFT (time to first token): how long the user waits before anything shows up. Interactive UX lives here.
ITL (inter-token latency): the “smoothness” of streaming output once decoding starts. Chat feels broken when this jitters.
Throughput (tokens/sec): the finance metric. It decides cost per request.

One important detail: E2E latency includes queueing + prefill + decode. TTFT is where queueing hides when you’re overloaded.

Practical measurement rule: measure TTFT and ITL at the client (or gateway), not inside the GPU server. Internal timings miss queueing in front of vLLM.

Hardware reality check: single L40S on PCIe

You can’t tune around a bus you don’t have.

An L40S is a strong inference GPU, but it’s not an NVLink box. It’s 48GB GDDR6 on PCIe Gen4 x16.
That matters because:

You have one GPU’s worth of memory for weights + KV cache.
You don’t get multi-GPU model parallel tricks for free.
Your main enemies are KV-cache pressure and batch/concurrency overshoot, not “GPU topology.”

On a single GPU server, latency failures usually look like:

TTFT spikes because the prefill queue grows.
ITL spikes because decode gets starved or the batch gets too big.
OOM/restarts because KV cache math was wishful thinking.

vLLM’s default behavior: TTFT-first scheduling (and the trade)

vLLM already picks a side; your job is to set guardrails around it.

By default, vLLM’s scheduler prioritizes prefills and does not batch prefill and decode into the same batch. That typically optimizes TTFT, but can worsen ITL and GPU utilization.

Translation: out of the box, vLLM tries to be responsive. You can still break it by feeding it mixed traffic with no limits.

The knobs that actually move TTFT, ITL, and OOM risk

You don’t “optimize latency.” You Configure concurrency and KV-cache headroom.

These four knobs do most of the work in production vLLM.

1) --max-num-seqs caps concurrency

This is your “how many requests can be active” ceiling.

--max-num-seqs is the maximum number of sequences per iteration.
Lowering it:

reduces concurrent KV cache usage
reduces queue contention inside the engine
usually helps tail latency (until you underutilize the GPU)

2) --max-num-batched-tokens controls batch size per iteration

This is where you trade throughput for TTFT/ITL stability.

--max-num-batched-tokens limits batched tokens per iteration.
Lowering it:

reduces “one huge prefill” events
reduces KV cache demand per cycle
can protect TTFT and prevent decode jitter

Raising it:

can increase throughput
can increase queueing and tail spikes if your traffic is bursty or prompts are long

3) --gpu-memory-utilization sets KV-cache headroom

This decides how much VRAM vLLM pre-allocates for cache.

vLLM pre-allocates GPU cache using gpu_memory_utilization. Increase it to provide more KV cache space.
If you set it too high, you risk fragmentation and less room for everything else. If you set it too low, you’ll hit KV cache limits early and TTFT will spike under concurrency.

4) --enable-chunked-prefill tames long prompts

Long prompts are TTFT killers; chunking makes them less explosive.

When enabled, vLLM can chunk prefill requests based on max_num_batched_tokens.
This is a practical guardrail when you can’t control prompt length perfectly.

A sane starting config for your SLA (p95 TTFT 250ms, p99 800ms)

Start conservative, hit the TTFT target, then earn throughput back.

On a single L40S, don’t begin with “maximum throughput.” Begin with “stable TTFT.”

Example vllm serve baseline (single GPU):

vllm serve /models/your-llm \
--host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 64 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill

Why these shapes:

max_num_seqs prevents unlimited concurrency blowups.
max_num_batched_tokens prevents one batch from ballooning.
gpu_memory_utilization keeps cache headroom explicit.
chunked prefill reduces “one giant prompt ruins the minute.”

You will tune these. But you need a stable base first.

Practical guardrails for mixed chat + batch traffic

Throughput tuning is easy. Guardrails are what keep p99 alive.

Mixed traffic (interactive + batch) is where systems get weird. Batch clients tend to:

send long prompts
request long generations
retry aggressively
keep load constant

Interactive chat needs:

fast TTFT
consistent ITL
predictable tail behavior

So you need admission control in front of vLLM. Not “best effort.”

Guardrail table (start here)

These caps stop one client from torching everyone else.

Guardrail	Default starting point	Why it exists
Max prompt tokens	4k–8k (per request)	Long prefills blow TTFT and batch size
Max output tokens	256–512 (interactive), 1k+ (batch)	Protect tail latency for chat
Max in-flight requests	Tie to max_num_seqs	Prevent internal queue explosion
Max queue depth	1–2× in-flight	If queue > that, reject/429 fast
Request timeout	Slightly above p99 target	Don’t let zombie requests clog decode
Retry policy	capped + jitter	Stops retry storms multiplying load

These aren’t theoretical. They’re how you keep a single GPU server usable.

Two-lane routing (interactive vs batch)

If you mix traffic in one FIFO queue, batch wins and chat loses.

On one GPU, the clean pattern is two lanes at the gateway:

Interactive lane: strict caps (short prompts, short outputs), low queue depth.
Batch lane: looser caps, but it yields when interactive is busy.

You can implement this with a thin gateway that:

inspects request size (prompt tokens + requested output tokens)
routes “interactive” to the main lane
routes “batch” to a background lane with stricter admission

Even if both lanes hit the same vLLM backend, the queue policy changes outcomes.

Concrete rule that works:
If interactive queue depth > N, reject batch (429) instead of letting it sit and inflate TTFT.

The tuning loop that converges (without cargo cult)

Tune one knob at a time and measure TTFT and ITL separately.

Here’s the loop to run on a GPU cloud server before you call it “production.”

Step 1: Fix the workload mix

Your traffic generator must match reality.

Build two test profiles:

Chat: short prompts, short outputs, bursty concurrency.
Batch: longer prompts and outputs, steady concurrency.

If you benchmark only one, you’ll tune only one.

Step 2: Lock SLOs first

You already have targets; enforce them.

Targets:

TTFT p95 ≤ 250ms
TTFT p99 ≤ 800ms

Keep a red line on the dashboard. If a tuning change crosses it, roll back.

Step 3: Set limits, then raise carefully

Earn throughput; don’t steal it from p99.

Order of operations:

Set max_num_seqs low enough that you never OOM under your worst prompt mix.
Set max_num_batched_tokens to prevent giant prefills from blocking decode.
Adjust gpu_memory_utilization to give KV cache room.
Enable chunked prefill if long prompts exist in real traffic.

Then:

increase max_num_seqs until TTFT p95 hits the edge of your budget
increase max_num_batched_tokens only if ITL stays stable and TTFT doesn’t spike

Step 4: Add overload behavior on purpose

A good system fails fast, not slowly.

Define overload mode:

when queue depth exceeds threshold → return 429 with Retry-After
when prompt/output exceeds limits → return 400 with a clear message
when batch lane is busy → shed batch first

If you don’t define this, your system will “define it” by melting.

Dashboards that catch trouble before users do

You can’t grep production. You need signals that predict tail spikes.

Track:

TTFT p50/p95/p99 (interactive lane, batch lane)
ITL distribution (interactive lane)
queue depth and reject rate (the guardrail is working if it fires)
GPU memory usage and cache pressure (OOM risk proxy)

vLLM already frames TTFT/ITL as the core performance story, and its scheduler tradeoffs explain why TTFT can look good while ITL suffers (or vice versa).

Where AceCloud fits (one honest paragraph)

IaaS isn’t the problem; inconsistency is.

If you’re serving on an IaaS gpu cloud server from a provider like AceCloud, treat it like any other VM: bake a known image, pin driver/CUDA versions, and script your vLLM flags so every node behaves the same. The tuning work above only sticks when the box is predictable.

Bottom line

Throughput is what you brag about. Latency is what users feel.

On vLLM + single L40S, you don’t win by chasing max tokens/sec. You win by controlling concurrency and batch size, allocating KV cache intentionally, and enforcing guardrails that keep mixed traffic from turning into a queueing disaster. Hit TTFT p95/p99 first. Then scale throughput without stealing it from your tail.