Muskan

Posted on Mar 23 • Edited on Apr 8 • Originally published at zop.dev

Going to Production: Spot Instances, Karpenter, and the Graviton Advantage

#devops #aws #cloud #kubernetes

The mathematics of Kubernetes in production is brutal and undeniable. Ninety-six percent of enterprises now run Kubernetes at scale, yet the economic reality underneath these deployments tells a troubling story.

Research from industry analysts consistently shows that 30% of cloud spending on Kubernetes workloads is wasted money that delivers zero operational value. When an organization invests $1,000,000 annually in Kubernetes infrastructure, $300,000 evaporates without improving performance, reliability, or throughput.

This waste compounds annually. 88% of teams report year over year increases in total cost of ownership, a trend that makes cloud-native economics increasingly difficult to justify to finance leadership. The root cause isn't Kubernetes itself; it's the gap between how development clusters operate and what production environments demand.

Development environments tolerate over-provisioned resources and idle capacity. Production environments do not. Production workloads require availability guarantees, consistent performance, and resilience against infrastructure failures. These legitimate requirements create the economic tension that defines modern cloud operations: how do you meet production SLAs while preventing costs from growing unchecked?

The Data Shows This is Solvable

An e-commerce platform running seasonal workloads reduced monthly Kubernetes spending from $89,000 to $52,000 in six weeks, achieving a 42% reduction by applying production-appropriate optimization patterns.
A fintech company with steady state workloads achieved a 38% reduction, moving from $34,000 to $21,000 monthly in four weeks.

These results are not anomalies. They represent what becomes possible when you approach production Kubernetes with the same rigor applied to other infrastructure decisions. This article examines three techniques that make this possible: Spot instance integration, Karpenter provisioning, and Graviton ARM migration.

Workload Tolerance Classification Framework

Before applying Spot instances, Karpenter, or Graviton migrations, you need to understand what you're running. Not all Kubernetes workloads deserve identical treatment. This classification determines which optimization techniques each workload can safely use:

Mission-Critical Workloads: Cannot tolerate Spot interruptions under any circumstances. These require always on capacity (On-Demand or Reserved) with zero tolerance for disruption. Examples: Payment processing pods, core database instances, customer-facing APIs.
Stateful Workloads: Occupy a middle ground. They can handle limited Spot usage but require persistent volumes and graceful shutdown handling. Examples: Databases with replicas, message queues with durable storage, and caching layers. (Spot instances work for non primary replicas; primary instances stay on-demand).
Batch-Tolerant Workloads: Ideal Spot candidates. Examples: Data pipelines, CI/CD jobs, ML training, report generation. These achieve the highest spot savings because interruptions simply trigger a retry.
Development and Test Workloads: Highest tolerance for interruption. Non-production environments can run entirely on Spot instances with aggressive scheduling.

Spot Instance Interruption Handling

When AWS reclaims a Spot instance, it doesn't disappear without warning. The cloud provider emits a two-minute termination notice through the instance metadata service. This brief window enables a sophisticated choreography of graceful shutdown, workload migration, and state preservation.

Kubernetes surfaces these infrastructure events through the node lifecycle controller. When a node detects the termination notice, it communicates its impending departure to the control plane by setting a taint. This taint triggers the scheduler to evict susceptible pods while preventing new ones from landing on condemned infrastructure.

Managing Evictions

Pod Disruption Budgets (PDBs) provide the declarative mechanism for controlling evacuation behavior. A PDB specifies the minimum number/percentage of replicas that must remain available during voluntary disruptions.

Infrastructure-Level Handling: Relies on Kubernetes primitives (PDBs, node taints, lifecycle controller) to manage evacuation declaratively. Effective for stateless services.
Application-Level Handling: Involves active state management: checkpointing in-memory state, completing in flight transactions, and replicating writes.

The savings versus complexity trade-off becomes explicit here. Extending Spot usage to stateful databases demands substantial engineering investment in graceful termination. However, restricting Spot instances to stateless API tiers captures substantial savings without multiplying operational complexity.

The Diversification Principle: As veteran cloud practitioners often note, the primary reason Spot strategies fail is instance rigidity.To truly master Spot economics, you must embrace flexibility and diversification. By allowing your orchestrator to pick from a wide array of instance families and sizes, you decrease the likelihood of a low capacity signal for any single type, making your infrastructure significantly more resilient to reclaims.

The Termination Handler Pattern: A DaemonSet continuously polls the instance metadata endpoint every 5 seconds. When it detects a termination notice, it triggers kubectl drain to gracefully evict pods, then waits for the 90-second grace period before the instance disappears.

Karpenter vs. Cluster Autoscaler: Decision Framework

When graduating to production, the choice between Cluster Autoscaler and Karpenter directly impacts cost.

Cluster Autoscaler (CA): Operates within the constraints of predefined, static node groups. It provides predictability but creates inefficiency. If your node group contains only m5.large instances and demand requires m5.xlarge, the cluster scales horizontally by adding more m5.large nodes rather than right sizing to the actual need.
Karpenter: Eliminates node groups. You define provisioners (declarative specifications of workload requirements), and Karpenter dynamically selects instance types from the broader AWS capacity pool.

The 3 Operational Advantages of Karpenter

Expansion Speed: Provisions new nodes in seconds rather than minutes.
Bin-Packing Efficiency: Matches pod resource requests to optimal instance sizes, reducing wasted vCPU and memory.
Consolidation: Combines smaller pods onto larger instances and decommissions underutilized nodes.

The Verdict: Choose Cluster Autoscaler when compliance requires predictable, pre approved instance types. Choose Karpenter when cost efficiency and scaling speed outweigh the need for rigid infrastructure control.

Note for ECS Users: While this guide focuses on Kubernetes, those running Amazon ECS can achieve similar results with even less operational overhead by utilizing the FARGATE_SPOT capacity provider, which automates much of this underlying instance management.

Node Pool Consolidation Strategies

Once Karpenter provisions nodes, it optimizes by consolidating them. Karpenter supports two consolidation modes:

consolidation=auto: Karpenter actively migrates pods when opportunities arise, terminating empty nodes immediately. Delivers rapid cost reduction but generates pod churn. Best for variable, stateless workloads.
consolidation=wait: Karpenter detects opportunities but waits to act until pods naturally terminate or scale down. Best for long-running stateful workloads where relocation incurs network/stability costs.

ARM Architecture Migration: The Graviton Advantage

AWS Graviton processors (custom ARM-based silicon) deliver 20-40% better price-performance than comparable x86 instances. This efficiency stems from the ARM instruction set requiring fewer transistors per instruction, reducing power consumption and heat generation.

Application compatibility is often more favorable than perceived. Applications written in Go, Java, Python, and Node.js execute on ARM without source code modification. The critical dependency is native libraries (compiled C/C++ extensions).

When to migrate: Teams running sustained compute workloads at scale, containerized apps lacking x86-specific dependencies, and environments using Karpenter.

Beyond price-performance, migrating to ARM adds a critical degree of freedom to your capacity strategy. By supporting both x86 and ARM (Multi-arch), you increase your access to the total available Spot pool. If x86 capacity is tight, your workloads can seamlessly shift to Graviton, effectively using architecture diversification to bypass regional supply constraints and reduce frequency of interruptions.

When to avoid: Workloads dependent on x86-specific binaries without ARM equivalents, or teams without the capacity to perform adequate ARM testing.

Multi-Architecture Container Images

The foundation for Graviton migration rests on your container images. Without properly constructed multi-architecture manifests, Karpenter cannot seamlessly route workloads to ARM nodes.

Multi-arch support begins at build time through Docker buildx. Rather than maintaining separate pipelines, buildx spins up builders for linux/amd64 and linux/arm64 simultaneously. The registry receives these as separate images, then combines them into a manifest list—a single tag that resolves to the correct architecture at pull time based on the node's platform.

The Dockerfile Pattern:
Utilize the platform argument in multi-stage builds. The build stage accepts $BUILDPLATFORM and $TARGETARCH to compile binaries, while the runtime stage pulls a matching base image.

Migration Sequencing Strategy:
Begin with stateless, embarrassingly parallel workloads (API gateways, CI runners, batch processors). These reschedule without state concerns, allowing you to validate Graviton performance at production scale before tackling databases or caching layers.

Conclusion

Production Kubernetes doesn’t become expensive by accident it becomes expensive through default decisions left unchallenged. Over-provisioned nodes, static scaling models, and architecture inertia quietly compound into significant financial waste.

The path to efficiency isn’t a single tool, but a combination of deliberate choices:

Spot instances unlock immediate cost savings for interruption tolerant workloads.
Karpenter introduces real time, intelligent infrastructure decisions that eliminate wasted capacity.
Graviton (ARM) delivers structural price-performance gains at the compute level.

Individually, each strategy improves efficiency. Together, they fundamentally reshape the economics of running Kubernetes in production.

The key is not to optimize everything at once, but to start with workload awareness. Classify what you run, apply the right strategy incrementally, and validate outcomes in production conditions not assumptions.

Top comments (1)

Muskan • Mar 26

Thanks to @paul Marcelin for the expert feedback on this!
Your point regarding Instance Diversification is spot on .I’ve updated the section on Spot strategy to highlight how flexibility specifically not being picky about instance types is the real secret to avoiding those frustrating capacity interruptions.
Thanks for helping make this a better resource for the community!