Why We Moved from GKE to EKS
When we initially adopted Kubernetes, Google Kubernetes Engine (GKE) Autopilot seemed like the perfect choice — fully managed, minimal operational overhead, and quick to get started.
But as our workloads matured, three major challenges started to surface:
- Rising and unpredictable costs
- Compliance constraints
- The need for deeper infrastructure control
This blog walks through why we migrated to Amazon Elastic Kubernetes Service (EKS) with Karpenter, the architectural changes we made, and the lessons we learned running production workloads post-migration.
⚠️ Why GKE Autopilot Started Falling Short
1. Cost Inefficiencies at Scale
GKE Autopilot pricing is based on requested resources, not actual usage. This sounds fine at small scale — but as traffic grows, the gaps between requested and actual usage start to compound.
Problems we observed:
- Over-provisioned workloads leading to higher bills
- No access to Spot/Preemptible node strategies with the same level of flexibility
- Very few cost optimization knobs to tune
As traffic grew, costs increased almost linearly with no meaningful way to optimize without restructuring our entire workload configuration.
2. Compliance and Governance Constraints
Operating in a regulated environment required:
- Fine-grained IAM control at the workload level
- Strict network isolation between services
- Audit-level visibility into infrastructure activity
With GKE Autopilot, several configurations are abstracted away or restricted by design. This made it harder to enforce organization-wide security policies and satisfy compliance requirements from auditors. Specifically:
- Enforcing per-pod IAM permissions cleanly was non-trivial
- Network policy enforcement had gaps in our specific setup
- Generating audit-ready logs tied to individual workload actions required workarounds
We needed something that gave us first-class integration with cloud-native IAM and security tooling — without layering on custom solutions.
3. Limited Infrastructure Control
When performance-sensitive services started hitting bottlenecks, the inability to choose instance types became a real blocker. We had no control over:
- CPU vs. memory-optimized instance selection
- ARM-based workloads on Graviton processors
- Custom AMIs or low-level networking tuning
For teams running general-purpose workloads, this abstraction is a feature. For us, it was a ceiling.
🎯 Why We Chose EKS + Karpenter
Full Infrastructure Control
Moving to EKS gave us direct control over:
- Instance families — CPU-optimized, memory-optimized, ARM (Graviton)
- Custom AMIs — hardened images meeting our internal security baseline
- Networking — VPC-native networking with fine-grained subnet and security group control
This unlocked workload-specific performance tuning that simply wasn't possible before.
Advanced Cost Optimization with Karpenter
Karpenter is not your traditional cluster autoscaler. Instead of scaling pre-defined node groups, it:
- Watches for unschedulable pods in real time
- Selects the right-sized instance based on actual pod requirements
- Prioritizes Spot instances where workloads allow, falling back to On-Demand seamlessly
- Bin-packs nodes efficiently, reducing idle capacity
The result: faster scaling reactions and a dramatically lower compute bill — without sacrificing reliability.
Compliance Alignment
AWS gave us the compliance story we needed:
- IRSA (IAM Roles for Service Accounts) — precise, per-pod IAM permissions with no shared credentials
- VPC-level isolation — full control over ingress, egress, and inter-service communication
- CloudTrail integration — every API call, every node action, fully auditable out of the box
- AWS Config + Security Hub — continuous compliance checks against CIS benchmarks and custom rules
This made our next compliance audit significantly smoother. Auditors got clear, traceable logs without us having to build custom instrumentation.
🏗️ Target Architecture
Here's what the high-level migration looked like architecturally:
| Component | Before (GKE) | After (EKS) |
|---|---|---|
| Cluster | GKE Autopilot | EKS (Managed Node Groups + Karpenter) |
| Autoscaling | Built-in Autopilot scaling | Karpenter |
| Spot Strategy | NO Control | Karpenter Spot-first provisioning |
| IAM | GCP Workload Identity | AWS IRSA |
| Audit Logging | Cloud Audit Logs | CloudTrail + CloudWatch |
| Networking | GKE VPC-native | AWS VPC with custom subnets |
⚙️ Karpenter Setup — The Game Changer
Karpenter replaced our traditional Cluster Autoscaler, and the difference was immediately visible.
How we configured it:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-arm64
spec:
template:
metadata:
labels:
# -----------------------------------------------
# These labels land on the EC2 node.
# Your pod affinity rules match against these.
# -----------------------------------------------
node-pool: spot-arm64
capacity-type: spot
arch: arm64
workload-class: standard
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
# c6g, c7g, c8g — compute optimized Graviton
# m6g, m7g, m8g — general purpose Graviton
# r6g, r7g, r8g — memory optimized Graviton
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"] # Graviton2+ only (gen 6,7,8)
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 168h # 7 days — shorter for spot nodes
limits:
cpu: 500
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 2m
weight: 100
Key decisions we made:
- Spot-first provisioning — workloads that tolerate interruptions run on Spot; stateful services stay on On-Demand
- Multiple instance families — Karpenter picks the cheapest right-sized option across families
- Interruption handling — we use the Karpenter interruption queue (SQS) to gracefully drain Spot nodes before AWS reclaims them
-
consolidateAfter: 2m— nodes deprovision 2m seconds after going idle, eliminating ghost capacity
📊 Results After Migration
💰 Cost
Compute costs dropped significantly. The main drivers:
- Spot instances covering the majority of our non-critical workloads
- Karpenter's bin-packing eliminating idle node waste
- Right-sized instances instead of over-provisioned static node groups
⚡ Performance
- Faster pod scheduling — Karpenter provisions new nodes in under 60 seconds in most cases
- Better workload isolation through custom node selectors and taints
- Graviton (ARM) instances for compatible workloads gave us a meaningful price-performance improvement
🔐 Compliance
- Audit reports now generated directly from CloudTrail without custom tooling
- IRSA eliminated shared IAM credential risks
- Security Hub provides continuous posture monitoring against our compliance framework
⚠️ Challenges We Faced
Being honest here — this is what makes a migration story actually useful.
1. Karpenter Learning Curve
Karpenter's provisioning model is fundamentally different from Cluster Autoscaler. Debugging why a node wasn't provisioned — or why Karpenter chose a specific instance type — required understanding its internal decision logic. The logs are verbose but not always immediately readable.
What helped: Running Karpenter in dry-run mode first, and adding structured logging to correlate provisioning decisions with pod events.
2. Networking Model Differences
GKE's VPC-native networking and AWS VPC behave differently in non-obvious ways — especially around CIDR planning, secondary IP ranges, and how pod IPs are allocated. We had to redesign our subnet layout and revisit some service-to-service communication assumptions.
3. IAM Complexity
IRSA is powerful but requires careful role design. Mapping GCP Workload Identity bindings to AWS IRSA role assumptions took time, especially for services that assumed broad IAM permissions under GCP that needed to be tightened properly.
🧠 Key Lessons Learned
- Managed ≠ always optimal at scale. Autopilot is excellent for getting started, but production-grade platforms eventually need control surfaces that fully managed offerings deliberately hide.
- Cost optimization requires infrastructure access. You can't tune what you can't see.
- Autoscaling strategy matters more than cluster size. Karpenter's approach of provisioning for the pod rather than scaling a group changed how we think about capacity planning entirely.
- Compliance is easier when the platform is designed for it. AWS's native compliance tooling removed a category of work that we were previously solving with custom scripts and log forwarding pipelines.
- Migration should always be incremental. Parallel environment, gradual DNS cutover, canary deployments — this approach meant we caught issues in staging before they became production incidents.
🏁 Conclusion
GKE Autopilot is an excellent choice for teams that want Kubernetes without the operational overhead — and we'd still recommend it for that use case.
But for production environments that require cost control at scale, fine-grained compliance posture, and workload-specific infrastructure decisions, EKS with Karpenter provided a more flexible and efficient platform.
The migration wasn't trivial, but the control, visibility, and cost profile on the other side made it worth it.
Have you gone through a similar migration? Or are you evaluating EKS vs GKE for your stack? Drop your questions in the comments — happy to dig into specifics.
Tags: kubernetes aws devops cloud karpenter
Top comments (0)