Ajinkya

Posted on Apr 25 • Originally published at dev.to

Why We Moved from GKE to EKS

#kubernetes #devops #aws #karpenter

Why We Moved from GKE to EKS

When we initially adopted Kubernetes, Google Kubernetes Engine (GKE) Autopilot seemed like the perfect choice — fully managed, minimal operational overhead, and quick to get started.

But as our workloads matured, three major challenges started to surface:

Rising and unpredictable costs
Compliance constraints
The need for deeper infrastructure control

This blog walks through why we migrated to Amazon Elastic Kubernetes Service (EKS) with Karpenter, the architectural changes we made, and the lessons we learned running production workloads post-migration.

⚠️ Why GKE Autopilot Started Falling Short

1. Cost Inefficiencies at Scale

GKE Autopilot pricing is based on requested resources, not actual usage. This sounds fine at small scale — but as traffic grows, the gaps between requested and actual usage start to compound.

Problems we observed:

Over-provisioned workloads leading to higher bills
No access to Spot/Preemptible node strategies with the same level of flexibility
Very few cost optimization knobs to tune

As traffic grew, costs increased almost linearly with no meaningful way to optimize without restructuring our entire workload configuration.

2. Compliance and Governance Constraints

Operating in a regulated environment required:

Fine-grained IAM control at the workload level
Strict network isolation between services
Audit-level visibility into infrastructure activity

With GKE Autopilot, several configurations are abstracted away or restricted by design. This made it harder to enforce organization-wide security policies and satisfy compliance requirements from auditors. Specifically:

Enforcing per-pod IAM permissions cleanly was non-trivial
Network policy enforcement had gaps in our specific setup
Generating audit-ready logs tied to individual workload actions required workarounds

We needed something that gave us first-class integration with cloud-native IAM and security tooling — without layering on custom solutions.

3. Limited Infrastructure Control

When performance-sensitive services started hitting bottlenecks, the inability to choose instance types became a real blocker. We had no control over:

CPU vs. memory-optimized instance selection
ARM-based workloads on Graviton processors
Custom AMIs or low-level networking tuning

For teams running general-purpose workloads, this abstraction is a feature. For us, it was a ceiling.

🎯 Why We Chose EKS + Karpenter

Full Infrastructure Control

Moving to EKS gave us direct control over:

Instance families — CPU-optimized, memory-optimized, ARM (Graviton)
Custom AMIs — hardened images meeting our internal security baseline
Networking — VPC-native networking with fine-grained subnet and security group control

This unlocked workload-specific performance tuning that simply wasn't possible before.

Advanced Cost Optimization with Karpenter

Karpenter is not your traditional cluster autoscaler. Instead of scaling pre-defined node groups, it:

Watches for unschedulable pods in real time
Selects the right-sized instance based on actual pod requirements
Prioritizes Spot instances where workloads allow, falling back to On-Demand seamlessly
Bin-packs nodes efficiently, reducing idle capacity

The result: faster scaling reactions and a dramatically lower compute bill — without sacrificing reliability.

Compliance Alignment

AWS gave us the compliance story we needed:

IRSA (IAM Roles for Service Accounts) — precise, per-pod IAM permissions with no shared credentials
VPC-level isolation — full control over ingress, egress, and inter-service communication
CloudTrail integration — every API call, every node action, fully auditable out of the box
AWS Config + Security Hub — continuous compliance checks against CIS benchmarks and custom rules

This made our next compliance audit significantly smoother. Auditors got clear, traceable logs without us having to build custom instrumentation.

🏗️ Target Architecture

Here's what the high-level migration looked like architecturally:

Component	Before (GKE)	After (EKS)
Cluster	GKE Autopilot	EKS (Managed Node Groups + Karpenter)
Autoscaling	Built-in Autopilot scaling	Karpenter
Spot Strategy	NO Control	Karpenter Spot-first provisioning
IAM	GCP Workload Identity	AWS IRSA
Audit Logging	Cloud Audit Logs	CloudTrail + CloudWatch
Networking	GKE VPC-native	AWS VPC with custom subnets

⚙️ Karpenter Setup — The Game Changer

Karpenter replaced our traditional Cluster Autoscaler, and the difference was immediately visible.

How we configured it:

 apiVersion: karpenter.sh/v1
    kind: NodePool
    metadata:
      name: spot-arm64
    spec:
      template:
        metadata:
          labels:
            # -----------------------------------------------
            # These labels land on the EC2 node.
            # Your pod affinity rules match against these.
            # -----------------------------------------------
            node-pool: spot-arm64
            capacity-type: spot
            arch: arm64
            workload-class: standard
        spec:
          requirements:
            - key: kubernetes.io/arch
              operator: In
              values: ["arm64"]
            - key: kubernetes.io/os
              operator: In
              values: ["linux"]
            - key: karpenter.sh/capacity-type
              operator: In
              values: ["spot"]
            - key: karpenter.k8s.aws/instance-category
              operator: In
              values: ["c", "m", "r"]
              # c6g, c7g, c8g — compute optimized Graviton
              # m6g, m7g, m8g — general purpose Graviton
              # r6g, r7g, r8g — memory optimized Graviton
            - key: karpenter.k8s.aws/instance-generation
              operator: Gt
              values: ["5"]             # Graviton2+ only (gen 6,7,8)
          nodeClassRef:
            group: karpenter.k8s.aws
            kind: EC2NodeClass
            name: default
          expireAfter: 168h             # 7 days — shorter for spot nodes


      limits:
        cpu: 500
        memory: 2000Gi
      disruption:
        consolidationPolicy: WhenEmptyOrUnderutilized
        consolidateAfter: 2m
      weight: 100

Key decisions we made:

Spot-first provisioning — workloads that tolerate interruptions run on Spot; stateful services stay on On-Demand
Multiple instance families — Karpenter picks the cheapest right-sized option across families
Interruption handling — we use the Karpenter interruption queue (SQS) to gracefully drain Spot nodes before AWS reclaims them
consolidateAfter: 2m — nodes deprovision 2m seconds after going idle, eliminating ghost capacity

📊 Results After Migration

💰 Cost

Compute costs dropped significantly. The main drivers:

Spot instances covering the majority of our non-critical workloads
Karpenter's bin-packing eliminating idle node waste
Right-sized instances instead of over-provisioned static node groups

⚡ Performance

Faster pod scheduling — Karpenter provisions new nodes in under 60 seconds in most cases
Better workload isolation through custom node selectors and taints
Graviton (ARM) instances for compatible workloads gave us a meaningful price-performance improvement

🔐 Compliance

Audit reports now generated directly from CloudTrail without custom tooling
IRSA eliminated shared IAM credential risks
Security Hub provides continuous posture monitoring against our compliance framework

⚠️ Challenges We Faced

Being honest here — this is what makes a migration story actually useful.

1. Karpenter Learning Curve

Karpenter's provisioning model is fundamentally different from Cluster Autoscaler. Debugging why a node wasn't provisioned — or why Karpenter chose a specific instance type — required understanding its internal decision logic. The logs are verbose but not always immediately readable.

What helped: Running Karpenter in dry-run mode first, and adding structured logging to correlate provisioning decisions with pod events.

2. Networking Model Differences

GKE's VPC-native networking and AWS VPC behave differently in non-obvious ways — especially around CIDR planning, secondary IP ranges, and how pod IPs are allocated. We had to redesign our subnet layout and revisit some service-to-service communication assumptions.

3. IAM Complexity

IRSA is powerful but requires careful role design. Mapping GCP Workload Identity bindings to AWS IRSA role assumptions took time, especially for services that assumed broad IAM permissions under GCP that needed to be tightened properly.

🧠 Key Lessons Learned

Managed ≠ always optimal at scale. Autopilot is excellent for getting started, but production-grade platforms eventually need control surfaces that fully managed offerings deliberately hide.
Cost optimization requires infrastructure access. You can't tune what you can't see.
Autoscaling strategy matters more than cluster size. Karpenter's approach of provisioning for the pod rather than scaling a group changed how we think about capacity planning entirely.
Compliance is easier when the platform is designed for it. AWS's native compliance tooling removed a category of work that we were previously solving with custom scripts and log forwarding pipelines.
Migration should always be incremental. Parallel environment, gradual DNS cutover, canary deployments — this approach meant we caught issues in staging before they became production incidents.

🏁 Conclusion

GKE Autopilot is an excellent choice for teams that want Kubernetes without the operational overhead — and we'd still recommend it for that use case.

But for production environments that require cost control at scale, fine-grained compliance posture, and workload-specific infrastructure decisions, EKS with Karpenter provided a more flexible and efficient platform.

The migration wasn't trivial, but the control, visibility, and cost profile on the other side made it worth it.

Have you gone through a similar migration? Or are you evaluating EKS vs GKE for your stack? Drop your questions in the comments — happy to dig into specifics.

Tags: kubernetes aws devops cloud karpenter

DEV Community

Why We Moved from GKE to EKS

Why We Moved from GKE to EKS

⚠️ Why GKE Autopilot Started Falling Short

1. Cost Inefficiencies at Scale

2. Compliance and Governance Constraints

3. Limited Infrastructure Control

🎯 Why We Chose EKS + Karpenter

Full Infrastructure Control

Advanced Cost Optimization with Karpenter

Compliance Alignment

🏗️ Target Architecture

⚙️ Karpenter Setup — The Game Changer

📊 Results After Migration

💰 Cost

⚡ Performance

🔐 Compliance

⚠️ Challenges We Faced

1. Karpenter Learning Curve

2. Networking Model Differences

3. IAM Complexity

🧠 Key Lessons Learned

🏁 Conclusion

Top comments (0)