DEV Community

Cover image for Why We Moved from GKE to EKS
Ajinkya
Ajinkya

Posted on • Originally published at dev.to

Why We Moved from GKE to EKS

Why We Moved from GKE to EKS

When we initially adopted Kubernetes, Google Kubernetes Engine (GKE) Autopilot seemed like the perfect choice — fully managed, minimal operational overhead, and quick to get started.

But as our workloads matured, three major challenges started to surface:

  • Rising and unpredictable costs
  • Compliance constraints
  • The need for deeper infrastructure control

This blog walks through why we migrated to Amazon Elastic Kubernetes Service (EKS) with Karpenter, the architectural changes we made, and the lessons we learned running production workloads post-migration.


⚠️ Why GKE Autopilot Started Falling Short

1. Cost Inefficiencies at Scale

GKE Autopilot pricing is based on requested resources, not actual usage. This sounds fine at small scale — but as traffic grows, the gaps between requested and actual usage start to compound.

Problems we observed:

  • Over-provisioned workloads leading to higher bills
  • No access to Spot/Preemptible node strategies with the same level of flexibility
  • Very few cost optimization knobs to tune

As traffic grew, costs increased almost linearly with no meaningful way to optimize without restructuring our entire workload configuration.

2. Compliance and Governance Constraints

Operating in a regulated environment required:

  • Fine-grained IAM control at the workload level
  • Strict network isolation between services
  • Audit-level visibility into infrastructure activity

With GKE Autopilot, several configurations are abstracted away or restricted by design. This made it harder to enforce organization-wide security policies and satisfy compliance requirements from auditors. Specifically:

  • Enforcing per-pod IAM permissions cleanly was non-trivial
  • Network policy enforcement had gaps in our specific setup
  • Generating audit-ready logs tied to individual workload actions required workarounds

We needed something that gave us first-class integration with cloud-native IAM and security tooling — without layering on custom solutions.

3. Limited Infrastructure Control

When performance-sensitive services started hitting bottlenecks, the inability to choose instance types became a real blocker. We had no control over:

  • CPU vs. memory-optimized instance selection
  • ARM-based workloads on Graviton processors
  • Custom AMIs or low-level networking tuning

For teams running general-purpose workloads, this abstraction is a feature. For us, it was a ceiling.


🎯 Why We Chose EKS + Karpenter

Full Infrastructure Control

Moving to EKS gave us direct control over:

  • Instance families — CPU-optimized, memory-optimized, ARM (Graviton)
  • Custom AMIs — hardened images meeting our internal security baseline
  • Networking — VPC-native networking with fine-grained subnet and security group control

This unlocked workload-specific performance tuning that simply wasn't possible before.

Advanced Cost Optimization with Karpenter

Karpenter is not your traditional cluster autoscaler. Instead of scaling pre-defined node groups, it:

  • Watches for unschedulable pods in real time
  • Selects the right-sized instance based on actual pod requirements
  • Prioritizes Spot instances where workloads allow, falling back to On-Demand seamlessly
  • Bin-packs nodes efficiently, reducing idle capacity

The result: faster scaling reactions and a dramatically lower compute bill — without sacrificing reliability.

Compliance Alignment

AWS gave us the compliance story we needed:

  • IRSA (IAM Roles for Service Accounts) — precise, per-pod IAM permissions with no shared credentials
  • VPC-level isolation — full control over ingress, egress, and inter-service communication
  • CloudTrail integration — every API call, every node action, fully auditable out of the box
  • AWS Config + Security Hub — continuous compliance checks against CIS benchmarks and custom rules

This made our next compliance audit significantly smoother. Auditors got clear, traceable logs without us having to build custom instrumentation.


🏗️ Target Architecture

Here's what the high-level migration looked like architecturally:

Component Before (GKE) After (EKS)
Cluster GKE Autopilot EKS (Managed Node Groups + Karpenter)
Autoscaling Built-in Autopilot scaling Karpenter
Spot Strategy NO Control Karpenter Spot-first provisioning
IAM GCP Workload Identity AWS IRSA
Audit Logging Cloud Audit Logs CloudTrail + CloudWatch
Networking GKE VPC-native AWS VPC with custom subnets

⚙️ Karpenter Setup — The Game Changer

Karpenter replaced our traditional Cluster Autoscaler, and the difference was immediately visible.

How we configured it:

 apiVersion: karpenter.sh/v1
    kind: NodePool
    metadata:
      name: spot-arm64
    spec:
      template:
        metadata:
          labels:
            # -----------------------------------------------
            # These labels land on the EC2 node.
            # Your pod affinity rules match against these.
            # -----------------------------------------------
            node-pool: spot-arm64
            capacity-type: spot
            arch: arm64
            workload-class: standard
        spec:
          requirements:
            - key: kubernetes.io/arch
              operator: In
              values: ["arm64"]
            - key: kubernetes.io/os
              operator: In
              values: ["linux"]
            - key: karpenter.sh/capacity-type
              operator: In
              values: ["spot"]
            - key: karpenter.k8s.aws/instance-category
              operator: In
              values: ["c", "m", "r"]
              # c6g, c7g, c8g — compute optimized Graviton
              # m6g, m7g, m8g — general purpose Graviton
              # r6g, r7g, r8g — memory optimized Graviton
            - key: karpenter.k8s.aws/instance-generation
              operator: Gt
              values: ["5"]             # Graviton2+ only (gen 6,7,8)
          nodeClassRef:
            group: karpenter.k8s.aws
            kind: EC2NodeClass
            name: default
          expireAfter: 168h             # 7 days — shorter for spot nodes


      limits:
        cpu: 500
        memory: 2000Gi
      disruption:
        consolidationPolicy: WhenEmptyOrUnderutilized
        consolidateAfter: 2m
      weight: 100 
Enter fullscreen mode Exit fullscreen mode

Key decisions we made:

  • Spot-first provisioning — workloads that tolerate interruptions run on Spot; stateful services stay on On-Demand
  • Multiple instance families — Karpenter picks the cheapest right-sized option across families
  • Interruption handling — we use the Karpenter interruption queue (SQS) to gracefully drain Spot nodes before AWS reclaims them
  • consolidateAfter: 2m — nodes deprovision 2m seconds after going idle, eliminating ghost capacity

📊 Results After Migration

💰 Cost

Compute costs dropped significantly. The main drivers:

  • Spot instances covering the majority of our non-critical workloads
  • Karpenter's bin-packing eliminating idle node waste
  • Right-sized instances instead of over-provisioned static node groups

⚡ Performance

  • Faster pod scheduling — Karpenter provisions new nodes in under 60 seconds in most cases
  • Better workload isolation through custom node selectors and taints
  • Graviton (ARM) instances for compatible workloads gave us a meaningful price-performance improvement

🔐 Compliance

  • Audit reports now generated directly from CloudTrail without custom tooling
  • IRSA eliminated shared IAM credential risks
  • Security Hub provides continuous posture monitoring against our compliance framework

⚠️ Challenges We Faced

Being honest here — this is what makes a migration story actually useful.

1. Karpenter Learning Curve

Karpenter's provisioning model is fundamentally different from Cluster Autoscaler. Debugging why a node wasn't provisioned — or why Karpenter chose a specific instance type — required understanding its internal decision logic. The logs are verbose but not always immediately readable.

What helped: Running Karpenter in dry-run mode first, and adding structured logging to correlate provisioning decisions with pod events.

2. Networking Model Differences

GKE's VPC-native networking and AWS VPC behave differently in non-obvious ways — especially around CIDR planning, secondary IP ranges, and how pod IPs are allocated. We had to redesign our subnet layout and revisit some service-to-service communication assumptions.

3. IAM Complexity

IRSA is powerful but requires careful role design. Mapping GCP Workload Identity bindings to AWS IRSA role assumptions took time, especially for services that assumed broad IAM permissions under GCP that needed to be tightened properly.


🧠 Key Lessons Learned

  • Managed ≠ always optimal at scale. Autopilot is excellent for getting started, but production-grade platforms eventually need control surfaces that fully managed offerings deliberately hide.
  • Cost optimization requires infrastructure access. You can't tune what you can't see.
  • Autoscaling strategy matters more than cluster size. Karpenter's approach of provisioning for the pod rather than scaling a group changed how we think about capacity planning entirely.
  • Compliance is easier when the platform is designed for it. AWS's native compliance tooling removed a category of work that we were previously solving with custom scripts and log forwarding pipelines.
  • Migration should always be incremental. Parallel environment, gradual DNS cutover, canary deployments — this approach meant we caught issues in staging before they became production incidents.

🏁 Conclusion

GKE Autopilot is an excellent choice for teams that want Kubernetes without the operational overhead — and we'd still recommend it for that use case.

But for production environments that require cost control at scale, fine-grained compliance posture, and workload-specific infrastructure decisions, EKS with Karpenter provided a more flexible and efficient platform.

The migration wasn't trivial, but the control, visibility, and cost profile on the other side made it worth it.


Have you gone through a similar migration? Or are you evaluating EKS vs GKE for your stack? Drop your questions in the comments — happy to dig into specifics.


Tags: kubernetes aws devops cloud karpenter

Top comments (0)