Rex Zhen

Posted on Feb 20

Why AWS Still Wins (Despite GCP's Better Design)

#aws #gcp #eks #gke

Why AWS Still Wins (Despite GCP's Better Design)

Introduction

This is a follow-up to my previous articles: AWS SRE's First Day with GCP: 7 Surprising Differences and AWS Multi-Account Architecture: The Organizational Chaos No One Talks About.

A few months ago, I wrote enthusiastically about GCP after my first hands-on experience. The infrastructure design was cleaner. The networking model made more sense. The pricing was better. I genuinely believed GCP had solved many of AWS's fundamental architectural problems.

After actually building and running my personal ML project on GCP for several months, I need to eat some humble pie.

Here's what I've learned: Infrastructure elegance doesn't win. Ecosystem breadth does.

GCP's design is still superior from an architectural purity standpoint. But AWS remains the better choice for most organizations—and now I understand why.

The Managed Services Gap: Bigger Than I Thought

When I praised GCP's cleaner architecture, I focused on foundational services: compute, networking, storage, Kubernetes. These are areas where GCP genuinely excels.

But here's what I didn't account for: The majority of production workloads don't just need foundational services. They need the ecosystem around them.

The Services GCP Doesn't Have (That You Desperately Need)

1. Kafka: AWS MSK vs... Nothing

In AWS:
Amazon Managed Streaming for Kafka (MSK) gives you:

Fully managed Kafka clusters
Automatic patching and upgrades
Built-in monitoring with CloudWatch
Integration with AWS IAM, VPC, and KMS
Multi-AZ deployment with automatic failover
Starting at ~$200/month for production-grade setup

In GCP:
You build it yourself with open-source Kafka on GCE instances or GKE.

The reality check:
Running Kafka in-house is not impossible—SREs have been doing it for years. But it's a significant operational burden:

Cluster sizing and capacity planning
ZooKeeper management (pre-Kafka 3.x) or KRaft mode configuration
Replication and partition rebalancing
Performance tuning (JVM heap, OS parameters, disk I/O)
Security configuration (SSL/TLS, SASL authentication, ACLs)
Monitoring and alerting setup
Upgrade orchestration
Disaster recovery planning

For a dedicated SRE, this becomes a part-time to full-time job if Kafka is core to your business. For a small team, it's a distraction from product development.

AWS MSK doesn't make this complexity disappear—it just shifts the responsibility. That shift is worth hundreds of thousands in salary costs annually for most organizations.

2. Elasticsearch/OpenSearch: AWS vs DIY Hell

In AWS:
Amazon OpenSearch Service (formerly Elasticsearch Service):

Managed clusters with automatic node recovery
Built-in Kibana/OpenSearch Dashboards
Automated snapshots and point-in-time recovery
Fine-grained access control integration
Index State Management for data lifecycle
~$150/month for small production clusters

In GCP:
Roll your own Elasticsearch cluster, or use Elastic Cloud Marketplace (third-party, more expensive).

The operational nightmare:
Elasticsearch is notoriously finicky in production:

Memory management (heap sizing, JVM tuning)
Shard allocation and rebalancing strategies
Split-brain scenarios and quorum configuration
Index mapping explosions
Query performance optimization
Storage capacity management (indices grow fast)
Version upgrades (breaking changes between major versions)
Cluster state management at scale

I've seen dedicated SRE teams with 2-3 engineers just managing Elasticsearch clusters for logging and observability. It's that complex at scale.

Unless search is your core business (like Elastic.co itself), running it in-house is resource-intensive compared to using a managed service.

3. Airflow: Both Have Managed, But...

AWS:
Amazon Managed Workflows for Apache Airflow (MWAA)

Starting at ~$350/month for small environment
Integrated with AWS services (S3, Glue, EMR, etc.)

GCP:
Cloud Composer (managed Airflow)

Starting at ~$300-400/month
But consistently more expensive at scale
My testing showed GCP pricing increases faster as you add workers and schedulers

My experience:
I previously ran Airflow in-house on Docker. Both managed services are better than DIY. But AWS MWAA integrates more naturally with the broader AWS ecosystem (Lambda, Step Functions, Glue, etc.).

For GCP, if you're already heavily invested in BigQuery and Dataflow, Cloud Composer makes sense. For multi-service orchestration, MWAA edges ahead.

EKS vs GKE: The Unexpected Reversal

In my first article, I praised GKE as more mature and better integrated. After deeper experience, I've changed my mind.

First Impressions: GKE Seems Superior

Why GKE looks better on day 1:

CLI consistency: gcloud container commands mirror kubectl patterns
Earlier launch: GKE launched in 2015; EKS in 2018 (3 years later)
Native integration: GCP services integrate with Kubernetes more naturally
Mature ecosystem: More GCP-native tools built on Kubernetes primitives

As an SRE coming from AWS, GKE genuinely felt cleaner and more Kubernetes-idiomatic.

Reality Check: EKS Has Caught Up (and Pulled Ahead)

1. Add-Ons: Complexity with Purpose

In EKS:
You need to install and maintain add-ons for AWS integration:

AWS Load Balancer Controller (ALB/NLB integration)
EBS CSI Driver (persistent volumes)
EFS CSI Driver (shared file storage)
Secrets Manager CSI Driver (secret injection)
IAM Roles for Service Accounts (IRSA) (pod-level IAM)

First reaction: "Why isn't this built-in? GKE is cleaner!"

After practice both: This separation is actually better for enterprise environments:

Version control: Update add-ons independently from cluster upgrades
Rollback safety: If an add-on breaks, rollback without touching the control plane
Customization: Fork and modify add-ons for specialized needs
Debugging: Clear separation between Kubernetes issues and AWS integration issues

Reality: If you manage these through Terraform and hide the complexity in IaC, the operational overhead is minimal. After initial setup, add-ons are stable and rarely require attention.

2. Cluster Autoscaling: EKS is Cheaper

This was the biggest surprise.

Cost comparison for a production cluster:

Scenario: 10-50 nodes, scaling based on load, mix of workload types

GKE (with Google-managed node pools):
- Control plane: FREE (under 15,000 pods)
- Nodes: Standard pricing
- Node pool autoscaling: Built-in
- Typical monthly cost: $2,500-4,000

EKS (with managed node groups + Karpenter):
- Control plane: $73/month per cluster
- Nodes: Standard pricing (often cheaper than GCP equivalent)
- Managed node groups: Built-in autoscaling
- Karpenter: Advanced provisioning (free, OSS)
- Typical monthly cost: $2,200-3,500

EKS is 10-15% cheaper for equivalent workloads at scale, even with the control plane cost.

Why? Two reasons:

EC2 instance pricing is generally lower than equivalent GCE instances for compute-optimized and memory-optimized workloads
Karpenter (AWS open-source) is more efficient at bin-packing than GKE's native autoscaler

3. Karpenter: The Game-Changer

What is Karpenter?
An open-source Kubernetes cluster autoscaler built by AWS, designed to replace the standard Cluster Autoscaler.

Why it's better:

Traditional autoscaling (GKE and EKS Cluster Autoscaler):

Pre-defined node groups/pools
Fixed instance types per node pool
Scales existing node groups up/down
Can get stuck in suboptimal configurations

Karpenter:

No pre-defined node groups
Dynamically selects optimal instance type based on pending pod requirements
Provisions exactly what's needed (mix of instance types in a single scaling event)
Consolidates underutilized nodes automatically
Faster provisioning (30-45 seconds vs 3-5 minutes)

GKE alternative: GKE has improved its autoscaling, but as of 2025, it doesn't match Karpenter's flexibility and intelligence.

SRE Perspective: What Actually Matters

After running workloads on both:

GKE advantages:

✅ Cleaner initial setup
✅ Fewer moving parts (no add-ons to install)
✅ Better out-of-box experience

EKS advantages:

✅ Better cost efficiency at scale (10-15% cheaper)
✅ Karpenter enables superior autoscaling intelligence
✅ Add-on separation = better enterprise change management
✅ Broader ecosystem integration (AWS has more services)

For SRE teams managing production infrastructure at scale, EKS wins. The cost savings and Karpenter's intelligence outweigh GKE's cleaner initial experience.

My Revised Recommendation

For most organizations, AWS remains the better choice.

Not because the infrastructure is better designed (it's not).

Not because networking is simpler (it's definitely not).

But because AWS reduces the operational burden more completely through breadth of managed services.

Decision Framework

Choose GCP if:

You're building on BigQuery/Dataflow/BigTable
Your workload is data-intensive with high cross-zone transfer
You don't need Kafka or Elasticsearch
You have GCP expertise in-house

Choose AWS if:

You need managed Kafka (MSK)
You need managed Elasticsearch (OpenSearch)
You want the broadest set of managed services
You're building a complex, multi-service architecture
You need mature ML infrastructure (SageMaker)

Have you compared AWS and GCP in production? What was your experience? Did you find the managed services gap as significant as I did? Let me know in the comments.

This article is part of a series exploring practical cloud architecture. Check out the previous articles for more context on AWS multi-account architecture and GCP's design advantages.

Connect with me on LinkedIn: https://www.linkedin.com/in/rex-zhen-b8b06632/

I share insights on cloud architecture, SRE practices, and honest takes on cloud platforms. Let's connect!

AWS #GCP #CloudEngineering #SRE #DevOps #Kubernetes #CloudComputing #InfrastructureAsCode #CostOptimization #Kafka #Elasticsearch

DEV Community

Why AWS Still Wins (Despite GCP's Better Design)

Why AWS Still Wins (Despite GCP's Better Design)

Introduction

The Managed Services Gap: Bigger Than I Thought

The Services GCP Doesn't Have (That You Desperately Need)

1. Kafka: AWS MSK vs... Nothing

2. Elasticsearch/OpenSearch: AWS vs DIY Hell

3. Airflow: Both Have Managed, But...

EKS vs GKE: The Unexpected Reversal

First Impressions: GKE Seems Superior

Reality Check: EKS Has Caught Up (and Pulled Ahead)

1. Add-Ons: Complexity with Purpose

2. Cluster Autoscaling: EKS is Cheaper

3. Karpenter: The Game-Changer

SRE Perspective: What Actually Matters

My Revised Recommendation

Decision Framework

AWS #GCP #CloudEngineering #SRE #DevOps #Kubernetes #CloudComputing #InfrastructureAsCode #CostOptimization #Kafka #Elasticsearch

Top comments (0)