Why AWS Still Wins (Despite GCP's Better Design)
Introduction
This is a follow-up to my previous articles: AWS SRE's First Day with GCP: 7 Surprising Differences and AWS Multi-Account Architecture: The Organizational Chaos No One Talks About.
A few months ago, I wrote enthusiastically about GCP after my first hands-on experience. The infrastructure design was cleaner. The networking model made more sense. The pricing was better. I genuinely believed GCP had solved many of AWS's fundamental architectural problems.
After actually building and running my personal ML project on GCP for several months, I need to eat some humble pie.
Here's what I've learned: Infrastructure elegance doesn't win. Ecosystem breadth does.
GCP's design is still superior from an architectural purity standpoint. But AWS remains the better choice for most organizations—and now I understand why.
The Managed Services Gap: Bigger Than I Thought
When I praised GCP's cleaner architecture, I focused on foundational services: compute, networking, storage, Kubernetes. These are areas where GCP genuinely excels.
But here's what I didn't account for: The majority of production workloads don't just need foundational services. They need the ecosystem around them.
The Services GCP Doesn't Have (That You Desperately Need)
1. Kafka: AWS MSK vs... Nothing
In AWS:
Amazon Managed Streaming for Kafka (MSK) gives you:
- Fully managed Kafka clusters
- Automatic patching and upgrades
- Built-in monitoring with CloudWatch
- Integration with AWS IAM, VPC, and KMS
- Multi-AZ deployment with automatic failover
- Starting at ~$200/month for production-grade setup
In GCP:
You build it yourself with open-source Kafka on GCE instances or GKE.
The reality check:
Running Kafka in-house is not impossible—SREs have been doing it for years. But it's a significant operational burden:
- Cluster sizing and capacity planning
- ZooKeeper management (pre-Kafka 3.x) or KRaft mode configuration
- Replication and partition rebalancing
- Performance tuning (JVM heap, OS parameters, disk I/O)
- Security configuration (SSL/TLS, SASL authentication, ACLs)
- Monitoring and alerting setup
- Upgrade orchestration
- Disaster recovery planning
For a dedicated SRE, this becomes a part-time to full-time job if Kafka is core to your business. For a small team, it's a distraction from product development.
AWS MSK doesn't make this complexity disappear—it just shifts the responsibility. That shift is worth hundreds of thousands in salary costs annually for most organizations.
2. Elasticsearch/OpenSearch: AWS vs DIY Hell
In AWS:
Amazon OpenSearch Service (formerly Elasticsearch Service):
- Managed clusters with automatic node recovery
- Built-in Kibana/OpenSearch Dashboards
- Automated snapshots and point-in-time recovery
- Fine-grained access control integration
- Index State Management for data lifecycle
- ~$150/month for small production clusters
In GCP:
Roll your own Elasticsearch cluster, or use Elastic Cloud Marketplace (third-party, more expensive).
The operational nightmare:
Elasticsearch is notoriously finicky in production:
- Memory management (heap sizing, JVM tuning)
- Shard allocation and rebalancing strategies
- Split-brain scenarios and quorum configuration
- Index mapping explosions
- Query performance optimization
- Storage capacity management (indices grow fast)
- Version upgrades (breaking changes between major versions)
- Cluster state management at scale
I've seen dedicated SRE teams with 2-3 engineers just managing Elasticsearch clusters for logging and observability. It's that complex at scale.
Unless search is your core business (like Elastic.co itself), running it in-house is resource-intensive compared to using a managed service.
3. Airflow: Both Have Managed, But...
AWS:
Amazon Managed Workflows for Apache Airflow (MWAA)
- Starting at ~$350/month for small environment
- Integrated with AWS services (S3, Glue, EMR, etc.)
GCP:
Cloud Composer (managed Airflow)
- Starting at ~$300-400/month
- But consistently more expensive at scale
- My testing showed GCP pricing increases faster as you add workers and schedulers
My experience:
I previously ran Airflow in-house on Docker. Both managed services are better than DIY. But AWS MWAA integrates more naturally with the broader AWS ecosystem (Lambda, Step Functions, Glue, etc.).
For GCP, if you're already heavily invested in BigQuery and Dataflow, Cloud Composer makes sense. For multi-service orchestration, MWAA edges ahead.
EKS vs GKE: The Unexpected Reversal
In my first article, I praised GKE as more mature and better integrated. After deeper experience, I've changed my mind.
First Impressions: GKE Seems Superior
Why GKE looks better on day 1:
-
CLI consistency:
gcloud containercommands mirrorkubectlpatterns - Earlier launch: GKE launched in 2015; EKS in 2018 (3 years later)
- Native integration: GCP services integrate with Kubernetes more naturally
- Mature ecosystem: More GCP-native tools built on Kubernetes primitives
As an SRE coming from AWS, GKE genuinely felt cleaner and more Kubernetes-idiomatic.
Reality Check: EKS Has Caught Up (and Pulled Ahead)
1. Add-Ons: Complexity with Purpose
In EKS:
You need to install and maintain add-ons for AWS integration:
- AWS Load Balancer Controller (ALB/NLB integration)
- EBS CSI Driver (persistent volumes)
- EFS CSI Driver (shared file storage)
- Secrets Manager CSI Driver (secret injection)
- IAM Roles for Service Accounts (IRSA) (pod-level IAM)
First reaction: "Why isn't this built-in? GKE is cleaner!"
After practice both: This separation is actually better for enterprise environments:
- Version control: Update add-ons independently from cluster upgrades
- Rollback safety: If an add-on breaks, rollback without touching the control plane
- Customization: Fork and modify add-ons for specialized needs
- Debugging: Clear separation between Kubernetes issues and AWS integration issues
Reality: If you manage these through Terraform and hide the complexity in IaC, the operational overhead is minimal. After initial setup, add-ons are stable and rarely require attention.
2. Cluster Autoscaling: EKS is Cheaper
This was the biggest surprise.
Cost comparison for a production cluster:
Scenario: 10-50 nodes, scaling based on load, mix of workload types
GKE (with Google-managed node pools):
- Control plane: FREE (under 15,000 pods)
- Nodes: Standard pricing
- Node pool autoscaling: Built-in
- Typical monthly cost: $2,500-4,000
EKS (with managed node groups + Karpenter):
- Control plane: $73/month per cluster
- Nodes: Standard pricing (often cheaper than GCP equivalent)
- Managed node groups: Built-in autoscaling
- Karpenter: Advanced provisioning (free, OSS)
- Typical monthly cost: $2,200-3,500
EKS is 10-15% cheaper for equivalent workloads at scale, even with the control plane cost.
Why? Two reasons:
- EC2 instance pricing is generally lower than equivalent GCE instances for compute-optimized and memory-optimized workloads
- Karpenter (AWS open-source) is more efficient at bin-packing than GKE's native autoscaler
3. Karpenter: The Game-Changer
What is Karpenter?
An open-source Kubernetes cluster autoscaler built by AWS, designed to replace the standard Cluster Autoscaler.
Why it's better:
Traditional autoscaling (GKE and EKS Cluster Autoscaler):
- Pre-defined node groups/pools
- Fixed instance types per node pool
- Scales existing node groups up/down
- Can get stuck in suboptimal configurations
Karpenter:
- No pre-defined node groups
- Dynamically selects optimal instance type based on pending pod requirements
- Provisions exactly what's needed (mix of instance types in a single scaling event)
- Consolidates underutilized nodes automatically
- Faster provisioning (30-45 seconds vs 3-5 minutes)
GKE alternative: GKE has improved its autoscaling, but as of 2025, it doesn't match Karpenter's flexibility and intelligence.
SRE Perspective: What Actually Matters
After running workloads on both:
GKE advantages:
- ✅ Cleaner initial setup
- ✅ Fewer moving parts (no add-ons to install)
- ✅ Better out-of-box experience
EKS advantages:
- ✅ Better cost efficiency at scale (10-15% cheaper)
- ✅ Karpenter enables superior autoscaling intelligence
- ✅ Add-on separation = better enterprise change management
- ✅ Broader ecosystem integration (AWS has more services)
For SRE teams managing production infrastructure at scale, EKS wins. The cost savings and Karpenter's intelligence outweigh GKE's cleaner initial experience.
My Revised Recommendation
For most organizations, AWS remains the better choice.
Not because the infrastructure is better designed (it's not).
Not because networking is simpler (it's definitely not).
But because AWS reduces the operational burden more completely through breadth of managed services.
Decision Framework
Choose GCP if:
- You're building on BigQuery/Dataflow/BigTable
- Your workload is data-intensive with high cross-zone transfer
- You don't need Kafka or Elasticsearch
- You have GCP expertise in-house
Choose AWS if:
- You need managed Kafka (MSK)
- You need managed Elasticsearch (OpenSearch)
- You want the broadest set of managed services
- You're building a complex, multi-service architecture
- You need mature ML infrastructure (SageMaker)
Have you compared AWS and GCP in production? What was your experience? Did you find the managed services gap as significant as I did? Let me know in the comments.
This article is part of a series exploring practical cloud architecture. Check out the previous articles for more context on AWS multi-account architecture and GCP's design advantages.
Connect with me on LinkedIn: https://www.linkedin.com/in/rex-zhen-b8b06632/
I share insights on cloud architecture, SRE practices, and honest takes on cloud platforms. Let's connect!
Top comments (0)