Is CloudNativePG Ready to Replace RDS/Aurora for High-Traffic Databases?

#aws #database #kubernetes #postgres

Is CloudNativePG Ready to Replace RDS/Aurora for High-Traffic Databases?

Short answer: Not for most teams. But for the right team, in the right conditions, it's a legitimate move.

I spent time digging into CloudNativePG (CNPG) — the CNCF-backed Postgres operator for Kubernetes — to answer a question I keep hearing from platform engineers: "Can we drop RDS/Aurora and run Postgres on EKS ourselves?"

Here's what I found, without the vendor spin.

The Evidence Problem

Before anything else: the public evidence base for CNPG at true high-traffic scale is thin.

Named enterprise adopters exist — GEICO Tech, HSBC, EDB's BigAnimal, IBM Cloud Paks. HSBC presented a real migration story at KubeCon EU 2026. The strongest concrete number comes from Mirakl: 450 Postgres clusters, 31 TB of data, 3,650 CPUs — impressive fleet scale.

But there is no published, independent benchmark of CNPG handling thousands of TPS on a single multi-TB database that explicitly migrated off Aurora or RDS. If you're evaluating this for a tier-1 OLTP workload, you're operating without a public proof point at that scale.

That's not a dealbreaker. But it's a flag.

Where CNPG Actually Falls Short Today

Failover is slower — and that's mostly Kubernetes, not CNPG

Aurora Multi-AZ typically fails over in ~30 seconds. RDS Multi-AZ clusters: under 35 seconds, with zero data loss.

The only public measured number for CNPG on node failure is a community report of 40–80 seconds — where the dominant term isn't CNPG's logic, it's Kubernetes' node-monitor-grace-period (40s on K8s 1.29–1.31, raised to 50s in 1.32+). After node detection, add EBS detach/reattach (20–90 seconds) before the new primary is writable.

Pod-level failures are faster. But losing a node — the failure mode that actually keeps you up at night — is gated by K8s infrastructure, not the operator.

RDS Proxy and Aurora also mask failover from clients by holding connections and re-routing. CNPG's PgBouncer Pooler doesn't queue connections across a failover the same way. Your clients see the blip.

No automatic primary fencing — by design

CNPG explicitly chose not to auto-fence an isolated primary. In a network partition, CNPG prevents split-brain at rejoin time rather than force-shutting the isolated primary immediately (the Patroni model).

With async replication, this is a real consistency risk. The mitigation is 3-node synchronous (quorum) replication across 3 AZs — which works, but adds write latency and means writes pause if a required standby goes unavailable.

Barman backup is mature, but you own it entirely

CNPG uses Barman Cloud for backups. It's solid. But there's a near-term migration required: the in-tree barmanObjectStore is deprecated as of CNPG 1.26 and removed in 1.28. New deployments need the plugin-barman-cloud sidecar architecture.

Documented failure modes in the wild: stalled WAL archiving filling the PGDATA volume and causing crashloops. The XAmzContentSHA256Mismatch bug after a routine image bump is the kind of incident that reminds you who's on-call when AWS isn't managing your backups.

Where CNPG Genuinely Wins

Cost at large scale — but the math is less obvious than it looks

Aurora I/O billing can be brutal. A busy OLTP workload at ~50,000 I/O/s is over $25,000/month in I/O charges alone under Standard pricing. Aurora I/O-Optimized helps when I/O exceeds 25% of your Aurora bill — but that's still you paying the AWS premium.

CNPG on EKS pays EC2 + EBS + S3. The savings are real. The catch: a senior engineer in Canada runs CAD 150k–250k/year fully loaded. Even a 0.25 FTE allocation for on-call, patching, upgrades, and restore drills is CAD 40–60k/year of fixed cost. At small/medium managed-DB spend, CNPG's infrastructure savings don't cover the labor. The crossover only happens when you're spending tens of thousands per month on RDS/Aurora — large fleets or I/O-heavy workloads — AND you already employ the expertise, making the marginal labor near-zero.

The unexpected win: Debezium / CDC workloads

This is the one area where the "default to managed services" recommendation actually flips.

Debezium depends on a logical replication slot surviving failover. Historically, on any self-managed HA Postgres, it didn't — which forced a full re-snapshot of a multi-TB database after every failover. Not acceptable at scale.

Aurora is the weakest of the three here. Its shared-storage replication does not preserve logical slots across failover. You re-snapshot.

CNPG ≥1.27 on PostgreSQL 17 natively synchronizes logical decoding slots across the HA cluster via synchronizeLogicalDecoding. Combined with CNPG's stable -rw service endpoint, Debezium reconnects after failover without re-snapshotting — the slot survives and the connection follows automatically. For a CDC-heavy shop, this is a legitimate reason to choose CNPG over Aurora.

Caveat: this is feature-existence, not a published production benchmark at scale. Validate it before trusting it.

Full Postgres control

RDS/Aurora restrict extensions to an allow-list and withhold true superuser. CNPG gives you any extension, any version, on your schedule. For teams blocked by RDS extension limitations or needing newer Postgres majors immediately — this is real.

GitOps-native operations

CNPG was designed for Kubernetes from day one. Declarative cluster specs, immutable images with SBOM provenance, native Prometheus metrics, PGAudit support. If your team lives in GitOps, CNPG fits that model in a way RDS/Aurora never will.

The Honest Decision Framework

Stay on Aurora / RDS if:

You don't have ≥2 engineers with deep overlapping Postgres + Kubernetes + storage/CSI expertise (bus factor matters — losing one person shouldn't be an existential risk)
Your managed-DB spend isn't large enough to justify the fixed operational overhead
Your RTO requirements are tight and you haven't benchmarked CNPG under your actual write load
You need compliance coverage you can point at (SOC, PCI, HIPAA eligibility) without building your own evidence chain

Consider CNPG if — and only if — at least 3 of these are true:

You already employ the Postgres + Kubernetes + storage expertise (marginal labor ≈ zero)
You're spending tens of thousands/month on managed DBs — fleet scale or I/O-heavy Aurora
You have a hard requirement managed services can't meet: extension control, version control, multi-cloud portability, data sovereignty, or CDC/Debezium slot failover
You can run 3 AZs with synchronous replication and accept the write-latency tradeoff
You've completed the in-tree → plugin Barman migration and have tested PITR at your actual data size

Before You Decide: Do This First

The public evidence gap is the biggest risk. Before any migration decision, run your own benchmark:

Stand up a 3-node CNPG cluster on EKS across 3 AZs. Drive it at your real write TPS. Measure pod-kill failover, node-kill failover (tuned node-monitor-grace-period + EBS reattach), and AZ-loss — under both async and sync replication. Compare against your Aurora/RDS SLOs.

If CNPG can't hit your RTO target under load, stop there. If it can, you have a data point nobody else seems to have published.

Bottom Line

CNPG is excellent software, backed by a serious community (CNCF Sandbox, Incubation applied for), and used in production at real enterprises. It is not vaporware.

But for a single high-traffic, multi-TB production database: the managed-service default holds. Aurora absorbs the 3am pages. CNPG hands them back to you — along with full control, which is either a feature or a burden depending entirely on your team's depth.

The clearest exception isn't a cost argument. It's CDC. If Debezium failing over without re-snapshotting is critical to your architecture, CNPG ≥1.27 on PG17 deserves a serious look.

I'm a Senior Staff DevOps/Platform Engineer working with Kubernetes, AWS, and Postgres at scale. What's your team's experience with CNPG in production? Drop it in the comments — there's a real gap in public production data here.