DEV Community

Wahyu Tricahyo
Wahyu Tricahyo

Posted on

Active-Passive vs Active-Active: How to Know Which Pattern Fits Your Database

OpenAI recently shared how they scale PostgreSQL for ChatGPT's 800 million weekly users. The part worth paying attention to wasn't the scale itself, it was the decision. They chose Active-Passive over Active-Active, and they didn't shard. For a product growing that fast, that's a bet most teams wouldn't make.

But the real question isn't "what did OpenAI do?" It's: how do you figure out which pattern is right for your own workload?

That's what this post is about. OpenAI's setup is a reference point, but the focus is on giving you the tools and framework to make this decision yourself.


First: Measure Before You Architect

Teams often jump to Active-Active or sharding because their user count "feels" big enough. The better approach is to let your actual data tell you what to do.

Check Your Read/Write Ratio in PostgreSQL

This is the single most important number for deciding your replication pattern.

Database-Level: pg_stat_database

SELECT
  datname,
  tup_returned + tup_fetched AS total_reads,
  tup_inserted + tup_updated + tup_deleted AS total_writes,
  ROUND(
    (tup_returned + tup_fetched)::numeric /
    NULLIF((tup_returned + tup_fetched + tup_inserted + tup_updated + tup_deleted), 0) * 100, 2
  ) AS read_percentage,
  ROUND(
    (tup_inserted + tup_updated + tup_deleted)::numeric /
    NULLIF((tup_returned + tup_fetched + tup_inserted + tup_updated + tup_deleted), 0) * 100, 2
  ) AS write_percentage
FROM pg_stat_database
WHERE datname = 'your_database_name';
Enter fullscreen mode Exit fullscreen mode

Quick reference: tup_returned = rows from sequential scans, tup_fetched = rows from index scans, and the write fields are self-explanatory. Note that tup_returned includes internal operations (e.g., sequential scans by autovacuum), which can inflate your read count. Treat the ratio as directional, not exact. These are cumulative since the last stats reset. To measure a specific window (peak hours vs off-peak), reset first with SELECT pg_stat_reset(); and query again after.

Per-Table: pg_stat_user_tables

The database-level number hides a lot. A database can be 85% reads overall while two tables are 95% writes. Those tables are the ones that matter.

SELECT
  schemaname,
  relname AS table_name,
  seq_tup_read + idx_tup_fetch AS total_reads,
  n_tup_ins + n_tup_upd + n_tup_del AS total_writes,
  ROUND(
    (seq_tup_read + idx_tup_fetch)::numeric /
    NULLIF((seq_tup_read + idx_tup_fetch + n_tup_ins + n_tup_upd + n_tup_del), 0) * 100, 2
  ) AS read_percentage
FROM pg_stat_user_tables
ORDER BY (n_tup_ins + n_tup_upd + n_tup_del) DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

This is the query that actually drives decisions. It shows which tables are write-heavy and might need to live somewhere else, while the rest stay on a single-primary Postgres.

Per-Query: pg_stat_statements

For deeper analysis, pg_stat_statements shows exactly which queries hit the database hardest:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top write queries by call count
SELECT query, calls, ROUND(total_exec_time / calls, 2) AS avg_time_ms
FROM pg_stat_statements
WHERE query ILIKE '%INSERT%' OR query ILIKE '%UPDATE%' OR query ILIKE '%DELETE%'
ORDER BY calls DESC
LIMIT 20;

-- Top read queries by call count
SELECT query, calls, ROUND(total_exec_time / calls, 2) AS avg_time_ms
FROM pg_stat_statements
WHERE query ILIKE 'SELECT%'
ORDER BY calls DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

This also catches problematic queries before they cause incidents. OpenAI found an ORM-generated query joining 12 tables that caused multiple high-severity outages. That's not a scaling problem, it's a query problem. pg_stat_statements would have surfaced it immediately.

A note on the ILIKE filters above: string matching on %INSERT% can catch queries that merely mention INSERT in comments or subqueries. For more precise filtering, check the query plan or use pg_stat_statements' queryid grouping. These queries are a starting point, not a production monitoring setup.

What Your Numbers Mean

Your Read % Likely Pattern Notes
90%+ reads Active-Passive Same territory as ChatGPT. Add replicas, not complexity
70-90% reads Active-Passive (probably) Check per-table breakdown. Offload the write-heavy outliers
50-70% reads It depends Look at where writes concentrate. Selective sharding might be enough
Below 50% reads Active-Active or Sharding A single primary will become the bottleneck at scale

These percentages assume writes are spread across many tables. If your writes are concentrated in two or three tables, the database-level ratio can be misleading. A database that's 70% reads overall might work fine on Active-Passive if the write-heavy tables can be offloaded. Always cross-check with the per-table query.

Important caveat: These percentages reflect steady-state traffic. Spikes matter too.

OpenAI's workload is heavily read-dominant, but during the ImageGen launch, 100 million users signed up in a single week and writes spiked 10x overnight. That spike caused their only SEV-0 Postgres incident in twelve months. The primary got overwhelmed, requests timed out, and users retried, creating a vicious cycle where retry traffic pushed the already-drowning database into full outage. They recovered by design (hot standby failover, rate limiting kicked in), but the lesson is clear: a read-heavy ratio can temporarily flip during a viral launch.

Measure steady-state, but also look at the worst traffic day in the last year. If writes can spike 5-10x during launches, the architecture needs headroom for that.


The Two Patterns: What Actually Matters

Active-Passive: The Gotchas

The selling point is simplicity: one source of truth, no conflict resolution, scale reads by adding replicas. But:

  • Failover isn't instant. Promoting a replica means brief downtime. Without recent testing, "brief" can stretch into minutes. OpenAI runs a hot standby matching the primary's specs, paying double for the writer tier as insurance.
  • Replicas have diminishing returns. 50 replicas doesn't mean 50x read capacity. Each one adds WAL streaming overhead on the primary. After ~15-20, the primary's CPU budget for writes starts getting eaten. OpenAI is hitting this now, hence cascading replication. There's also a cost question: 50 read replicas plus a hot standby can exceed the cost of a smaller Active-Active cluster with fewer nodes that handle both reads and writes. The math depends on instance sizes and cloud provider, but it's worth running before committing to either pattern.
  • Replication lag is invisible until it isn't. Usually milliseconds. During write spikes, lag balloons, and replicas serve stale data. Without routing recent-write reads to the primary, users see inconsistencies they can't explain. Monitor it with pg_stat_replication on the primary:
SELECT
  client_addr,
  state,
  sent_lsn,
  replay_lsn,
  replay_lag
FROM pg_stat_replication;
Enter fullscreen mode Exit fullscreen mode

If replay_lag is consistently above a few hundred milliseconds, replicas are falling behind. Set up alerts for this, not just dashboards.

Active-Active: The Complexity Tax

Active-Active Architecture
Source: pgEdge

The selling point is resilience and global write latency. But:

  • Conflict resolution leaks into application code. "Last-write-wins" sounds fine until two users update the same record simultaneously and one silently loses their change. Especially nasty for account balances, permissions, or anything where silent overwrites cause real harm.
  • Testing is exponentially harder. Every code path needs testing against the possibility of concurrent conflicting writes across regions.
  • It couples database choice to architecture. PostgreSQL doesn't natively support multi-master. Active-Active means adding third-party tooling (pgEdge, BDR) or switching databases entirely (CockroachDB, YugabyteDB). That's a bigger commitment than adding read replicas.

Side-by-Side

Factor Active-Passive Active-Active
Write throughput Limited to one primary Scales with nodes
Read throughput Scales with replicas Scales with nodes
Consistency Simple, one source of truth Needs conflict resolution
Write latency Higher for distant regions Low everywhere
Operational complexity Lower Significantly higher
Failover Promote a replica (brief downtime) Other masters keep serving
Best for Read-heavy workloads Write-heavy or geo-distributed writes

How OpenAI Applies This

OpenAI describes their workload as heavily read-dominant. Active-Passive was the obvious call.

The setup: one primary PostgreSQL on Azure Flexible Server, ~50 read replicas across regions, millions of QPS, low double-digit millisecond p99 latency, five-nines availability.

The interesting part isn't the pattern, it's what keeps it alive at this scale.

Operational Defenses

A single-primary setup at 800M users is fragile without these layers. Most of these apply regardless of replication pattern:

  • PgBouncer for connection pooling. Dropped connection setup from 50ms to 5ms. At millions of QPS, this is the difference between a system that works and one that doesn't. Note that PgBouncer handles pooling, not read/write routing. Sending reads to replicas and writes to the primary requires something else: HAProxy, pgpool-II, or application-level routing logic.
  • Cache locking. When a cache key expires, one request fetches from Postgres. Everyone else waits for the cache to repopulate instead of all hammering the database at once.
  • Workload isolation. Low-priority and high-priority traffic route to separate replica instances. A background analytics job can't slow down user-facing reads.
  • Multi-layer rate limiting. At application, pooler, proxy, and query levels. All of them, not just one.

Moving Write-Heavy Tables, Not the Whole Database

Instead of sharding Postgres or migrating to a distributed database, OpenAI identified specific write-heavy workloads and moved only those to Cosmos DB. Everything else stays on Postgres.

The rule: no new tables in the main Postgres. New features needing heavy writes go to sharded systems from day one. Existing write-heavy tables get migrated out when possible.

This maps directly to the per-table query above. Run it, find the top write-heavy tables, and ask: can these live somewhere else?

Cascading Replication: Scaling Past the Replica Ceiling

With ~50 replicas, the primary streams WAL to every single one, and each replica adds CPU and network overhead. Cascading replication solves this by introducing intermediate replicas: the primary streams to a handful of intermediates, and each intermediate relays WAL to a group of downstream replicas. This means the primary only maintains a few replication connections instead of fifty. The tradeoff is slightly more lag for downstream replicas (they're one hop further from the source) and more complex failover, since promoting an intermediate requires repointing its downstream replicas. OpenAI is testing this with Azure but hasn't deployed it to production yet.

The Hybrid Approach

This is the option most teams overlook, and probably the right answer for most:

  • Keep the main database Active-Passive
  • Identify write-heavy outliers with pg_stat_user_tables
  • Move those specific workloads to a system built for writes
  • Don't touch anything that doesn't need to be touched

Full Active-Active is expensive in complexity. Full Active-Passive might have a few tables that don't fit. The hybrid gets 90% of the benefit with 10% of the migration pain.


The Real Lesson

The most common mistake is choosing a replication pattern based on the user count a team hopes to reach, not the workload they actually have. Active-Active gets adopted for a scale problem that doesn't exist yet, and the team spends two years debugging the complexity.

Meanwhile, the problems that actually cause outages are mundane: an unoptimized ORM query, missing connection pooling, a background job competing with user-facing reads. None of these require a different replication pattern. They require running pg_stat_statements and paying attention.

Run the queries. Look at the ratio. If it says Active-Passive, trust it, even if it feels too simple. Complexity can always be added later. It can never easily be removed.

When to revisit the decision:

  • Write latency on the primary is consistently above your SLA and you've already optimized queries and offloaded what you can
  • Replication lag (replay_lag) stays above acceptable thresholds during normal traffic, not just spikes
  • Failover tests are taking longer than your downtime budget allows
  • You're expanding to regions where write latency to the primary is unacceptable for the user experience

If none of those are true, the current pattern is working. Keep measuring.


References

Further Reading

Top comments (0)