PostgreSQL has a reputation.
Great database.
Rock‑solid.
But “web‑scale”? Eh…
OpenAI just published how they’re running Postgres at millions of queries per second for ~800 million users.
With:
- One primary
- ~50 read replicas
- And no sharding (for Postgres)
Let’s talk about what actually made this work.
The Architecture (Surprisingly Simple)
- Single primary PostgreSQL
- Writes go to the primary
- Reads go almost everywhere else
- ~50 read replicas across regions
This only works because the workload is:
- Extremely read‑heavy
- Ruthlessly optimized
Postgres wasn’t the bottleneck.
Bad assumptions were.
The Failure Mode That Kept Biting Them
Every major incident followed the same pattern:
- Cache fails / misses spike
- Expensive queries flood Postgres
- Latency rises
- Requests start timing out
- Retries kick in
- Load gets worse
Congratulations, you’ve built a feedback loop from hell.
Writes Are the Real Problem (Thanks, MVCC)
Postgres MVCC means:
- Updates copy the entire row
- Dead tuples pile up
- Autovacuum becomes a full‑time job
At scale:
- Reads are cheap
- Writes are dangerous
So OpenAI made a call most teams avoid.
Rule #1: Don’t Let Postgres Do Write‑Heavy Work
- Shardable, write‑heavy workloads → CosmosDB
- PostgreSQL stays unsharded
- No new tables allowed on the main Postgres cluster
Why?
Because retroactively sharding a massive OLTP system is how roadmaps go to die.
ORMs Will Absolutely Hurt You If You Let Them
One real incident:
- A query joining 12 tables
- Traffic spike
- CPU pegged
- SEV triggered
Lessons learned:
- ORMs generate wild SQL
- Multi‑table joins at scale are a footgun
- Complex joins often belong in the app layer
Also:
- Kill idle transactions
- Set
idle_in_transaction_session_timeout - Always read the SQL your ORM emits
PgBouncer Is Not Optional
They hit:
- Connection storms
- Exhausted connection limits
- Cascading failures
Fix:
- PgBouncer in transaction / statement pooling
- Deployed via Kubernetes per replica
- Co‑located with clients
Results:
- Connection time: 50ms → 5ms
- Way fewer idle connections
- Fewer “why is Postgres on fire” moments
Cache Misses Can Take You Down
A cache miss storm is just a DDoS you did to yourself.
Solution:
- Cache locking / leasing
- One request fetches from Postgres
- Everyone else waits
This single pattern prevented multiple outages.
Rate Limiting at Every Layer
They rate‑limit:
- Endpoints
- Connection pools
- Proxies
- Individual query digests
They’ll even:
- Block specific queries at the ORM layer
Retries are treated as dangerous, not helpful.
One Primary Is a Risk — Reduce the Blast Radius
If the primary dies:
- Writes fail
- But reads still work
How:
- Critical read paths are replica‑only
- Primary runs in HA with a hot standby
- Fast, reliable failovers
This turns “everything is down” into “writes are temporarily unavailable.”
Huge difference.
Noisy Neighbors Are Real
New feature launches caused:
- CPU spikes
- Latency for unrelated traffic
Fix:
- Workload isolation
- High‑priority vs low‑priority instances
- Product‑level separation
One bad feature should not take down ChatGPT.
Schema Changes Are Basically Treated as Incidents
Rules:
- No table rewrites
- No long migrations
- 5‑second schema change timeout
- Indexes only with
CONCURRENTLY - Backfills are rate‑limited and can take weeks
Speed is optional.
Stability isn’t.
The Results
- Millions of QPS
- Low double‑digit ms p99 latency
- ~50 global read replicas
- Five‑nines availability
- One SEV‑0 in a year
Postgres didn’t fail.
Engineering discipline scaled it.
The Takeaway
PostgreSQL can scale insanely far if:
- Your workload is read‑heavy
- Writes are treated with respect
- Caching is defensive
- Retries are controlled
- ORMs are kept on a leash
Most Postgres horror stories aren’t about Postgres.
They’re about letting the system do work it was never meant to do.
Top comments (0)