Sivaram

Posted on Jan 24

OpenAI Is Running PostgreSQL at Millions of QPS (Here’s How It Didn’t Explode)

#postgres #devops #openai #database

PostgreSQL has a reputation.

Great database.

Rock‑solid.

But “web‑scale”? Eh…

OpenAI just published how they’re running Postgres at millions of queries per second for ~800 million users.

With:

One primary
~50 read replicas
And no sharding (for Postgres)

Let’s talk about what actually made this work.

The Architecture (Surprisingly Simple)

Single primary PostgreSQL
Writes go to the primary
Reads go almost everywhere else
~50 read replicas across regions

This only works because the workload is:

Extremely read‑heavy
Ruthlessly optimized

Postgres wasn’t the bottleneck.
Bad assumptions were.

The Failure Mode That Kept Biting Them

Every major incident followed the same pattern:

Cache fails / misses spike
Expensive queries flood Postgres
Latency rises
Requests start timing out
Retries kick in
Load gets worse

Congratulations, you’ve built a feedback loop from hell.

Writes Are the Real Problem (Thanks, MVCC)

Postgres MVCC means:

Updates copy the entire row
Dead tuples pile up
Autovacuum becomes a full‑time job

At scale:

Reads are cheap
Writes are dangerous

So OpenAI made a call most teams avoid.

Rule #1: Don’t Let Postgres Do Write‑Heavy Work

Shardable, write‑heavy workloads → CosmosDB
PostgreSQL stays unsharded
No new tables allowed on the main Postgres cluster

Why?
Because retroactively sharding a massive OLTP system is how roadmaps go to die.

ORMs Will Absolutely Hurt You If You Let Them

One real incident:

A query joining 12 tables
Traffic spike
CPU pegged
SEV triggered

Lessons learned:

ORMs generate wild SQL
Multi‑table joins at scale are a footgun
Complex joins often belong in the app layer

Also:

Kill idle transactions
Set idle_in_transaction_session_timeout
Always read the SQL your ORM emits

PgBouncer Is Not Optional

They hit:

Connection storms
Exhausted connection limits
Cascading failures

Fix:

PgBouncer in transaction / statement pooling
Deployed via Kubernetes per replica
Co‑located with clients

Results:

Connection time: 50ms → 5ms
Way fewer idle connections
Fewer “why is Postgres on fire” moments

Cache Misses Can Take You Down

A cache miss storm is just a DDoS you did to yourself.

Solution:

Cache locking / leasing
One request fetches from Postgres
Everyone else waits

This single pattern prevented multiple outages.

Rate Limiting at Every Layer

They rate‑limit:

Endpoints
Connection pools
Proxies
Individual query digests

They’ll even:

Block specific queries at the ORM layer

Retries are treated as dangerous, not helpful.

One Primary Is a Risk — Reduce the Blast Radius

If the primary dies:

Writes fail
But reads still work

How:

Critical read paths are replica‑only
Primary runs in HA with a hot standby
Fast, reliable failovers

This turns “everything is down” into “writes are temporarily unavailable.”

Huge difference.

Noisy Neighbors Are Real

New feature launches caused:

CPU spikes
Latency for unrelated traffic

Fix:

Workload isolation
High‑priority vs low‑priority instances
Product‑level separation

One bad feature should not take down ChatGPT.

Schema Changes Are Basically Treated as Incidents

Rules:

No table rewrites
No long migrations
5‑second schema change timeout
Indexes only with CONCURRENTLY
Backfills are rate‑limited and can take weeks

Speed is optional.
Stability isn’t.

The Results

Millions of QPS
Low double‑digit ms p99 latency
~50 global read replicas
Five‑nines availability
One SEV‑0 in a year

Postgres didn’t fail.
Engineering discipline scaled it.

The Takeaway

PostgreSQL can scale insanely far if:

Your workload is read‑heavy
Writes are treated with respect
Caching is defensive
Retries are controlled
ORMs are kept on a leash

Most Postgres horror stories aren’t about Postgres.

They’re about letting the system do work it was never meant to do.

DEV Community