OpenAI · Databases · 18 May 2026
ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.
- 800M users, 1 primary PG instance
- ~50 read replicas globally
- Millions of QPS, p99 <20ms
- PgBouncer: 50ms → 5ms connect
- One SEV-0 in 12 months
- 5-second DDL timeout enforced
The Story
The conventional wisdom about database scaling at 800 million users is straightforward: you shard. You move to a distributed SQL system. You decompose into microservices each with their own database. You do not run a single primary PostgreSQL instance. OpenAI's ChatGPT does not follow this conventional wisdom. It runs on one Azure PostgreSQL Flexible Server that handles all writes — backed by approximately 50 read replicas spread across multiple regions. The system handles millions of queries per second at low double-digit millisecond p99 latency and has maintained five-nines availability. In twelve months, they had one SEV-0. The story is not that Postgres is magic. The story is that relentless optimization of a boring, proven technology can outperform premature architectural complexity.
WHY SINGLE-PRIMARY WORKS AT THIS SCALE
ChatGPT's workload is overwhelmingly read-heavy. When 800 million users open the app, browse their chat history, or load their settings, those are reads. Writes happen on message submission and account updates — a much smaller fraction of the total traffic. This access pattern is exactly what a single-primary with many read replicas handles well: the write path stays narrow, the read load fans out horizontally across replicas. The architecture is not brilliant. It is appropriate for the workload. That fit is what makes it work.
OpenAI's blog published at PGConf.dev 2025 was unusually candid about both the decisions that worked and the ones that nearly broke the system. The database load grew by more than 10x in a single year following ChatGPT's viral growth. The team responded with aggressive optimization at every layer: connection management, query design, caching, write path discipline, and schema change governance. Each of these deserves examination — not because the techniques are novel, but because executing all of them simultaneously, under extreme growth pressure, with production at risk, is far harder than any one technique in isolation.
🔌
OpenAI's Azure PostgreSQL Flexible Server has a maximum of 5,000 concurrent connections. At ChatGPT's scale, application servers would easily exhaust this limit without connection pooling. Before deploying PgBouncer (a lightweight connection pooler for PostgreSQL that multiplexes many application connections into a smaller pool of real database connections, dramatically reducing connection overhead), average connection time was 50ms. After deployment in statement-pooling mode: 5ms. A 10x improvement from one infrastructure change.
Problem
10x Database Load Growth in One Year
ChatGPT's viral growth — 100 million users in two months at launch, 800 million by 2025 — drove database load up more than 10x in a single year. Connection exhaustion became a recurring threat. A 12-table ORM-generated join was causing multiple high-severity incidents when traffic spiked. Write pressure on the single primary was approaching dangerous levels during high-demand events.
Cause
Invisible Query Complexity and Write Pressure
ORMs ORM (Object-Relational Mapping — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code, abstracting away the database — convenient but capable of generating complex, inefficient queries that are invisible until they cause production incidents) generate SQL automatically, hiding complexity from developers. Under low load, even a 12-table join is fast enough to not notice. Under 10x load, the same query saturates database CPU. Meanwhile, write-heavy workloads that could be migrated to sharded systems like Azure Cosmos DB remained on the single primary longer than optimal.
Solution
Multi-Layer Defense: Pool + Cache + Rate Limit + Migrate
OpenAI implemented PgBouncer connection pooling (cutting connect time 10x), a cache-locking mechanism to prevent thundering herd on cache misses, multi-layer rate limiting at application, proxy, and query levels, surgical elimination of the worst ORM-generated queries, strict schema change governance (5-second DDL timeout), and a policy of migrating all new write-heavy workloads to sharded systems by default.
Result
One SEV-0 in Twelve Months, Five-Nines Availability
One SEV-0 in twelve months — triggered by the viral launch of ChatGPT ImageGen, which caused a 10x write surge as over 100 million users signed up within a week. Postgres recovered by design. p99 latency held at low double-digit milliseconds. The single-primary architecture remained viable at a scale that surprised the entire database engineering community.
⚠️
The ORM Query That Caused Multiple SEV-0s
OpenAI's engineers discovered that a single ORM-generated SQL query was joining 12 tables. Under normal load, the query executed in acceptable time. Under traffic spikes, it saturated the primary database's CPU and caused multiple high-severity incidents. The query had been auto-generated by the ORM framework and never explicitly reviewed. ORMs are excellent for developer productivity and terrible for query performance visibility. OpenAI now requires that all ORM-generated queries against high-traffic tables be reviewed and analyzed with EXPLAIN ANALYZE before deployment.
OpenAI's schema change governance is one of the most operationally distinctive aspects of their Postgres setup. They enforce a strict rule: schema changes that trigger a full table rewrite are prohibited in production. Postgres's MVCC (Multi-Version Concurrency Control — Postgres's mechanism for allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row and requiring periodic vacuum to reclaim space) model means that operations like ALTER TABLE ADD COLUMN DEFAULT on large tables can hold an exclusive lock for hours while rewriting billions of rows. This would be catastrophic at ChatGPT's scale. All DDL operations have a 5-second timeout : if the schema change cannot acquire a lock within 5 seconds, it is cancelled automatically. Long-running queries that would block vacuum or DDL are automatically terminated.
ℹ️
The Hot Standby in High-Availability Mode
OpenAI runs the primary database in High-Availability mode with a hot standby — a continuously synchronized replica specifically designated as the failover target. If the primary goes down, the hot standby can be promoted to primary with ~30–60 seconds of downtime. During a primary failure, read traffic on replicas is unaffected — since most ChatGPT requests are reads, a primary failure is not a SEV-0 (because reads remain available). Writes fail until promotion completes. This asymmetry between read and write availability is a conscious architectural tradeoff: the 800 million users who are just browsing conversation history continue being served.
ℹ️
Why Not Shard? The Honest Answer
The engineering question 'why didn't OpenAI shard PostgreSQL?' has a straightforward answer: sharding is expensive and their workload didn't require it yet. Horizontal sharding introduces cross-shard transaction complexity, scatter-gather query patterns, operational overhead of multiple database instances, and application-layer awareness of shard routing. For a read-heavy workload that can be served from replicas, these costs are not justified. OpenAI chose to pay the operational cost of extreme Postgres optimization rather than the architectural cost of sharding — and the math worked out. The 'no new tables' policy ensures this calculation will be revisited for write-heavy workloads as they emerge.
IDLE TRANSACTION TIMEOUTS: THE QUIET KILLER
OpenAI identified a subtle but devastating Postgres pattern at scale: idle transactions. When application code opens a database connection, starts a transaction, does unrelated work (calling an external API, waiting for user input), and only then commits — the transaction holds locks for the entire duration. At ChatGPT's scale, applications that hold open transactions for seconds can block vacuum, block DDL, and degrade query performance for all other connections. OpenAI enforces strict idle_in_transaction_session_timeout settings — any connection idle inside a transaction for more than a few seconds is automatically terminated. This breaks poorly-written code immediately in staging rather than causing incidents in production.
📊
Despite having ~50 read replicas across multiple geographic regions, OpenAI reports near-zero replication lag on most replicas under normal conditions. This is achieved by co-locating PgBouncer, application servers, and replicas in the same region (minimizing network latency in the replication path) and by keeping primary write load within the replication throughput capacity of the replicas. Heavy write events — like the ImageGen launch surge — temporarily increase replication lag, which is why read-your-own-write operations are always routed to the primary.
⚠️
The Write Ceiling Is Real
OpenAI's single-primary architecture has an acknowledged limit: write-heavy events can overwhelm it. The ImageGen SEV-0 was caused by a write surge, not a read surge. The architecture is not defended against arbitrary write load — it is defended against the current write profile, which remains manageable because most new write-heavy workloads are being routed to Cosmos DB. If write load grows faster than the migration effort proceeds, the single-primary architecture will face a harder ceiling. The 'no new tables in Postgres' policy is the operational discipline that buys time.
The Fix
The Seven-Layer Defense
OpenAI's Postgres scaling is not one clever trick — it is seven mutually reinforcing operational practices applied simultaneously. Any one of them in isolation would help marginally. Together they have produced an architecture that handles a scale that its underlying technology was not originally designed for. The practices are: connection pooling, thundering herd prevention, multi-layer rate limiting, hot standby failover, write offloading, query surgery, and DDL governance. Each addresses a specific failure mode that appeared as ChatGPT grew.
- 10x — Database load growth in a single year following ChatGPT's viral expansion — the growth rate that forced each of the seven defensive layers to be implemented under production pressure
- 5ms — Average connection setup time after PgBouncer deployment — down from 50ms before pooling, a 10x improvement that eliminated connection exhaustion as a recurring incident cause
- 5 sec — Maximum DDL lock wait timeout — schema changes that cannot acquire a lock within 5 seconds are automatically cancelled, preventing table-lock incidents on billion-row tables
- 1 SEV-0 — High-severity incidents in the twelve months after full defensive architecture was deployed — triggered by ImageGen launch write surge, resolved by design without full platform outage
# Cache-locking pattern: prevents thundering herd on cache misses
# When cache expires, only ONE request repopulates it — others wait
import threading
# Simplified cache-lock implementation
_cache = {}
_locks = {}
_lock_mutex = threading.Lock()
def get_with_cache_lock(key: str, fetch_fn, ttl_seconds: int):
"""Get value from cache. On miss, only one thread fetches;
others block and receive the result once available."""
# Fast path: cache hit
if key in _cache:
return _cache[key]
# Slow path: cache miss — acquire per-key lock
with _lock_mutex:
if key not in _locks:
_locks[key] = threading.Event()
should_fetch = True
else:
should_fetch = False
event = _locks[key]
if should_fetch:
try:
# This thread does the database read
value = fetch_fn() # ONE database query
_cache[key] = value # populate cache
return value
finally:
with _lock_mutex:
event = _locks.pop(key)
event.set() # wake all waiters
else:
# Other threads wait for the fetching thread to complete
event.wait(timeout=5) # don't wait forever
return _cache.get(key) # return from cache
ℹ️
The Cosmos DB Migration Policy
OpenAI's most forward-looking operational decision is a standing policy: no new tables are created in PostgreSQL. All new workloads default to sharded systems — primarily Azure Cosmos DB. Existing write-heavy workloads that can be horizontally partitioned are gradually migrated out. This policy doesn't fix the current architecture; it fixes the future architecture. Over time, the Postgres primary handles a smaller and smaller share of writes while remaining the canonical store for core user and conversation data. The single-primary architecture is not defended forever — it's being gracefully phased toward a hybrid model.
REVIEW ORM-GENERATED SQL IN PRODUCTION
OpenAI's most actionable advice: add ORM-generated SQL review to your production deployment process. ORM frameworks are brilliant for development velocity. They are silent performance landmines at scale. A query that joins 12 tables, a query that does a full table scan on an unindexed column, a query that triggers N+1 loads — none of these are visible in code review because the ORM generates them at runtime. OpenAI now requires that SQL generated by ORM frameworks for high-traffic tables be logged, analyzed with EXPLAIN ANALYZE at peak load, and reviewed by a database engineer before the code ships. This practice is cheap. Not having it costs SEV-0s.
✅
Lazy Writes: Smoothing Write Spikes
OpenAI introduced lazy writes for certain workloads — deferring non-critical writes instead of executing them immediately. For example, updating a user's last-seen timestamp or incrementing a view counter doesn't need to hit the database synchronously with every request. Batching these writes and flushing them periodically smooths write traffic from a spiky real-time pattern to a steadier background pattern. Lazy writes reduced write load on the primary meaningfully without any change to user-visible behavior.
✅
Covering Indexes: The Query Surgery Tool
Beyond eliminating bad ORM queries, OpenAI invested heavily in covering indexes — indexes that contain all columns needed by a query, so Postgres can answer it from the index alone without reading table rows. A covering index on a high-frequency query can reduce query cost from a sequential scan of billions of rows to a few hundred index lookups. OpenAI's database engineers regularly audit slow query logs and apply targeted index improvements, particularly after any traffic increase reveals latent query inefficiencies that weren't visible at lower load.
ℹ️
Feature Traffic Isolation
OpenAI isolates low-priority features from critical traffic paths. If a secondary feature — say, a background data analysis job — starts behaving poorly and consuming database resources, it should not degrade ChatGPT's core conversational experience. This isolation is implemented through separate connection pools for different traffic classes , Kubernetes resource quotas for background workloads, and rate limiting that gives core product queries priority access to database capacity. The principle: a misbehaving low-priority feature should degrade itself, not the entire platform.
Architecture
OpenAI's Postgres architecture is simple at the macro level — one writer, many readers — but densely engineered at the micro level. The simplicity is intentional: every additional layer of infrastructure complexity is a potential failure mode. The dense engineering at the application and proxy layers is what makes the simple macro architecture viable at unprecedented scale. Understanding why this architecture works requires understanding both its strengths (read-heavy workload perfectly matched to replica fan-out) and its known limits (write spikes, ORM-generated queries, connection exhaustion).
OpenAI's PostgreSQL Architecture: Single Primary, Global Read Scale
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Multi-Layer Rate Limiting: Defense in Depth for Write Spikes
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
THE REPLICATION LAG TRADEOFF
Asynchronous replication to read replicas introduces a tradeoff: reads may return slightly stale data. For most ChatGPT operations — loading conversation history, displaying user settings, browsing the interface — a few hundred milliseconds of staleness is imperceptible and acceptable. For the small fraction of requests that require current data (a write followed immediately by a read-your-own-write pattern), OpenAI routes those reads to the primary. This explicit differentiation between 'reads that can tolerate lag' and 'reads that cannot' is a design discipline, not an accident — and it's what allows the read load to be distributed across 50 replicas.
⚠️
The Backfill Rate Limit: So Slow It Takes a Week
OpenAI enforces strict rate limits on database backfill operations — migrations that populate new columns or update existing rows across large tables. These rate limits are aggressive enough that a large backfill can take over a week to complete. This is deliberate: a fast backfill on a billion-row table would compete with live traffic for I/O, degrade query latency, and risk triggering the DDL timeout. Slow backfills are boring and invisible. Fast backfills cause incidents. OpenAI chose boring.
✅
The Five-Nines Achieved
OpenAI reports achieving 99.999% availability on their Postgres infrastructure — five nines, which means less than 5.26 minutes of downtime per year. This is achieved despite running a single primary, primarily because most customer traffic is read-only (served by replicas even during primary downtime), write failures during primary maintenance are brief (hot standby promotion in 30–60 seconds), and the defensive layers prevent the most common failure modes from escalating. Five nines on a single-primary setup requires more engineering discipline, not less, than achieving the same availability on a distributed system.
Lessons
OpenAI's PostgreSQL story is the strongest available evidence that conventional wisdom about 'you must shard at scale' is not a law — it's a heuristic that depends heavily on workload shape. The lessons here are about operational discipline, honest workload analysis, and knowing the limits of your architecture before they find you.
- 01. Analyze your workload before choosing your architecture. OpenAI's single-primary architecture works because ChatGPT is overwhelmingly read-heavy. A write-heavy workload at the same scale would fail with this architecture. The lesson is not 'use a single primary' — it's 'design for your actual access patterns, not for the scale number on the slide.'
- 02. Connection pooling (deploying a proxy like PgBouncer between application servers and PostgreSQL that multiplexes thousands of application connections into a smaller pool of database connections, reducing connection overhead and preventing connection exhaustion) is not optional at scale. At ChatGPT's traffic volume, hitting Postgres's 5,000-connection limit without pooling would have caused regular outages. PgBouncer turned a recurring incident cause into a non-issue. Deploy it before you need it.
- 03. Review ORM-generated SQL for high-traffic tables before shipping. A 12-table join that worked fine at 1x traffic caused multiple SEV-0s at 10x. ORMs are invisible query generators. Add explicit review of ORM-generated queries — EXPLAIN ANALYZE at production load levels — as a standard pre-deployment step for database-touching code.
- 04. Enforce schema change governance with hard timeouts. A DDL operation that holds a table lock for hours will cause an incident. OpenAI's 5-second DDL timeout automatically cancels any schema change that cannot acquire a lock quickly. This constraint forces engineers to use online DDL tools (pg_repack, zero-downtime column addition) rather than naive ALTER TABLE on large tables.
- 05. Plan the exit from your current architecture before you need it. OpenAI's 'no new tables in PostgreSQL' policy and ongoing write workload migration to Cosmos DB are the planned evolution of the current architecture. A single-primary Postgres at 800M users is viable today because write load is bounded. It's viable tomorrow because write-heavy workloads are being systematically migrated out. Know the limits of your current architecture and have a credible plan for crossing them.
ℹ️
The ImageGen Launch: The One That Got Through
In twelve months of operation with the fully hardened architecture, OpenAI had one SEV-0: the launch of ChatGPT ImageGen. Over 100 million new users signed up within a week, driving a >10x spike in write traffic — specifically new account creation and preference storage — that temporarily overwhelmed the primary's write capacity. The system recovered by design, but the event validated the 'no new tables in Postgres' policy. Write surges at viral launch scale are the known limit of single-primary architecture. The Cosmos DB migration is the known fix.
THE HYBRID MIGRATION STRATEGY
OpenAI's hybrid approach — keep Postgres for what it does well, migrate write-heavy workloads to Cosmos DB, enforce 'no new tables in Postgres' — is a template for any team running a successful legacy database under growth pressure. The alternative extremes (migrate everything at once, or never migrate anything) are both wrong. Incremental migration guided by workload characteristics is boring, slow, and correct. The discipline is in writing down the policy and enforcing it before the crisis arrives.
OpenAI runs ChatGPT for 800 million users on one Postgres instance and the most complex part of their database engineering is telling people not to use ORMs without reading the SQL they generate.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)