TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 18

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

#webdev #programming #database #backend

800M users — 1 primary PostgreSQL instance on Azure, ~50 read replicas globally
Millions of QPS — p99 latency held at low double-digit milliseconds
50ms → 5ms connection setup time after PgBouncer deployment — a 10× improvement
10× database load growth in a single year following ChatGPT's viral expansion
5 seconds — maximum DDL lock wait timeout; schema changes that can't acquire a lock in time are automatically cancelled
1 SEV-0 in twelve months — triggered by ImageGen launch write surge, resolved by design

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimisation and ruthless operational discipline.

The Story

The conventional wisdom about database scaling at 800 million users is straightforward: you shard. You move to a distributed SQL system. You do not run a single primary PostgreSQL instance. OpenAI's ChatGPT does not follow this conventional wisdom. It runs on one Azure PostgreSQL Flexible Server that handles all writes — backed by approximately 50 read replicas spread across multiple regions. The system handles millions of queries per second at low double-digit millisecond p99 latency and has maintained five-nines availability. In twelve months, they had one SEV-0.

The story is not that Postgres is magic. The story is that relentless optimisation of a boring, proven technology can outperform premature architectural complexity.

Why Single-Primary Works at This Scale

ChatGPT's workload is overwhelmingly read-heavy. When 800 million users open the app, browse their chat history, or load their settings, those are reads. Writes happen on message submission and account updates — a much smaller fraction of total traffic. This access pattern is exactly what a single-primary with many read replicas handles well: the write path stays narrow, the read load fans out horizontally across replicas. The architecture is not brilliant. It is appropriate for the workload. That fit is what makes it work.

OpenAI's blog published at PGConf.dev 2025 was unusually candid about both the decisions that worked and the ones that nearly broke the system. The database load grew by more than 10× in a single year following ChatGPT's viral growth. The team responded with aggressive optimisation at every layer: connection management, query design, caching, write path discipline, and schema change governance.

Problem

10× Database Load Growth in One Year

ChatGPT's viral growth — 100 million users in two months at launch, 800 million by 2025 — drove database load up more than 10× in a single year. Connection exhaustion became a recurring threat. A 12-table ORM (Object-Relational Mapping — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code, abstracting away the database — convenient but capable of generating complex, inefficient queries invisible until they cause production incidents) generated join was causing multiple high-severity incidents when traffic spiked. Write pressure on the single primary was approaching dangerous levels during high-demand events.

Cause

Invisible Query Complexity and Write Pressure

ORMs generate SQL automatically, hiding complexity from developers. Under low load, even a 12-table join is fast enough to not notice. Under 10× load, the same query saturates database CPU. Meanwhile, write-heavy workloads that could be migrated to sharded systems like Azure Cosmos DB remained on the single primary longer than optimal.

Solution

Multi-Layer Defence: Pool + Cache + Rate Limit + Migrate

OpenAI implemented PgBouncer connection pooling (cutting connect time 10×), a cache-locking mechanism to prevent thundering herd on cache misses, multi-layer rate limiting at application, proxy, and query levels, surgical elimination of the worst ORM-generated queries, strict schema change governance (5-second DDL timeout), and a policy of migrating all new write-heavy workloads to sharded systems by default.

Result

One SEV-0 in Twelve Months, Five-Nines Availability

One SEV-0 in twelve months — triggered by the viral launch of ChatGPT ImageGen, which caused a 10× write surge as over 100 million users signed up within a week. Postgres recovered by design. p99 latency held at low double-digit milliseconds. The single-primary architecture remained viable at a scale that surprised the entire database engineering community.

The Fix

The Seven-Layer Defence

OpenAI's Postgres scaling is not one clever trick — it is seven mutually reinforcing operational practices applied simultaneously. Any one in isolation would help marginally. Together they have produced an architecture that handles a scale its underlying technology was not originally designed for.

10× — database load growth in a single year; the growth rate that forced each defensive layer to be implemented under production pressure
5ms — average connection setup time after PgBouncer deployment; down from 50ms, a 10× improvement
5 sec — maximum DDL lock wait timeout; prevents table-lock incidents on billion-row tables
1 SEV-0 — high-severity incidents in twelve months after the full defensive architecture was deployed

# Cache-locking pattern: prevents thundering herd on cache misses
# When cache expires, only ONE request repopulates it — others wait for the result

import threading

_cache = {}
_locks = {}
_lock_mutex = threading.Lock()

def get_with_cache_lock(key: str, fetch_fn, ttl_seconds: int):
    """
    On cache hit: return immediately.
    On cache miss: one thread fetches from DB; all others wait for the result.
    Prevents N simultaneous DB queries for the same expired cache key.
    """
    # Fast path: cache hit
    if key in _cache:
        return _cache[key]

    # Slow path: cache miss — acquire per-key lock
    with _lock_mutex:
        if key not in _locks:
            _locks[key] = threading.Event()
            should_fetch = True
        else:
            should_fetch = False
            event = _locks[key]

    if should_fetch:
        try:
            value = fetch_fn()       # ONE database query, not N
            _cache[key] = value      # populate cache for all waiters
            return value
        finally:
            with _lock_mutex:
                event = _locks.pop(key)
            event.set()              # wake all waiting threads
    else:
        event.wait(timeout=5)        # wait for the fetching thread
        return _cache.get(key)       # return from cache

ORM Query Review: The Most Actionable Lesson

OpenAI discovered a single ORM-generated query joining 12 tables. Under normal load it was acceptable. Under traffic spikes it saturated the primary's CPU and caused multiple SEV-0s. The query had been auto-generated by the ORM framework and never explicitly reviewed. OpenAI now requires that all ORM-generated queries against high-traffic tables be analysed with EXPLAIN ANALYZE before deployment. This practice is cheap. Not having it costs SEV-0s.

OpenAI's schema change governance is one of the most operationally distinctive aspects of their Postgres setup. They enforce a strict rule: schema changes that trigger a full table rewrite are prohibited in production. Postgres's MVCC (Multi-Version Concurrency Control — Postgres's mechanism allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row and requiring periodic vacuum to reclaim space) model means that operations like ALTER TABLE ADD COLUMN DEFAULT on large tables can hold an exclusive lock for hours while rewriting billions of rows. All DDL operations have a 5-second timeout: if the schema change cannot acquire a lock within 5 seconds, it is cancelled automatically.

The seven defensive layers:

Connection pooling — PgBouncer in statement mode; 50ms → 5ms connect time; eliminates connection exhaustion
Thundering herd prevention — cache-locking on cache misses; only one thread fetches from DB; all others wait
Multi-layer rate limiting — at application, proxy, and query levels; core product queries get priority
Hot standby failover — continuously synchronised replica; primary promotion in 30–60 seconds
Write offloading — all new write-heavy workloads default to Cosmos DB; no new tables in Postgres
Query surgery — ORM-generated SQL reviewed with EXPLAIN ANALYZE; covering indexes on hot query paths
DDL governance — 5-second lock timeout; backfills rate-limited so aggressively they can take a week

The Cosmos DB migration policy: fixing the future architecture

OpenAI's most forward-looking operational decision is a standing policy: no new tables are created in PostgreSQL. All new workloads default to sharded systems — primarily Azure Cosmos DB. Existing write-heavy workloads that can be horizontally partitioned are gradually migrated out. This policy doesn't fix the current architecture; it fixes the future architecture. Over time, the Postgres primary handles a smaller and smaller share of writes while remaining the canonical store for core user and conversation data. The single-primary architecture is not defended forever — it is being gracefully phased toward a hybrid model.

Idle transaction timeouts: the quiet killer

OpenAI identified a subtle but devastating Postgres pattern at scale: idle transactions. When application code opens a database connection, starts a transaction, does unrelated work (calling an external API, waiting for user input), and only then commits — the transaction holds locks for the entire duration. At ChatGPT's scale, applications holding open transactions for seconds can block vacuum, block DDL, and degrade query performance for all other connections. OpenAI enforces strict idle_in_transaction_session_timeout settings — any connection idle inside a transaction for more than a few seconds is automatically terminated. This breaks poorly-written code immediately in staging rather than causing incidents in production.

The backfill rate limit: so slow it takes a week

OpenAI enforces strict rate limits on database backfill operations — migrations that populate new columns or update existing rows across large tables. These rate limits are aggressive enough that a large backfill can take over a week to complete. This is deliberate: a fast backfill on a billion-row table would compete with live traffic for I/O, degrade query latency, and risk triggering the DDL timeout. Slow backfills are boring and invisible. Fast backfills cause incidents. OpenAI chose boring.

Architecture

OpenAI's Postgres architecture is simple at the macro level — one writer, many readers — but densely engineered at the micro level. The simplicity is intentional: every additional layer of infrastructure complexity is a potential failure mode. The dense engineering at the application and proxy layers is what makes the simple macro architecture viable at unprecedented scale.

OpenAI's PostgreSQL Architecture: Single Primary, Global Read Scale

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Multi-Layer Rate Limiting: Defence in Depth for Write Spikes

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Replication Lag Tradeoff

Asynchronous replication to read replicas introduces a tradeoff: reads may return slightly stale data. For most ChatGPT operations — loading conversation history, displaying user settings, browsing the interface — a few hundred milliseconds of staleness is imperceptible and acceptable. For the small fraction of requests that require current data (write followed immediately by read-your-own-write), OpenAI routes those reads to the primary. This explicit differentiation between "reads that can tolerate lag" and "reads that cannot" is a design discipline that allows read load to be distributed across 50 replicas — not an accident.

Lessons

Analyse your workload before choosing your architecture. OpenAI's single-primary architecture works because ChatGPT is overwhelmingly read-heavy. A write-heavy workload at the same scale would fail with this architecture. The lesson is not "use a single primary" — it's "design for your actual access patterns, not for the scale number on the slide."
Connection pooling (deploying a proxy like PgBouncer between application servers and PostgreSQL that multiplexes thousands of application connections into a smaller pool of database connections) is not optional at scale. At ChatGPT's traffic volume, hitting Postgres's 5,000-connection limit without pooling would have caused regular outages. PgBouncer turned a recurring incident cause into a non-issue. Deploy it before you need it.
Review ORM-generated SQL for high-traffic tables before shipping. A 12-table join that worked fine at 1× traffic caused multiple SEV-0s at 10×. ORMs are invisible query generators. Add explicit review of ORM-generated queries — EXPLAIN ANALYZE at production load levels — as a standard pre-deployment step for database-touching code.
Enforce schema change governance with hard timeouts. A DDL operation that holds a table lock for hours will cause an incident. OpenAI's 5-second DDL timeout automatically cancels any schema change that cannot acquire a lock quickly. This constraint forces engineers to use online DDL tools (pg_repack, zero-downtime column addition) rather than naive ALTER TABLE on large tables.
Plan the exit from your current architecture before you need it. OpenAI's "no new tables in PostgreSQL" policy and ongoing write workload migration to Cosmos DB are the planned evolution of the current architecture. A single-primary Postgres at 800M users is viable today because write load is bounded. It's viable tomorrow because write-heavy workloads are being systematically migrated out. Know the limits of your current architecture and have a credible plan for crossing them.

Engineering Glossary

Covering index — a database index that contains all columns needed by a query, allowing Postgres to answer it from the index alone without reading table rows. Can reduce query cost from a sequential scan of billions of rows to a few hundred index lookups on high-frequency query paths.

DDL (Data Definition Language) — SQL statements that change database structure, such as ALTER TABLE and CREATE INDEX. DDL operations in Postgres can acquire exclusive locks that block all reads and writes on a table while they execute — catastrophic at scale. OpenAI enforces a 5-second DDL timeout and prohibits operations that require full table rewrites.

Hot standby — a continuously synchronised replica specifically designated as the failover target. When the primary goes down, the hot standby can be promoted to primary with ~30–60 seconds of downtime. During a primary failure, read traffic on other replicas is unaffected.

Idle transaction timeout — a database session setting that automatically terminates connections that are idle inside an open transaction for longer than a configured period. Prevents applications that open transactions and do unrelated work from holding locks indefinitely and blocking vacuum and DDL.

MVCC (Multi-Version Concurrency Control) — Postgres's mechanism allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row. Requires periodic vacuum to reclaim space from deleted and updated rows.

ORM (Object-Relational Mapping) — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code. Convenient for development velocity; capable of generating complex, inefficient queries (like a 12-table join) that are invisible in code review because they are generated at runtime.

PgBouncer — a lightweight connection pooler for PostgreSQL that multiplexes many application connections into a smaller pool of real database connections. Reduces connection overhead and prevents connection exhaustion. Reduced OpenAI's average connection setup time from 50ms to 5ms.

Thundering herd — a failure pattern where many concurrent requests simultaneously attempt to repopulate an expired cache entry, each independently querying the database and generating N identical queries. Prevented by a cache-locking mechanism where only one thread fetches while all others wait for the result.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community