Alexis

Posted on Jan 24

OpenAI's testament to Postgresql reliability

#database #mariadb #mysql #postgres

As someone who has seen the "Postgres vs The World" wars more times than I’ve seen successful schema migrations, I find OpenAI's engineering blog on scaling PostgreSQL to be a fascinating study in Brute Force meets Elegance.

It is truly heartwarming to see a trillion-dollar AI powerhouse wrestling with the same VACUUM issues and connection storms that haunt a weekend Shopify plugin. Here's my analysis their journey, contrasted with the "grass is greener" alternatives of the RDBMS world.

1. Cracks in the Initial Design ("Postgres is slow" phase)

The challenge: OpenAI hit the classic wall: write amplification. In Postgres, every UPDATE is effectively an INSERT of a new row version (MVCC), which triggers index updates and bloat.

The OpenAI Fix: Scaling up (bigger instances) and scaling out (more read replicas), then eventually surrendering and moving write-heavy shards to Azure Cosmos DB.

How other engines compare:

MySQL/MariaDB: These would have smirked. MySQL’s InnoDB uses an "Update-in-Place" mechanism with an Undo Log. It doesn't copy the whole row for every change, making it inherently more efficient for write-heavy workloads. OpenAI might have avoided moving to Cosmos DB so early if they’d started with a storage engine that doesn't treat an update like a funeral for the old row.

2. Reducing Load on the Primary ("One Writer to Rule Them All")

The Challenge: Relying on a single primary for writes for 800 million users is… bold. One "write storm" from a new feature, and the whole house of cards wobbles.

The OpenAI Fix: Moving sharded workloads away and aggressively optimizing application-side writes.

How other engines compare:

TiDB: This (and other) NewSQL engines were built specifically to avoid this single writer existential dread. They offer native horizontal write scaling. While OpenAI is busy manually sharding to Cosmos DB (a completely different API), a TiDB cluster would have just asked for more nodes and kept the SQL interface intact. Just ask Atlassian, Plaid, Bolt, Pinterest and many others.

3. Query Optimization & ORM Sins

The Challenge: A 12-table join generated by an ORM nearly brought down ChatGPT.

The OpenAI Fix: Moving join logic to the application layer and hunting down "idle in transaction" sessions.

How other engines compare:

PostgreSQL actually has one of the best query planners in existence. If PostgreSQL couldn't handle that 12-table join, MySQL likely would have just timed out and retired. However, MariaDB has some interesting hash join optimizations that can occasionally handle ORM-induced trauma better than standard nested loops.

4. Workload Isolation ("Noisy Neighbor")

The Challenge: A new feature launch can starve the API of resources.

The OpenAI Fix: Physically splitting high-priority and low-priority traffic onto different instances.

How other engines compare:

Oracle / SQL Server (The "Expensive" Engines): While we love open source, Oracle’s Resource Manager and SQL Server's Resource Governor are decades ahead here, allowing fine-grained CPU/IO caps within a single instance. In the open-source world, OpenAI's "separate instance" approach is the standard solution, though PostgreSQL's lack of native multi-tenancy resource controls makes this inevitable.

There are a few open-source alternatives to resource governing (other than OpenAI's proxy layer) though :

Linux Control Groups (cgroups)

This is the most direct open-source equivalent, but it operates at the OS level rather than the SQL level.
You create cgroups (CPU shares, memory limits, I/O bandwidth) and assign specific database processes to them.
This works best for PostgreSQL because Postgres is process-based (one PID per connection). You can move a PID into a restricted cgroup.
However, it’s a manual nightmare to automate at scale. You end up writing custom watcher scripts to catch rogue queries and shove them into the "penalty box" cgroup.

MySQL 8.4 LTS & 9.x Resource Groups ("tread-pinning")

MySQL (the free one) actually introduced a native Resource Group feature to compete with the big boys.
You can define resource groups and assign them to specific threads. You can even use SQL hints like SELECT /*+ RESOURCE_GROUP(Batch_Group) */ ... to force a query into a restricted lane.
You can also assign specific priorities to threads (0 to 19). Higher priorities get more cpu face time under contention.
Alas, it currently only manages CPU/vCPU affinity. It doesn’t yet have the sophisticated IOPS or memory throttling found in SQL Server’s Resource Governor or Oracle's Resource Manager.

MariaDB 11.x Thread-Pooling

MariaDB has historically rolled its eyes at MySQL’s Resource Groups. Instead, they’ve perfected the Thread Pool.

Priority Queuing: Instead of pinning threads to CPUs, MariaDB’s Thread Pool (especially in the latest 11.x LTS releases) uses a sophisticated queuing system. You can define thread_pool_prio=high for specific connections.

The "LIFO" Trick: To prevent "clogging," MariaDB can prioritize existing transactions over new ones, ensuring that work already in progress finishes faster and releases resources—a "clean up your own mess" philosophy.

Resource Limits (the old school way): MariaDB still leans heavily on MAX_USER_CONNECTIONS, MAX_QUERIES_PER_HOUR, and MAX_UPDATES_PER_HOUR. It’s less "dynamic throttling" and more "hard ceiling". It still lacks a "CREATE RESOURCE GROUP" similar to MySQL's.

pg_resgroup (Greenplum / CloudNativePG)

While standard Postgres doesn't have native resource groups, its specialized cousin Greenplum (open-source) does.

Some extensions and operators for Kubernetes (like CloudNativePG) are trying to bridge this gap by using Kubernetes "Resource Quotas" to mimic governance.
In standard Postgres, the closest you get without a specialized fork is setting statement_timeout or using work_mem limits per user, which is a blunt instrument compared to a governor.

5. Connection Pooling (The PgBouncer Tax)

The Challenge: Postgres creates a process per connection. 5,000 connections and the kernel starts sweating.

The OpenAI Fix: Deploying fleets of PgBouncer in Kubernetes.

How other engines compare:

MySQL: Handles connections as threads, which is lighter, but still struggles at massive scale.

MariaDB: Specifically implemented a Thread Pool plugin years ago to solve exactly what OpenAI is doing with PgBouncer. Using MariaDB, they might have avoided the complexity of managing a middle-tier proxy layer entirely.

6. Scaling Read Replicas (The WAL Avalanche)

The Challenge: The primary has to ship Write Ahead Logs (WAL) to 50+ replicas, eating its own bandwidth.

The OpenAI Fix: Cascading replication (replicas of replicas).

In this model, the Primary sends logs to level 1 replicas, which in turn act as masters for level 2 replicas.

This strategy protects the Primary's bandwidth. If you have 50 replicas, the Primary only talks to 3 or 4; those 4 handle the traffic for the rest.
However, every hop adds latency. By the time data reaches a level 2 replica, it might be hundreds of milliseconds (or seconds) behind the Primary. This makes "Read-Your-Own-Writes" nearly impossible.
Worse, if a level 1 replica dies, all its children become "orphans" and stop updating. You need a very smart automation layer (like Patroni) to re-home those orphans.

Other options :

Amazon Aurora PostgreSQL

Aurora laughed at the "Cascading" problem and deleted it entirely.
Here, replicas do not have their own copy of the data. They all look at a single, massive, distributed storage volume.

Since there is no "Log Shipping" between instances, there is zero replication lag in the traditional sense. You can have 15 replicas, and they all see the data almost instantly (typically <10ms).
You don't need "Levels." You just add more readers to the cluster endpoint.
So, you gain incredible scaling but lose the ability to move that database back to a standard on-prem server without a full export/import.

Google Cloud AlloyDB

AlloyDB is Google's answer to Aurora, but it adds an "Intelligent Cache" layer.

AlloyDB uses a specialized "Log Processing Layer" that sits between the database and storage. It pre-digests the WALs so the replicas don't have to work as hard, here's how :

AlloyDB splits the database into a Compute Layer (where SQL lives) and an Intelligent Storage Layer. The "pre-digesting" happens in the middle, handled by the Log Processing Service (LPS).

Step 1: The Primary ships "thin" logs: When you update a row, the primary node only sends the WAL record to the storage layer. It doesn't worry about updating data blocks on disk.
Step 2: The LPS takes over: The LPS is a fleet of regional workers that receive these WAL records. Their entire job is to materialize the logs. They take the stream of changes and apply them to the blocks in the distributed storage system asynchronously.
Step 3: Replicas stay "fresh": Because the storage itself is being updated by the LPS, the read replicas don't have to replay anything. When a replica needs a data block, it simply fetches the already-updated block from the storage layer.

It is designed specifically for "Analytical-Lite" queries on read replicas, often outperforming standard PostgreSQL by 4x–5x for complex scans.
But, like Aurora, it’s a "black box." You get the performance, but you pay the "Cloud Premium."
How other engines compare:

7. Schema Management

The Challenge: Table rewrites for column changes.

The OpenAI Fix: A strict 5-second timeout and forbidding any change that triggers a rewrite.

How other engines compare:

MySQL 8.0+: Introduced "Instant DDL," allowing users to add columns in milliseconds regardless of table size.

MariaDB introduced Instant Add Column in version 10.3 (back in 2018) and has been expanding it ever since.
In MariaDB, adding a column at the end of a table, changing a default value, or renaming a column is a metadata-only operation. It doesn't matter if your table has 10 rows or 10 billion; the change happens in milliseconds because the engine doesn't touch the existing data files.
By MariaDB 10.6 and 11.x, even more operations became "Instant," including dropping columns (which is handled by logically hiding them rather than physically deleting them immediately).

Postgres is getting better at this (it no longer rewrites for many ALTER TABLE ops), but OpenAI’s trauma suggests they are still haunted by the ghosts of Postgres versions in the past.

Final thoughts

OpenAI's achievement is a testament to PostgreSQL's reliability. They have successfully pushed a general-purpose, single-primary database to do things it was never meant to do.

However, looking at their "Road Ahead," they are slowly reinventing Vitess or TiDB one architectural SEV at a time. They chose the "Devil they knew" (PostgreSQL) and are paying the "Engineering Tax" to keep it alive. It’s a masterclass in stability over novelty, even if that stability requires 50 replicas and a separate NoSQL database just to handle the overflow.

DEV Community

OpenAI's testament to Postgresql reliability

1. Cracks in the Initial Design ("Postgres is slow" phase)

2. Reducing Load on the Primary ("One Writer to Rule Them All")

3. Query Optimization & ORM Sins

4. Workload Isolation ("Noisy Neighbor")

Linux Control Groups (cgroups)

MySQL 8.4 LTS & 9.x Resource Groups ("tread-pinning")

MariaDB 11.x Thread-Pooling

pg_resgroup (Greenplum / CloudNativePG)

5. Connection Pooling (The PgBouncer Tax)

6. Scaling Read Replicas (The WAL Avalanche)

Amazon Aurora PostgreSQL

Google Cloud AlloyDB

7. Schema Management

Final thoughts

Top comments (0)