TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

How Stripe Moves Petabytes Between Database Shards Without Stopping the Money

#database #stripe #systemdesign #scalability

Stripe · Databases · 17 May 2026

Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.

$1T+ payment volume 2023
99.999% uptime achieved
5M database queries/sec
1.5 PB migrated in 2023
Thousands of shards managed
Zero-downtime migrations

The Story

$1T+ — Payment volume processed by Stripe in 2023 — making their database reliability requirements some of the most demanding in the industry
99.999% — Uptime achieved — five nines means less than 5.26 minutes of total downtime per year across all Stripe APIs and payment processing
5M QPS — Database queries per second sustained across Stripe's DocDB fleet — comparable to some of the largest databases in the world
1.5 PB — Data migrated between shards in 2023 alone using the Data Movement Platform — transparently to all applications

When Stripe launched in 2011, they chose MongoDB (a document-oriented NoSQL database that stores data as flexible JSON-like documents rather than fixed relational table schemas, offering developer productivity advantages for rapid iteration) because it was more developer-friendly than standard relational databases for a fast-moving startup. Over the next decade, as Stripe grew from a startup to a financial infrastructure company processing trillion-dollar payment volumes, the team built a layer on top of MongoDB that they call DocDB — a Database-as-a-Service (an abstraction layer that gives application developers a simple API for data access while hiding all the complexity of sharding, replication, failover, and migrations beneath it). DocDB handles horizontal sharding (a database scaling technique that distributes data rows across multiple independent database instances (shards) based on a partition key, so no single instance holds all the data and traffic is distributed) across thousands of shards, manages replication for high availability, and — crucially — enables the zero-downtime data migrations that allow Stripe's database fleet to scale continuously without ever taking payments offline.

The central innovation of DocDB is its Data Movement Platform — a system that can migrate chunks of data between shards while both the source and target shards continue serving live production traffic. This capability is essential for Stripe's operations: as certain merchants grow rapidly and their shard fills up, it needs to be split. As the fleet evolves and some shards become underutilized, they can be consolidated. When a new MongoDB version is released, shards can be upgraded by fork-lifting (migrating data to a new instance running the target version, avoiding multi-step in-place upgrades that pass through each intermediate version) to the new version rather than performing multi-step in-place upgrades. All of these operations have one requirement in common: Stripe cannot stop accepting payments while they happen.

THE FIVE NINES CONSTRAINT

99.999% uptime means less than 5.26 minutes of downtime per year. For a payment processor, downtime is not just SLA violation — it's merchants unable to complete sales, customers unable to pay, and real-time revenue loss for the businesses Stripe serves. Every database operation — migration, split, consolidation, upgrade — must happen transparently. The constraint is absolute: there is no maintenance window at Stripe's scale.

The Six-Step Migration Protocol

The Data Movement Platform executes every shard migration through a six-step protocol: (1) register the migration plan in the chunk metadata service (a central catalog that tracks which data chunks live on which shards — the source of truth for query routing across Stripe's fleet), (2) build indexes on the target shard before data arrives (avoiding the performance penalty of indexing after a large data load), (3) bulk-copy a snapshot of the chunk from source to target, (4) stream async replication to apply changes made on the source since the snapshot was taken, (5) perform correctness checks to verify data consistency, (6) switch traffic to the target and deregister the chunk from the source. Steps 3 and 4 were where Stripe hit unexpected engineering challenges — and where the most creative solutions emerged.

Problem

Shard Splits and Consolidations Required Downtime

Without the Data Movement Platform, scaling Stripe's database fleet required either accepting downtime during shard operations or building complex dual-write logic for every migration. As Stripe's fleet grew to thousands of shards, this was operationally unsustainable and created real risk for every migration event.

Cause

Financial Data Cannot Tolerate Inconsistency

Payment data has zero tolerance for consistency errors — a payment record that exists on the source shard but hasn't yet appeared on the target is a payment that could be double-charged, lost, or corrupted if traffic switches at the wrong moment. The six-step protocol was designed specifically to guarantee that by the time traffic switches, the target is exactly consistent with the source including all writes made during migration.

Solution

CDC Replication + Correctness Verification

Stripe solved the consistency problem using Change Data Capture (a technique that continuously reads the MongoDB operation log (oplog) to stream every write applied to the source shard to the target, keeping the target synchronized even as live traffic modifies the source data) streaming from the source shard's oplog. After CDC replication catches up to near-real-time, correctness checks compare source and target before traffic is switched. The switch itself is atomic from the application's perspective.

Result

1.5 Petabytes Moved in 2023 Transparently

In 2023 alone, Stripe migrated 1.5 petabytes of data between shards, consolidated thousands of databases through bin packing, and upgraded the entire MongoDB fleet — all with zero application downtime and no payment processing interruptions. 99.999% uptime was maintained throughout.

DocDB's ability to migrate data between shards in a consistent, granular and reliable way has made it significantly easier for Stripe to scale.

— — Jimmy Morzaria, Suraj Narkhede — via Stripe Engineering Blog, June 2024

⚠️

The Bulk Load Throughput Problem

Step 3 of the migration — bulk loading a snapshot of the chunk onto the target shard — hit a significant throughput limitation during testing. Stripe's engineering team tried batching writes and tuning DocDB engine parameters, but neither approach resolved the bottleneck. The root cause was an impedance mismatch between the bulk loader and the target shard's write path: the target shard was not optimized for sequential ingestion at high speeds. The engineering team eventually solved this by building purpose-built bulk import tooling with different I/O patterns than the standard DocDB write path.

🗃️

Stripe manages thousands of DocDB shards — and periodically performs bin-packing consolidations where underutilized shards are merged to reduce operational overhead and hardware costs. In 2023 they reduced the total number of underlying DocDB shards by approximately three-quarters through such consolidation, migrating 1.5 petabytes of data in the process.

⬆️

The Fork-Lift Upgrade Strategy

Traditional in-place database major version upgrades require going through each intermediate version sequentially — upgrading from MongoDB 4.0 to 5.0 to 6.0, for example, each step requiring careful validation. Stripe's Data Movement Platform enables a fork-lift strategy : provision a new shard running the target version, migrate the data to it, switch traffic, decommission the old shard. Any version can jump to any other version in a single migration step. This eliminates the risk accumulation of multi-step in-place upgrades.

ℹ️

DocDB: Not a Rewrite, an Extension

A key decision in Stripe's database evolution was building DocDB on top of MongoDB Community rather than replacing MongoDB with a different database. This preserved compatibility with existing application code, the existing data model, and years of operational knowledge. The extensions — sharding, proxy routing, migration tooling — were added as a platform layer, not a fork. This pragmatic approach to building on existing foundations rather than starting from scratch is characteristic of Stripe's infrastructure philosophy.

The Fix

DocDB Architecture: The Database-as-a-Service Abstraction

DocDB's architecture is a three-tier system sitting between Stripe's application code and raw MongoDB instances. The Database Proxy is the entry point for all application read/write requests — it performs access control checks, validates queries, and routes requests to the correct shard by consulting the chunk metadata service. The Chunk Metadata Service maintains the authoritative map of which data chunks live on which shards. The Database Shards are replicated MongoDB instances that store the actual data. Applications talk only to the proxy; they are completely unaware of sharding, shard splits, or migrations in progress.

# Simplified 6-step Data Movement Platform migration flow
# Each step is atomic and resumable — migrations can be paused and continued

class DataMovementPlatform:
    def migrate_chunk(self, chunk_id: str, source_shard: str, target_shard: str):
        # Step 1: Register migration plan — makes migration visible to monitoring
        self.chunk_metadata.register_migration(
            chunk_id=chunk_id, 
            source=source_shard,
            target=target_shard
        )

        # Step 2: Pre-build indexes on target BEFORE data arrives
        # Avoids the performance penalty of indexing a large loaded dataset
        self.build_indexes_on_target(target_shard, chunk_id)

        # Step 3: Bulk copy snapshot at time T
        # Uses purpose-built I/O patterns for high-throughput sequential writes
        snapshot_timestamp = self.bulk_copy_snapshot(chunk_id, source_shard, target_shard)

        # Step 4: Stream CDC replication — catch up all writes since snapshot
        # Reads MongoDB oplog on source; applies to target until near-real-time
        self.cdc_replicate_to_target(
            source_shard, target_shard, since=snapshot_timestamp
        )

        # Step 5: Correctness verification — compare source and target
        # Financial data requires full consistency before any traffic switch
        assert self.verify_consistency(chunk_id, source_shard, target_shard)

        # Step 6: Atomic traffic switch — update chunk metadata, switch routing
        self.chunk_metadata.set_active_shard(chunk_id, target_shard)
        # Applications querying the chunk now get routed to target
        # Deregister from source after confirmation
        self.chunk_metadata.deregister_from_source(chunk_id, source_shard)

BIN-PACKING: REDUCING THE FLEET BY 75%

In 2023, Stripe used the Data Movement Platform to bin-pack thousands of underutilized shards into a smaller number of larger shards. Bin-packing is the reverse of splitting: instead of one shard becoming two, many small shards are consolidated into fewer shards with more data. This reduced the total number of DocDB shards by approximately 75% while moving 1.5 petabytes — dramatically reducing operational overhead and hardware costs without any application code changes.

ℹ️

Multitenant to Single-Tenant: Isolation on Demand

DocDB supports migrating a large merchant's data from a shared multitenant shard (multiple merchants on one shard) to a dedicated single-tenant shard (one merchant per shard). This is done transparently via the Data Movement Platform: the merchant's data is migrated to a dedicated shard, traffic routing is updated atomically, and the merchant gets dedicated resources without any downtime or visible change in behavior. This capability is increasingly important as Stripe's largest customers grow to Shopify, Amazon, and OpenAI scale.

✅

The Heat Management System: Next Chapter

At the time of the June 2024 blog post, Stripe was prototyping a heat management system that proactively balances data across shards based on real-time access patterns. Rather than waiting for a shard to become a bottleneck and then splitting it reactively, the heat management system would detect access pattern shifts and pre-emptively migrate hot data to shards with more capacity. Reactive sharding at Stripe's scale will eventually give way to predictive sharding.

Correctness verification (Step 5) is the most cautious part of the migration protocol, and deliberately so. The platform compares a sample of records between source and target shards after CDC replication has caught up. For financial data, even a single inconsistency before the traffic switch would be unacceptable — a payment that exists on the source but not on the target could be double-charged or lost if the switch happens before it replicates. The verification step is the safety gate that makes five-nines availability compatible with live shard migrations. The cost is time — migrations take longer because of the verification window — but that cost is the explicit price of correctness guarantees on financial data.

⚠️

The Bulk Load Throughput Engineering Challenge

During testing, Stripe found that standard MongoDB write patterns were insufficiently fast for bulk data loading during shard migrations. Batching writes and tuning engine parameters both failed to resolve the throughput bottleneck. The root cause: the standard MongoDB write path is optimized for low-latency individual writes , not for high-throughput sequential bulk loads. The engineering team built custom I/O patterns specifically for the bulk copy phase of migrations — patterns that bypassed some standard write overhead in favor of throughput.

THE OPLOG AND FINANCIAL CONSISTENCY

MongoDB's oplog (a capped collection that stores all write operations in order, used for replication across MongoDB replica sets) is the technical foundation of CDC replication in DocDB. Every write to the source shard appears in the oplog in order. By replaying the oplog on the target shard in sequence, the Data Movement Platform guarantees that every write applied to the source during migration is also applied to the target — preserving full consistency of financial records. The oplog is not just a replication mechanism: it is a linearizable history of financial truth.

Architecture

DocDB's architecture enforces a clean separation between application code and database topology. Applications at Stripe never connect directly to MongoDB instances — they connect to the Database Proxy, which is the single point of truth for routing, access control, and scalability decisions. This indirection is what makes zero-downtime migrations possible: the proxy can update its routing table atomically as migrations complete, and applications never see anything other than consistent data.

DocDB Architecture: Three-Tier Database-as-a-Service

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Data Movement Platform: Six-Step Migration Protocol