TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

How Discord Migrated Trillions of Messages and Fired Their Garbage Collector

#webdev #programming #database #backend

177 → 72 nodes — cluster footprint after migration; 59% fewer nodes handling the same workload
40–125ms → 15ms — p99 read latency; unpredictable range became stable flat line
4+ trillion messages migrated in 9 days — vs original 3-month estimate
3.2M records/second — peak throughput of the Rust-rewritten migration tool
0 user-visible downtime during the migration
Weekly on-call node babysitting eliminated after go-live

It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.

The Story

Our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.

— Bo Ingram, Senior Software Engineer, via Discord Engineering Blog

Discord launched in 2015 and by 2017 had migrated their message store to Apache Cassandra (a distributed wide-column NoSQL database designed for high availability across many nodes without a single point of failure). For a few years it held. By 2022, the promises had curdled into a maintenance nightmare. The cluster had grown to 177 nodes holding trillions of messages, and keeping it alive required the kind of vigilance that should be reserved for nuclear reactor operators, not chat app engineers.

The core problem was architectural. Cassandra is written in Java, and Java's garbage collector periodically halts all threads in the JVM to reclaim heap memory — JVM GC pauses (Java Virtual Machine garbage collection — periodic stop-the-world pauses where the JVM freezes all threads to reclaim memory). Under Discord's workloads, these pauses lasted long enough to cause cascading latency spikes, and in severe cases a node would fall out of the cluster entirely. An on-call engineer would manually reboot it and babysit it back to health. The p99 latency on historical message reads ranged between 40 and 125 milliseconds depending on whether compaction was running — an unpredictability that made SLO planning impossible.

The Hot Partition Problem

Discord's message data model organised messages by channel ID and time bucket. Cassandra performs writes cheaply by appending to a commit log and an in-memory memtable (an in-memory write buffer flushed to disk as SSTables when it fills up). Reads, however, must query the memtable and potentially multiple SSTables (immutable on-disk files that must all be merged on read to reconstruct the current value) — a dramatically more expensive operation. When a popular Discord server made a major announcement and thousands of users simultaneously opened their apps, every single read hammered the same partition. These hot partitions were Discord's most common and painful operational incident.

Problem

The Maintenance Spiral

By early 2022, Discord's on-call rotation was spending more time nursing Cassandra than building features. GC pause alerts fired multiple times a week. The p99 latency on reads ranged from 40ms to 125ms depending on whether compaction was running on the affected node — an unpredictability engineers had simply learned to live with.

Cause

JVM GC + Hot Partition Physics

The root cause split into two layers: JVM garbage collection (Java's memory management system that periodically pauses all threads to reclaim heap memory — in large heaps these pauses could last hundreds of milliseconds) on write-heavy nodes created latency cliffs, while Cassandra's read path — requiring merges across multiple SSTables — meant any popular channel partition would spike latency under concurrent user load.

Solution

ScyllaDB + Rust Data Services

Discord chose ScyllaDB, a Cassandra-compatible database rewritten in C++ with a shard-per-core architecture (each CPU core is assigned its own exclusive subset of data and handles requests independently, avoiding cross-core coordination and lock contention — and crucially, no garbage collector). They also built a Rust-based data services layer between the API and the database to absorb hot-partition spikes via request coalescing.

Result

9 Days, 4 Trillion Messages, Zero Users Noticed

The migration completed in nine days — the original estimate using ScyllaDB's Spark migrator had been three months. The cluster shrank from 177 nodes to 72. P99 latency for historical reads settled at a stable, predictable 15 milliseconds.

The Fix

The Rust Migrator Rewrite: From 3 Months to 9 Days

The turning point in the migration timeline was a one-day engineering sprint. ScyllaDB's off-the-shelf Spark migrator (an Apache Spark-based tool for bulk data migration that reads token ranges from Cassandra and writes them to ScyllaDB) estimated three months to move the message data. Bo Ingram decided that was three months too long. He and two colleagues rewrote the migrator in Rust in a single day — reading token ranges from a database, checkpointing locally via SQLite for crash recovery, and firing records into ScyllaDB as fast as possible. The result: 3.2 million records per second. The new estimate was nine days.

177 → 72 — Cassandra nodes replaced by ScyllaDB nodes; 59% reduction in cluster footprint
15ms — stable p99 read latency on ScyllaDB, down from an unpredictable 40–125ms range
9 days — total migration time for 4+ trillion messages vs original 3-month estimate
3.2M/s — peak migration throughput of the Rust-rewritten migrator

// Simplified version of Discord's request coalescing logic in the Rust data service
// Real implementation uses Tokio async runtime

use std::collections::HashMap;
use tokio::sync::broadcast;

struct CoalescingDataService {
    // Map from cache_key -> active broadcast sender
    // If a query is in flight, all subsequent requests for the same key
    // subscribe to it rather than issuing their own database query
    in_flight: HashMap<String, broadcast::Sender<Vec<Message>>>,
}

impl CoalescingDataService {
    async fn get_messages(
        &mut self,
        channel_id: u64,
        before_id: u64,
    ) -> Result<Vec<Message>, Error> {
        let key = format!("{}:{}", channel_id, before_id);

        if let Some(sender) = self.in_flight.get(&key) {
            // A query for this channel is already in flight
            // Subscribe and wait — NO second database round-trip
            // 1,000 simultaneous requests → 1 database query
            let mut rx = sender.subscribe();
            return Ok(vec![rx.recv().await?]);
        }

        // First request for this key — create the broadcast channel
        let (tx, _rx) = broadcast::channel(16);
        self.in_flight.insert(key.clone(), tx.clone());

        // Execute the SINGLE database query to ScyllaDB
        let results = self.scylladb.query_messages(channel_id, before_id).await?;

        // Broadcast to ALL waiting subscribers at once
        let _ = tx.send(results.clone());
        self.in_flight.remove(&key);

        Ok(results)
    }
}
// Consistent hashing routes all requests for a given channel_id to the same
// data service instance — maximising the chance that an in-flight coalescing
// task for that channel already exists when the next request arrives.

The Tombstone Trap at 99.9999%

After running the Rust migrator for days at 3.2 million records per second, the progress bar hit 99.9999% — and stopped. The migrator was timing out trying to read the last few token ranges (contiguous slices of the Cassandra hash ring assigned to a node) because they contained gigantic ranges of tombstones (deletion markers in Cassandra — when data is deleted, a tombstone is written rather than the row being removed, because immutable SSTables cannot be modified in-place; these tombstones must be read and skipped during every subsequent read until compaction removes them) that had never been compacted. Engineers manually triggered compaction on that token range. Seconds later, the migration hit 100%.

Before and after migration:

Metric	Cassandra (Before)	ScyllaDB (After)
Cluster Nodes	177	72
Disk per Node (avg)	4 TB	9 TB
p99 Read Latency	40–125ms (variable)	~15ms (stable)
GC Pauses	Frequent stop-the-world	None (C++, no GC)
Hot Partition Risk	High — no coalescing	Mitigated by Rust data services
On-Call Toil	Weekly node babysitting	Dramatically reduced

The validation strategy: shadow reads during migration

Discord ran automated correctness validation throughout the migration by sending a small percentage of reads to both databases simultaneously and comparing results. Only when reads matched across Cassandra and ScyllaDB was a partition considered successfully migrated. This shadow-read approach caught data inconsistencies without any user-visible impact, and gave the team confidence to flip the cutover as a single atomic event rather than a long, hedged transition. Any database migration touching production data should implement this pattern before flipping the final switch.

The SuperDisk: custom hardware for cloud durability

ScyllaDB is optimised for NVMe SSDs (Non-Volatile Memory Express solid-state drives — extremely fast local storage that dramatically reduces I/O latency) but in cloud environments NVMe is ephemeral — a node restart wipes the disk. Discord engineered a custom RAID 1 configuration they called the Superdisk: writes go to both fast local NVMe and slower persistent network-attached storage simultaneously; reads prefer the NVMe for speed. This gave them NVMe-level read performance with cloud-level data durability.

Why cassandra-messages was the last to move

By 2020 Discord had migrated every other database to ScyllaDB — the messages cluster was the lone holdout. They deliberately waited because it was the most critical dataset: trillions of messages, nearly 200 nodes, and the one cluster whose failure would be immediately visible to every user. They used the other migrations to tune ScyllaDB for their access patterns first, including filing and waiting on performance improvements to ScyllaDB's reverse query support. The long preparation was the reason the execution was clean.

Architecture

Before the migration, Discord's message read path ran through a monolithic API server directly into Cassandra. Every user action that read messages translated directly into a database query, with no protection against fan-out or hot partition amplification (when many users simultaneously request data stored in the same database partition, causing that node to receive far more traffic than its neighbours, creating latency spikes). Post-migration, a Rust data services layer sits between the API and ScyllaDB, coalescing concurrent reads for the same channel into single database queries via consistent hashing.

Before: Direct API-to-Cassandra Architecture (Hot Partition Risk)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Rust Data Services + ScyllaDB Architecture (Hot Partition Mitigated)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Shard-Per-Core Architecture: Why ScyllaDB Handles Concurrency Differently

The fundamental reason ScyllaDB handles concurrent reads so much better than Cassandra is its shard-per-core architecture. Each CPU core is assigned its own exclusive slice of the data and handles all requests for that data without coordination with other cores. In Cassandra's JVM model, all threads compete for heap memory under a single garbage collector. In ScyllaDB's C++ model, each core is an independent actor: no cross-core locking, no GC, no stop-the-world. When one partition gets hot, it affects only the core assigned to that shard — it cannot cascade to neighbours.

Lessons

Migrate your riskiest system last, but commit to a deadline once operational confidence is achieved. Discord deliberately kept the messages database in Cassandra for two years after migrating everything else, using that time to build ScyllaDB expertise on less critical workloads. They then committed to a hard deadline — avoiding the trap of indefinite deferral that plagues many large migrations.
Request coalescing (combining multiple concurrent requests for identical data into a single database query, broadcasting the result to all waiters) is a force multiplier against hot partitions that no amount of database scaling alone can provide. When you have popular content that thousands of users read simultaneously, add a coalescing layer between your application and your database — the reduction in query fan-out is often more impactful than hardware upgrades.
Rewrite your migration tooling if the estimated duration is unacceptable. A three-month migration estimate is not a constraint — it's a scope definition that you can change. Discord's one-day Rust rewrite of the migrator turned a three-month project into nine days. Always ask: what would it take to make this ten times faster?
Stop-the-world GC pauses (periodic halts in JVM-based systems where all threads freeze while the garbage collector reclaims memory) are a predictable, structural problem in Java-based databases at high concurrency — not a tuning problem you can engineer your way out of at Discord's scale. When your on-call team spends more time maintaining a database than improving it, that's the signal to evaluate architecturally different alternatives, not just different JVM flags.
Run shadow reads for data validation before any large-scale cutover. Sending a percentage of reads to both old and new systems simultaneously — and comparing results automatically — gives you objective confidence that your migration is correct without user-visible risk. This pattern is applicable to any database migration and should be standard practice before any atomic cutover switch.

Engineering Glossary

Cassandra commit log — a sequential on-disk journal where writes are recorded before being applied to the in-memory memtable structure, enabling fast writes at the cost of read complexity. Not to be confused with a database transaction log.

Hot partition — a condition where many users simultaneously request data stored in the same database partition, causing that node to receive far more traffic than its neighbours. Creates latency spikes and potential instability. Discord's most common and painful operational incident before the Rust data services coalescing layer.

JVM GC pause (stop-the-world) — a periodic halt in JVM-based systems (including Cassandra) where all threads freeze while the garbage collector reclaims heap memory. In large heaps under high write load, these pauses can last hundreds of milliseconds — causing latency cliffs visible to users and, in extreme cases, node instability.

Memtable — an in-memory write buffer in Cassandra that is flushed to disk as immutable SSTables when it fills up. Writes are fast (append to memtable); reads are expensive (must merge memtable with all relevant SSTables to reconstruct current value).

Request coalescing — a pattern where the first incoming request for a piece of data triggers an active lookup, and all subsequent requests for the same data subscribe to that lookup's result rather than issuing their own query. Reduces N database hits to 1 for concurrent requests for identical data.

Shard-per-core architecture — ScyllaDB's design where each CPU core is assigned its own exclusive subset of data and handles all requests for that data independently, with no cross-core coordination or shared garbage collector. Enables high concurrency without the GC overhead that plagues JVM-based databases.

SSTable (Sorted String Table) — an immutable on-disk file in Cassandra or ScyllaDB that holds flushed memtable data. Immutable means they cannot be modified in place — deletes write tombstones instead. Reads must merge across all relevant SSTables, making read amplification a key cost driver.

Tombstone — a deletion marker in Cassandra/ScyllaDB. When data is deleted, a tombstone is written rather than the row being removed, because immutable SSTables cannot be modified in place. Tombstones must be read and skipped during every subsequent read until compaction removes them — causing performance problems when they accumulate in uncompacted token ranges.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community