Discord · Databases · 17 May 2026
It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.
- 177 → 72 nodes
- p99 latency 15ms (was 40–125ms)
- 9-day migration (was 3-month est.)
- 3.2M records/sec migrated
- 4T+ messages moved
- Zero user-visible downtime
The Story
Our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.
— — Bo Ingram, Senior Software Engineer — via Discord Engineering Blog
Discord launched in 2015 with a mission to build the best voice and text chat platform for gamers. By 2017 they had outgrown MongoDB and migrated their entire message store to Apache Cassandra (a distributed wide-column NoSQL database designed for high availability across many nodes without a single point of failure). Cassandra's promise was compelling: write anywhere, replicate everywhere, scale horizontally forever. For a few years it held. By 2022, however, the promises had curdled into a maintenance nightmare that consumed engineering cycles every single week. The database cluster had grown to 177 nodes holding trillions of messages , and keeping it alive required the kind of expertise and vigilance that should be reserved for nuclear reactor operators, not chat app engineers.
🔥
At peak, Discord's Cassandra cluster required engineers to manually reboot individual nodes after JVM GC pauses (Java Virtual Machine garbage collection — periodic stop-the-world pauses where the JVM freezes all threads to reclaim memory) spiraled long enough to drop the node from the cluster. This was not a rare emergency — it was routine on-call work.
The core problem was architectural. Cassandra is written in Java, and Java's garbage collector periodically halts all threads in the JVM to reclaim heap memory — a moment engineers call a stop-the-world pause. Under Discord's workloads, these pauses could last long enough to cause cascading latency spikes visible to users, and in severe cases, the JVM's consecutive GC pauses got so bad that a node would effectively fall out of the cluster entirely. An on-call engineer would then have to manually reboot it and babysit it back to health. The p99 latency on historical message reads ranged between 40 and 125 milliseconds depending on whether compaction was running — an unpredictability that made SLO planning impossible. Every time someone tried to improve the cluster rather than merely maintain it, they risked triggering a cascade.
The Hot Partition Problem
Discord's message data model organized messages by channel ID and a fixed time window called a bucket (a fixed time slice, e.g. 10 days, used as part of the partition key so messages are spread across multiple Cassandra partitions rather than one per channel). This was efficient for write distribution and replication, but created a painful read problem. Cassandra performs writes cheaply by appending to a commit log (a sequential on-disk journal where writes are recorded before being applied to the in-memory structure, enabling fast writes at the cost of read complexity) and an in-memory structure called a memtable (an in-memory write buffer in Cassandra that is flushed to disk as SSTables when it fills up). Reads, however, must query the memtable and potentially multiple SSTables (Sorted String Tables — immutable on-disk files in Cassandra that hold flushed memtable data, which must all be merged on read to reconstruct the current value), a dramatically more expensive operation. When a popular Discord server made a major announcement and thousands of users simultaneously opened their apps to read it, every single one of those reads would hammer the same partition. The cluster called these hot partitions , and they were Discord's most common and painful operational incident.
Problem
The Maintenance Spiral
By early 2022, Discord's on-call rotation was spending more time nursing Cassandra than building features. GC pause alerts fired multiple times a week, and the p99 latency on reads ranged from 40ms to 125ms depending on whether compaction was running on the affected node — an unpredictability engineers had simply learned to live with.
Cause
JVM GC + Hot Partition Physics
The root cause split into two layers: JVM garbage collection (Java's memory management system that periodically pauses all threads to reclaim heap memory — in large heaps, these pauses could last hundreds of milliseconds) on write-heavy nodes created latency cliffs, while Cassandra's read path — requiring merges across multiple SSTables — meant any popular channel partition would spike latency under concurrent user load. The combination made the cluster inherently unpredictable at scale.
Solution
ScyllaDB + Rust Data Services
Discord chose ScyllaDB, a Cassandra-compatible database rewritten in C++ with a shard-per-core architecture (a design where each CPU core is assigned its own exclusive subset of data and handles requests independently, avoiding cross-core coordination and lock contention). They also built a Rust-based data services layer between the API and the database to absorb hot-partition spikes via request coalescing. The migration tool was rewritten in Rust to achieve 3.2 million records per second transfer speed.
Result
9 Days, 4 Trillion Messages, Zero Users Noticed
The migration completed in nine days — the original estimate using ScyllaDB's Spark migrator had been three months. The cluster footprint shrank from 177 nodes to 72, each ScyllaDB node running with 9 TB of disk versus the average 4 TB on Cassandra. P99 latency for historical reads settled at a stable, predictable 15 milliseconds.
⚠️
Why cassandra-messages Was the Last to Move
By 2020 Discord had migrated every other database to ScyllaDB — the messages cluster was the lone holdout. They deliberately waited to last because it was the most critical dataset: trillions of messages, nearly 200 nodes, and the one cluster whose failure would be immediately visible to every user. They used the other migrations to tune ScyllaDB for their access patterns first, including filing and waiting on performance improvements to ScyllaDB's reverse query support.
The Tombstone Trap at 99.9999%
The migration nearly ended in drama rather than triumph. After running the Rust migrator for days at 3.2 million records per second , the progress bar hit 99.9999% — and stopped. The migrator was timing out trying to read the last few token ranges (in Cassandra, data is distributed across the ring by assigning each partition a hash token, and a token range is a contiguous slice of that ring assigned to a node) because they contained gigantic ranges of tombstones (deletion markers in Cassandra — when data is deleted, a tombstone is written instead of the row being removed, because immutable SSTables cannot be modified in-place; these tombstones must be read and skipped during every subsequent read until compaction removes them) that had never been compacted away. Engineers had to manually trigger compaction on that token range; seconds later, the migration hit 100%. Automated data validation confirmed correctness by sending a sample of reads to both databases and comparing results. Discord switched to ScyllaDB in May 2022.
THE WORLD CUP TEST
The real stress test came months after go-live: the 2022 FIFA World Cup Final between Argentina and France. Every goal by Messi, every equalizer by Mbappé, every moment in the shootout created a massive spike of simultaneous message reads across Discord's biggest servers. Under the old Cassandra architecture this would have triggered hot-partition alerts and cascading latency. Under ScyllaDB with the Rust data services layer, the monitoring dashboards showed nothing unusual. The system held flat through 120 minutes of football and a penalty shootout.
The Rust data services layer was the architectural insight that made ScyllaDB viable, not just the database choice alone. When a popular server makes an announcement and thousands of users simultaneously open their clients , all those read requests arrive at the data service within milliseconds of each other — all asking for the same messages in the same channel. Without coalescing, each request would hit the database separately, creating a hot partition. With request coalescing (a pattern where the first incoming request for a piece of data triggers an active lookup, and all subsequent requests for the same data subscribe to that lookup's result rather than issuing their own query, reducing N database hits to 1), only one query goes to ScyllaDB; every subsequent request subscribes to the in-flight result and receives the answer when the single database query returns. The data services layer also used consistent hashing (a ring-based routing scheme where each data service instance is responsible for a specific subset of channel IDs, ensuring all requests for a given channel are routed to the same service instance to maximize coalescing effectiveness) to route requests for the same channel to the same service instance, maximizing coalescing opportunity.
The Fix
- 177→72 — Cassandra nodes replaced by ScyllaDB nodes — a 59% reduction in cluster footprint while handling the same workload
- 15ms — Stable p99 read latency on ScyllaDB, down from an unpredictable 40–125ms range on Cassandra depending on compaction status
- 9 days — Total migration time for 4+ trillion messages — versus the original 3-month estimate with ScyllaDB's Spark migrator
- 3.2M/s — Peak migration throughput of the Rust-rewritten migrator, unlocking a single-flip cutover instead of a complex time-based phased approach
The fix had three distinct components, and Discord was deliberate about not rushing any of them. First, they spent years migrating every other database to ScyllaDB to build operational expertise before touching the one cluster that mattered most. Second, they collaborated with the ScyllaDB team to improve reverse query performance — a blocker they hit in early testing — and waited until that was production-grade before proceeding. Third, they built the Rust data services layer before starting the migration , so the new database would go live already protected from hot-partition load patterns. This sequencing was the engineering discipline that made the migration look easy in retrospect.
The Rust Migrator Rewrite
The turning point in the migration timeline was a one-day engineering sprint. ScyllaDB's off-the-shelf Spark migrator (an Apache Spark-based tool provided by ScyllaDB for bulk data migration that reads token ranges from Cassandra and writes them to ScyllaDB) estimated three months to move the message data — three months of dual-running two massive database clusters, three months of operational complexity, and three months of potential failure modes. Bo Ingram decided that was three months too long. He and two colleagues rewrote the migrator in Rust in a single day. The new migrator read token ranges from a database, checkpointed them locally via SQLite for crash recovery, and fired them into ScyllaDB as fast as possible. The result: 3.2 million records per second. The new estimate was nine days, and the team chose to do a single-flip cutover instead of a phased time-based approach entirely.
// Simplified version of Discord's request coalescing logic in the Rust data service
// Real implementation uses Tokio async runtime
use std::collections::HashMap;
use tokio::sync::broadcast;
struct CoalescingDataService {
// Map from cache_key -> active broadcast sender
// If a task is in flight, subscribers receive the result
in_flight: HashMap>,
}
impl CoalescingDataService {
async fn get_messages(
&mut self,
channel_id: u64,
before_id: u64,
) -> Result> {
// Build a stable cache key for this exact query
let key = format!("{}:{}", channel_id, before_id);
if let Some(sender) = self.in_flight.get(&key) {
// A query for this channel is already in flight
// Subscribe and wait — NO second database round-trip
let mut rx = sender.subscribe();
return Ok(vec![rx.recv().await?]); // receive the shared result
}
// No existing task — we are the first; create the broadcast channel
let (tx, _rx) = broadcast::channel(16);
self.in_flight.insert(key.clone(), tx.clone());
// Execute the single database query to ScyllaDB
let results = self.scylladb.query_messages(channel_id, before_id).await?;
// Broadcast to ALL waiting subscribers at once
let _ = tx.send(results.clone()); // every subscriber wakes up
self.in_flight.remove(&key); // clean up the in-flight tracker
Ok(results)
}
}
ℹ️
The SuperDisk: Custom Hardware for Cloud Durability
ScyllaDB is optimized for NVMe SSDs (Non-Volatile Memory Express solid-state drives — extremely fast local storage that dramatically reduces I/O latency) but in cloud environments NVMe is ephemeral — a node restart wipes the disk. Discord engineered a custom RAID 1 configuration they called the Superdisk: writes go to both fast local NVMe and slower persistent network-attached storage simultaneously; reads prefer the NVMe for speed. This gave them NVMe-level read performance with cloud-level data durability.
✅
Zstandard Compression: 50–60% Disk Reduction
Alongside the database migration, Discord enabled Zstandard compression on their ScyllaDB tables. Message data compresses extremely well. The result was a 50–60% reduction in raw disk usage compared to uncompressed Cassandra storage — effectively giving each physical node far more useful capacity at zero hardware cost.
THE VALIDATION STRATEGY
Discord ran automated correctness validation throughout the migration by sending a small percentage of reads to both databases simultaneously and comparing results. Only when reads matched across Cassandra and ScyllaDB was a partition considered successfully migrated. This shadow-read approach caught data inconsistencies without any user-visible impact, and gave the team confidence to flip the cutover switch as a single atomic event rather than a long, hedged transition.
Cassandra vs ScyllaDB at Discord: Before and After Migration
| Metric | Cassandra (Before) | ScyllaDB (After) |
|---|---|---|
| Cluster Nodes | 177 | 72 |
| Disk per Node (avg) | 4 TB | 9 TB |
| p99 Read Latency | 40–125ms (variable) | ~15ms (stable) |
| GC Pauses | Frequent stop-the-world | None (C++, no GC) |
| Hot Partition Risk | High — no coalescing | Mitigated by Rust data services |
| On-Call Toil | Weekly node babysitting | Dramatically reduced |
Architecture
Before the migration, Discord's message write and read path ran through a monolithic API server directly into the Cassandra cluster. There was no intermediary — every user action that read messages translated directly into a database query, with no protection against fan-out or hot partition amplification (when many users simultaneously request data stored in the same database partition, causing that node to receive far more traffic than its neighbors, creating latency spikes and potential instability). The API server held connection pools to Cassandra, handled CQL queries (Cassandra Query Language — a SQL-like interface for querying Cassandra) for message pagination, and relied on Cassandra's own internal mechanisms (memtable, SSTables, compaction) to handle read pressure. Under normal load this worked. Under peak load — a major announcement, a viral moment, a World Cup Final — it did not.
Before: Direct API-to-Cassandra Architecture (Hot Partition Risk)
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
SHARD-PER-CORE ARCHITECTURE
The fundamental reason ScyllaDB handles concurrent reads so much better than Cassandra is its shard-per-core architecture. Each CPU core is assigned its own exclusive slice of the data and handles all requests for that data without coordination with other cores. In Cassandra's JVM-based model, all threads compete for heap memory under a single garbage collector. In ScyllaDB's C++ model, each core is an independent actor : no cross-core locking, no GC, no stop-the-world. When one partition gets hot, it affects only the core assigned to that shard — it cannot cascade to neighbors.
ℹ️
Consistent Hashing: Routing Channels to Service Instances
Each Rust data service instance is responsible for a deterministic subset of channel IDs via consistent hashing (a routing scheme where each channel_id is mapped to a specific service instance using a hash ring, so all requests for channel #12345 always go to Data Service Instance B — maximizing the chance that an in-flight coalescing task for that channel already exists). This means if 1,000 users simultaneously load the same popular channel, all 1,000 requests arrive at the same service instance and collapse into one database query.
After: Rust Data Services + ScyllaDB Architecture (Hot Partition Mitigated)
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
🦀
Why Rust for the Data Services Layer
Discord chose Rust for data services because it offered C-level throughput with memory safety guarantees that prevent entire classes of concurrency bugs common in C++ — exactly what you want in a layer handling millions of concurrent subscribers. The Tokio async runtime gave them non-blocking I/O without the GC overhead that had plagued their Cassandra setup. As Bo Ingram noted with characteristic candor: it also let them say they rewrote it in Rust.
Lessons
Discord's migration took years of preparation and nine days of execution. The long preparation was not waste — it was the reason the execution was clean. The lessons here are as much about sequencing and courage as they are about database choice.
- 01. Migrate your riskiest system last, but don't use that as an excuse to never migrate it. Discord deliberately kept the messages database in Cassandra for two years after migrating everything else, using that time to build ScyllaDB expertise on less critical workloads. However, they committed to a hard deadline once operational confidence was achieved — avoiding the trap of indefinite deferral that plagues many large migrations.
- 02. Request coalescing (combining multiple concurrent requests for identical data into a single database query, broadcasting the result to all waiters) is a force multiplier against hot partitions that no amount of database scaling alone can provide. When you have popular content that thousands of users read simultaneously, add a coalescing layer between your application and your database — the reduction in query fan-out is often more impactful than hardware upgrades.
- 03. Rewrite your migration tooling if the estimated duration is unacceptable. A three-month migration estimate is not a constraint — it's a scope definition that you can change. Discord's one-day Rust rewrite of the migrator turned a three-month project into nine days, enabling a simpler single-flip cutover instead of a complex phased approach. Always ask: what would it take to make this ten times faster?
- 04. Stop-the-world GC pauses (periodic halts in JVM-based systems where all threads freeze while the garbage collector reclaims memory) are a predictable, structural problem in Java-based databases at high concurrency — not a tuning problem you can engineer your way out of at Discord's scale. When your on-call team spends more time maintaining a database than improving it, that's the signal to evaluate architecturally different alternatives, not just different JVM flags.
- 05. Run shadow reads for data validation before any large-scale cutover. Sending a percentage of reads to both old and new systems simultaneously — and comparing results automatically — gives you objective confidence that your migration is correct without user-visible risk. This pattern is applicable to any database migration and should be standard practice before any atomic cutover switch.
✅
The World Cup Validation
The 2022 FIFA World Cup Final was Discord's unplanned load test — and the system passed cleanly. Every goal, every save, every penalty created message spikes across thousands of servers simultaneously. The combination of ScyllaDB's shard-per-core architecture and Rust data services coalescing kept latency flat through all 120 minutes plus penalties. No hot partition alerts. No on-call pages. No post-match war rooms.
SHADOW READ VALIDATION
Discord's validation strategy during migration was elegantly simple: send a small percentage of reads to both Cassandra and ScyllaDB simultaneously , compare results automatically, and flag any discrepancy. This meant correctness was continuously verified during the nine days of data transfer — not checked at the end in a tense manual review. Any database migration touching production data should implement this pattern before flipping the final switch.
They migrated four trillion messages in nine days, and the most stressful moment was the progress bar stopping at 99.9999% — because even tombstones refuse to die quietly.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)