Distributed Systems | What can we learn from Roblox's 3-day outage?

#distributedsystems #sitereliabilityengineering #database #go

In October 2021, Roblox suffered the longest outage in its history—73 hours of complete downtime, affecting millions of players worldwide. The root cause? A subtle yet devastating issue buried deep inside an outdated database structure. What makes this case fascinating is that the issue wasn’t an obvious system failure or an external attack—it was a slow-burning technical debt hidden inside the database.

I recently did a deep dive into the post-mortem and gained valuable insights. This is arguably one of the most detailed and thorough outage reports available online.

🌐 Background:

Roblox employs a microservices architecture for its backend, using HashiCorp’s Consul for service discovery, enabling internal services to locate and communicate with each other. On the afternoon of October 28th, a single Consul server experienced high CPU load. The Consul cluster's performance continued to degrade, ultimately bringing down the entire Roblox system since Consul acted as a single point of failure. Engineers from both Roblox and HashiCorp collaborated to diagnose and resolve the issue.

🔧 Debugging Attempts:

Suspected Hardware Failure: The team replaced one of the Consul cluster nodes, but the issue persisted.
Suspected Increased Traffic: The team replaced all Consul cluster nodes with more powerful machines featuring 128 cores (2x increase) and faster NVMe SSD disks. This also did not resolve the issue.
Resetting Consul’s State: The team shut down the entire Consul cluster and restored its state using a snapshot from a few hours before the outage began. Initially, the system appeared stable, but it soon degraded again, returning to an unhealthy state.
Reducing Consul Usage: Roblox services that typically had hundreds of instances running were scaled down to single digits. This approach provided temporary relief for a few hours before Consul again became unhealthy.
Identifying Resource Contention Issues: Upon deeper inspection of debug logs, the team discovered resource contention problems. They reverted to machines similar to those used before the outage. Eventually, they identified the issue: Consul’s new streaming feature. This feature utilized fewer concurrency control elements (Go channels), which led to excessive contention on a single Go channel under high read/write loads. Disabling streaming dramatically improved the Consul cluster’s health.
Leader Election Optimization: The team observed Consul intermittently electing new cluster leaders, which was normal. However, some leaders exhibited the same latency issues. The team worked around this by preventing problematic leaders from staying elected.

✅ Following these measures, the system was finally stable. The team carefully brought Roblox back online by restoring caching systems and gradually allowing randomly selected players to reconnect. After 73 hours, Roblox was fully operational.

🔍 BoltDB’s Freelist: A Silent Culprit

The root cause of the outage stemmed from two key issues: the Consul streaming feature (as mentioned above) and Consul’s underlying database BoltDB exhibiting severe performance degradation, which I find pretty interesting.

Consul uses Raft consensus for leader election to ensure data consistency in a distributed environment. For persistence, it relies on BoltDB, a popular embedded key-value store, to store Raft logs.

Like many databases, BoltDB has a freelist, which tracks free pages—disk space that was previously occupied but is now available for reuse. This mechanism is crucial for efficient database performance, preventing unnecessary disk growth and optimizing read/write operations.

However, BoltDB’s freelist implementation had a critical inefficiency (see source code here). It used an array to store the ID of each free page, meaning that every database read/write operation involved a linear scan of the freelist. As the freelist grew large, the cost of operations increased significantly.

📈 Interestingly, this performance issue was first reported in 2016 (GitHub Issue), but it was never fixed. The author of BoltDB, Ben Johnson, stopped maintaining the project in 2017, stating:

"Maintaining an open-source database requires an immense amount of time and energy. Changes to the code can have unintended and sometimes catastrophic effects, so even simple changes require hours of careful testing and validation.

Unfortunately, I no longer have the time or energy to continue this work. Bolt is in a stable state and has years of successful production use. As such, I feel that leaving it in its current state is the most prudent course of action

Although BoltDB was no longer maintained, the Go community forked it into a new project called bbolt (bbolt GitHub) to continue active maintenance and add new features. Unfortunately, Consul was still using the outdated, unmaintained version of BoltDB.

In 2019, the freelist performance issue was finally resolved in bbolt (Alibaba Cloud Blog). The fix was straightforward: using a hashmap instead of an array, reducing the linear scan overhead and allowing instant lookups. ( 🚀 I love how a simple idea brings huge performance boost!)

Since this fix was committed to bbolt—not BoltDB—Consul did not benefit from the improvement, ultimately leading to the three-day Roblox outage in 2021.

🤔 Unanswered Questions

This post has covered a lot, but there’s only so much that can fit into a single post. As an engineer, I find myself eager to explore further details. Several intriguing questions remain unanswered:

Why didn’t Roblox roll back Consul’s streaming feature sooner?

Given that Consul was clearly the culprit early on and a significant change had just been made to its infrastructure, rolling back should have been one of the first things attempted. What factors delayed this decision?

Why did only some Consul servers experience the BoltDB freelist performance issue?

In theory, all servers should have been in a similar state since the leader is usually ahead of its followers only by a small margin. Yet, only some instances suffered severe degradation. What caused this inconsistency?

Why didn’t restoring Consul’s state using a previous snapshot fix the issue?

My hypothesis is that restoring Consul’s state did not reset BoltDB’s underlying raft.db file on each server, meaning the bloated freelist persisted even after the rollback. If true, this suggests snapshots do not include critical optimizations for internal database structures.

Why did reducing Consul usage work temporarily before failing again?

If the freelist was already too large, reducing usage shouldn’t have provided any relief. Did scaling down slow the growth of the freelist temporarily, delaying the inevitable, or was another factor at play?

Why did the new streaming feature work for a day before the outage occurred?

If the new Consul streaming feature was inherently flawed, why didn’t the system immediately degrade? Was there an initial buffer that temporarily masked the issue, or did a specific traffic pattern trigger the breakdown?

Since the BoltDB freelist performance issue has been around for years, why didn’t Roblox experience such system performance degradation in earlier months?

BoltDB’s freelist inefficiency had been a known issue since 2016. What changed in Roblox’s workload or data structure that made this issue surface now? Did the new Consul streaming feature exacerbate the problem by dramatically increasing write operations to BoltDB?

💡 Endnotes

This post-mortem report offers invaluable lessons, and I highly encourage everyone to check it out! There are also extensive discussions about this outage on Hacker News, where BoltDB's author, Ben Johnson, also participated in the conversations.

As a software engineer, I firmly believe that a key trait of a great engineer is the ability to efficiently navigate large, complex systems and diagnose issues under pressure. I deeply admire and respect the engineers from Roblox and HashiCorp, who worked tirelessly under immense pressure to investigate and resolve the issue. Hats off to them for their resilience and expertise.

Thank you for reading this far! If you found this post useful, I’d truly appreciate it if you could share it with others.