In the sphere of real-time communication, Slack stands out as a pivotal platform for corporate connectivity and collaboration. However, on October 12, 2022, the system's reliability was put to the test. The Datastores team, responsible for overseeing Slack's database clusters, encountered a significant challenge — a sudden increase in the number of failed database queries pointed to an underlying issue that required immediate attention.
Incident Onset and Immediate Response
During an onsite in Amsterdam, the Datastores team, freshly augmented with new engineers, was alerted to a troubling rise in database query failures. An investigation revealed the cause: a long-running asynchronous job was purging substantial amounts of data, leading to an overload on the database cluster. To mitigate this, the team executed a temporary solution known as 'shimming,' which allowed ongoing jobs to complete while halting the initiation of new processes.
Recurrence and Escalation
The remedy seemed effective until the very next day when the problem resurfaced with greater severity. This incident shed light on an edge-case bug within Datastores' automation, which had failed to manage a surge in requests, directly affecting customer access to Slack. The team responded by disabling certain features to alleviate the load on the cluster, which provided the necessary room for recovery.
Slack's datastore stratergy
Before discussing the trigger and the cascading effects of the incident, it is essential to understand Slack's datastore strategy, which is at the core of its operational integrity. Slack's data architecture is built around Vitess, a database clustering system for horizontal scaling of MySQL. By employing Vitess, Slack effectively shards its extensive dataset across multiple MySQL instances. This not only allows for more efficient data management and retrieval but also ensures that operations can continue seamlessly even if one shard encounters an issue. Each shard contains a portion of the database and operates in conjunction with replicas to balance the load and facilitate quick data access. This setup is designed to maximize uptime and performance, a necessity for a platform supporting millions of concurrent users. The strategic use of sharding and replication is central to Slack's ability to scale dynamically and maintain robust data integrity, even as user numbers and data volumes continue to grow.
The Trigger and Subsequent Measures
Upon further analysis, the team identified that the incident was triggered by a customer removing a large number of users from their workspace, an operation that initiated a cascade of data modifications beyond the usual scope. To address the immediate issue, the Datastores team manually provisioned larger instance types for replicas, circumventing the automated systems that were not equipped to handle such an anomaly.
Long-Term Solutions and Preventative Actions
Moving forward, the Datastores team has adopted several strategic measures to prevent a recurrence. They have implemented throttling mechanisms and the circuit breaker pattern, both of which serve as safeguards against query overload. These measures enable the team to proactively limit or cancel queries to affected shards, thereby maintaining database stability.
In addition, to address the specific challenges posed by the 'forgetUser' job, which was central to the incident, the team optimized the job's performance. They streamlined the process to reduce the load on the database during large-scale user removal operations.
Conclusion
The incidents of October 2022 highlighted the intricate balance required in managing large-scale, distributed databases. The Datastores team's adept response and the subsequent refinements to their systems underscore the continuous need for vigilance, adaptability, and innovation in database management. As a result of these efforts, Slack's infrastructure has been fortified, showcasing a commitment to resilience and the uninterrupted service that users rely on.
Top comments (0)