DEV Community

shalomtash
shalomtash

Posted on

Simulink server crash incident report

__## _By the Simulink backend developer team

_

Earlier this week we experienced a crash in our servers that led to the interruption of over 500,000 client transactions
Enter fullscreen mode Exit fullscreen mode

_.
_

Image description

How could we make such a massive fail?

We are providing an incident report that details the nature of the outage and our response.

The following is the incident report for the Simulink server crash that occurred on November 1, 2022. We understand this service issue has impacted our valued developers and users, and we apologize to everyone who was affected.

Issue summary

From 12:10 PM to 8:50 PM EAT, all server requests to Simulink resulted in internal server error messages. This issue affected 100% of traffic to our servers. Users could only access certain APIs that run on separate infrastructures. The root cause of this outage was an invalid configuration change that exposed a bug in a widely used internal library.

Timeline

11:05 AM: System update released
12:05 PM: Servers restart after update
12:10 PM: Outage begins
14:30 PM: Pagers alerted teams
6:00 PM: Identification of issue
6:30 PM: Failed configuration change rollback
6:50 PM: Successful configuration change rollback
7:10 PM: Server restarts begin
8:50 PM: 100% of traffic back online

Root cause

At 11:05 AM EAT, a biannual system update was released to our production environment. The update was accompanied by an undetected configuration change that specified an invalid address for the authentication servers in production. This exposed a bug in the authentication libraries which caused them to block permanently while attempting to resolve the invalid address to physical services. In addition, the internal monitoring systems permanently blocked on this call to the authentication library. The combination of the update, bug, and configuration error quickly caused all of the serving threads to be consumed. Traffic was permanently queued waiting for a serving thread to become available. The servers began repeatedly hanging and restarting as they attempted to recover and at 12:10 PM EAT, the service outage began.

Resolution and recovery

At 2:30 PM EAT, the DataDog monitoring systems alerted our backend engineers who investigated the issue. The ongoing start and restart of servers however led to a delay in resolving the problem. An abrupt stop of all server processes would have led to the inadvertent loss of client data from our servers. As such care had to be taken to backup and transfer all data to an availability zone before proceeding to stop all processes and identify the underlying issue. By 6:00 PM, the incident response team identified that the monitoring system was exacerbating the problem caused by this bug.

At 6:30 PM, we attempted to roll back the problematic configuration change. This rollback failed due to complexity in the configuration system which caused our security checks to reject the rollback. These problems were addressed and we successfully rolled back at 6:50 PM. We then redid the system update with the servers still in offline monitoring mode.

We decided to restart servers gradually (at 7:10 PM), to avoid possible cascading failures from a wide-scale restart. By 8:00 PM, 25% of traffic was restored and 100% of traffic was routed to the API infrastructure at 8:50 PM.

Corrective and preventative measures

In the last two days, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence and improve response times:

Disable the current combined system update and configuration release mechanism until safer measures are implemented. (Completed.)
Change monitoring alert process to be quicker and more robust.
Fix the underlying authentication libraries and monitoring to correctly timeout/interrupt on errors.
Programmatically enforce staged rollouts of all configuration changes.
Improve process for auditing all high-risk configuration options.
Add a faster rollback mechanism and improve the traffic ramp-up process, so any future problems of this type can be corrected quickly.
Develop better mechanism for quickly delivering status notifications during incidents.
Simulink is committed to continually and quickly improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.

Image descriptionSincere apologies

Top comments (0)