DEV Community

Dayisi Tobi Iyanu
Dayisi Tobi Iyanu

Posted on

Budgtr Downtime incident report.

We would like to apologize to all our users for the downtime experienced last week. We understand the discomfort this would have caused, and we have done the necessary checks and fixes to ensure that this does not happen again.

We have provided an incident report of the downtime that occurred on the 5th of June, 2024. Also outlined is our response to the issue.

Issue Summary

The issue started from 6:03 AM to 8:52 AM WAT, requests to the website resulted in a 500 error as users couldn't access the service for this period. The cause of this outage was an untested change that was pushed into production, which resulted in a bug.

Timeline (All West African Time)

  • 6:03 AM - New Change was pushed.
  • 6:55 AM - The first Outage occurrence was logged in.
  • 6:56 AM - Our monitoring system alerted us.
  • 7:20 AM - failed change was rolled back.
  • 7:30 AM - Successful change rolled back
  • 8:00 AM - New change was tested and pushed into production.
  • 8:30 AM - Server was restarted.
  • 8:52 AM - System was restored at 100% functionality.

Root Cause

At 6:03 AM a new feature that was discussed and approved for development within the team was pushed to production without being tested. The new feature is a payment infrastructure that Budgetr would be using, but the APIs to be consumed were not consumed properly which resulted in breaking the whole code base which then caused the 500 error.

Resoluion and Recovery

At 6:56 AM our monitoring system alerted our engineers of the error, which escalated immediately. At 7:20 AM our engineers tried to roll back the change to fix it locally, but it failed due to some authorization constraints.

At 7:30 AM the proper authorization access was granted and our engineers were able to successfully roll back the changes. Our engineers went to work straight and fixed the error, after pushing it to the test environment, the test process took place and the results came out as positive.

At 8:00 AM our engineers pushed to production. To ensure a smooth sail of service, we restarted the servers at 8:30 AM and our service was confirmed to be 100% stable at 8:52 AM.

Corrective and Preventative Measures

In the last 4 days, we have conducted an internal review and analysis of the outage. The following are the actions that will be taken to ensure this issue doesn't occur again:

  1. All new features would be pushed to the test environment by default.
  2. Only authorized personnel can push tested and approved changes to production.
  3. Detailed information concerning the features or changes being pushed to test should be provided in the commit messages.

Budgetr is committed to ensuring a seamless service for all our customers and as a result we constantly improve our technology and operational processes to prevent these issues. We sincerely apologize for the discomfort this issue would have caused you or your businesses and we appreciate your patience and understanding.

Sincerely,
The Budgetr Team.

Top comments (0)