Real-Time Incident Recovery with Event-Driven Microservices Architecture and Early Monitoring

#aws #cloudwatch #eventdriven #rds

Introduction

It was a regular day. I’ve deployed a few more features to our core infrastructure. Those features contained one new SQL query and migration adding new indexes to optimize the new SQL query.

However, I thought that this wouldn’t affect the current DB load or production processes. That’s why I deployed it during the day, embracing the Deploy Early, Deploy Often methodology.

The Problem Begins

The issue was that we had millions of rows in the affected table and a constant load during the day. So one query to a non-optimized table generated a lot of load to DB.

The deployment process is following for each stage:

• First, the code is deployed to the production infrastructure.

• Then, the migration is applied (adding a new index).

• Lastly, end-to-end tests (e2e) are running.

Our CI/CD process is deployed with these steps to test and staging environments before going to production. However, due to a lack of load testing in CI/CD on staging, we hadn’t experienced any problems before.

The Deployment Incident

The fun started when the deployment to production was completed and the index migration started. Since the table is big, applying the index migration might take a while.

During this time, the database CPU load started increasing to 90%, which caused our services to work slower.

I noticed something was wrong when I received an alarm — ApproximateAgeOfOldestMessage in SQS exceeded the threshold.

Identifying the Issue

• We had set up a low threshold (60 seconds) to observe unusual behaviour early.

• Our infrastructure is based on Lambda and RDS, and the usage is typically below 50%, even during high traffic.

So, if a message in SQS waits longer than 60 seconds, something is wrong.

When I started looking at metrics and logs, I received another alarm — RDS Cluster CPU utilization > 70%. This led me to the conclusion that:

• Messages were waiting in SQS due to slow processing by Lambda, caused by the RDS load.

• I observed traffic metrics and realized there were no traffic spikes.

The Cause

Was it a coincidence that I deployed a new feature just now and received alarms? Apparently, not.

I waited for a few minutes, but nothing happened. The RDS load was still high. If I waited longer, we might have had an incident. The migration was not applied in CI/CD, the pipeline failed due to timeout, and the index was not created.

Rolling Back

I rolled back my deployment, but it was already too late:

• RDS load reached 100%.

• Messages started failing in the DLQ (Dead Letter Queue).

I didn’t want to wait any longer to have customer support report an incident.

Resolution

Since our architecture is distributed microservice architecture based on events and commands, with retry and failover mechanisms for each part, I was confident in restarting the RDS Cluster.

• I restarted the RDS cluster, and everything went fine.

To properly apply the index, I created an additional deployment that would:

Apply the index to the table.
Only after that, deploy the new feature.

Everything went smoothly, and I learned from my mistakes and prevented a major incident.

Conclusion

• Embracing distributed event-driven architecture with efficient monitoring allowed me to detect issues early and mitigate the risk of a major incident.

• Monitoring tools like SQS alarms and RDS CPU metrics played a critical role in identifying the root cause of the issue.

• Rolling back and restarting services in a distributed system helped prevent customer impact, and this experience underlined the importance of load testing in CI/CD environments.