DEV Community

Cover image for Real-Time Incident Recovery with Event-Driven Microservices Architecture and Early Monitoring
Oleksandr Hanhaliuk
Oleksandr Hanhaliuk

Posted on

Real-Time Incident Recovery with Event-Driven Microservices Architecture and Early Monitoring

Introduction

It was a regular day. I’ve deployed a few more features to our core infrastructure. Those features contained one new SQL query and migration adding new indexes to optimize the new SQL query.

However, I thought that this wouldn’t affect the current DB load or production processes. That’s why I deployed it during the day, embracing the Deploy Early, Deploy Often methodology.

The Problem Begins

The issue was that we had millions of rows in the affected table and a constant load during the day. So one query to a non-optimized table generated a lot of load to DB.

The deployment process is following for each stage:

• First, the code is deployed to the production infrastructure.

• Then, the migration is applied (adding a new index).

• Lastly, end-to-end tests (e2e) are running.

Our CI/CD process is deployed with these steps to test and staging environments before going to production. However, due to a lack of load testing in CI/CD on staging, we hadn’t experienced any problems before.

The Deployment Incident

The fun started when the deployment to production was completed and the index migration started. Since the table is big, applying the index migration might take a while.

During this time, the database CPU load started increasing to 90%, which caused our services to work slower.

I noticed something was wrong when I received an alarm — ApproximateAgeOfOldestMessage in SQS exceeded the threshold.

Identifying the Issue

• We had set up a low threshold (60 seconds) to observe unusual behaviour early.

• Our infrastructure is based on Lambda and RDS, and the usage is typically below 50%, even during high traffic.

So, if a message in SQS waits longer than 60 seconds, something is wrong.

When I started looking at metrics and logs, I received another alarm — RDS Cluster CPU utilization > 70%. This led me to the conclusion that:

• Messages were waiting in SQS due to slow processing by Lambda, caused by the RDS load.

• I observed traffic metrics and realized there were no traffic spikes.

The Cause

Was it a coincidence that I deployed a new feature just now and received alarms? Apparently, not.

I waited for a few minutes, but nothing happened. The RDS load was still high. If I waited longer, we might have had an incident. The migration was not applied in CI/CD, the pipeline failed due to timeout, and the index was not created.

Rolling Back

I rolled back my deployment, but it was already too late:

• RDS load reached 100%.

• Messages started failing in the DLQ (Dead Letter Queue).

I didn’t want to wait any longer to have customer support report an incident.

Resolution

Since our architecture is distributed microservice architecture based on events and commands, with retry and failover mechanisms for each part, I was confident in restarting the RDS Cluster.

• I restarted the RDS cluster, and everything went fine.

To properly apply the index, I created an additional deployment that would:

  1. Apply the index to the table.

  2. Only after that, deploy the new feature.

Everything went smoothly, and I learned from my mistakes and prevented a major incident.

Conclusion

• Embracing distributed event-driven architecture with efficient monitoring allowed me to detect issues early and mitigate the risk of a major incident.

• Monitoring tools like SQS alarms and RDS CPU metrics played a critical role in identifying the root cause of the issue.

• Rolling back and restarting services in a distributed system helped prevent customer impact, and this experience underlined the importance of load testing in CI/CD environments.

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay