Lessons we've learned after burning many thousands thanks to AWS Lambda. Expect no mercy from AWS.

Aleksandr Zakharov — Wed, 20 Oct 2021 02:10:55 +0000

Preface.

A year ago, we decided to make a transition towards serverless architecture. Our management was very excited about it, and its excitement resulted in many tries and failures for developers(including me). So one Monday, we started our working day and realized that one of our lambdas had been going right into the rabbit hole the whole weekend. We were astonished, management was dissatisfied, and I was happy with the new material for the current article.

Our setup.

The staple part of several microservices at our disposal heavily relies upon S3 event notifications. So what happened.
A developer screwed up and invoked Lambda from within the same Lambda for the same file in S3, which initially triggered Lambda. These invocations created other S3 files, which started different lambdas... You got the idea.

Dev wasn't fired or sanctioned in any way. Because it's an architectural problem, anyone can make a silly mistake.

How much we've lost? Tenths of thousands.

We filled the ticket afterward and got compensated 5k only because we spent this much before the alarm came through.

Precautions we implemented to prevent future incidents.

We set budget notifications and created alarms to email, slack channel, and mobile phone of key tech company figures.
Most of the Lambdas must have reserved concurrency parameters set.
Most of the Lambdas must be invoked via SQS only.
We also implemented AWS Config rule to check all our Lambdas for reserved concurrency.

With reserved concurrency, we avoid calling functions more than we should. This way, essentially throttle it.

And SQS helps us to prevent data loss. In case of facing concurrency limit, Lambda will wait before obtaining the following message from the queue.

Questions to think about.