A year ago, we decided to make a transition towards serverless architecture. Our management was very excited about it, and its excitement resulted in many tries and failures for developers(including me). So one Monday, we started our working day and realized that one of our lambdas had been going right into the rabbit hole the whole weekend. We were astonished, management was dissatisfied, and I was happy with the new material for the current article.
The staple part of several microservices at our disposal heavily relies upon S3 event notifications. So what happened.
A developer screwed up and invoked Lambda from within the same Lambda for the same file in S3, which initially triggered Lambda. These invocations created other S3 files, which started different lambdas... You got the idea.
Dev wasn't fired or sanctioned in any way. Because it's an architectural problem, anyone can make a silly mistake.
We filled the ticket afterward and got compensated 5k only because we spent this much before the alarm came through.
We set budget notifications and created alarms to email, slack channel, and mobile phone of key tech company figures.
Most of the Lambdas must have reserved concurrency parameters set.
Most of the Lambdas must be invoked via SQS only.
We also implemented AWS Config rule to check all our Lambdas for reserved concurrency.
With reserved concurrency, we avoid calling functions more than we should. This way, essentially throttle it.
And SQS helps us to prevent data loss. In case of facing concurrency limit, Lambda will wait before obtaining the following message from the queue.
- Why is there no option to kill all AWS activities after reaching some usage threshold?
- Is it this complicated to create an intelligent tool to help AWS customers catch this situation and avoid money loss?