Preface.
A year ago, we decided to make a transition towards serverless architecture. Our management was very excited about it, and its excitement resulted in many tries and failures for developers(including me). So one Monday, we started our working day and realized that one of our lambdas had been going right into the rabbit hole the whole weekend. We were astonished, management was dissatisfied, and I was happy with the new material for the current article.
Our setup.
The staple part of several microservices at our disposal heavily relies upon S3 event notifications. So what happened.
A developer screwed up and invoked Lambda from within the same Lambda for the same file in S3, which initially triggered Lambda. These invocations created other S3 files, which started different lambdas... You got the idea.
Dev wasn't fired or sanctioned in any way. Because it's an architectural problem, anyone can make a silly mistake.
How much we've lost? Tenths of thousands.
We filled the ticket afterward and got compensated 5k only because we spent this much before the alarm came through.
Precautions we implemented to prevent future incidents.
We set budget notifications and created alarms to email, slack channel, and mobile phone of key tech company figures.
Most of the Lambdas must have reserved concurrency parameters set.
Most of the Lambdas must be invoked via SQS only.
We also implemented AWS Config rule to check all our Lambdas for reserved concurrency.
With reserved concurrency, we avoid calling functions more than we should. This way, essentially throttle it.
And SQS helps us to prevent data loss. In case of facing concurrency limit, Lambda will wait before obtaining the following message from the queue.
Questions to think about.
- Why is there no option to kill all AWS activities after reaching some usage threshold?
- Is it this complicated to create an intelligent tool to help AWS customers catch this situation and avoid money loss?
Top comments (6)
Could you provide a bit more insight about the architecture of your system and what exactly triggered this crazy bill? Your suggestions are absolutely valid but it it would be useful to understand the context and the causes.
Thank you for suggestion :)
Will provide more details. Just wanted warn anyone who deal with serverless and lambda that this situation is not smth unheard-of.
Sorry to hear that :( FWIW you're not alone, I've heard many cloud cost horror stories over the years.
Cost estimation can be pretty complicated. With github.com/infracost/infracost we're building an open source tool to help engineers get cost estimates before launching resources, initially we're focused on Terraform (so it couldn't have helped in you unfortunately) but we have plans to go beyond that. It's taken a fairly large community over 6 months to code the price mappings for ~200 cloud resources across AWS/Azure/GCP.
What don't you love about the AWS Config solution you implemented? I'm wondering if we should code-up the precautions from your article into infracost...
If you are using Node.js, then just use this in the future:
github.com/getndazn/dazn-lambda-po...
wwwoof, thank you :)
but unforunately our choice is python
We in Lumigo have a solution for that and sounds like we can add some other value points.
Would love to schedule an intro call to show you how you can use us.:)
Yehuda@lumigo.io