I've worked in serverless computing for over five years and wanted to share some lessons learned.
By the term "costly," I don't mean necessarily dollars or how expensive it was. "Costly" could mean difficulties in troubleshooting, difficulties in architecture, and just pain points I've discovered over the years.
There are many more, but this post focuses on the costly findings that have stood out.
An overview of serverless
Read the following post for an overview of serverless.
https://medium.com/serverless-is-cool/a-super-quick-overview-of-serverless-52b1e5cf9e2f
Input injection
Until just recently, OWASP Top Ten had input injection as the top risk. It is still the top risk in the Serverless Top Ten.
The OWASP Top Ten API Security Risks has broken object-level authorization as the highest risk. This means that if I ask an API for data, it will give it to me. Then, if I ask it for another user's data, it'll give me that, too, even if I'm still logged in with my account. Since I'm not meant to get someone else's data, it seems appropriate that it would be a higher risk than input injection.
Input injection is still high on the list, and it would be important to address that first.
In the scenario above, we designed a Lambda function to get an update from the database to get an event from a DynamoDB table whenever the table data changes.
The Lambda function then sends a clear cache command to the API Gateway.
This workflow allowed the API Gateway to cache data and its cache to be cleared when there was new data. The DynamoDB table data stayed the same most of the time, and caching would reduce the number of read transactions on the table.
Unfortunately, this resulted in an accidental input injection attack. Somehow, the DynamoDB stream would resend old events. Some of these events were old. Furthermore, multiple old events were sent in a short period.
The Lambda function was designed to clear the API Gateway cache when it received an event. The numerous events resulted in multiple commands to the API. Shortly after, the API reached its rate limit and stopped responding to requests.
We experienced a self-induced denial of service attack!
We reworked our Lambda function when we realized our users could not get data because of this DoS event.
The Lambda function would validate each input rather than assuming it was "safe" to process. It would check the timestamp and only process events created within a few minutes. Furthermore, it would keep track of the last time it cleared the API cache. It would only clear the cache no sooner than five minutes than the previous time.
The image above shows some input injection risks and mitigations to consider.
Over-privileged policies
I joined an existing serverless project. I discovered all the Lambda functions within a Serverless Framework service used the default IAM role instead of having one least privileged role per function.
The above shows the example of a role with an over-privileged IAM policy.
I emulated an injection attack where I would send a malicious command to the AWS CLI. (The Lambda function code allowed me to send Linux commands to the operating system with the AWS CLI installed.) I was able to modify other parts of the AWS infrastructure.
An attacker could have issued a malicious command to IAM and started an account takeover.
We created an IAM role for each function and assigned a unique IAM policy that had the least privileges to allow the function to execute successfully.
I attempted the same malicious exploit, and the function threw an error.
The image above shows some over-provisioning risks and mitigations to consider.
Failures and service outages
It is easy to assume that AWS services will never go down or experience degradation. Designing for the "happy path" is the most logical approach. But, it is essential to consider the "negative path" where things go wrong.
One day, we experienced degradation in CloudWatch rules. (Since then, it was rebranded to EventBridge.) Some of our functions failed to work for hours.
Some functions were designed for the scheduled events to get third-party data to store it in our DynamoDB tables. The third-party data source only provided new data every six minutes. If we failed to save the data within that time, we would never have access to it again.
We lost data for hours. That affected our user experience and billing.
To remediate this, we started monitoring the health of the AWS services. We would be notified when there was an outage.
We made sure all functions exited gracefully by using try-catch blocks. We could log errors that could be monitored. If our monitoring noticed an increase in logged errors, that would be an indicator to investigate.
We explored having multi-region scheduled events. For example, if the North Virginia region was down, we could still get scheduled events from the Ohio region.
The image above shows some failure and service outage risks and mitigations to consider.
Cost engineering
We found our AWS bill was costly. For almost a year, we accepted it as the "cost of moving" to the cloud.
We decided to investigate whether we were making the best use of the AWS cloud and being as efficient as possible.
We observed our DynamoDB traffic patterns and realized they were irregular. We realized we would save money by moving to on-demand capacity even though it was more expensive than the provisioned capacity.
We realized we had old log and S3 files that were no longer needed. We took advantage of lifecycle rules to delete old, unnecessary data.
An initial version of an app used Kinesis, but we found it was more cost-effective to use SQS.
The image above shows some cost risks and mitigations to consider.
Encryption
We wanted to be super secure and enabled customer-managed KMS keys for all our data. We found our AWS bill increased considerably.
We discovered the AWS-managed keys were less expensive than the customer-managed keys (CMK). We did not have security requirements that required us to use CMKs. We moved all our encryption to AWS-managed keys where possible.
Better security did not provide business value in this case.
The image above shows some encryption risks and mitigations to consider.
Spaghetti
We sometimes created "serverless spaghetti" designs.
Much like "spaghetti code" that is difficult to trace and has complex dependencies, the same can happen in a serverless design.
The image above shows an unnecessarily complex design in an attempt to implement a "pure" and "clean" microservice design.
We reviewed our design and attempted to simplify it. The above shows how we simplified the previous design example. Getting the data directly from the table was cleaner than going through a microservice.
The image above shows some serverless spaghetti risks and mitigations to consider.
Watch the LayerOne presentation
Before you go
Here are other posts you might enjoy.
AWS CDK Serverless Cookbook: A Step-by-step Guide to Build a Serverless App in the Cloud
Speed up AWS CDK deploys up to 80%: The ts-node package needed the speed boost
Build Robust Cloud Architectures: Applying Military Design Principles to Cloud Applications
Introduction to Cloud Computing Security: An excerpt from the Serverless Security book
Top comments (0)