Lately I've been involved in reviewing services developed using 'serverless' that had been struggling with performance issues. Rather than talk about all the benefits of serverless, this post looks at some top tips for keeping your service operational and production-ready, gained from analysing these real life issues.
As an engineer, or someone responsible for monitoring a service, the best way to easily find and fix issues is through the use of observability and serverless monitoring tools rather than focusing on logs. Although AWS provide a service called
AWS X-Ray for distributed tracing, there are better tools in the partner ecosystems such as those provided by Lumigo, Thundra, Instana and several others. For example, Lumigo provides some key benefits over
X-Ray such as (thanks to Yan Cui for highlighting some of these):
- support for streams
- dashboard with overview of environment - lambda invocations, most errors, cold starts
- issues page showing recent failures including timeouts
- captures request and response for every HTTP request
- supports auto-scrubbing of sensitive data
- supports searching for specific transactions
- built in alerting
Although there is still value in logging when stepping through complex business logic, you will soon find that these observability tools are your first point of call for understanding the behaviour and current state of your service.
Historically, I'd been involved in writing systems where you would log a failure using an
ERROR log level. These logs would be shipped, and another component would scrape the files, and alert based on log patterns. Unfortunately, it is still common to see this same approach replicated in a cloud environment.
My recommended approach today is to use metrics. For example, when your function has finished processing, Lambda sends metrics about the invocation to
Amazon Cloudwatch. You can set alarms to respond to these changes in metrics, and out of the box Lambda provides a number of Invocation, Performance and Concurrency metrics. You can also create your own custom metrics using the Cloudwatch embedded metric format, taking into account your own specific context, which can be alerted on in the same way.
Although it is possible to ingest metrics into external systems, this should not be done for the purpose of alerting. This is because there is often a delay with the asynchronous nature of ingesting metrics, by which time, you are already aware of the errors from disappointed customers.
The serverless application lens part of the
Well Architected Framework noted that:
you can no longer sustain complex workflows using synchronous transactions. The more service interaction you need the more you end up chaining calls that may end up increasing the risk on service stability as well as response time.
The first two fallacies of "the 8 fallacies of distributed computing" are that "the network is reliable" and "latency is zero". Chaining together synchronous HTTP calls leads to a brittle service that is prone to break. This led to Tim Bray, a former distinguished engineer at AWS, to state that:
“if your application is cloud-native, or large-scale, or distributed, and doesn’t include a messaging component, that’s probably a bug.”
There are plenty of options available to decouple components in the AWS ecosystem. To get a better understanding of the differences between Events, Queues, Topics and Streams it's worth watching this TechTalk by Julian Wood.
Personally, I'm excited at the growing adoption of event-driven applications using an event bus provided by
Amazon EventBridge. Rather than orchestrating calls between components, you can publish an event that represents a significant change in state, and other components can subscribe to these events and respond accordingly. A good place to start here is looking at the talks and blogs by Sheen Brisals on how Lego have adopted EventBridge.
There is a fantastic article on the AWS Builders Library that talks about 'Timeouts, retries and backoff with jitter.
When you design a service, you need to be aware of the timeout settings and retry behaviour of all components in the flow of data from the frontend all the way to the backend. There is no point setting a Lambda timeout of 60 seconds if it is triggered by API Gateway with a default timeout of 29 seconds that cannot be increased.
Alongside the timeout within a service, there are also timeout settings within the SDK. The default behaviour for the AWS SDK is different based on the service being invoked and the specific driver being used as highlighted in this knowledge centre article.
After saying that you should focus on asynchronous communication, there are times when you have no choice but to make a synchronous call. This may because you are reliant on an external interface. In which case, if the external API is failing, best practice is to respond back to the calling client immediately by closing the circuit, whilst retrying the API in the background until it is ready for the circuit to be opened. This is the basis of the
Circuit Breaker design pattern. Jeremy Daly has written up a number of Serverless Microservice Patterns for AWS
The bottom line is the more you understand both the programming language and the AWS services you are using, the better the chances of delivering a resilient, secure and highly available service.
At a high level, I will break down optimising your service into three main areas:
a) General Optimisations
There are set of optimisations when working with AWS Lambda functions that have emerged over time. AWS publish a set of best practices for working AWS Lambda functions.
Having a good knowledge of the AWS Lambda lifecycle alongside associated configurations is essential. Some examples include:
- Ensure HTTP Keep-Alive is enabled if using Nodejs
- Reduce package size to reduce cold start times using tools like
- Use the
AWS Lambda Power Tuningtool by Alex Casalboni to determine the right memory/power configuration and instruction set architecture
- Use reserved concurrency to limit the blast radius of a function where appropriate. I have seen examples where this not being set during a batch load has resulted in the entire AWS Lambda concurrent executions for an entire account being consumed
- Initialise SDK clients and database connections out of the function handler
b) Reduce network calls
Each time a call is made over the network, additional latency and error handling is introduced. These calls should be reduced to the minimum possible, and two specific examples come to mind.
The first is where a call was made to DynamoDB to check a record doesn't exist, before a second call was made to put an Item. In this case, using a Condition Expression enabled this to be done in a single call.
The second involved multiple calls to DynamoDB to retrieve data once after the other. In this case, the challenge was down to poor schema design. The growing adoption of Single Table Design in DynamoDB is driven by the desire to reduce the number of calls needed.
c) Parallelise where possible
NOTE there are some optimisations that are good to always do since they are essentially free to carry out. Other ones will take up engineering time to set up, run, analyse and refactor. It makes sense to focus on the most frequently invoked services or functions or those that observability tools have highlighted as poorly performing. Always be careful over optimising and not taking into account engineering effort.
Finally, remember that
AWS Lambda is not a silver bullet.
AWS Lambda was first launched, the configuration options where limited, which made it a simple service to use. Now, with Lambda extensions, layers, up to 15 minute timeout, and up to 10 GB memory allocation, all kinds of use cases have been opened up. However, this often leads to the default choice being writing a new function. If you are writing an ETL batch job using
AWS Lambda where you are having to chunk down files to process and allocate 15 minute timeout and 10GB RAM, it is likely there are better "serverless" options out there such as
One thing better than writing a function which you have to manage and maintain is not writing a function at all, and this way we are moving to an approach of "favouring configuration over code".
At the last AWS re:Invent, there was an announcement for AWS Glue Elastic Views which is currently in preview. This removes the need to write custom functions to combine and replicate data. More recent is the announcement that Step Functions now has direct integration for over 200 AWS Services. Again, this removes the need for writing custom code.