Taavi Rehemägi for Dashbird

Posted on Oct 29, 2020 • Edited on Apr 22, 2021 • Originally published at dashbird.io

Solving the Challenges of Serverless at Scale

#aws #serverless #devops

Best Practices of Serverless at Scale

A serverless application in its infancy looks and runs vastly different to one at scale. When there are more components to manage, the key to operational excellence is rooted in serverless best practices. Dashbird was created with the mission to help developers succeed with modern cloud environments, no matter their size. As experienced developers ourselves, we've faced and understand the challenges found in the functionality of at-scale serverless architecture. In this article, we run through the common serverless challenges, the architectural patterns and best practices to combat them.

Find out more about scalable serverless designs for enterprises.

Exploring the Challenges

As with anything, we should be constantly aspiring to catch problems sooner rather than later. Here is an example of an established but early-stage serverless application:

As you can see, its workflow is simple and there is minimal load meaning the requests, execution times and concurrency are manageable.

In just a few months, that same architecture can look like this:

As load increases, the existing infrastructure comes under stress. This is a great exercise in identifying the potential points of failures in your system, and the scenarios in which those could happen. In this example, you can see clearly how each source has its own limit leading to either failure or performance degradation. It's important to remember that while different services have different API limits and throttling limits, failures can also happen through configuration mistakes and code errors.

Common issues at higher loads:

Lambda Concurrency

Lambda concurrency is the number of requests that your function serves at any time. A good formula for estimated Lambda concurrency is:

Average Execution Time * Average Request Per Second = Estimate Concurrency

This helps to determine the number of containers that'll be used simultaneously. With this in mind, let's remind ourselves of some default AWS limits in place.

Function-Based Burst Limits

These can still occur even when concurrency is running fine. There is between a 500-3000 initial burst limit on functions (region dependent) with the ability to scale up by 500 every one minute.

Account Wide Limit

These are soft limits and built-in for your protection. By default, it's set to 1000 concurrent executions, however these can be changed.

API Gateway Limit

There is a 1*0k request per second limit, per region which can be increased* as needed. However, the 5k concurrency burst limit and 29-second timeout lime cannot be changed.

Other AWS API Limits

All AWS APIs have limits, which is important to factor in when building and mapping out your application for scale. For example, KMS has a limit between 5,500-10,000 requests per second, depending on the region.

As your application scales or if it often experiences spikey loads, these limits need to be kept in mind for stable performance.

Architectural Patterns and Best Practice

An unoptimized at-scale serverless application would look like this:

With so many requests per second, the stress becomes clear as other resources multiply. For a relational database, 3,000 new connections per second is a huge load and can cause lag in your system. Additionally, the 7,500 containers now needed increases your costs significantly.

These are our top tips for code-level optimizations to help with this.

Keep everything in the Initialization Phase, and only connect the database when KMS queries have been cached. By doing this, executions will only run for the main logic you have.
Keep orchestration out of code.
Manage all connections out of the handler code.

Using the above, the optimized at-scale serverless architecture now looks like this:

You can see a huge reduction in the execution time, as the connection doesn't need to be established and the total connections resulting in a far smoother performance.

Additional Serverless Patterns to Question

Do you need an API response?

A habit we can fall into is always having a detailed database response from the API, when sometimes a simple acknowledgment is all that's needed. By doing this, you can decouple the database from the KMS request and create an asynchronous processing model using SQS and Lambda, allowing you to set your concurrency limit and the load. There is no change to the model.

Definitely need an API response?

If an API response is needed, there are few optimization tweaks to consider.

Switching to a serverless, non-relational database such as DynamoDB or Serverless Aurora. Using the HTTP interface and the proxy/cache elements, there is no connection limit and being non-relational means there will be less lag and slowness to experience.
Implement client retries and backoffs, to wait for the response outside of the synchronous call*.
Implement webhooks or polling long tasks*.

*These features may have a negative impact for the client, however at a very high scale, the compromise can be worthwhile.

Don't orchestrate in code

The purpose of serverless is to keep code focused on business logic, meaning that elements of your serverless application of undifferentiated value can use managed services. Make use of the best services to support your application's functionality.

Additionally don't wait in code, and instead, use Step Functions to enable tasks to be run in parallel and enable automatic triggers and retries. This is one of the best optimization actions many of our customers have seen from both a performance perspective and a reduction of costs.

Tackling Operating and Monitoring Challenges

With the benefits of serverless, comes a new host of monitoring challenges to overcome, which is where Dashbird can provide value and expertise.

Challenges Using Managed Services

There is no code access like we are used to. It's no longer a case of attaching an agent to the API to send a failure alarm, instead we have a more abstract control panel to work from.
Serverless components also have a huge amount of data output, with each resource providing logs, tracing data, errors, and configuration data; it rapidly piles up.
Failures are very specific to the service used. The issues found in API Gateway vary from Lambda, for example, emphasizing the requirement for deep knowledge of individual services and all their possible errors.
Its large scale nature naturally means challenges are potentially larger and widespread.

Challenges Using a Distributed System

There is a lot of surface area to manage. There can be hundreds or thousands of parts to your infrastructure, which organically increases the likelihood of failure, errors, and vulnerabilities for attackers.
It's a dynamic and forever changing system, adapting to demand and requirements.
Understanding the resource relationships and their interactions are new in the serverless world.

Dashbird is built on three core pillars that target all these issues:

Centralized Observability and Visualization
Automated Failure and Error Alert
Actionable Well-Architected and Best Practice Insights

Centralized Observability and Visualisation

It's important to make the already available mass of data output work efficiently for us. Democratizing data breaks down traditional silos and enables users to navigate their own data more easily through customizable queries and searches. Dashbird's use of prebuilt views and simple dashboard offering visualization of your data, for easier and quicker understanding.

The centralized platform offers dynamic resource management, where you're able to understand resource relationships and view your entire application in one place.

Automated Failure and Error Alert

Monitoring is only effective if there is continuous alert coverage across your entire infrastructure. Dashbird uses out-of-the-box automated alerts notifying you of failures and errors, which integrate seamlessly into a developer's workflow by sending in real-time via Slack or email.

Dashbird also proactively listens to log and metric data meaning that any potential negative trails (not yet failures) are highlighted and can be investigated before they escalate.

Actionable Well-Architected and Best Practice Insights

Building serverless applications requires consistent best practice habits, which can be difficult to maintain or even start. Using the AWS Well-Architected lens, Dashbird helps to ensure your system is built and fixed based on industry-standard best practices.

The Insights Engine detects non-binary issues such as delays, consumption issues, or limits enabling users to take action and improve and optimize their architecture to be reliable at any scale. Within its periodic assessments, Dashbird also helps to instill strong security and compliance practices, discovering areas needing encryption, inactive resources and over- or under-provisioned components all of which can be increasing exposure for attacks.

DEV Community