Wojciech Matuszewski for AWS Community Builders

Posted on Apr 27, 2022

Processing large payloads with Amazon API Gateway asynchronously

#aws #serverless #apigw #sfn

In my previous article, I've talked about Synchronous AWS Lambda & Amazon API Gateway limits and what to do about them.

As stated in the blog post, the ultimate solution for a "big payload" problem is making the architecture asynchronous. Let us then zoom in on the aspect of asynchronous communication in the Amazon API Gateway service context and build a serverless architecture based on the notion of "pending work".

Problem refresher

The following illustrates the Amazon API Gateway & AWS Lambda payload size problem I've touched about in the previous article.

No matter how hard you try, you unable to synchronously send more than 6 MB of data to your AWS Lambda function. The limit can seriously mess with your significant data-processing needs.

Luckily there are ways to process much more data via AWS Lambda using asynchronous workflows.

The starting line - pushing the data to the storage layer

Suppose you were to ask the community about the potential solution to this problem. In that case, I wager that the most common answer would be to use Amazon S3 as the data storage layer and utilize presigned URLs to push the data to Amazon S3 storage.

While very universal, the flow of presigned URL to Amazon S3 can be nuanced, especially since one can create the presigned URL in two ways.

Describing these would make this article a bit too long for my liking, so I'm going to defer you to this great article by my colleague Zac Charles which did the topic much more justice than I could ever do.

I have the data in Amazon S3. Now what?

Before processing the data, our system must know whether the client used the presigned URL to push the data to the storage layer. To my best knowledge, there are two options we can pursue here (please let me know if there are other ways to go about it, I'm very keen to learn!)

I omitted the AWS CloudTrail to Amazon EventBridge flow on purpose as I think engineers should favor the direct integration with AWS EventBridge instead.

S3 Event Notifications

The Amazon S3 event notifications, till recently, was a de facto way of knowing whether data landed into Amazon S3.

While sufficient for most use-cases, the feature is not without its problems, the biggest of which, I would argue, are the misunderstandings around event filtering and IaC implementation.

The most essential thing to have in mind when it comes to event filtering is that you cannot use wildcards for prefix or suffix matching in your filtering rules. There are more things to consider, though. If you plan to use the filtering feature of S3 event notifications, I strongly encourage you to read this documentation page thoroughly.

On the IaC side of things, know that, in creating the AWS CloudFormation template, you might end up with circular dependency problems. Deployment frameworks like AWS CDK will make sure that should never happen. However, I still think you should be aware of this potential problem, even if you use deployment frameworks.

EventBridge events

In late 2021, AWS announced Amazon EventBridge support for S3 Event Notifications. The announcement had a warm welcome in the serverless community as EventBridge integration solves most of the "native" S3 event notifications problems.

Utilizing Amazon EventBridge for S3 events gives you more extensive integration surface area (you can forward the event to more targets) and better filtering capabilities (one of the strong points of Amazon EventBridge). The integration is not without its problems, though.

For starters, the events are sent to the "default" bus, and using EventBridge might be more costly for high event volumes. To learn more about different caveats, consider giving this great article a read.

Munching on the data

You have the data in the Amazon S3 storage, and you have a way to notify your system about that fact. Now what? How could you process the data and yield the result back to the user?

Given the nature of AWS, for better or worse, there are multiple scenarios one can move forward. We will start from the "simplest" architecture and move our way up to deploying an orchestrator with shared, high-speed data access.

Processing with a dedicated AWS Lambda

The AWS Lambda service is often called a "swiss knife" of serverless. In most cases, processing the data in-memory within the AWS Lambda is sufficient.

The only limitations are your imagination and the AWS Lambda service timeout – the 15-minute maximum function runtime. Please note that in this setup, your AWS Lambda function must spend some of that time downloading the object.

I'm vague about saving the results of the performed work on purpose as it is very use-case dependant. You might want to keep the outcome of your work back on Amazon S3 or add an entry to a database. Up to you.

But what if that 15-minute timeout limitation is a thorn at your side? What if the process within the compute layer of the solution is complex and would benefit from splitting it into multiple chunks? If that is the case, keep on reading, we are going to be talking about AWS Step Functions next.

Processing with Amazon StepFunctions

If the compute process within the AWS Lambda we looked at previously is complex or takes more time than the hard timeout limit, the AWS Step Functions might be just the service you need.

Take a note of the number of times the code within your Step Function definition needs to download the object from Amazon S3 storage (the illustration being an example, of course). Depending on the workload, that number may vary, but no matter the workflow, after a certain number threshold, it does feel "wasteful" for me to have to download the object again and again.

Keep in mind that AWS Step Functions maximum payload size is 256KB. That is why you have to download the object repeatedly whenever you need access to the object.

You could implement in-memory caching in your AWS Lambda function, but that technique only applies to a single AWS Lambda and is tied to its container lifecycle.

Depending on the requirements and constraints, I like to use Amazon EFS integration with AWS Lambda in such situations. Amazon EFS allows me to have a storage layer shared by all AWS Lambda functions that partake in the workflow. Let us look into that next.

Shared AWS Lambda storage

Before starting, understand that using Amazon EFS with AWS Lambda functions requires Amazon VPC. For some workflows, this fact does not change anything. For others, it does. I strongly advocate keeping an open mind (I know some people from serverless community despise VPCs) and evaluating the architecture according to your business needs.

This architecture is, arguably, quite complex. You have to consider some networking concerns. If your object is quite large and downloading it takes a lot of time, I would advise you to look into this architecture – VPCs do not bite!

Suppose you yearn for a practical example, here is a sample architecture I've built for parsing media files. It utilizes Amazon EFS for fast access to that video file across all AWS Lambda functions involved in the AWS Step Function workflow.

Yielding back to the user

With the compute part behind us, it is time to see how we might notify the user that our system processed the object and that the results are available.

No matter what kind of solution you choose to notify the user about the result, the state of the work has to be saved somewhere. Is the work pending? Is it finished? Maybe an error occurred?

Keeping track of the work status

The following is a diagram of an example architecture that keeps track of the status of the performed work.

The database sends CDC events to the system. We could use these CDC events to notify the user about the work progress in real-time! (although that might not be necessary, in most scenarios, polling for the results by the client is sufficient).

In my humble opinion, the key:value nature of the "work progress" data makes the Amazon DynamoDB a perfect choice for the database component (nothing is stopping you from using RDS, which also supports CDC events, which themselves are optional here).

Notifying the user about the results

The last piece of the puzzle is making sure the user has a way to be notified (or retrieve) the status of the request.

Usually, in such situations, we have two options to consider – either we implement an API in which the user is going to pool for the updates, or we push the status directly to the user.

The first option, pooling, would look similar to the following diagram in terms of architecture.

This architecture is sufficient up to a particular scale. The architecture must scale according to the pollers hitting the API. The more pollers actively engage with the API, the higher the chance of throttling or other issues (refer to the Thundering herd problem).

In higher throughput scenarios, one might want to replace the polling behavior with a push model based on WebSockets. For AWS-specific implementations, I would recommend looking at Amazon API Gateway WebSocket support or AWS IoT Core WebSockets for broadcast-like use-cases.

I've written an article solely focused on serverless WebSockets on AWS. You can find the blog post here.

Closing words

And that is the end of our journey. We have looked at how to get the large payload into our system, process it, and respond to the user. Implementing this architecture is not a small feat, and I hope you found the walkthrough helpful.

For more AWS serverless content, consider following me on Twitter - @wm_matuszewski

Thank you for your precious time.