Integrating with Lambda MicroVMs

#lambda #proxylity #systemdesign #microvm

I was excited to see the public release of AWS Lambda MicroVMs last week, and have been exploring the best way to integrate them with Proxylity UDP Gateway (PUG). Here are my unfiltered thoughts so far.

MicroVMs are a bit different than the other destination services PUG supports, which are are the "meat and potatoes" of serverless on AWS: Lambda Functions, Step Functions, SNS, SQS, S3, DynamoDB, Firehose, Event Bridge, IoT Data, etc. Using PUG with any of these services is straightforward and well optimized.

Our customers simply take the ARN of any resource they want to receive UDP requests and paste it into our console (or more likely configure it using our CloudFormation custom resources). Behind the scenes, we setup the infrastructure needed to handle UDP traffic and route it to the resource using the ARN (region, service, resource name).

Conceptually, this works for MicroVMs as well: A running instance has an ARN and endpoint to which requests can be sent. But it's not quite that simple...

From an integration perspective, the first big issue with MicroVMs is their lifecycle:

Lambda Functions, for example, are persistent resources and once given an ARN we can invoke it as many times as we like, whenever we like. They don't go away unless we delete them. The same is true for SQS queues and DDB tables.
Requests can be sent to MicroVMs instances, but only while active. Since they have a maximum life of 8 hours the ARNs are temporary and expire. We can control when they terminate, but they will terminate automatically if we don't. New instances aren't automatically created; no autoscaling or load balancing.

Another lifecycle issue comes up related to permissions/access control:

Access to Lambda is controlled by persistent IAM permissions (Role and Policies). Once setup, access remains unchanged unless we make changes.
Access to MicroVM instances is controlled by short-lived, service-specific authorization tokens. These tokens last a maximum of 60 minutes, must be refreshed to maintain access. Moreover, generating an auth token requires an API call (added latency), unlike Sigv4 for IAM.

So in addition to the MicroVM image resource (which is persistent), we have two other resources that need to be managed to maintain continuous access as a PUG Destination: the instance and the auth token that allows using it.

One approach for PUG would be to allow using a MicroVM instance ARN as the destination. The auth token could be generated as needed in our infrastructure and cached for efficiency. But since the instance will live for at most 8 hours this approach would require customers to constantly be updating PUG with the new instance ARNs. And, it would limit scaling since only a single instance destination ARN could be configured per region.

A more sophisticated approach would be to use the image ARN for the PUG destination and manage instances as needed. PUG would be responsible for running, suspending, resuming and terminating instances based on the rate of requests. In other words, re-creating autoscaling. We could avoid autoscaling by allowing our customers to specify the number of instances to keep running, but that leads to reserved and underutilized capacity and is antithetical to the value PUG embodies (don't pay for idle).

This is a difficult design decision with many trade-offs to balance.

The approach we've landed-on for the initial, select-availability release borrows from Lambda's tenant isolation feature and the support for it we already implement for Lambda Functions.

Using this approach we allow the destination to be configured with a BREX expression to extract a tenant ID from each packet and map it to a MicroVM instance. If none is currently running, we start one. If one is available, all packets with the matching tenant ID will be sent to it.

We're also including the max duration, max idle time and max suspended time options the destination arguments to help customers minimize instance idle time for their use cases.

This feels like a great approach to leveraging MicroVMs in UDP Gateway. Customers will be able to serve highly episodic workloads (spikey), with great tenant isolation and more easily incorporate stateful processing for protocols that require it. Tenant isolation was highlighted by AWS as a key use case for MicroVMs so this approach seems well aligned.

I'm very eager to see how this new integration turns out in general availability, and how it gets used.

DEV Community

Integrating with Lambda MicroVMs

Top comments (0)