DEV Community

Cover image for Why We Moved From Lambda to ECS

Why We Moved From Lambda to ECS

Taylor Reece on April 21, 2021

After many months of development, my team just announced the general availability of our platform. That milestone seems like a perfect opportunity ...
Collapse
 
libertxyz profile image
Libert S

Thanks for the insights. I wasn't aware of the process isolation, as far as I'm aware every new request starts a new lambda environment in a new container.

Collapse
 
taylorreece profile image
Taylor Reece

That's what surprised us, too - we thought every Lambda got a new environment in a new container. It turns out if you invoke a Lambda that you haven't in a while, it "cold starts", so you get a new environment. Then, that Lambda sits around "warm" waiting for more invocations. That same environment might be used several times before it gets removed from the pool.

That's usually fine, since Lambdas tend to be stateless for most use cases, but in our case state could potentially be mucked with by a user's custom code that we execute.

Collapse
 
kayis profile image
K

Did you try a custom runtime?

Thread Thread
 
taylorreece profile image
Taylor Reece

We didn't. What has your experience been with custom runtimes in Lambda?

Thread Thread
 
kayis profile image
K

I don't have much, but as far as I understood it a custom runtime is basically a HTTP API that passes event data to a "function" whatever that may be.

I'd guessed that you could have used a customer ID in the event data and have the custom runtime spin up isolated "functions" for every customer.

Collapse
 
loujaybee profile image
Lou (🚀 Open Up The Cloud ☁️)

I wrote a thread about this a while back:

twitter.com/loujaybee/status/13463...

Collapse
 
nonbeing profile image
Ambar

To get around the SQS size limit issues we faced, we swapped in a Redis-backed queuing service.

Interesting, wouldn't this Redis queue have solved this particular issue even within Lambda-land also? Or does it only work for ECS?

Collapse
 
taylorreece profile image
Taylor Reece

Great question - The Redis queue would have solved the size issue in Lambda, for sure. Is there a great way to invoke the next Lambda in line using a Redis queue? Looking through StackOverflow, it seems like people suggest leveraging SQS to invoke a series of Lambdas in series and pass data between them, but there may be better ways I'm not aware of related to leveraging a Redis queue.

Collapse
 
nonbeing profile image
Ambar • Edited

I've heard good things about RSMQ (simple, fast queue abstractions on top of plain old redis). Their TLDR is:

If you run a Redis server and currently use Amazon SQS or a similar message queue you might as well use this fast little replacement. Using a shared Redis server multiple Node.js processes can send / receive messages.

Sounds like it could be a good match for your requirements. Perhaps you could setup an AWS step function to trigger the appropriate next lambda (with RSMQ ID and msg id in the payload to the subsequent lambda function). We use AWS step functions to great success for similar asynchronous lambda processing in a state machine.

Collapse
 
rolfstreefkerk profile image
Rolf Streefkerk

With the addition of EFS for Lambda a lot of your problems can be solved with Lambda.
Latency should be vastly reduced doing network disk operations with EFS when you provision transfer rate and set to high iops.

process isolation can be an issue if you have code executed outside the handler functions, these will remain until the Lambda container is thrown away. If you require such isolation, this is where you need to cut code a lot and keep it in your execution handler.

Collapse
 
taylorreece profile image
Taylor Reece • Edited

That's a good point. EFS in Lambda is exciting.

WRT the process isolation thing, try running a test of this code in Lambda twice. The first time, you get a nice logged "Hello, world!". The second time you run it, console.log has been redefined and you get a less desirable "Your message has been hijacked".

gist.github.com/taylorreece/70ed16...

Collapse
 
elthrasher profile image
Matt Morgan

There aren't a lot of languages or runtimes where you'd want to allow endusers to hack the global scope. You can certainly use Lambda safely with process isolation by not creating globals and creating and setting any runtime variables inside your handler. Moving to ECS won't solve your problem. Polite suggestion: don't allow your customers to attach things to the global scope. NodeJS has support for isolating the vm or you can just regex the code.

Thread Thread
 
taylorreece profile image
Taylor Reece

Hey Matt, thanks for linking the vm module - it's good to know about. It seems like that should work, though the docs note:

The vm module enables compiling and running code within V8 Virtual Machine contexts. The vm module is not a security mechanism. Do not use it to run untrusted code.

For our use case, where our platform runs customers' code which could contain anything, we've had to be a bit more heavy-handed with isolating our runtime environments. We ended up creating chroot jails and distinct node processes within our ECS containers to run our customers' code, so each run is guaranteed to not interact with any another.

Thread Thread
 
elthrasher profile image
Matt Morgan

That makes sense and it's obvious that your business puts you in a position to do something that most apps would not want to do (execute untrusted enduser code). My comment was really in response to your gist above. The behavior of globals in Lambda is well documented and predictable. This didn't fit your rather unusual use case, but for most users, a quick read of the docs will arm them with what they need to understand process isolation in Lambda.

Collapse
 
ssimontis profile image
Scott Simontis

I think you did a great job of highlighting what I would say is the ideal path for AWS development nowadays. Start with Lambda, move to ECS if Lambda doesn't fit your needs, and only consider EC2-based applications as a last resort or a lift n' shift strategy for migration.

Collapse
 
taylorreece profile image
Taylor Reece

Thanks, Scott! Lambda sure make it easy to iterate quickly, without needing to sink hours into DevOps tasks, but it definitely does make sense for some use cases to sink the time needed into running on ECS/EC2/etc.

Collapse
 
omrisama profile image
Omri Gabay

Great read. Would Amazon Step Functions have worked for your business use cases?

Collapse
 
taylorreece profile image
Taylor Reece

Hey Omri - thanks, and great question! I'll start with a disclaimer that I'm no expert in AWS Step Functions, so correct me if I get anything wrong :-)

Similar to Step Functions, Prismatic's platform allows you to build workflows (we call them integrations) where data flows through a series of steps. The integration's steps do a variety of things, like interact with third party APIs, mutate data, manage files, etc. Users can create branches and loops that run based on some conditionals, etc. - all of that is pretty similar to what you'd see in a Step Function state machine definition.

Our platform differs from Step Functions in a number of ways, too. Most notably, our users (typically B2B software companies) can develop one integration and deploy instances of the integration to their own customers. Those instances are configurable, and can be driven by customer-specific configuration variables and credentials, so one integration can handle a variety of customer-specific setups. We also offer tooling - things like logging, monitoring and alerting - so our users can easily track instance executions and can configure if/how they are notified if something goes wrong in their integrations.

It might have been possible to create some sort of shim that dynamically converted a Prismatic integration to a Step Functions state machine definition - I'd need to look more into Step Functions to figure out what difficulties we'd have there. The biggest thing keeping us from doing something like that is probably vendor lock-in. We have customers who would like to run Prismatic on other cloud platforms (or on their on-premise stacks), and implementing our integration runner as a container gives us more flexibility to migrate cloud providers as needed.

Collapse
 
omrisama profile image
Omri Gabay

Good looking out on the vendor lock-in issue. And I was under the impression that Prismatic was only being sold as a cloud solution, but self-hosted options are great as well. Thanks for responding!

Collapse
 
ksaaskil profile image
Kimmo Sääskilahti

That's a very interesting read, thanks a lot!

Are you using Bull as queue library over Redis?

My experience of SQS is that it's great for decoupling services and queueing jobs when you don't have an end-user waiting for the job to complete. If you need near-real-time user experience, I'd similarly go for e.g. Redis, RabbitMQ or even Kafka. Does that sound reasonable?

Collapse
 
taylorreece profile image
Taylor Reece

Yep! We're big fans of Bull :-)

Collapse
 
luitame profile image
Luítame de Oliveira

It gave me a lot of insights. Thanks for sharing.