DEV Community

Dustin Liu for AWS Community Builders

Posted on

Injecting Machine Learning Model into AWS Lambda for Accelerated Inference Performance

In my previous article, we introduced a cost-effective inference architecture that successfully handles approximately one million inferences per month in our live production environment.

Image by author, previous logic

Previous logic:

  1. Upon receiving an inference request, Lambda will load the specified model, which is mounted on the Amazon Elastic File System (EFS).

  2. In the event that loading the model from EFS fails, the system will resort to fetching the model from Amazon S3 as a backup solution (Alternative 1)

One challenge associated with this solution is that during a scale-up scenario, when a large number of concurrent Lambda functions are invoked, they all attempt to load models from EFS simultaneously. This surge in demand can quickly exhaust the reserved bandwidth (throughput), leading to numerous inference timeouts.

Image by author, high lambda error rate due to timeout

Solution 1 (Not recommended):
While increasing the provisioned throughput may seem like a straightforward solution, this can prove to be costly and is therefore not recommended.

Solution 2 (Recommended):

Image by author, new logic

  1. Injecting the model into the base image used by Lambda allows each inference to load the model from 'local', resulting in reduced inference latency and cost savings on EFS. If this approach fails, the following backup solutions are in place:

  2. Loading the model from EFS (Backup Solution 1)

  3. Loading the model from S3 if EFS loading fails (Backup Solution 2)

Results:

Upon implementing the model injection, there is a notable reduction in both EFS throughput and Lambda inference duration, resulting in decreased costs and improved performance.

Image by author, EFS throughput before implementation

Image by author, lambda duration before implementation

Image by author, lambda duration after implementation

Top comments (0)