Matia Rašetina for AWS Community Builders

Posted on May 18

Stop Using Lambda for ML at This Scale (Benchmark + Cost Analysis)

#webdev #aws #programming #startup

As a CTO, your job during a Proof of Concept (POC) is deceptively simple: don’t over-engineer, and don’t overspend.

You don’t need the perfect ML infrastructure—you need the cheapest architecture that works well enough.

Here’s the pipeline we built for our ML POC:

Audio file → S3 → Compute → Prediction → DynamoDB

The real question isn’t how to run inference—it’s:

At what point does Lambda stop being the smartest choice?

In this blog post, we are comparing the 3 Serverless ways of processing the data with an already trained Machine Learning model:

AWS Lambda with standard configuration
AWS Lambda with Snapstart enabled
AWS Lambda used as a proxy to use AWS SageMaker

To access the full project code, you can click the link here.

Experiment setup

The architecture across the board is very similar — all compute resources (Lambdas and SageMaker instance) have the same 4GB RAM configuration.

There is a subtle difference in assigning the vCPUs, as for each 1.769GB of RAM in AWS Lambda, you get the equivalent of one vCPU, meaning that our Lambdas would have 2.31 vCPU assigned (based on AWS docs here), and our SageMaker instance (ml.t2.medium instance) would have 2 vCPU assigned.

In addition, SageMaker stack has a proxy Lambda, with 128MB of RAM assigned, which gets the information from the uploaded file in S3, forwards the information to SageMaker and saves the results into DynamoDB.

All stacks do not use any GPU instances, making the playing field as level as possible.

Here is some other experiment choices to make the benchmark fair:

Same CPU architecture everywhere (x86_64): Lambda functions use the x86_64 architecture, dependencies are bundled with the SAM x86_64 Python 3.12 image, and the SageMaker container image is built for linux/amd64 so ONNX and wheels behave the same across paths.
Same language runtime: All Lambda handlers run Python 3.12 with the same packaged lambda_src layout (only the handler and SnapStart wiring differ).
Same model and container vs zip trade-off is intentional: One shared ONNX artifact from S3; standard and SnapStart load it inside the function, SageMaker serves it from a dedicated container behind an endpoint.

To keep the benchmark fair, SageMaker serverless was intentionally excluded. The reason for this is to keep the costs of running the ML model as low as possible and to keep the performance fair across the board.

The architecture diagram for this benchmark can be seen in the following picture:

Here is an overview of all stacks in this experiment:

Option	Setup	Cost Model	Pros	Cons
Lambda (4GB)	Model runs directly inside Lambda, ~2.31 vCPU, 4GB of RAM	Pay-per-request	Scales to zero, no idle cost, fast per request	High memory cost, not ideal at very high traffic
Lambda with SnapStart Enabled	Model runs directly inside Lambda, ~2.31 vCPU, 4GB of RAM	Pay-per-request	Predictable performance, cost-efficient at scale, SnapStart helping in cold starts	High memory cost, not ideal at very high traffic, additional SnapStart cost if traffic is sporadic
SageMaker Endpoint	Model hosted on ml.t2.medium (2 vCPU with 4GB of RAM), invoked via 128MB Lambda	Fixed monthly	Predictable performance, cost-efficient at scale	Always-on, pays even when idle, slightly higher latency

Configuration in CDK code

Here are the code snippets of all compute resources used in this experiment. All Lambdas are created with the following method, to reduce code duplication.

def create_python_function(
    *,
    scope: Construct,
    function_name: str,
    handler: str,
    environment: dict[str, str] | None = None,
    timeout: Duration = Duration.seconds(60),
    memory_size: int = 4096,
    architecture: _lambda.Architecture = _lambda.Architecture.X86_64,
    snapstart: _lambda.SnapStartConf | None = None,
) -> _lambda.Function:
    runtime_environment: dict[str, str] = {**_DEFAULT_RUNTIME_ENV}
    if environment:
        runtime_environment.update(environment)

    return _lambda.Function(
        scope,
        function_name,
        function_name=function_name,
        runtime=_lambda.Runtime.PYTHON_3_12,
        architecture=architecture,
        handler=handler,
        code=bundled_lambda_code(),
        timeout=timeout,
        memory_size=memory_size,
        environment=runtime_environment,
        snap_start=snapstart,
    )

Standard Lambda:

# Initialize a Lambda with the standard configuration
standard_lambda = create_python_function(
    scope=self,
    function_name="standard-predictor",
    handler="standard_handler.handler",
    timeout=Duration.seconds(90),
    environment={
        "PREDICTIONS_TABLE": self.predictions_table.table_name,
        "PREDICTOR": "standard",
        "MODEL_S3_URI": f"s3://{self.model_asset.s3_bucket_name}/{self.model_asset.s3_object_key}",
    },
)

SnapStart Lambda:

# Initialize the SnapStart Lamba
snapstart_lambda = create_python_function(
    scope=self,
    function_name="snapstart-predictor",
    handler="snapstart_handler.handler",
    timeout=Duration.seconds(90),
    snapstart=_lambda.SnapStartConf.ON_PUBLISHED_VERSIONS, # Very important to configure this parameter!
    environment={
        "PREDICTIONS_TABLE": predictions_table.table_name,
        "PREDICTOR": "snapstart",
        "MODEL_S3_URI": f"s3://{model_asset.s3_bucket_name}/{model_asset.s3_object_key}",
    },
)

# Initializing an Alias, as SnapStart doesn't work without it
live_alias = _lambda.Alias(
    self,
    "SnapStartLiveAlias",
    alias_name="live",
    version=snapstart_lambda.current_version,
)

SageMaker endpoint + Lambda proxy:

# Configure the SageMaker endpoint
endpoint_config = SageMaker.CfnEndpointConfig(
    self,
    "AudioPredictorEndpointConfig",
    production_variants=[
        SageMaker.CfnEndpointConfig.ProductionVariantProperty(
            variant_name="AllTraffic",
            model_name=model.attr_model_name,
            initial_instance_count=1,
            instance_type="ml.t2.medium",
            initial_variant_weight=1.0,
        )
    ],
)

# Define the endpoint
endpoint = SageMaker.CfnEndpoint(
    self,
    "AudioPredictorEndpoint",
    endpoint_config_name=endpoint_config.attr_endpoint_config_name,
)

# Initialize the SageMaker Lambda proxy
SageMaker_trigger = create_python_function(
    scope=self,
    function_name="SageMaker-predictor",
    handler="SageMaker_trigger_handler.handler",
    timeout=Duration.seconds(90),
    environment={
        "PREDICTIONS_TABLE": predictions_table.table_name,
        "PREDICTOR": "SageMaker",
        "MODEL_S3_URI": f"s3://{model_asset.s3_bucket_name}/{model_asset.s3_object_key}",
        "ENDPOINT_NAME": endpoint.attr_endpoint_name,
    },
)

Results

In the following image, you can see the execution duration of all the stacks which were used (note - SnapStart Lambda was ran once before to save the environment and then waited for 10 minutes for the Lambda to have a cold start again):

Method	Mean Latency	Median	Stability (Std)
Standard Lambda	280.72 ms	127.65 ms	443.29 ms
SnapStart	178.60 ms	124.69 ms	166.35 ms
SageMaker	339.18 ms	226.91 ms	350.76 ms

From the graph, we can see that the Lambdas execute faster than the SageMaker endpoint, staying under the 200ms mark. The circles represent the cold starts, and you can see that the SnapStart Lambda was at least 2x faster than other resources, thanks to SnapStart. SageMaker stack performed the worst, but not by a lot, having the most Lambda invocations just above the 200ms mark and the cold start taking almost 1.4 seconds.

Cost Breakdown

Lambda Cost (per request)

Formula: Cost = Duration × Memory × $0.0000166667

Duration: ~200 ms
Memory: 4 GB

Cost per request: ~$0.0000133

Cost per 1M requests: ~$13.80

SageMaker Cost (fixed)

ml.t4g.medium ≈ $24–30/month
Runs 24/7, even when idle

Takeaway:

Lambda has variable costs that scale with usage. SageMaker has fixed costs, making the tradeoff clear when requests grow.

The main question is:

When does SageMaker become the better option?

I’ve done the math.

SageMaker becomes a better option at ~72 requests per minute — take a look at the following graph:

It is obvious that, with the serverless nature of Lambda, costs are going to be lower since you have a fixed price for running the SageMaker endpoint, but as you have more traffic, SageMaker will handle it cheaper.

You can notice that the green line, representing the SageMaker endpoint, starts going up as well, — that is expected, as you will have many Lambda invocations as well, however it’s manageable as the already mentioned Lambda proxy is configured to use the lowest configuration.

Here is a broader look at the cost of this benchmark, it shows a broader view of expected cost, based on the latest pricing and traffic you can expect.

Traffic Volume	Standard Lambda (4GB)	SnapStart Lambda (4GB)	SageMaker (ml.t2.medium + 128MB Caller)
Price per 1M Req (Variable)	$13.80	$16.82	$0.81
Fixed Monthly Cost	$0.00	$0.00	$40.88
Total: 10 RPM (~438k req/mo)	$6.05	$7.37	$41.24
Total: 50 RPM (~2.1M req/mo)	$30.24	$36.87	$42.66
Total: 72 RPM (~3.1M req/mo)	$43.51	$53.05	$43.43 (Crossover point)
Total: 200 RPM (~8.7M req/mo)	$120.84	$147.32	$47.97
Total: 1000 RPM (~43.8M req/mo)	$604.22	$736.62	$76.38

References:

SageMaker pricing - link
Lambda pricing - link

CTO Verdict: A Decision Framework

Think in thresholds, not services.

Use Standard Lambda when:

You’re in POC or early stage
Traffic is low or unpredictable
You want zero idle cost

Use Lambda with SnapStart when:

Traffic is low and sporadic
You are willing to pay for the SnapStart snapshot restoration
You also want a zero idle cost

Use SageMaker when:

You exceed the mentioned 72 requests/minute consistently
Traffic is steady
You want predictable cost

Final Rule:

Lambda is the default
SageMaker is the optimization

DEV Community