Cristopher Coronado for AWS Community Builders

Posted on Dec 29, 2025

Serverless ML on AWS: How I Cut Inference Costs by 99% with Lambda & ONNX

#serverless #python #machinelearning #aws

TL;DR: I built an AutoML platform on AWS that handles training and inference for pennies. By moving from always-on SageMaker endpoints to a Serverless Lambda + ONNX architecture, I eliminated idle costs completely—bringing the total bill for a side project down to ~$3-25/month vs ~$36-171/month with SageMaker.

As an AWS Community Builder, I'm always looking for ways to leverage cloud-native services to solve problems efficiently. In my previous article, I shared how to train models cheaply using AWS Fargate Spot. But training is only half the battle.

To turn a trained model into a product, you need Inference—the ability to make predictions on new data. The industry standard is Amazon SageMaker Real-time Endpoints. They are powerful, scalable, and enterprise-ready. But for a personal project or sporadic workload, they have one major downside: You pay for them 24/7, even when no one is using them.

In this post, I'll walk you through how I evolved AWS AutoML Lite into a full-stack ML platform by adding Serverless Inference, Dark Mode, and robust Model Comparison, all while keeping costs at rock bottom.

1. The Inference Challenge: $0 vs $36-171/month

The standard "easy" path on AWS is deploying a SageMaker Endpoint.

Production Standard: An ml.c5.xlarge instance costs ~$147/month.
Budget Option: An ml.t3.medium instance costs ~$36/month.

But what about traffic?
Let's compare costs for a Side Project (100k reqs) vs Startup Scale (10M reqs).

Scenario	Monthly Requests	SageMaker (ml.t3.medium)	Lambda (Serverless)
Side Project	100,000	$36.00	$0.35
Startup Scale	10,000,000	$36.00	~$35.00

Transparent Accounting: "Serverless" isn't magic. You also pay for S3 Storage (to hold the model) and S3 GET Requests (every time a cold Lambda downloads it).

Storage: 50MB model = ~$0.001/month.

Access: 100k cold starts = ~$0.04 in request fees.

Even adding these "hidden" costs, the total remains ~$35.04 vs $36.00. The math holds up.

The Verdict: You would need over 10 million predictions/month just to match the cost of the cheapest, least reliable SageMaker instance ($36). For the production-grade ml.m5.xlarge ($171), the break-even point is closer to 48 million requests.

Until you hit that scale, Serverless is vastly cheaper.

The Solution: ONNX Runtime on Lambda

I decided to move the inference logic to AWS Lambda. Since Lambda is event-driven, you only pay when code runs.

Cost: ~$0.20 per 1 million requests.
Idle Cost: $0.00.

To make this work efficiently, I used the ONNX (Open Neural Network Exchange) format.

Export: During training (on Fargate), we convert models (Scikit-Learn, LightGBM) to .onnx.
Deploy: The deployment process simply flags the job in DynamoDB—no server provisioning required.
Predict: A Python Lambda function loads the model from S3 into memory (creating a "warm" start) and uses onnxruntime to generate predictions in milliseconds.

This architecture is the definition of Cloud Native Efficiency: maximum value, minimum waste.

We added a "Deploy" button to the results page. Under the hood, this doesn't spin up a server. It simply flags the job as "deployed" in DynamoDB and ensures the model artifacts are ready in S3. The API then knows to route prediction requests for that Job ID to the inference engine.

Bonus: Portable Models

If you don't even want to run Lambda, you can just download the model. We provide both the raw .pkl (Pickle) file and the .onnx file. This means you can run your trained model locally, on your own server, or even inside a browser using ONNX Runtime Web.

2. Comparing Models: Data-Driven Decisions

Training one model is rarely enough. You usually train 4-5 variations with different time budgets or datasets to see what sticks.

Version 1.1.0 introduces a Compare Page. You can select up to 4 training runs and see them side-by-side.

Once selected, the platform visualizes the differences. This is crucial for spotting trade-offs. Maybe Model A has slightly better accuracy (94%) but took 20 minutes to train, while Model B is close enough (92%) but finished in 2 minutes.

3. UI Polish: Dark Mode & UX

A modern developer tool isn't complete without Dark Mode. We implemented a system-aware theme switcher using next-themes and Tailwind CSS. It respects your OS preference by default but lets you toggle it manually.

We also moved from a "developer console" look to a cleaner design, optimizing the history table to show just the key stats you need.

4. Engineering Lessons Learned

Building features is fun, but fixing bugs teaches you more. Here are two big technical hurdles we overcame in this release.

Lesson A: The "Deleted" Resource that Wasn't

We encountered a frustrating bug: users would delete a training job, but if they clicked "Back" in their browser, the Job Details page would still load perfectly—serving stale data.

The Culprit: Browser Caching.
Even though the DELETE API call was successful, the browser had cached the previous GET request for the Job Details. When the user navigated back, the browser served the cached "200 OK" response from disk without asking the server.

The Fix:
We had to implement a strict "Trust No One" caching strategy:

Backend: The DELETE response now sends aggression anti-cache headers (no-store, max-age=0).
Frontend: The critical getJobDetails fetch call now uses cache: 'no-cache'. This forces the browser to send an ETag check to the server.
- If the job exists: Server says "304 Not Modified" (fast).
- If the job is deleted: Server says "404 Not Found" -> UI updates immediately to show the error state.

Takeaway: Deleting a resource on the server doesn't automatically purge it from your user's browser cache. You have to be explicit.

Lesson B: Preserving State via Polling

Our UI polls the backend every 5 seconds to update the training progress bar. However, we found that sometimes valid presigned URLs (for downloading models) would disappear or break during these updates.

Ideally, you shouldn't re-fetch the whole object if you only need the status. But since we do fetch the full state, we implemented a client-side merger. It takes the new status from the API but preserves any existing valid URLs from the old state. This prevents the "flickering" download buttons and ensures a smooth user experience.

What's Next? (Roadmap)

We aren't stopping here. The roadmap for next version includes:

Multi-user Authentication (Cognito)
Email Notifications (SNS) when long training jobs finish.
Hyperparameter Tuning UI for advanced users.

Try It Yourself

AutoML Lite is open source and can be deployed to your own AWS account in about 15 minutes.

GitHub Repository

If you have questions or want to contribute, drop a comment or open an issue on GitHub!

DEV Community