Running ML models in production sounds simple until you realize you're paying for servers 24/7 even when nobody is using them. That was my situation.
I had a model running on EC2, serving predictions through Flask. It worked. It also quietly burned money every hour of the day. So I rebuilt the entire inference pipeline using AWS Lambda and reduced costs to almost zero during idle time.
This post walks through exactly how I did it.
The Problem with "Always-On" ML Inference
When I first deployed a machine learning model, I followed the standard approach:
- Flask API
- EC2 instance
- Load model at startup
- Serve predictions over HTTP
It worked.
But it also meant:
- Paying for compute 24/7
- Even at 3AM when traffic = 0
For systems like AquaChain, inference is event-driven:
- Bursts of requests from devices
- Long idle periods
Running a server continuously for this pattern is wasteful.
Enter: Serverless ML Inference
With AWS Lambda:
- You pay only when your model runs
- No idle infrastructure
- Fully event-driven execution
The Stack
- scikit-learn 1.4.0
- XGBoost 2.0.3
- numpy 1.26.3 + pandas 2.1.4
- Python 3.11
- AWS Lambda (container image)
- Amazon ECR (container registry)
- S3 (model artifact storage)
Project Structure
ml_inference/
├── handler.py # Lambda entry point
├── model_loader.py # S3 model caching logic
├── feature_extractor.py
├── Dockerfile
└── requirements.txt
The Dockerfile
The key is using AWS's official Lambda base image. It includes the Lambda runtime interface client, so your container behaves exactly like a standard Lambda function.
FROM public.ecr.aws/lambda/python:3.11
# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy function code
COPY handler.py model_loader.py feature_extractor.py ./
# Lambda handler entrypoint
CMD ["handler.lambda_handler"]
The requirements.txt for the ML stack:
scikit-learn==1.4.0
xgboost==2.0.3
numpy==1.26.3
pandas==2.1.4
boto3==1.34.34
joblib==1.3.2
One important detail: put
COPY requirements.txtandRUN pip installbefore copying your application code. Docker caches each layer — if your code changes but your dependencies don't, the pip install layer is reused and your build takes seconds instead of minutes.
The Handler
import json
import logging
from model_loader import get_model
from feature_extractor import extract_features
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
try:
readings = event.get('readings', {})
device_id = event.get('deviceId', 'unknown')
# Validate inputs before touching the model
required = ['pH', 'turbidity', 'tds', 'temperature']
missing = [f for f in required if f not in readings]
if missing:
return {
'statusCode': 400,
'body': json.dumps({
'error': f"Missing fields: {missing}",
'code': 'VALIDATION_ERROR'
})
}
# Extract features (includes trend calculations)
features = extract_features(readings)
# Get model — cached in /tmp after first load
model = get_model()
# Run inference
wqi = float(model.predict([features])[0])
confidence = float(model.predict_proba([features]).max())
quality = classify_wqi(wqi)
logger.info(f"Inference complete", extra={
'deviceId': device_id,
'wqi': wqi,
'quality': quality,
'confidence': confidence
})
return {
'statusCode': 200,
'body': json.dumps({
'wqi': round(wqi, 2),
'quality': quality,
'confidence': round(confidence, 4),
'deviceId': device_id
})
}
except Exception as e:
logger.error(f"Inference error: {e}", exc_info=True)
return {
'statusCode': 500,
'body': json.dumps({
'error': 'Inference failed',
'code': 'INFERENCE_ERROR'
})
}
def classify_wqi(wqi: float) -> str:
if wqi >= 90: return 'Excellent'
if wqi >= 70: return 'Good'
if wqi >= 50: return 'Fair'
if wqi >= 25: return 'Poor'
return 'Very Poor'
Model Caching: The Most Important Optimization
Lambda's /tmp directory persists across warm invocations of the same container instance. Loading a model from S3 on every request would add 200–500ms of latency and unnecessary S3 GET costs. Cache it in /tmp on first load:
import os
import boto3
import joblib
import logging
logger = logging.getLogger()
MODEL_S3_BUCKET = os.environ['MODEL_BUCKET']
MODEL_S3_KEY = os.environ['MODEL_KEY']
LOCAL_MODEL_PATH = '/tmp/model.joblib'
_model_cache = None # Module-level cache — survives across warm invocations
def get_model():
global _model_cache
if _model_cache is not None:
logger.debug("Using in-memory model cache")
return _model_cache
# Check /tmp first (warm container, model already downloaded)
if os.path.exists(LOCAL_MODEL_PATH):
logger.info("Loading model from /tmp cache")
_model_cache = joblib.load(LOCAL_MODEL_PATH)
return _model_cache
# Cold start — download from S3
logger.info(f"Downloading model from s3://{MODEL_S3_BUCKET}/{MODEL_S3_KEY}")
s3 = boto3.client('s3')
s3.download_file(MODEL_S3_BUCKET, MODEL_S3_KEY, LOCAL_MODEL_PATH)
_model_cache = joblib.load(LOCAL_MODEL_PATH)
logger.info("Model loaded and cached")
return _model_cache
Two levels of caching here:
-
_model_cache— in-memory, fastest possible, survives as long as the container is warm -
/tmp/model.joblib— survives container reuse even if the Python process restarts
On a cold start you pay the S3 download once. Every subsequent warm invocation skips it entirely.
Building and Pushing to ECR
# Authenticate Docker with ECR
aws ecr get-login-password --region ap-south-1 | \
docker login --username AWS --password-stdin \
758346259059.dkr.ecr.ap-south-1.amazonaws.com
# Build the image
docker build -t aquachain-ml-inference .
# Tag for ECR
docker tag aquachain-ml-inference:latest \
758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest
# Push
docker push \
758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest
Then deploy the Lambda pointing at the ECR image:
aws lambda update-function-code \
--function-name aquachain-function-ml-inference-dev \
--image-uri 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest \
--region ap-south-1
CDK Definition
from aws_cdk import (
aws_lambda as lambda_,
aws_ecr as ecr,
aws_iam as iam,
Duration
)
# Reference existing ECR repo
repo = ecr.Repository.from_repository_name(
self, "MLInferenceRepo", "aquachain-ml-inference"
)
ml_inference_fn = lambda_.DockerImageFunction(
self, "MLInferenceFunction",
function_name="aquachain-function-ml-inference-dev",
code=lambda_.DockerImageCode.from_ecr(
repo,
tag_or_digest="latest"
),
memory_size=1024, # ML models benefit from more memory
timeout=Duration.seconds(30),
environment={
"MODEL_BUCKET": "aquachain-models-dev",
"MODEL_KEY": "wqi/model_v2.joblib",
"LOG_LEVEL": "INFO"
}
)
# Grant S3 read access for model download
model_bucket.grant_read(ml_inference_fn)
Memory sizing matters here. I started at 512MB and saw ~180ms inference times. Bumping to 1024MB dropped it to ~85ms — Lambda allocates CPU proportionally to memory, so more memory = faster CPU = faster inference. Run a few tests at different memory sizes; the cost difference is often negligible compared to the latency improvement.
Handling Cold Starts
Cold starts for container-based Lambdas are longer than zip-based ones — typically 2–5 seconds for a 500MB image. For AquaChain this is acceptable because inference is triggered asynchronously (the data processing Lambda doesn't wait for the result). But if you need synchronous inference with strict latency SLAs, two options:
1. Provisioned Concurrency — keeps N container instances warm at all times. Eliminates cold starts, but you pay for idle time. Only worth it if your p99 latency requirement is under 500ms and you have consistent traffic.
2. Scheduled warm-up ping — an EventBridge rule that invokes the function every 5 minutes with a dummy payload. Cheap, effective for low-traffic functions, but not a guarantee.
For most ML inference use cases, async invocation + accepting occasional cold starts is the right trade-off.
Updating the Model
One of the best things about this setup: updating the model doesn't require a code deployment. You just upload a new model.joblib to S3 with the same key. The next cold start picks it up automatically.
For versioned rollouts, use S3 versioning and point the Lambda env var at a specific version ID:
# Upload new model version
aws s3 cp model_v3.joblib s3://aquachain-models-dev/wqi/model_v2.joblib
# If you need to roll back, just update the env var to point at the previous version
aws lambda update-function-configuration \
--function-name aquachain-function-ml-inference-dev \
--environment "Variables={MODEL_KEY=wqi/model_v1.joblib,MODEL_BUCKET=aquachain-models-dev}" \
--region ap-south-1
The Numbers
Running in production on AquaChain:
| Metric | Value |
|---|---|
| Cold start (image download + model load) | ~2.1s |
| Warm inference (in-memory cache) | ~85ms |
| Warm inference (first call, /tmp cache) | ~120ms |
| Memory used | ~310MB of 1024MB allocated |
| Cost per 1M inferences | ~$0.21 |
Compare that to a t3.small EC2 instance running 24/7: ~$15/month regardless of traffic. At our current inference volume, Lambda costs under $1/month.
When NOT to Use Lambda
Serverless ML is not a silver bullet.
Avoid Lambda if:
- You need ultra-low latency (<50ms)
- You have constant high traffic
- Your model is extremely large (>5GB and slow to load)
In those cases, a dedicated endpoint (SageMaker / ECS / EC2) is a better fit.
What I'd Do Differently
1. Use multi-stage Docker builds. The current image includes build tools that aren't needed at runtime. A multi-stage build copies only the installed packages into the final image, reducing image size by 30–40% and speeding up cold starts.
2. Pin the base image digest, not just the tag. python:3.11 tags can change. Use the SHA256 digest for reproducible builds in production.
3. Add model validation on load. Before caching the model, run a quick sanity check — predict on a known input and assert the output is in the expected range. Catches corrupted model files before they serve bad predictions.
Serverless ML inference isn’t for every system.But for event-driven workloads — like AquaChain — it hits a rare sweet spot:
- low cost, zero idle infrastructure, and production-grade performance.
- If your model doesn’t need to run 24/7, your infrastructure shouldn’t either.
Top comments (0)