Karthik K Pradeep

Posted on Mar 22

Serverless ML Inference with AWS Lambda + Docker

#machinelearning #aws #serverless #devops

Running ML models in production sounds simple until you realize you're paying for servers 24/7 even when nobody is using them. That was my situation.
I had a model running on EC2, serving predictions through Flask. It worked. It also quietly burned money every hour of the day. So I rebuilt the entire inference pipeline using AWS Lambda and reduced costs to almost zero during idle time.
This post walks through exactly how I did it.

The Problem with "Always-On" ML Inference

When I first deployed a machine learning model, I followed the standard approach:

Flask API
EC2 instance
Load model at startup
Serve predictions over HTTP

It worked.

But it also meant:

Paying for compute 24/7
Even at 3AM when traffic = 0

For systems like AquaChain, inference is event-driven:

Bursts of requests from devices
Long idle periods

Running a server continuously for this pattern is wasteful.

Enter: Serverless ML Inference
With AWS Lambda:

You pay only when your model runs
No idle infrastructure
Fully event-driven execution

The Stack

scikit-learn 1.4.0
XGBoost 2.0.3
numpy 1.26.3 + pandas 2.1.4
Python 3.11
AWS Lambda (container image)
Amazon ECR (container registry)
S3 (model artifact storage)

Project Structure

ml_inference/
├── handler.py          # Lambda entry point
├── model_loader.py     # S3 model caching logic
├── feature_extractor.py
├── Dockerfile
└── requirements.txt

The Dockerfile

The key is using AWS's official Lambda base image. It includes the Lambda runtime interface client, so your container behaves exactly like a standard Lambda function.

FROM public.ecr.aws/lambda/python:3.11

# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy function code
COPY handler.py model_loader.py feature_extractor.py ./

# Lambda handler entrypoint
CMD ["handler.lambda_handler"]

The requirements.txt for the ML stack:

scikit-learn==1.4.0
xgboost==2.0.3
numpy==1.26.3
pandas==2.1.4
boto3==1.34.34
joblib==1.3.2

One important detail: put COPY requirements.txt and RUN pip install before copying your application code. Docker caches each layer — if your code changes but your dependencies don't, the pip install layer is reused and your build takes seconds instead of minutes.

The Handler

import json
import logging
from model_loader import get_model
from feature_extractor import extract_features

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        readings = event.get('readings', {})
        device_id = event.get('deviceId', 'unknown')

        # Validate inputs before touching the model
        required = ['pH', 'turbidity', 'tds', 'temperature']
        missing = [f for f in required if f not in readings]
        if missing:
            return {
                'statusCode': 400,
                'body': json.dumps({
                    'error': f"Missing fields: {missing}",
                    'code': 'VALIDATION_ERROR'
                })
            }

        # Extract features (includes trend calculations)
        features = extract_features(readings)

        # Get model — cached in /tmp after first load
        model = get_model()

        # Run inference
        wqi = float(model.predict([features])[0])
        confidence = float(model.predict_proba([features]).max())

        quality = classify_wqi(wqi)

        logger.info(f"Inference complete", extra={
            'deviceId': device_id,
            'wqi': wqi,
            'quality': quality,
            'confidence': confidence
        })

        return {
            'statusCode': 200,
            'body': json.dumps({
                'wqi': round(wqi, 2),
                'quality': quality,
                'confidence': round(confidence, 4),
                'deviceId': device_id
            })
        }

    except Exception as e:
        logger.error(f"Inference error: {e}", exc_info=True)
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': 'Inference failed',
                'code': 'INFERENCE_ERROR'
            })
        }


def classify_wqi(wqi: float) -> str:
    if wqi >= 90: return 'Excellent'
    if wqi >= 70: return 'Good'
    if wqi >= 50: return 'Fair'
    if wqi >= 25: return 'Poor'
    return 'Very Poor'

Model Caching: The Most Important Optimization

Lambda's /tmp directory persists across warm invocations of the same container instance. Loading a model from S3 on every request would add 200–500ms of latency and unnecessary S3 GET costs. Cache it in /tmp on first load:

import os
import boto3
import joblib
import logging

logger = logging.getLogger()

MODEL_S3_BUCKET = os.environ['MODEL_BUCKET']
MODEL_S3_KEY = os.environ['MODEL_KEY']
LOCAL_MODEL_PATH = '/tmp/model.joblib'

_model_cache = None  # Module-level cache — survives across warm invocations

def get_model():
    global _model_cache

    if _model_cache is not None:
        logger.debug("Using in-memory model cache")
        return _model_cache

    # Check /tmp first (warm container, model already downloaded)
    if os.path.exists(LOCAL_MODEL_PATH):
        logger.info("Loading model from /tmp cache")
        _model_cache = joblib.load(LOCAL_MODEL_PATH)
        return _model_cache

    # Cold start — download from S3
    logger.info(f"Downloading model from s3://{MODEL_S3_BUCKET}/{MODEL_S3_KEY}")
    s3 = boto3.client('s3')
    s3.download_file(MODEL_S3_BUCKET, MODEL_S3_KEY, LOCAL_MODEL_PATH)
    _model_cache = joblib.load(LOCAL_MODEL_PATH)
    logger.info("Model loaded and cached")
    return _model_cache

Two levels of caching here:

_model_cache — in-memory, fastest possible, survives as long as the container is warm
/tmp/model.joblib — survives container reuse even if the Python process restarts

On a cold start you pay the S3 download once. Every subsequent warm invocation skips it entirely.

Building and Pushing to ECR

# Authenticate Docker with ECR
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS --password-stdin \
  758346259059.dkr.ecr.ap-south-1.amazonaws.com

# Build the image
docker build -t aquachain-ml-inference .

# Tag for ECR
docker tag aquachain-ml-inference:latest \
  758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest

# Push
docker push \
  758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest

Then deploy the Lambda pointing at the ECR image:

aws lambda update-function-code \
  --function-name aquachain-function-ml-inference-dev \
  --image-uri 758346259059.dkr.ecr.ap-south-1.amazonaws.com/aquachain-ml-inference:latest \
  --region ap-south-1

CDK Definition

from aws_cdk import (
    aws_lambda as lambda_,
    aws_ecr as ecr,
    aws_iam as iam,
    Duration
)

# Reference existing ECR repo
repo = ecr.Repository.from_repository_name(
    self, "MLInferenceRepo", "aquachain-ml-inference"
)

ml_inference_fn = lambda_.DockerImageFunction(
    self, "MLInferenceFunction",
    function_name="aquachain-function-ml-inference-dev",
    code=lambda_.DockerImageCode.from_ecr(
        repo,
        tag_or_digest="latest"
    ),
    memory_size=1024,   # ML models benefit from more memory
    timeout=Duration.seconds(30),
    environment={
        "MODEL_BUCKET": "aquachain-models-dev",
        "MODEL_KEY": "wqi/model_v2.joblib",
        "LOG_LEVEL": "INFO"
    }
)

# Grant S3 read access for model download
model_bucket.grant_read(ml_inference_fn)

Memory sizing matters here. I started at 512MB and saw ~180ms inference times. Bumping to 1024MB dropped it to ~85ms — Lambda allocates CPU proportionally to memory, so more memory = faster CPU = faster inference. Run a few tests at different memory sizes; the cost difference is often negligible compared to the latency improvement.

Handling Cold Starts

Cold starts for container-based Lambdas are longer than zip-based ones — typically 2–5 seconds for a 500MB image. For AquaChain this is acceptable because inference is triggered asynchronously (the data processing Lambda doesn't wait for the result). But if you need synchronous inference with strict latency SLAs, two options:

1. Provisioned Concurrency — keeps N container instances warm at all times. Eliminates cold starts, but you pay for idle time. Only worth it if your p99 latency requirement is under 500ms and you have consistent traffic.

2. Scheduled warm-up ping — an EventBridge rule that invokes the function every 5 minutes with a dummy payload. Cheap, effective for low-traffic functions, but not a guarantee.

For most ML inference use cases, async invocation + accepting occasional cold starts is the right trade-off.

Updating the Model

One of the best things about this setup: updating the model doesn't require a code deployment. You just upload a new model.joblib to S3 with the same key. The next cold start picks it up automatically.

For versioned rollouts, use S3 versioning and point the Lambda env var at a specific version ID:

# Upload new model version
aws s3 cp model_v3.joblib s3://aquachain-models-dev/wqi/model_v2.joblib

# If you need to roll back, just update the env var to point at the previous version
aws lambda update-function-configuration \
  --function-name aquachain-function-ml-inference-dev \
  --environment "Variables={MODEL_KEY=wqi/model_v1.joblib,MODEL_BUCKET=aquachain-models-dev}" \
  --region ap-south-1

The Numbers

Running in production on AquaChain:

Metric	Value
Cold start (image download + model load)	~2.1s
Warm inference (in-memory cache)	~85ms
Warm inference (first call, /tmp cache)	~120ms
Memory used	~310MB of 1024MB allocated
Cost per 1M inferences	~$0.21

Compare that to a t3.small EC2 instance running 24/7: ~$15/month regardless of traffic. At our current inference volume, Lambda costs under $1/month.

When NOT to Use Lambda

Serverless ML is not a silver bullet.

Avoid Lambda if:

You need ultra-low latency (<50ms)
You have constant high traffic
Your model is extremely large (>5GB and slow to load)

In those cases, a dedicated endpoint (SageMaker / ECS / EC2) is a better fit.

What I'd Do Differently

1. Use multi-stage Docker builds. The current image includes build tools that aren't needed at runtime. A multi-stage build copies only the installed packages into the final image, reducing image size by 30–40% and speeding up cold starts.

2. Pin the base image digest, not just the tag. python:3.11 tags can change. Use the SHA256 digest for reproducible builds in production.

3. Add model validation on load. Before caching the model, run a quick sanity check — predict on a known input and assert the output is in the expected range. Catches corrupted model files before they serve bad predictions.

Serverless ML inference isn’t for every system.But for event-driven workloads — like AquaChain — it hits a rare sweet spot:

low cost, zero idle infrastructure, and production-grade performance.
If your model doesn’t need to run 24/7, your infrastructure shouldn’t either.

Top comments (2)

Aleh Karachun • Mar 22

Since Lambda lacks GPU support and ties CPU performance directly to memory allocation, isn't this architecture limited to very small, niche models rather than production-grade ML?

Karthik K Pradeep • Mar 23

"Production-grade" isn't about hardware. It's about reliability, cost efficiency, and suitability for purpose. Lambda is production-grade for CPU-native models like XGBoost and scikit-learn on tabular data, where warm inference runs at ~85ms for approximately $0.21 per million calls, making an always-on GPU endpoint difficult to justify. Yes, the architecture is limited. But it's limited by design. For bursty, event-driven inference on smaller models, that's a benefit, not a bug.