Christian Nuss

Posted on Nov 26, 2024

Deploy Hugging Face Models to AWS Lambda in 3 steps

#python #aws #serverless #machinelearning

Ever wanted to deploy a Hugging Face model to AWS Lambda but got stuck with container builds, cold starts, and model caching? Here's how to do it in under 5 minutes using Scaffoldly.

TL;DR

Create an EFS filesystem named .cache in AWS:
- Go to AWS EFS Console
- Click "Create File System"
- Name it .cache
- Select any VPC (Scaffoldly will take care of the rest!)

Create your app from the python-huggingface branch:

 npx scaffoldly create app --template python-huggingface

Deploy it:
```
 cd my-app && npx scaffoldly deploy
```

That's it! You'll get a Hugging Face model running on Lambda (using openai-community/gpt2 as an example), complete with proper caching and container deployment.

Pro-Tip: For the EFS setup, you can customize it down to a Single AZ in Burstable mode for even more cost savings. Scaffoldly will match the Lambda Function to the EFS's VPC, Subnets, and Security Group.

✨ Check out the Live Demo and the example code!

The Problem

Deploying ML models to AWS Lambda traditionally involves:

Building and managing Docker containers
Figuring out model caching and storage
Dealing with Lambda's size limits
Managing cold starts
Setting up API endpoints

It's a lot of infrastructure work when you just want to serve a model!

The Solution

Scaffoldly handles all this complexity with a simple configuration file. Here's a complete application that serves a Hugging Face model (using openai-community/gpt2 as an example):

# app.py
from flask import Flask
from transformers import pipeline
app = Flask(__name__)
generator = pipeline('text-generation', model='openai-community/gpt2')
@app.route("/")
def hello_world():
    output = generator("Hello, world,")
    return output[0]['generated_text']

// requirements.txt
Flask ~= 3.0
gunicorn ~= 23.0
torch ~= 2.5
numpy ~= 2.1
transformers ~= 4.46
huggingface_hub[cli] ~= 0.26

// scaffoldly.json
{
  "name": "python-huggingface",
  "runtime": "python:3.12",
  "handler": "localhost:8000",
  "files": ["app.py"],
  "packages": ["pip:requirements.txt"],
  "resources": ["arn::elasticfilesystem:::file-system/.cache"],
  "schedules": {
    "@immediately": "huggingface-cli download openai-community/gpt2"
  },
  "scripts": {
    "start": "gunicorn app:app"
  },
  "memorySize": 1024
}

How It Works

Scaffoldly does some clever things behind the scenes:

Smart Container Building:
- Automatically creates a Docker container optimized for Lambda
- Handles all Python dependencies including PyTorch
- Pushes to ECR without you writing any Docker commands
Efficient Model Handling:
- Uses Amazon EFS to cache the model files
- Pre-downloads models after deployment for faster cold starts
- Mounts the cache automatically in Lambda
Lambda-Ready Setup:
- Rus up a proper WSGI server (gunicorn)
- Creates a public Lambda Function URL
- Proxies Function URL requests to gunicorn
- Manages IAM roles and permissions

What `deploy` looks like

Here's output from a npx scaffoldly deploy command I ran on this example:

Real World Performance & Costs

✅ Costs: ~$0.20/day for AWS Lambda, ECR, and EFS

✅ Cold Start: ~20s for first request (model loading)

✅ Warm Requests: 5-20s (CPU-based inference)

While this setup uses CPU inference (which is slower than GPU), it's an incredibly cost-effective way to experiment with ML models or serve low-traffic endpoints.

Customizing for Other Models

Want to use a different model? Just update two files:

Change the model in app.py:

generator = pipeline('text-generation', model='your-model-here')

Update the download in scaffoldly.json:

"schedules": {
  "@immediately": "huggingface-cli download your-model-here"
}

Using Private or Gated Models

Scaffoldly supports private and gated models via the HF_TOKEN environment variable. You can add your Hugging Face token in several ways:

Local Development: Add to your shell profile (.bashrc, .zprofile, etc.):

  export HF_TOKEN="hf_rH...A"

CI/CD: Add as a GitHub Actions Secret:

  # In your repository settings -> Secrets and Variables -> Actions
  HF_TOKEN: hf_rH...A

The token will be automatically used for both downloading and accessing your private or gated models.

CI/CD Bonus

Scaffoldly even generates a GitHub Action for automated deployments:

name: Scaffoldly Deploy
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: scaffoldly/scaffoldly@v1
        with:
          secrets: ${{ toJSON(secrets) }}

Try It Yourself

The complete example is available on GitHub:
scaffoldly/scaffoldly-examples#python-huggingface

And you can create your own copy of this example by running:

npx scaffoldly create app --template python-huggingface

You can see it running live (though responses might be slow due to CPU inference):
Live Demo

What's Next?

Try deploying different Hugging Face models
Join the Scaffoldly Community on Discord
Check out other examples
Star our repos if you found this useful!
- The scaffoldly toolchain
- The Scaffoldly Examples repository