Gabriel Koo for AWS Community Builders

Posted on Mar 13 • Edited on Mar 15

From 3-Minute Cold Starts to ~20 Seconds: Whisper on AWS Lambda + EFS for OpenClaw

#whisper #openclaw #aws #efs

Part 3 of my series on building a low-cost personal AI stack on AWS.
Part 1 — Squeezing my $1k/month API bill to $20/month with AWS Credits
Part 2 — Drop-in Perplexity Sonar replacement with AWS Bedrock Nova Grounding

TL;DR

I built a self-hosted speech-to-text API on AWS Lambda using faster-whisper. After trying Amazon Transcribe, SageMaker Serverless, and Lambda with a bundled model, I landed on a Lambda + EFS + S3 architecture that achieves ~20-30 second cold starts (once the model is cached on EFS) for ~$0.21/month in storage costs. Once warm, specifying the language drops response time to ~10s.

Open source: gabrielkoo/aws-lambda-whisper-adaptor

The Problem

I wanted to automatically transcribe Telegram voice messages. The requirements were simple:

Accuracy: Good enough for Cantonese
Cost: Pay-per-use, scales to zero when idle
Latency: Cold start under 60 seconds

There's a fourth constraint that's easy to overlook outside Hong Kong: most managed STT APIs simply aren't available here. OpenAI's Whisper API falls under their notorious China's regional restriction. Google's Gemini models are available and actually competitive on both accuracy and price — Gemini 3 Flash achieves 3.1% WER at ~$1.92/1000 minutes (Artificial Analysis STT leaderboard), cheaper than OpenAI's Whisper API and competitive with Lambda at low volume. The real reason I went with Lambda: AWS Credits from the Community Builder program (same theme as the rest of this series) make it effectively free.

Simple enough. Except it took four attempts to get there.

What I Tried (and Why It Didn't Work)

Option 1: Amazon Transcribe

The obvious first choice — fully managed, pay-per-use, native AWS integration.

Why I rejected it before even trying:

Amazon Transcribe supports zh-CN and zh-TW, but not yue (Cantonese). Whisper large-v3-turbo handles Cantonese significantly better, and accuracy matters more than convenience here.

Option 2: SageMaker Serverless Inference

SageMaker Serverless scales to zero and handles model serving — sounds perfect.

What happened:

I deployed a SageMaker Serverless endpoint with faster-whisper. The first invocation after idle:

Container provisioning: ~30s
Model loading: ~45-60s
Total cold start: 60-90 seconds

For a voice message that's 5-10 seconds long, waiting 90 seconds is a terrible experience.

The 6GB memory wall:

SageMaker Serverless maxes out at 6144 MB (6 GB) RAM. Here's why that's a problem for Whisper:

whisper-large-v3-turbo (INT8): ~780MB model + ~2GB Python/runtime overhead ≈ 2.8GB minimum
whisper-large-v3 (FP16): ~3GB model alone — barely fits, zero headroom for audio processing
Any concurrent requests? You're OOM.

Lambda goes up to 10,240 MB. That headroom matters.

Cost comparison:

SageMaker Serverless bills per GB-second of inference time. For sporadic voice message transcription (~10s per request, a few times a day), Lambda's per-invocation pricing is significantly cheaper. My Lambda setup costs ~$0.21/month in storage — the compute is essentially free at this volume.

I deleted the endpoint after testing.

Option 2b: Bedrock Marketplace

AWS Bedrock Marketplace does list Whisper Large V3 Turbo — but it deploys on a dedicated endpoint instance. Auto-scaling is available (including scale-to-zero), but that creates a different problem:

Keep minimum 1 instance: always paying for idle time, even at 3am
Scale to zero: cold starts when traffic resumes — SageMaker cold starts are measured in minutes, not seconds
Not token/usage-based pricing either way

For a Telegram bot that gets a few voice messages a day, you're either burning money on idle instances or waiting minutes for the first message to transcribe. Lambda's ~20-30s cold start looks great by comparison.

Option 3: Lambda with Bundled Model

Next idea: bundle the model directly into the Docker image. No external dependencies, simple architecture.

What happened:

# Download model during build
# Note: using openai/whisper-large-v3-turbo converted to int8 via sync-model workflow
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('openai/whisper-large-v3-turbo')"

Docker image size: ~10GB
ECR push time: 5+ minutes
Lambda cold start: 2 minutes 51 seconds

The cold start is dominated by Lambda pulling the 10GB image from ECR. AWS Lambda caches images, but any cold start after the cache expires hits this wall.

Why it didn't work:

3-minute cold start is unusable for interactive transcription
Every code change requires rebuilding and pushing a 10GB image
ECR storage: ~$1/month just for the image

Option 4: Lambda + S3 (No EFS)

What if Lambda downloads the model from S3 on cold start, storing it in /tmp?

The problem:

Lambda's /tmp is ephemeral. Every cold start re-downloads the model from S3:

S3 download for 1.6GB FP16 model: 30-60 seconds
S3 download for 780MB INT8 model: 15-30 seconds

This is better than the bundled model approach, but there's a bigger issue: no caching between Lambda instances. If you have 3 concurrent invocations, all 3 download the model independently. You're paying for S3 transfer on every cold start.

What about Lambda SnapStart or Durable Functions?

AWS added two relevant features since this was written:

SnapStart for Python (Nov 2024): snapshots the initialized execution environment — sounds perfect for caching a loaded model. The catch: SnapStart doesn't support container images. This adaptor is container-based, so it's off the table.
Lambda Durable Functions (re:Invent 2025): enables multi-step workflows with automatic checkpointing, pause/resume for up to one year, and failure recovery. This is workflow orchestration (think Azure Durable Functions) — useful for multi-step AI pipelines, but not for persisting a 780MB model binary between cold starts.

EFS remains the right solution for model caching.

What Actually Worked: Lambda + EFS + S3

The solution: use EFS as a persistent model cache, bootstrapped from S3. I've used EFS for persistent Streamlit state on ECS before — same pattern, different compute layer.

Request → Lambda Function URL
               ↓
          Lambda (VPC)
               ↓ first cold start only: S3 → EFS
              EFS (model cached here permanently)

How it works:

First cold start (once per model): Lambda checks for a marker file on EFS. If missing, downloads model from S3 to EFS (~55s for INT8). Writes marker file. Then loads model into RAM.
Subsequent cold starts (new container, model already on EFS): Marker file exists → load model from EFS into RAM (~20-30s for INT8).
Warm invocations (same container reused): Model already in memory → transcription-only time (~10-22s depending on audio length and whether language is specified).

HF_MODEL_REPO = os.environ.get('HF_MODEL_REPO', 'openai/whisper-large-v3-turbo')
MODEL_SLUG = HF_MODEL_REPO.replace('/', '--')
EFS_MODEL_DIR = f'/mnt/whisper-models/{MODEL_SLUG}'
MODEL_MARKER = f'/mnt/whisper-models/.ready-{MODEL_SLUG}'

def bootstrap_model():
    if os.path.exists(MODEL_MARKER):
        return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')

    # First run: sync model from S3 to EFS
    s3 = boto3.client('s3')
    prefix = f'models/{MODEL_SLUG}/'
    os.makedirs(EFS_MODEL_DIR, exist_ok=True)

    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=os.environ['MODEL_S3_BUCKET'], Prefix=prefix):
        for obj in page.get('Contents', []):
            key = obj['Key']
            local_path = os.path.join(EFS_MODEL_DIR, key[len(prefix):])
            os.makedirs(os.path.dirname(local_path), exist_ok=True)
            s3.download_file(os.environ['MODEL_S3_BUCKET'], key, local_path)

    open(MODEL_MARKER, 'w').close()  # Mark as ready
    return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')

MODEL = bootstrap_model()  # Runs at Lambda init time, cached for warm invocations

Why EFS works:

EFS persists across Lambda instances — model is downloaded once, reused forever
EFS is mounted at /mnt/whisper-models — Lambda reads it like a local filesystem
S3 VPC Gateway Endpoint is free — no NAT Gateway needed (saves ~$32/month)
Zero internet egress — Lambda → S3 via VPC Gateway Endpoint, Lambda → EFS within VPC. The Lambda function never reaches the internet. This is a meaningful security benefit when using third-party models from HuggingFace — model weights never leave the AWS network once synced to S3.
EFS storage: ~$0.19/month for the 780MB INT8 model

🔒 Security note: The Lambda runs in a VPC with no internet access — no NAT Gateway, no public subnet. It can only reach EFS (VPC-internal) and S3 (via the free VPC Gateway Endpoint). This means even if you're using a third-party HuggingFace model, the model weights and your audio data never leave the AWS network. No data exfiltration risk, no outbound calls to unknown endpoints.

INT8 vs FP16: The Model Size Trade-off

The openai/whisper-large-v3-turbo model on HuggingFace needs conversion to CTranslate2 format. The sync-model workflow handles this, converting to INT8 and fixing the num_mel_bins config. Alternatively, use Zoont/faster-whisper-large-v3-turbo-int8-ct2 — a pre-converted CTranslate2 INT8 model that works out of the box with quantization=none:

Model	Size (EFS)	First Bootstrap	EFS Cold Start	Warm (2.5s audio)	Memory
`Zoont/faster-whisper-large-v3-turbo-int8-ct2`	~780MB	~55s	~22s ✅	~10s ✅	~2.8GB
`openai/whisper-large-v3-turbo` (INT8, via sync-model)	~780MB	~55s	~22s	~10s	~2.8GB
`openai/whisper-large-v3-turbo` (FP16)	~1.5GB	~126s	~40s	~15s	~4GB
`Systran/faster-whisper-large-v3` (FP16, loaded as int8)	~1.6GB	~54s	~30s	~13s	6GB

Recommended: Zoont/faster-whisper-large-v3-turbo-int8-ct2 — no conversion step needed, identical performance to the openai model converted to INT8. Use quantization=none in the sync-model workflow since it's already in CTranslate2 format.

Cost Breakdown

Resource	Monthly Cost
EFS storage (780MB INT8)	~$0.19
S3 storage (780MB)	~$0.02
Lambda compute	~$0.00167/warm invocation*
S3 VPC Gateway Endpoint	Free
NAT Gateway	Not needed ($0)
Total (storage only)	~$0.21/month

10GB × 10s = 100 GB-seconds per warm invocation. The Lambda free tier covers **400,000 GB-seconds/month* — roughly 4,000 warm invocations. For a personal bot, compute cost is effectively $0. Storage dominates.

Compare to SageMaker Serverless: minimum ~$5-10/month for similar workloads, plus the 60-90s cold start penalty.

Why not Provisioned Concurrency? PC keeps Lambda permanently warm (no cold starts), but costs ~$0.0000097222/GB-second. For a 10GB function running 24/7: ~$252/month. Even a minimal 4GB setup runs ~$100/month — roughly 500x more than the $0.21 storage approach. For a personal bot with a few voice messages a day, the occasional ~60s cold start is a fine trade-off.

vs. OpenAI Whisper API

OpenAI's Whisper API costs $0.006/minute. Here's how it compares for a bot averaging 15s voice messages:

Volume	OpenAI Whisper API	Self-hosted Lambda
50 msgs/month	$0.08	$0.21 (storage only)
140 msgs/month	$0.21	$0.21 ← break-even
500 msgs/month	$0.75	$0.21 (storage only)
1,000 msgs/month	$1.50	$0.21 (storage only)
4,000 msgs/month	$6.00	$0.21 (storage only)

Lambda compute is free within the free tier (~4,000 warm invocations/month). Beyond that, it's $0.00167/invocation — but that's a high volume for a personal bot.

Break-even: ~140 messages/month. Above that, Lambda wins on cost.

But cost isn't the only reason to self-host:

Geographic availability: OpenAI's API is not available in Hong Kong — HK falls under China's regional restriction. Azure OpenAI does offer Whisper, but only whisper-1 (large-v2 based) — large-v3 and large-v3-turbo are not available. If you're in HK (or other restricted regions), this approach isn't just cheaper — it's the only option for v3-quality transcription.
Cantonese accuracy: language=yue with Whisper large-v3-turbo is noticeably better than the managed API for Cantonese
Privacy: audio never leaves your infrastructure
No rate limits: Lambda scales independently

Architecture

Telegram voice message
        ↓
   OpenClaw (gateway)
        ↓
Lambda Function URL (auth via token)
        ↓
Lambda (VPC, 10GB RAM, 900s timeout)
        ↓
EFS /mnt/whisper-models/{model-slug}
        ↓
faster-whisper (CTranslate2, INT8)
        ↓
    Transcript

Lambda configuration:

Memory: 10,240 MB — actual usage is ~2.2GB (INT8 model), but Lambda allocates CPU proportional to memory. 10GB gives ~6 vCPUs vs ~2.3 vCPUs at 4GB, cutting warm transcription from ~16s to ~10s. You're paying for CPU, not RAM.
Timeout: 900s (handles long audio files)
VPC: Default VPC (no NAT Gateway)
EFS: Mounted at /mnt/whisper-models

Memory vs. cost trade-off (tested, 3 runs each):

Config	Cold Start	Warm (2.5s audio)	GB-seconds/invocation
4,096 MB	~30s	~21s	84 (~$0.00140)
6,144 MB	~25s	~16s	96 (~$0.00160)
8,192 MB	~24s	~18s	144 (~$0.00240)
10,240 MB	~22s	~15s	150 (~$0.00250)

Cold start is ~20-30s across all configs — it's EFS I/O bound, not CPU bound, so more memory doesn't help much here. Warm inference time does scale with memory (more vCPUs = faster CTranslate2 decoding). Interestingly, 4GB is the cheapest per invocation — the warm time savings at higher memory don't offset the extra GB-seconds. Within the free tier, cost differences are negligible regardless.

API Compatibility

The adaptor exposes two endpoints so it works as a drop-in replacement for existing integrations:

OpenAI compatible (/v1/audio/transcriptions):

curl -X POST https://<function-url>/v1/audio/transcriptions \
  -H "Authorization: Token <secret>" \
  -F "file=@audio.ogg" \
  -F "language=yue"

{"text": "transcript here"}

Deepgram compatible (/v1/listen):

curl -X POST https://<function-url>/v1/listen?language=yue \
  -H "Authorization: Token <secret>" \
  -H "Content-Type: audio/ogg" \
  --data-binary @audio.ogg

Model Management API

Once you've synced multiple models to EFS, there's no SSH access to see what's there or clean up. I added two non-standard endpoints:

List models on EFS:

curl https://<function-url>/v1/models -H "Authorization: Token <secret>"

{
  "object": "list",
  "data": [
    {"id": "openai/whisper-large-v3-turbo", "object": "model", "owned_by": "openai"},
    {"id": "Systran/faster-distil-whisper-large-v3", "object": "model", "owned_by": "Systran"}
  ]
}

Delete a model from EFS (the currently loaded model returns 409):

curl -X DELETE https://<function-url>/v1/models/Systran/faster-distil-whisper-large-v3 \
  -H "Authorization: Token <secret>"

{"id": "Systran/faster-distil-whisper-large-v3", "object": "model", "deleted": true}

Slashes in model IDs work naturally — rawPath preserves the full path, so DELETE /v1/models/openai/whisper-large-v3-turbo correctly maps to model ID openai/whisper-large-v3-turbo.

Performance Tip: Always Specify Language

When no language is specified, Whisper runs language detection on the first audio chunk — adding noticeable overhead. For a 2.5s voice message:

Request	Response Time
No language (auto-detect)	~22s
`language=yue` (Cantonese)	~10s

That's a 2x speedup just from passing a language hint. Two ways to do it:

Option A — per-request query param (recommended, keeps Lambda language-agnostic):

# Deepgram endpoint
curl -X POST https://<function-url>/v1/listen?language=yue \
  -H "Authorization: Token <secret>" \
  -H "Content-Type: audio/ogg" \
  --data-binary @audio.ogg

# OpenAI endpoint
curl -X POST https://<function-url>/v1/audio/transcriptions \
  -H "Authorization: Token <secret>" \
  -F "file=@audio.ogg" \
  -F "language=yue"

Option B — Lambda env var (simpler if you only ever transcribe one language):

WHISPER_LANGUAGE=yue

I use Option A — the language is set in my OpenClaw config (language: "yue" in the audio model), which passes it as ?language=yue to the Lambda on every request.

Real-time Factor

Once warm, the Lambda transcribes faster than real-time for typical voice messages:

Audio Duration	Warm Response Time	Real-time Factor
2.5s	~10s	4x
33s	~23s	0.68x ✅ faster than real-time

The 2.5s result looks slow (4x), but Whisper processes audio in 30-second chunks — the overhead is fixed regardless of audio length. For longer messages, the real-time factor drops well below 1x.

Open Source

The project is open source at gabrielkoo/aws-lambda-whisper-adaptor.

Key features:

Any faster-whisper model via HF_MODEL_REPO env var
GitHub Actions workflow to sync models from HuggingFace → S3 (quantization=int8 for HF-format models, quantization=none for pre-converted CTranslate2 models)
GET /v1/models — list all models currently on EFS
DELETE /v1/models/{owner}/{model} — remove a model from EFS on demand
Pre-built Docker image: ghcr.io/gabrielkoo/aws-lambda-whisper-adaptor:latest
Configurable language detection via WHISPER_LANGUAGE env var or per-request parameter

Pre-warming

For OpenClaw voice prompts: the ~20-30s cold start is often negligible in practice — if you're asking the agent to run a multi-step job, it'll take a few minutes anyway. Pre-warming only matters if you need the very first transcription to be fast.

Cold starts happen when Lambda hasn't been invoked recently. For predictable usage patterns (e.g. a morning standup bot), pre-warm the Lambda before you need it:

#!/bin/bash
# prewarm.sh — trigger Lambda init before expected usage
curl -s -o /dev/null \
  -X POST "$WHISPER_LAMBDA_URL/v1/listen?language=yue" \
  -H "Authorization: Token $WHISPER_API_SECRET" \
  -H "Content-Type: audio/ogg" \
  --data-binary @sample.ogg
echo "Lambda pre-warmed"

Schedule with cron: 0 8 * * * /path/to/prewarm.sh (runs at 8am daily).

Alternatively, use an EventBridge rule to ping the Lambda every few minutes — though at that frequency, Provisioned Concurrency starts making more sense cost-wise.

Conclusion

The Lambda + EFS + S3 architecture achieves:

~20-30s cold start (INT8 model on EFS); first-ever bootstrap from S3 takes ~55s (one-time only)
~10s warm invocations with language=yue
~$0.21/month storage cost
Zero idle cost (scales to zero)
Deepgram and OpenAI compatible APIs

The key insight: EFS is the missing piece. It provides persistent, fast storage that Lambda can access without a NAT Gateway (using the free S3 VPC Gateway Endpoint for bootstrapping).

I couldn't find any existing write-up of Whisper on Lambda using EFS for persistent model caching — most approaches either bundle the model in Docker (3-minute cold starts) or re-download from S3 on every cold start (no caching between instances). If you've seen this done before, I'd love to know.

Two things worth knowing before you deploy:

Use Zoont/faster-whisper-large-v3-turbo-int8-ct2 with quantization=none in the sync-model workflow — it's pre-converted to CTranslate2 INT8 and works out of the box (the openai/whisper-large-v3-turbo model requires conversion and can hit num_mel_bins config issues)
Always pass a language parameter if you know it — cuts response time roughly in half

If you're building voice transcription on AWS and want Whisper-quality accuracy without the SageMaker complexity, give it a try.

Using EFS as a persistent model cache follows the same pattern I used earlier for scaling a stateful Streamlit chatbot with ECS + EFS — if you're building other stateful workloads on AWS, that one's worth a look too.

Top comments (1)

Paul Marcelin • Mar 13

Whether for AI or any other application, this is a great study in understanding the levers of cost, the features of AWS services, and the benefits of a private network. Nice analysis!