Part 3 of my series on building a low-cost personal AI stack on AWS.
Part 1 — Squeezing my $1k/month API bill to $20/month with AWS Credits
Part 2 — Drop-in Perplexity Sonar replacement with AWS Bedrock Nova Grounding
TL;DR
I built a self-hosted speech-to-text API on AWS Lambda using faster-whisper. After trying Amazon Transcribe, SageMaker Serverless, and Lambda with a bundled model, I landed on a Lambda + EFS + S3 architecture that achieves ~20-30 second cold starts (once the model is cached on EFS) for ~$0.21/month in storage costs. Once warm, specifying the language drops response time to ~10s.
Open source: gabrielkoo/aws-lambda-whisper-adaptor
The Problem
I wanted to automatically transcribe Telegram voice messages. The requirements were simple:
- Accuracy: Good enough for Cantonese
- Cost: Pay-per-use, scales to zero when idle
- Latency: Cold start under 60 seconds
There's a fourth constraint that's easy to overlook outside Hong Kong: most managed STT APIs simply aren't available here. OpenAI's Whisper API falls under their notorious China's regional restriction. Google's Gemini models are available and actually competitive on both accuracy and price — Gemini 3 Flash achieves 3.1% WER at ~$1.92/1000 minutes (Artificial Analysis STT leaderboard), cheaper than OpenAI's Whisper API and competitive with Lambda at low volume. The real reason I went with Lambda: AWS Credits from the Community Builder program (same theme as the rest of this series) make it effectively free.
Simple enough. Except it took four attempts to get there.
What I Tried (and Why It Didn't Work)
Option 1: Amazon Transcribe
The obvious first choice — fully managed, pay-per-use, native AWS integration.
Why I rejected it before even trying:
Amazon Transcribe supports zh-CN and zh-TW, but not yue (Cantonese). Whisper large-v3-turbo handles Cantonese significantly better, and accuracy matters more than convenience here.
Option 2: SageMaker Serverless Inference
SageMaker Serverless scales to zero and handles model serving — sounds perfect.
What happened:
I deployed a SageMaker Serverless endpoint with faster-whisper. The first invocation after idle:
- Container provisioning: ~30s
- Model loading: ~45-60s
- Total cold start: 60-90 seconds
For a voice message that's 5-10 seconds long, waiting 90 seconds is a terrible experience.
The 6GB memory wall:
SageMaker Serverless maxes out at 6144 MB (6 GB) RAM. Here's why that's a problem for Whisper:
-
whisper-large-v3-turbo(INT8): ~780MB model + ~2GB Python/runtime overhead ≈ 2.8GB minimum -
whisper-large-v3(FP16): ~3GB model alone — barely fits, zero headroom for audio processing - Any concurrent requests? You're OOM.
Lambda goes up to 10,240 MB. That headroom matters.
Cost comparison:
SageMaker Serverless bills per GB-second of inference time. For sporadic voice message transcription (~10s per request, a few times a day), Lambda's per-invocation pricing is significantly cheaper. My Lambda setup costs ~$0.21/month in storage — the compute is essentially free at this volume.
I deleted the endpoint after testing.
Option 2b: Bedrock Marketplace
AWS Bedrock Marketplace does list Whisper Large V3 Turbo — but it deploys on a dedicated endpoint instance. Auto-scaling is available (including scale-to-zero), but that creates a different problem:
- Keep minimum 1 instance: always paying for idle time, even at 3am
- Scale to zero: cold starts when traffic resumes — SageMaker cold starts are measured in minutes, not seconds
- Not token/usage-based pricing either way
For a Telegram bot that gets a few voice messages a day, you're either burning money on idle instances or waiting minutes for the first message to transcribe. Lambda's ~20-30s cold start looks great by comparison.
Option 3: Lambda with Bundled Model
Next idea: bundle the model directly into the Docker image. No external dependencies, simple architecture.
What happened:
# Download model during build
# Note: using openai/whisper-large-v3-turbo converted to int8 via sync-model workflow
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('openai/whisper-large-v3-turbo')"
- Docker image size: ~10GB
- ECR push time: 5+ minutes
- Lambda cold start: 2 minutes 51 seconds
The cold start is dominated by Lambda pulling the 10GB image from ECR. AWS Lambda caches images, but any cold start after the cache expires hits this wall.
Why it didn't work:
- 3-minute cold start is unusable for interactive transcription
- Every code change requires rebuilding and pushing a 10GB image
- ECR storage: ~$1/month just for the image
Option 4: Lambda + S3 (No EFS)
What if Lambda downloads the model from S3 on cold start, storing it in /tmp?
The problem:
Lambda's /tmp is ephemeral. Every cold start re-downloads the model from S3:
- S3 download for 1.6GB FP16 model: 30-60 seconds
- S3 download for 780MB INT8 model: 15-30 seconds
This is better than the bundled model approach, but there's a bigger issue: no caching between Lambda instances. If you have 3 concurrent invocations, all 3 download the model independently. You're paying for S3 transfer on every cold start.
What about Lambda SnapStart or Durable Functions?
AWS added two relevant features since this was written:
SnapStart for Python (Nov 2024): snapshots the initialized execution environment — sounds perfect for caching a loaded model. The catch: SnapStart doesn't support container images. This adaptor is container-based, so it's off the table.
Lambda Durable Functions (re:Invent 2025): enables multi-step workflows with automatic checkpointing, pause/resume for up to one year, and failure recovery. This is workflow orchestration (think Azure Durable Functions) — useful for multi-step AI pipelines, but not for persisting a 780MB model binary between cold starts.
EFS remains the right solution for model caching.
What Actually Worked: Lambda + EFS + S3
The solution: use EFS as a persistent model cache, bootstrapped from S3. I've used EFS for persistent Streamlit state on ECS before — same pattern, different compute layer.
Request → Lambda Function URL
↓
Lambda (VPC)
↓ first cold start only: S3 → EFS
EFS (model cached here permanently)
How it works:
- First cold start (once per model): Lambda checks for a marker file on EFS. If missing, downloads model from S3 to EFS (~55s for INT8). Writes marker file. Then loads model into RAM.
- Subsequent cold starts (new container, model already on EFS): Marker file exists → load model from EFS into RAM (~20-30s for INT8).
- Warm invocations (same container reused): Model already in memory → transcription-only time (~10-22s depending on audio length and whether language is specified).
HF_MODEL_REPO = os.environ.get('HF_MODEL_REPO', 'openai/whisper-large-v3-turbo')
MODEL_SLUG = HF_MODEL_REPO.replace('/', '--')
EFS_MODEL_DIR = f'/mnt/whisper-models/{MODEL_SLUG}'
MODEL_MARKER = f'/mnt/whisper-models/.ready-{MODEL_SLUG}'
def bootstrap_model():
if os.path.exists(MODEL_MARKER):
return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')
# First run: sync model from S3 to EFS
s3 = boto3.client('s3')
prefix = f'models/{MODEL_SLUG}/'
os.makedirs(EFS_MODEL_DIR, exist_ok=True)
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=os.environ['MODEL_S3_BUCKET'], Prefix=prefix):
for obj in page.get('Contents', []):
key = obj['Key']
local_path = os.path.join(EFS_MODEL_DIR, key[len(prefix):])
os.makedirs(os.path.dirname(local_path), exist_ok=True)
s3.download_file(os.environ['MODEL_S3_BUCKET'], key, local_path)
open(MODEL_MARKER, 'w').close() # Mark as ready
return WhisperModel(EFS_MODEL_DIR, device='cpu', compute_type='int8')
MODEL = bootstrap_model() # Runs at Lambda init time, cached for warm invocations
Why EFS works:
- EFS persists across Lambda instances — model is downloaded once, reused forever
- EFS is mounted at
/mnt/whisper-models— Lambda reads it like a local filesystem - S3 VPC Gateway Endpoint is free — no NAT Gateway needed (saves ~$32/month)
- Zero internet egress — Lambda → S3 via VPC Gateway Endpoint, Lambda → EFS within VPC. The Lambda function never reaches the internet. This is a meaningful security benefit when using third-party models from HuggingFace — model weights never leave the AWS network once synced to S3.
- EFS storage: ~$0.19/month for the 780MB INT8 model
🔒 Security note: The Lambda runs in a VPC with no internet access — no NAT Gateway, no public subnet. It can only reach EFS (VPC-internal) and S3 (via the free VPC Gateway Endpoint). This means even if you're using a third-party HuggingFace model, the model weights and your audio data never leave the AWS network. No data exfiltration risk, no outbound calls to unknown endpoints.
INT8 vs FP16: The Model Size Trade-off
The openai/whisper-large-v3-turbo model on HuggingFace needs conversion to CTranslate2 format. The sync-model workflow handles this, converting to INT8 and fixing the num_mel_bins config. Alternatively, use Zoont/faster-whisper-large-v3-turbo-int8-ct2 — a pre-converted CTranslate2 INT8 model that works out of the box with quantization=none:
| Model | Size (EFS) | First Bootstrap | EFS Cold Start | Warm (2.5s audio) | Memory |
|---|---|---|---|---|---|
Zoont/faster-whisper-large-v3-turbo-int8-ct2 |
~780MB | ~55s | ~22s ✅ | ~10s ✅ | ~2.8GB |
openai/whisper-large-v3-turbo (INT8, via sync-model) |
~780MB | ~55s | ~22s | ~10s | ~2.8GB |
openai/whisper-large-v3-turbo (FP16) |
~1.5GB | ~126s | ~40s | ~15s | ~4GB |
Systran/faster-whisper-large-v3 (FP16, loaded as int8) |
~1.6GB | ~54s | ~30s | ~13s | 6GB |
Recommended: Zoont/faster-whisper-large-v3-turbo-int8-ct2 — no conversion step needed, identical performance to the openai model converted to INT8. Use quantization=none in the sync-model workflow since it's already in CTranslate2 format.
Cost Breakdown
| Resource | Monthly Cost |
|---|---|
| EFS storage (780MB INT8) | ~$0.19 |
| S3 storage (780MB) | ~$0.02 |
| Lambda compute | ~$0.00167/warm invocation* |
| S3 VPC Gateway Endpoint | Free |
| NAT Gateway | Not needed ($0) |
| Total (storage only) | ~$0.21/month |
10GB × 10s = 100 GB-seconds per warm invocation. The Lambda free tier covers **400,000 GB-seconds/month* — roughly 4,000 warm invocations. For a personal bot, compute cost is effectively $0. Storage dominates.
Compare to SageMaker Serverless: minimum ~$5-10/month for similar workloads, plus the 60-90s cold start penalty.
Why not Provisioned Concurrency? PC keeps Lambda permanently warm (no cold starts), but costs ~$0.0000097222/GB-second. For a 10GB function running 24/7: ~$252/month. Even a minimal 4GB setup runs ~$100/month — roughly 500x more than the $0.21 storage approach. For a personal bot with a few voice messages a day, the occasional ~60s cold start is a fine trade-off.
vs. OpenAI Whisper API
OpenAI's Whisper API costs $0.006/minute. Here's how it compares for a bot averaging 15s voice messages:
| Volume | OpenAI Whisper API | Self-hosted Lambda |
|---|---|---|
| 50 msgs/month | $0.08 | $0.21 (storage only) |
| 140 msgs/month | $0.21 | $0.21 ← break-even |
| 500 msgs/month | $0.75 | $0.21 (storage only) |
| 1,000 msgs/month | $1.50 | $0.21 (storage only) |
| 4,000 msgs/month | $6.00 | $0.21 (storage only) |
Lambda compute is free within the free tier (~4,000 warm invocations/month). Beyond that, it's $0.00167/invocation — but that's a high volume for a personal bot.
Break-even: ~140 messages/month. Above that, Lambda wins on cost.
But cost isn't the only reason to self-host:
-
Geographic availability: OpenAI's API is not available in Hong Kong — HK falls under China's regional restriction. Azure OpenAI does offer Whisper, but only
whisper-1(large-v2 based) — large-v3 and large-v3-turbo are not available. If you're in HK (or other restricted regions), this approach isn't just cheaper — it's the only option for v3-quality transcription. -
Cantonese accuracy:
language=yuewith Whisper large-v3-turbo is noticeably better than the managed API for Cantonese - Privacy: audio never leaves your infrastructure
- No rate limits: Lambda scales independently
Architecture
Telegram voice message
↓
OpenClaw (gateway)
↓
Lambda Function URL (auth via token)
↓
Lambda (VPC, 10GB RAM, 900s timeout)
↓
EFS /mnt/whisper-models/{model-slug}
↓
faster-whisper (CTranslate2, INT8)
↓
Transcript
Lambda configuration:
- Memory: 10,240 MB — actual usage is ~2.2GB (INT8 model), but Lambda allocates CPU proportional to memory. 10GB gives ~6 vCPUs vs ~2.3 vCPUs at 4GB, cutting warm transcription from ~16s to ~10s. You're paying for CPU, not RAM.
- Timeout: 900s (handles long audio files)
- VPC: Default VPC (no NAT Gateway)
- EFS: Mounted at
/mnt/whisper-models
Memory vs. cost trade-off (tested, 3 runs each):
| Config | Cold Start | Warm (2.5s audio) | GB-seconds/invocation |
|---|---|---|---|
| 4,096 MB | ~30s | ~21s | 84 (~$0.00140) |
| 6,144 MB | ~25s | ~16s | 96 (~$0.00160) |
| 8,192 MB | ~24s | ~18s | 144 (~$0.00240) |
| 10,240 MB | ~22s | ~15s | 150 (~$0.00250) |
Cold start is ~20-30s across all configs — it's EFS I/O bound, not CPU bound, so more memory doesn't help much here. Warm inference time does scale with memory (more vCPUs = faster CTranslate2 decoding). Interestingly, 4GB is the cheapest per invocation — the warm time savings at higher memory don't offset the extra GB-seconds. Within the free tier, cost differences are negligible regardless.
API Compatibility
The adaptor exposes two endpoints so it works as a drop-in replacement for existing integrations:
OpenAI compatible (/v1/audio/transcriptions):
curl -X POST https://<function-url>/v1/audio/transcriptions \
-H "Authorization: Token <secret>" \
-F "file=@audio.ogg" \
-F "language=yue"
{"text": "transcript here"}
Deepgram compatible (/v1/listen):
curl -X POST https://<function-url>/v1/listen?language=yue \
-H "Authorization: Token <secret>" \
-H "Content-Type: audio/ogg" \
--data-binary @audio.ogg
Model Management API
Once you've synced multiple models to EFS, there's no SSH access to see what's there or clean up. I added two non-standard endpoints:
List models on EFS:
curl https://<function-url>/v1/models -H "Authorization: Token <secret>"
{
"object": "list",
"data": [
{"id": "openai/whisper-large-v3-turbo", "object": "model", "owned_by": "openai"},
{"id": "Systran/faster-distil-whisper-large-v3", "object": "model", "owned_by": "Systran"}
]
}
Delete a model from EFS (the currently loaded model returns 409):
curl -X DELETE https://<function-url>/v1/models/Systran/faster-distil-whisper-large-v3 \
-H "Authorization: Token <secret>"
{"id": "Systran/faster-distil-whisper-large-v3", "object": "model", "deleted": true}
Slashes in model IDs work naturally — rawPath preserves the full path, so DELETE /v1/models/openai/whisper-large-v3-turbo correctly maps to model ID openai/whisper-large-v3-turbo.
Performance Tip: Always Specify Language
When no language is specified, Whisper runs language detection on the first audio chunk — adding noticeable overhead. For a 2.5s voice message:
| Request | Response Time |
|---|---|
| No language (auto-detect) | ~22s |
language=yue (Cantonese) |
~10s |
That's a 2x speedup just from passing a language hint. Two ways to do it:
Option A — per-request query param (recommended, keeps Lambda language-agnostic):
# Deepgram endpoint
curl -X POST https://<function-url>/v1/listen?language=yue \
-H "Authorization: Token <secret>" \
-H "Content-Type: audio/ogg" \
--data-binary @audio.ogg
# OpenAI endpoint
curl -X POST https://<function-url>/v1/audio/transcriptions \
-H "Authorization: Token <secret>" \
-F "file=@audio.ogg" \
-F "language=yue"
Option B — Lambda env var (simpler if you only ever transcribe one language):
WHISPER_LANGUAGE=yue
I use Option A — the language is set in my OpenClaw config (language: "yue" in the audio model), which passes it as ?language=yue to the Lambda on every request.
Real-time Factor
Once warm, the Lambda transcribes faster than real-time for typical voice messages:
| Audio Duration | Warm Response Time | Real-time Factor |
|---|---|---|
| 2.5s | ~10s | 4x |
| 33s | ~23s | 0.68x ✅ faster than real-time |
The 2.5s result looks slow (4x), but Whisper processes audio in 30-second chunks — the overhead is fixed regardless of audio length. For longer messages, the real-time factor drops well below 1x.
Open Source
The project is open source at gabrielkoo/aws-lambda-whisper-adaptor.
Key features:
- Any faster-whisper model via
HF_MODEL_REPOenv var - GitHub Actions workflow to sync models from HuggingFace → S3 (
quantization=int8for HF-format models,quantization=nonefor pre-converted CTranslate2 models) -
GET /v1/models— list all models currently on EFS -
DELETE /v1/models/{owner}/{model}— remove a model from EFS on demand - Pre-built Docker image:
ghcr.io/gabrielkoo/aws-lambda-whisper-adaptor:latest - Configurable language detection via
WHISPER_LANGUAGEenv var or per-request parameter
Pre-warming
For OpenClaw voice prompts: the ~20-30s cold start is often negligible in practice — if you're asking the agent to run a multi-step job, it'll take a few minutes anyway. Pre-warming only matters if you need the very first transcription to be fast.
Cold starts happen when Lambda hasn't been invoked recently. For predictable usage patterns (e.g. a morning standup bot), pre-warm the Lambda before you need it:
#!/bin/bash
# prewarm.sh — trigger Lambda init before expected usage
curl -s -o /dev/null \
-X POST "$WHISPER_LAMBDA_URL/v1/listen?language=yue" \
-H "Authorization: Token $WHISPER_API_SECRET" \
-H "Content-Type: audio/ogg" \
--data-binary @sample.ogg
echo "Lambda pre-warmed"
Schedule with cron: 0 8 * * * /path/to/prewarm.sh (runs at 8am daily).
Alternatively, use an EventBridge rule to ping the Lambda every few minutes — though at that frequency, Provisioned Concurrency starts making more sense cost-wise.
Conclusion
The Lambda + EFS + S3 architecture achieves:
- ~20-30s cold start (INT8 model on EFS); first-ever bootstrap from S3 takes ~55s (one-time only)
-
~10s warm invocations with
language=yue - ~$0.21/month storage cost
- Zero idle cost (scales to zero)
- Deepgram and OpenAI compatible APIs
The key insight: EFS is the missing piece. It provides persistent, fast storage that Lambda can access without a NAT Gateway (using the free S3 VPC Gateway Endpoint for bootstrapping).
I couldn't find any existing write-up of Whisper on Lambda using EFS for persistent model caching — most approaches either bundle the model in Docker (3-minute cold starts) or re-download from S3 on every cold start (no caching between instances). If you've seen this done before, I'd love to know.
Two things worth knowing before you deploy:
- Use
Zoont/faster-whisper-large-v3-turbo-int8-ct2withquantization=nonein the sync-model workflow — it's pre-converted to CTranslate2 INT8 and works out of the box (theopenai/whisper-large-v3-turbomodel requires conversion and can hitnum_mel_binsconfig issues) - Always pass a
languageparameter if you know it — cuts response time roughly in half
If you're building voice transcription on AWS and want Whisper-quality accuracy without the SageMaker complexity, give it a try.

Top comments (1)
Whether for AI or any other application, this is a great study in understanding the levers of cost, the features of AWS services, and the benefits of a private network. Nice analysis!