💡 The Problem
We recently ran into a production issue — our Speech-to-Text (STT) service stopped working for a few hours.
The feature was fixed quickly, but the transcripts for that downtime were missing.
Luckily, in Amazon Connect, all call recordings are stored in S3.
So the audio was there, but no transcripts.
We needed to reprocess all those missed files — fast.
🧠 First Attempt: Lambda (and its Limitations)
We quickly built a Lambda function to process unprocessed files from S3.
It worked fine — until it didn’t.
AWS Lambda has a 15-minute execution limit, and processing large audio files can easily exceed that.
We could have switched to EC2, but that felt like using a hammer for a small screw — no auto-scaling, no graceful shutdown, no built-in retry or job management.
We needed something that behaved like a job, not a script.
🚀 Enter AWS Batch + Fargate
That’s when AWS Batch came to the rescue.
It’s perfect for this kind of workload — long-running, batch-style, event-driven jobs.
Here’s the setup we used:
- Created a Compute Environment
- Backed by AWS Fargate → no EC2 management.
- Scales automatically depending on job load.
- Defined a Job Queue
- All reprocessing jobs will be submitted here.
- The queue ensures controlled concurrency and retries.
- Built a Job Definition
- Packaged our STT processing logic as a Docker image.
- Uploaded it to Amazon ECR.
- Defined required vCPU and memory for each job.
- Triggered via Lambda
- A small Lambda fetches a list of unprocessed S3 files.
- For each batch (say 50 files), it submits a Batch Job.
⚙️ The Flow in Action
- Lambda → Checks for unprocessed audio files in S3.
- Lambda → AWS Batch: Submits a job to process them.
- AWS Batch (Fargate) spins up compute, runs the job.
- Job → Downloads audio → runs STT → uploads transcript → updates metadata.
- Fargate shuts down automatically when the job finishes.
No idle servers, no manual cleanup, no stress.
🧩 Why This Design Rocks
✅ Serverless all the way — Lambda + Fargate + S3
✅ Auto-scaling compute — no EC2 to babysit
✅ Long-running safe zone — runs beyond Lambda’s 15-min cap
✅ Reusable — we can reprocess any backlog anytime
✅ Cost-efficient — pay only for what’s used
🪄 Bonus Tip
You can even schedule a “missed transcript” job to run daily or weekly,
checking for any files without transcripts and triggering a Batch job automatically.
🧩 Understanding AWS Batch Scaling
In AWS Batch, the number of tasks (containers) that run in parallel depends on three things working together:
Compute Environment capacity
→ e.g., your environment has a maximum of10 vCPUs.Job Definition requirements
→ e.g., each job needs1 vCPU.How many jobs are in the queue (and their array size, if used).
🔹 Case 1: You Submit Multiple Independent Jobs
If you submit 10 jobs, each with 1 vCPU, and your environment allows 10 vCPUs,
then AWS Batch can run all 10 in parallel (subject to available Fargate capacity).
Example:
# pseudo example
for i in {1..10}; do
aws batch submit-job \
--job-name process-audio-$i \
--job-queue my-queue \
--job-definition my-job-def
done
Each job = 1 vCPU → up to 10 can run simultaneously.
AWS Batch’s Job Scheduler will automatically pack as many as possible based on available compute.
🔹 Case 2: You Use an Array Job
Instead of manually looping, you can submit an array job.
Example:
aws batch submit-job \
--job-name process-audios \
--job-queue my-queue \
--job-definition my-job-def \
--array-properties size=10
This creates 10 child jobs under a single parent, each running independently (great for S3 list chunking).
Same result — 10 parallel containers, each with 1 vCPU.
🔹 Case 3: You Submit a Single Job that Needs More vCPUs
If you set in your job definition:
"vcpus": 4
and your environment has 10 total vCPUs →
then Batch will reserve 4 vCPUs for that job, leaving room for other smaller jobs.
So the compute environment doesn’t spawn “10 copies automatically” —
it just enforces a maximum pool of total CPU that concurrent jobs can consume.
⚙️ TL;DR — How to Scale
| Goal | What to Do |
|---|---|
| Run multiple tasks concurrently | Submit multiple jobs or an array job |
| Each job’s CPU need | Defined in Job Definition (e.g., 1 vCPU) |
| Max parallel limit | Based on compute environment capacity |
| Control at runtime | You can pass --array-properties size=N dynamically |
| Scaling behavior | Batch automatically scales Fargate/EC2 capacity up/down |
🏁 Closing Thoughts
This experience reminded me —
“When your script starts feeling like a job, give it job-like powers.”
AWS Batch (especially with Fargate) is often underrated,
but it’s a powerful tool when you need on-demand, containerized, long-running compute
without managing any servers.
Top comments (0)