Every modern product that handles user-generated media; say a podcast platform, a video CMS, a learning product, a content-moderation layer - eventually runs into the same problem. A file lands in storage, and now the system needs to understand it: extract the speech, identify what's on screen, summarise it, tag it, and make it queryable. Doing that on a single server is expensive, fragile, and impossible to scale predictably.
This article walks through a serverless design for that problem on AWS. The pipeline ingests audio, video, or images, runs them through managed AI services (Rekognition, Transcribe, Bedrock), and persists the extracted intelligence into DynamoDB for downstream use - all without operating a single EC2 instance. The focus is on why each piece exists and how it fits together, not just naming the services.
Why serverless is the right shape for this problem
Media workloads are spiky and long-running in a way that punishes traditional compute.
A single 45 minute video can take several minutes to transcribe; ten of them landing at once shouldn't require keeping ten servers warm all day. The workflow also fans out naturally; transcription, visual analysis, and thumbnail generation are independent steps that want to run in parallel.
Three properties of a serverless, event-driven design map cleanly onto this:
Elastic concurrency: Lambda, Rekognition, and Transcribe scale out on demand. You pay per request and per second of processing, not for idle capacity.
Native asynchrony: Step Functions turn a multi-stage AI pipeline into a declarative state machine with retries, timeouts, and parallel branches built in - no message queues or cron jobs to wire by hand.
Composability: Every stage is a managed service with a clear IAM contract. Swapping Transcribe for a different ASR provider, or Bedrock for a self-hosted model, is a local change.
Architecture at a glance
The pipeline has five logical layers:
- Ingestion: S3 receives the upload.
-
Event routing: EventBridge picks up the
Object Createdevent and triggers the workflow. - Orchestration: Step Functions coordinates the processing stages.
- Intelligence: Lambda functions call Rekognition, Transcribe, and Bedrock to extract structured data.
- Persistence: DynamoDB stores the metadata, keyed by media ID, for downstream querying.
Walking through each component
1. S3 as the ingestion surface
The pipeline starts where most media pipelines start: a single S3 bucket configured for object uploads. A few things are worth doing at this layer that are easy to skip:
-
Separate buckets (or at least prefixes) for raw input and processed artefacts. Keeping
incoming/andprocessed/distinct means lifecycle rules, replication, and IAM policies stay clean as the system grows. - Multipart upload configuration. Large video files need multipart uploads; the default S3 SDKs handle this, but the bucket should have a lifecycle rule to abort incomplete multipart uploads after N days, or costs quietly accumulate.
- Server-side encryption and versioning. If the system ever handles sensitive content (KYC video, medical, private podcasts), this is non-negotiable.
2. EventBridge as the event router
You could wire S3 notifications directly to a Lambda, it works but could be rigid. Putting EventBridge between S3 and the rest of the pipeline gives you something much more valuable: a declarative event bus where rules pattern-match on the event payload.
The immediate benefit is filtering. A rule can fire the pipeline only for specific prefixes, file extensions, or size ranges:
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": { "name": ["media-ingest-prod"] },
"object": {
"key": [{ "prefix": "incoming/" }],
"size": [{ "numeric": [">", 1024] }]
}
}
}
The longer-term benefit is that the event bus becomes a seam. When a second consumer appears, say; an analytics team that wants to count uploads per tenant, they attach their own rule without touching the core pipeline.
3. Step Functions as the orchestrator
Interestingly, this appears to be the heart of the design. Step Functions expresses the pipeline as a JSON (or YAML) state machine, and the payoff is enormous: retries with exponential backoff, per-state timeouts, parallel branches, and error paths are all configuration rather than code.
A reasonable shape for the state machine looks like this:
ClassifyMedia; a Lambda that inspects the MIME type and branches on whether the file is image, audio, or video.
ParallelAnalysis; a Parallel state that runs, for a video:
- StartTranscriptionJob (Transcribe) on the extracted audio track.
- StartLabelDetection and StartContentModeration (Rekognition Video) on the visual track.
WaitForCompletion; asynchronous Rekognition and Transcribe jobs publish completion events; the state machine uses .waitForTaskToken or polling to resume once results are ready.
SummariseWithBedrock; a Lambda that sends the transcript and label set to a Bedrock model and receives a structured summary.
PersistMetadata; writes the final object into DynamoDB.
CatchAll; a catch block that routes any failure to a dead-letter state, writes the error into a failures table, and optionally notifies via SNS.
The thing to internalise is that Step Functions is not "just" a workflow visualiser, it's the reason the pipeline is resilient. A transient Bedrock throttling error retries automatically; a malformed file fails into the catch branch instead of silently leaving the system in a half-processed state.
4. Lambda as the glue
Each state in the machine is either a direct service integration (Step Functions can call Rekognition, Transcribe, and DynamoDB without a Lambda in between) or a small Lambda function for the bits that need custom logic: MIME classification, payload shaping before Bedrock, post-processing transcripts.
A rule of thumb: prefer direct service integrations wherever possible. Every Lambda you add is code to test, package, monitor, and patch. The direct integrations are zero-code and zero-cold-start.
5. The AI layer: Rekognition, Transcribe, and Bedrock
Each of these handles a different modality:
Amazon Rekognition handles visual analysis, object and scene labels, celebrity detection, content moderation flags, and text-in-image (OCR) for video and still images. For video, jobs are asynchronous: you start a job and receive an SNS notification when results are ready.
Amazon Transcribe turns audio into a structured transcript with timestamps, speaker diarisation, and optional custom vocabularies. For a podcast pipeline, custom vocabularies dramatically improve accuracy on domain-specific terms (product names, acronyms, people).
Amazon Bedrock is where the "AI-powered" part lives in the modern sense. Given the raw transcript and Rekognition labels, a Bedrock model (Claude, Nova, Llama, depending on preference) produces the artefacts downstream systems actually want: a one-paragraph summary, chapter markers with timestamps, topic tags, a short SEO description, a list of named entities.
The prompt design for the Bedrock step deserves its own attention. A pattern that works well is to ask the model for a strict JSON response against a schema you define, rather than free-form text so the persistence step can store it without fragile parsing:
You will receive a transcript and a list of visual labels from a video.
Return ONLY a JSON object with this exact shape:
{
"summary": string (max 80 words),
"chapters": [{ "title": string, "start_seconds": number }],
"topics": string[],
"entities": { "people": string[], "organisations": string[], "places": string[] }
}
6. DynamoDB as the metadata store
DynamoDB is the right fit here because the access patterns are simple and the shape is predictable. A single table keyed by media_id (partition key) with created_at as the sort key handles the primary "fetch processing result for this upload" pattern. A GSI on tenant or status lets you answer "show me all videos processed in the last 24 hours" without scanning.
Keep the Bedrock-generated summary and the raw transcript reference separate: summaries are small and hot, transcripts can be large and belong back in S3 with just a pointer in DynamoDB.
Putting it together: an end-to-end request
Concretely, here's what happens when a user uploads a 20 minute interview video:
The client performs a multipart upload to
s3://media-ingest-prod/incoming/<tenant-id>/<uuid>.mp4.S3 emits an Object Created event to the default event bus.
An EventBridge rule matches on the
incoming/ prefix and starts a Step Functions execution, passing the bucket and key as input.The state machine classifies the file as video, then fans out in parallel: Transcribe starts on the audio track, Rekognition starts label detection and content moderation on the visual track.
Both jobs complete asynchronously; the state machine resumes via task tokens.
A Lambda gathers the transcript and labels, constructs a Bedrock prompt, and receives a structured JSON summary.
A final step writes the full metadata record to DynamoDB and moves the original file from
incoming/toprocessed/.Any failure along the way is caught, logged to a failures table, and surfaced via an SNS topic that pages the on-call engineer if it crosses a threshold.
The whole thing runs without a single long-lived server.
Things that bite you in production
Diagrams make this look clean. A few practical notes from taking a pipeline like this to production:
IAM is the hard part. Each Lambda and each Step Functions state needs a narrowly scoped role. Do not share a single execution role across the pipeline, one overly broad policy is how
s3:GetObject *turns into a security incident.Bedrock throttling is real. Regional model quotas can and will throttle you under load. Wrap the Bedrock call in a retry policy with jitter, and consider provisioned throughput if the pipeline is on a critical path.
Large transcripts exceed context windows. A two-hour podcast transcript can be larger than a single model context. Chunk on speaker turns or paragraph boundaries, summarise each chunk, then do a reduce pass - a classic map-reduce over text.
Cost lurks in the idle corners. The obvious costs (Transcribe per audio-minute, Rekognition per video-minute, Bedrock per token) are easy to forecast. The easily-missed ones are Step Functions state transitions on high-volume pipelines and CloudWatch Logs ingestion if you log every event verbatim. Sample your logs.
Observability needs to span the whole flow. A single correlation ID; the S3 object key or a generated media ID should travel through every Lambda, Step Functions state, and DynamoDB record, so that when something breaks you can reconstruct exactly what happened with a single query.
Where this design shines
The same skeleton supports a surprising range of products with only the Bedrock prompt and the DynamoDB schema changing:
Podcast platforms that want automatic show notes, chapter markers, and searchable transcripts.
Video intelligence tools for media libraries tagging, searching, and moderating large content archives.
Learning and compliance products that need to extract key points and generate quizzes from lecture recordings.
Content moderation systems combining Rekognition's moderation labels with LLM-based policy reasoning for edge cases.
Customer support analytics processing recorded calls to surface sentiment, topics, and escalation signals.
Where to take it next
Once the base pipeline is in place, the interesting extensions are mostly at the edges:
Realtime mode. Swap batch Transcribe for Transcribe Streaming and emit partial results over WebSockets for live captioning.
Semantic search. Pipe the Bedrock-generated summary and transcript chunks into an embedding model and store the vectors in OpenSearch or a vector store now the media library is searchable by meaning, not just tags.
Human-in-the-loop review. For content moderation or compliance, route low-confidence Bedrock decisions to an SQS queue backed by a reviewer UI, and feed the decisions back as training data.
Multi-tenant isolation. Use S3 access points and dynamic Step Functions execution roles to enforce tenant boundaries at the infrastructure layer rather than in application code.
My closing thought
The interesting thing about this architecture isn't any individual service; it's that the boundaries between them are event-driven and declarative. You can reason about the system by looking at the state machine definition and the EventBridge rules, not by reading through layers of application code. That's what makes it durable: when the product team asks for a new capability ("can we also detect languages automatically?"), you add a state, not a service.
Serverless isn't the right answer for every system, but for "something landed in storage, now go understand it," it's hard to beat. The pipeline scales with the work, costs track usage, and the blast radius of any one failure is a single execution not the whole platform.

Top comments (0)