Marcelo Acosta Cavalero for AWS Community Builders

Posted on Mar 26 • Originally published at buildwithaws.substack.com

A Serverless Blueprint for Multimodal Video Search on AWS

#serverless #ai #aws #tutorial

Originally published on Build With AWS. Subscribe for weekly AWS builds.

This design was inspired by Miguel Otero Pedrido and Alex Razvant’s “Kubrick” course, but rebuilt using native AWS primitives instead of custom frameworks.

Video is impossible to search.

You can scrub through it manually, or rely on YouTube’s auto-generated captions that only match exact keywords.

But what if you want to find “the outdoor mountain scene” or “where they discuss AI ethics”?

Traditional video platforms fail here because they treat video as a single data type.

This system treats video as three parallel search problems.

Speech gets transcribed with word-level timestamps and indexed for semantic search.
Every frame generates a semantic description through Claude Vision and goes into a separate index.
Those same frames become 1,024-dimensional vectors for visual similarity search.

Users ask questions in natural language, and an intelligent agent figures out which index to query. Results come back with exact timestamps.

The architecture runs entirely on serverless AWS: AgentCore Gateway for tool orchestration, Bedrock Knowledge Bases for RAG, S3 Vectors for image search, and Lambda tying everything together.

Processing cost is front-loaded (heavy on first upload), but once videos are indexed, the system runs for roughly $3 per month per 100 videos. Query latency stays under 2 seconds.

The Three-Index Architecture

Most video systems treat search as a single problem: match keywords in titles or auto-generated captions. That works if users know exactly what they’re looking for and can describe it with the exact words spoken in the video.

It breaks down when someone asks “show me outdoor mountain scenes” or wants to find visually similar shots.

The solution is to treat video as three separate, parallel search problems.

-First, transcribe the audio track completely and index every spoken word with word-level timestamps.

This handles “what was said” queries.

-Second, extract frames throughout the video, generate semantic descriptions using Claude Vision, and index those descriptions.

This handles “what was shown” queries.

-Third, create vector embeddings of those same frames using Titan Multimodal and store them in S3 Vectors for visual similarity search.

Each index serves a different user intent.

The speech index answers “find where they discuss machine learning.”

The caption index answers “show me celebration scenes.”

The image index answers “find shots that look like this” when users upload a reference image.

Users don’t need to know which index exists. An intelligent agent analyzes their query, determines which tool to invoke, executes the search, and returns results with exact timestamps.

System Architecture

The frontend is a single-page app hosted on S3 and delivered via CloudFront. Users upload videos through a presigned URL directly to S3, which triggers the processing pipeline. Searches go through API Gateway to the agent Lambda, which either invokes tools directly (Manual Mode) or asks Claude Sonnet to analyze intent and select the right tool (Auto Mode). Tools are exposed via AgentCore Gateway using the Model Context Protocol.

Video Processing Pipeline

When a user uploads a video, the orchestrator Lambda kicks off two parallel tracks: frame extraction and transcription.

The frame track extracts frames using FFmpeg, sends them to Claude Vision for semantic descriptions, and creates vector embeddings for similarity search.

The transcription track uses AWS Transcribe to generate word-level timestamps, then chunks and indexes the transcript for semantic search.

Both complete in roughly 5-6 minutes for a 2-minute video.

Frame extraction doesn’t use a fixed frame rate like 6fps or 1fps. Instead, it extracts a fixed number of frames evenly distributed across the video duration. A 30-second clip gets 45-120 frames. A 10-minute video also gets 45-120 frames. This matters because caption generation costs scale with frame count, not video length.

Timestamps are calculated using (frame_number - 1) × duration / (total_frames - 1) to ensure frames are spread evenly from start to finish, with the first frame at 0 seconds and the last frame at the video’s end.

FFmpeg runs inside a Lambda function with 2GB of memory and a 10-minute timeout. For videos longer than 10 minutes, the system would need Fargate or Step Functions to handle the extended processing time. But the processing logic stays the same, just a different execution environment.

Transcription happens in parallel via AWS Transcribe. The service processes the audio track asynchronously and typically finishes in about 1/4 of the video duration. A 10-minute video transcribes in roughly 2.5 minutes. A polling Lambda checks the job status with a 5-second delay between attempts (up to 60 attempts max, allowing roughly 5 minutes of polling).

Transcribe returns word-level timestamps in JSON format: each word gets a start time, end time, and confidence score. Punctuation appears as separate items without timing. This granularity is critical because when Bedrock Knowledge Base returns a text snippet later, we need to map that snippet back to exact timestamps in the original video.

The chunk_transcript Lambda processes the Transcribe output into 10-second audio chunks, each preserving the original word-level timestamps. Each chunk becomes a separate JSON file (chunk_0001.json, chunk_0002.json, etc.) containing the chunk text, precise start_time_sec and end_time_sec boundaries, and metadata.

This pre-chunking ensures that search results can be mapped back to exact video positions while maintaining semantic coherence within each searchable segment.

Documents are stored at {video_id}/speech_index/ and {video_id}/caption_index/ within the processed bucket. Caption data follows a similar pattern, with one JSON file per frame containing the Claude Vision-generated description, frame number, and timestamp.

Bedrock Knowledge Base has a limitation: it doesn’t support wildcards in S3 inclusion prefixes. You cannot configure it to scan */speech_index/ across multiple video folders. The deployed Bedrock Knowledge Bases are configured to work with the current bucket structure. The chunk_transcript and embed_captions Lambdas trigger KB ingestion jobs after uploading new documents, ensuring search indexes stay synchronized with processed content. Bedrock KB generates embeddings for each document, enabling semantic search while preserving the timestamp metadata attached to each chunk.

The current implementation prioritizes organizing all video-related data under a single video_id prefix for easier management and deletion. An alternative architecture would place the index type at the top level (speech_index/{video_id}/...) allowing a single KB inclusion prefix to scan all videos, but would sacrifice per-video organizational simplicity.

Caption generation is where processing costs concentrate. Each frame goes to Claude 3.5 Sonnet via Bedrock with a prompt that asks for 2-3 sentence descriptions focusing on subjects, actions, setting, and atmosphere. Claude returns natural language like “A chef in a white uniform demonstrates knife skills in a modern kitchen, dicing vegetables while explaining technique to a camera.” Each caption saves as a JSON file with the description, frame metadata, and timestamp.

At roughly $0.005-0.008 per frame, a video with 100 frames costs $0.50-0.80 to caption. That’s 5-8x more expensive than Amazon Rekognition, which would return structured labels like “Person” (93% confidence), “Kitchen” (89%), “Knife” (85%). The cost premium buys search quality. When users ask “show me cooking demonstrations,” Claude’s semantic descriptions match the intent. Rekognition’s labels don’t connect to natural language queries the same way. For a system built around conversational search, Claude’s cost is justified.

The same frames that get captions also become vector embeddings. Titan Multimodal Embeddings generates 1,024-dimensional vectors at $0.00006 per frame, essentially free compared to caption costs. These vectors go into S3 Vectors, a serverless vector store that handles indexing and similarity search without infrastructure management. Each vector record includes the embedding wrapped in a float32 format plus metadata for video ID, frame number, and timestamp. This enables “find similar shots” queries where users upload a reference image and get back visually similar frames.

Search and Retrieval

The three indexes sit behind Bedrock Knowledge Bases (for speech and captions) and S3 Vectors (for images). AgentCore Gateway exposes six tools via the Model Context Protocol: search_by_speech, search_by_caption, search_by_image, list_videos, get_video_metadata, and get_full_transcript. The agent Lambda invokes these tools either directly when users pick Manual Mode, or through Claude’s analysis in Auto Mode.

The Knowledge Base has a timestamp problem. When it returns a text snippet from a transcript, it doesn’t include the original timestamps from the Transcribe JSON.

The snippet is just text. But users need “go to 2:34 in the video,” not “this text appears somewhere in there.”

The solution is having Claude match the snippet back to the word-level timeline. The agent downloads the Transcribe JSON, extracts all words with their start and end times, and asks Claude to find which words semantically match the returned snippet. Claude returns {”start_time”: 154.2, “end_time”: 157.8}. This adds about 500ms to query latency, but the precision is worth it.

The Knowledge Base might paraphrase “we’re exploring artificial intelligence” while the original transcript says “we are exploring AI,” and Claude maps them correctly anyway.

Intelligent Routing

The agent Lambda receives user queries and decides which tool to invoke. In Manual Mode, users explicitly pick speech, caption, or image search, and the agent calls that tool directly. In Auto Mode, users just type natural language, and Claude Sonnet 4 figures out the intent.

A query like “find where they discuss machine learning” goes to speech search. “Show me outdoor mountain scenes” goes to caption search. “Find similar shots” triggers image search.

Claude gets a system prompt explaining what each tool does: search_by_speech queries transcripts, search_by_caption queries frame descriptions, search_by_image handles visual similarity. Claude analyzes the user’s question and returns structured JSON with the tool name, parameters, and reasoning. The agent then invokes that tool via AgentCore Gateway using SigV4-signed requests. Results come back with video IDs, timestamps, matched text, and confidence scores, all formatted for the frontend to display.

This design skips Bedrock Agents entirely. Bedrock Agents handle orchestration automatically, but that comes with limited control over error handling, no support for custom timestamp extraction logic, and extra cost for features this system doesn’t need. Building the agent from scratch using Claude’s tool-use API gives full control over the routing logic, parallel tool execution, and response formatting.

AgentCore Gateway sits between the agent and the tools, hosting an MCP (Model Context Protocol) server that exposes the six search and utility tools. Each tool backs to a Lambda function, and the Gateway handles SigV4 authentication, tool discovery, and request routing. When the agent invokes search_by_speech, the Gateway routes that to the speech search Lambda, waits for results, and returns them. Adding new tools means registering them in the Gateway configuration. No agent code changes required.

Design Trade-Offs

The three-index architecture trades infrastructure complexity for search quality.

A single Knowledge Base containing transcripts, captions, and image data would be simpler to manage. But speech needs dense text with context windows. Captions need short, precise matching. Images need vector similarity, not text search. Separate indexes let each modality optimize independently, and the search quality difference is measurable. Users asking “show me outdoor scenes” get relevant results from the caption index that a combined index would miss.

Claude Vision costs 5-8x more than Rekognition per frame. For 100 frames, that’s $0.50-0.80 versus $0.10. The cost premium comes from Claude generating full semantic descriptions while Rekognition returns structured labels with confidence scores. When users search with natural language like “cooking demonstrations,” Claude’s narrative captions match their intent. Rekognition’s labels (”Person”, “Kitchen”, “Utensil”) don’t connect to conversational queries the same way.

The system prioritizes search experience over processing cost because users abandon systems that don’t find what they’re looking for.

S3 Vectors handles vector storage without managing clusters or configuring indexes. Query latency runs 200-300ms, which is acceptable for this use case.

OpenSearch Serverless would deliver sub-100ms queries and support hybrid keyword+vector search, but it adds complexity and cost that the system doesn’t need yet. The switch point is around 10k videos or when query latency becomes the primary user complaint. Below that threshold, S3 Vectors is simpler and cheaper.

Lambda handles all processing because video workflows are bursty. A system might process 10 videos in an hour, then sit idle for three hours. Fargate would cost roughly $30 per month per service even when doing nothing. Lambda costs $0 when idle.

The breaking point is continuous processing at 100+ videos per hour, where Fargate’s flat rate becomes cheaper than Lambda’s per-execution pricing. Most video systems never hit that threshold.

Frame extraction uses a fixed frame count (45-120 frames) evenly distributed across video duration rather than a fixed frame rate. This decision controls caption costs: 100 frames to caption regardless of whether the video runs 30 seconds or 10 minutes. A 6fps approach would generate 1800 frames for a 5-minute video and 600 frames for a 100-second video, wildly different costs. Fixed frame count makes processing costs predictable and avoids redundant captions when adjacent frames look nearly identical.

Cost Analysis

Processing 100 two-minute videos costs roughly $58 up front, then $3 per month to keep running. With a fixed frame count of 80 frames per video (middle of the 45-120 range), the math is straightforward: 100 videos × 80 frames = 8,000 frames total. Claude Vision at $0.006 per frame comes to $48. AWS Transcribe adds $4.80 for speech transcription (200 minutes at $0.024 per minute). Titan image embeddings cost $0.48 for those same 8,000 frames. Lambda invocations are negligible at $0.10.

Storage runs about $0.40 per month. Frames take up roughly 5-8GB (80 frames × 50-100KB average per JPEG × 100 videos) at $0.12-0.18. S3 Vectors hold 120-150MB of embeddings (8,000 vectors × 15-20KB each including metadata) for $0.003-0.004. Transcripts take about 20MB at $0.0005. Bedrock Knowledge Base vectors store in S3, already counted in the frame storage cost. The dominant cost is always frame storage.

Queries cost $0.27 per thousand. Bedrock Knowledge Base retrieval is $0.10, Claude Sonnet 4 for routing is $0.15, and S3 Vectors queries are $0.02. API Gateway and Lambda execution costs are minimal enough to ignore at this scale. A system running 10,000 queries per month pays $2.70 in query costs.

The cost structure is front-loaded. Month 1 with 100 new videos: $61 (processing + storage + queries). Month 2 with no new uploads: $3.10 (storage + queries). Month 3: $3.10. The system essentially costs $3 per month to operate once videos are processed, with spikes when new content arrives.

Frame count directly controls caption costs. Using 120 frames per video instead of 80 increases caption costs from $48 to $72 per 100 videos. Using 45 frames drops it to $27. Bedrock Batch Inference offers a 50% discount on Claude pricing but delays results by 24 hours, acceptable for async workflows. Combining lower frame counts (45-60) with batch inference brings processing costs down to $15-20 per 100 videos.

Performance and Scaling

A 2-minute video takes 5-6 minutes to become fully searchable. Frame extraction completes in 10-15 seconds. Transcription runs asynchronously and finishes in about 30 seconds. Caption generation is the bottleneck at 3-5 minutes, processing 100 frames at 2-3 seconds each. Image embedding adds another 20-30 seconds. Batch inference trades processing speed for cost savings: results take 24 hours instead of 5 minutes, but cut costs in half.

Query latency stays under 2 seconds for speech and caption search. Speech queries run 800-1200ms: Bedrock Knowledge Base retrieves matching snippets in 400-600ms, then Claude extracts precise timestamps from the Transcribe JSON in another 400-500ms.

Caption queries run faster at 600-900ms since frame timestamps come directly from metadata. Image similarity search is fastest at 300-500ms, just a vector query against S3 Vectors. The agent routing overhead (Claude analyzing intent and selecting tools) adds 400-600ms in Auto Mode.

Production Considerations

When Lambda crashes during caption generation, AWS automatically retries async invocations twice by default. The generate_captions Lambda catches individual frame failures and continues processing remaining frames rather than halting the entire batch.

The process_video and extract_frames Lambdas update DynamoDB status to ‘error’ on failure, but caption generation failures are logged to CloudWatch without explicit DynamoDB status tracking.

Partial results persist in S3 - frames remain available even if caption generation crashes afterward. There’s no automatic recovery mechanism, so resuming a failed step requires manually re-invoking the specific Lambda function with the video_id parameter, which reprocesses that entire step rather than resuming from the failure point.

Search query failures depend on a cascade of timeouts. The agent_api Lambda has a 60-second timeout, though internal Bedrock requests use a 30-second timeout. API Gateway enforces a 29-second maximum integration timeout, which would typically trigger first and return a timeout error to the user.

Query performance depends heavily on result count and metadata filtering - requesting 50 results from a large Knowledge Base performs worse than requesting 5 with specific video_id filters.

Knowledge Base synchronization happens programmatically, not on a schedule. After uploading transcripts or captions to S3, the chunk_transcript and embed_captions Lambdas explicitly trigger ingestion jobs via bedrock_agent.start_ingestion_job(). This ensures new content becomes searchable without waiting for automatic syncs.

The code logs indicate ingestion typically completes in around 2 minutes, though actual time varies with document count and KB size.

The architecture scales from 100 to 1,000 videos without structural changes. Storage costs scale linearly with video count - 10x the videos means 10x the S3 storage costs. Query latency depends more on index size and query complexity than sheer video count, since Bedrock KB and S3 Vectors both use vector indexes that grow with content volume. Lambda concurrency rarely becomes an issue because video processing happens asynchronously over time rather than simultaneously.

At 10,000+ videos, you’d monitor specific bottlenecks as they emerge.

Bedrock Knowledge Base query latency could increase as vector indexes grow larger. S3 Vectors performance might degrade with hundreds of thousands or millions of frame vectors.

The list_videos DynamoDB scan would slow down, requiring pagination and potentially a Global Secondary Index on upload_timestamp for efficient retrieval.

These are optimization problems, not architectural redesigns - the core processing logic stays the same while execution environments might shift from Lambda to Fargate for longer videos, or from S3 Vectors to OpenSearch Serverless for consistently sub-100ms vector queries at scale.

Deployment and Production Readiness

The infrastructure deploys through AWS CDK with a single command: cdk deploy --all. This creates two stacks - InfrastructureStack with 19 Lambda functions, 2 Bedrock Knowledge Bases, 4 S3 buckets, a DynamoDB table, and API Gateway, plus FrontendStack with CloudFront distribution and frontend bucket. The Bedrock Knowledge Bases and AgentCore Gateway are pre-configured in AWS rather than created by the CDK deployment. The entire stack is version-controlled and reproducible across environments.

All Lambda functions log to CloudWatch with 731 days (2 years) of retention. The deployment includes no CloudWatch alarms, SNS topics, or automated monitoring by default - production deployments would need to add metric filters for processing duration, query latency, and failure rates. The CloudWatch logs capture every Lambda invocation but require manual querying or external tooling for insights beyond basic log inspection.

Native AWS services handle complex multimodal AI workloads without custom frameworks or infrastructure. AgentCore Gateway provides MCP standardization for tool orchestration. Bedrock Knowledge Bases manage retrieval-augmented generation across speech and caption indexes. S3 Vectors store image embeddings. Lambda processes videos and routes queries. The system runs at a low monthly cost after initial video processing, with predictable scaling characteristics up to 10,000 videos.

The three-index architecture is a practical solution to a real problem. Users can’t find specific moments in video content using traditional keyword search. This system lets them ask natural language questions and get back exact timestamps, whether they’re searching for spoken content, visual scenes, or similar-looking shots.

The design prioritizes search quality over processing cost because users abandon systems that don’t find what they’re looking for.

The architecture scales from prototype to production without rewrites.

Start with 100 videos on Lambda and S3 Vectors. Grow to 1,000 videos without changes. Push to 10,000 videos with monitoring and metadata filters.

Beyond that, swap Lambda for Fargate, S3 Vectors for OpenSearch, and add ElastiCache. The core logic stays the same.

What’s next?

Challenge the Blueprint: Share your advanced use case or propose an upgrade in the comments.

You can find a detailed account of how each part is built, the criteria for the options chosen, and other details in the project’s repo. Feel free to contribute or open any issues you find.

I publish every week at buildwithaws.substack.com. Subscribe. It's free.

DEV Community