Auto-Vid: Serverless Video Processing Platform built for the AWS Lambda Hackathon

#aws #hackathon #ai #awschallenge

This is a copy of my original participation in the DevPost AWS Lambda Hackathon: https://devpost.com/software/auto-vid-serverless-video-processing

🎬 The Inspiration

As a developer who's spent countless hours manually editing videos for side projects, I was frustrated by the repetitive nature of adding voiceovers, background music, and sound effects. Every marketing team I knew was struggling with the same "content treadmill" - needing to produce 5-10 videos per week but lacking the time or budget for professional editing.

The breakthrough moment came when I realized that most video editing follows predictable patterns: add a voiceover at specific timestamps, duck the background music during speech, insert sound effects at key moments. This seemed perfect for automation, but existing solutions were either too expensive or required complex video editing skills.

I wanted to create something that could transform a simple JSON specification into a professionally edited video - making video production as easy as writing a configuration file.

🎯 What it does

Auto-Vid transforms video creation from a manual, time-consuming process into an automated workflow. Users submit a simple JSON specification describing their video requirements - the base video file, background music, voiceover text, and sound effects with precise timing. The platform then automatically generates a professionally edited video with AI-powered text-to-speech, intelligent audio mixing (including automatic ducking of background music during speech), crossfading between music tracks, and synchronized sound effects. The entire process happens serverlessly on AWS, scaling from zero to hundreds of concurrent video processing jobs, with results delivered via secure download URLs and optional webhook notifications.

🛠️ How I Built It

Architecture Decision: I chose a fully serverless approach to handle unpredictable workloads - from zero videos per day to hundreds during peak times. The architecture uses three main components:

API Layer (Lambda + API Gateway): Lightweight functions for job submission and status checking
Processing Engine (Lambda Container): Heavy-duty video processing with MoviePy and AWS Polly
Storage & Orchestration (S3 + SQS + DynamoDB): Managed storage with reliable job queuing

Development Workflow: Local development was tricky since video processing requires the full AWS environment. I created a hybrid approach:

Individual components (TTS generation, S3 upload, webhooks) can be tested locally
Full integration testing requires AWS deployment
SAM handles the complex container build and ECR management automatically

Key Technical Implementation:

# Core video processing pipeline
def _process_video_internal(self, job_id, job_spec, job_temp_dir, start_time):
    # 1. Download assets from S3
    audio_assets = self._download_audio_assets(job_spec.assets.audio, job_temp_dir)
    video_path = self.asset_manager.download_asset(job_spec.assets.video.source, job_temp_dir)

    # 2. Load video and get duration
    video = VideoFileClip(video_path)
    video_duration = video.duration

    # 3. Generate TTS and audio clips for timeline events
    audio_clips = []
    ducking_ranges = []

    for index, event in enumerate(job_spec.timeline):
        if event.type == "tts":
            clip = self._create_tts_clip(event, index, job_temp_dir)
            audio_clips.append(clip)
            if event.data.duckingLevel is not None:
                ducking_ranges.append({
                    "start": event.start,
                    "end": event.start + clip.duration,
                    "ducking_level": event.data.duckingLevel,
                    "fade_duration": event.data.duckingFadeDuration,
                })

    # 4. Create background music with crossfading
    background_music = self._create_background_music(
        job_spec.backgroundMusic, audio_assets, video_duration
    )

    # 5. Apply audio ducking during speech
    if background_music and ducking_ranges:
        background_music = self._apply_ducking(background_music, ducking_ranges)

    # 6. Composite final video
    all_audio = [background_music] if background_music else []
    all_audio.extend(audio_clips)
    final_audio = CompositeAudioClip(all_audio)
    final_video = video.with_audio(final_audio)

    return final_video

Infrastructure as Code: Everything is defined in a single SAM template that creates:

Lambda functions with proper IAM roles
S3 bucket with organized folder structure
SQS queue for reliable job processing
DynamoDB table for status tracking
API Gateway endpoints with CORS support

🚧 Challenges Faced

Lambda Memory Limits: The biggest surprise was discovering that many AWS accounts have a 3GB Lambda memory limit by default. Video processing needs significantly more - I configured 10GB for optimal performance. This required users to request quota increases through AWS Support, which I documented thoroughly in the deployment guide.

Container Size Optimization: My initial Docker image was 800MB, which caused slow cold starts. I implemented multi-stage builds, removed unnecessary dependencies, and optimized the Python environment to get down to 360MB while maintaining full functionality.

Audio Synchronization: Getting perfect audio ducking was surprisingly complex. Background music needs to fade down smoothly when speech starts, maintain the lower volume during the entire speech clip, then fade back up. I developed a custom algorithm that:

def _apply_ducking(self, background_music, ducking_ranges):
    """Apply ducking to background music based on ranges"""
    # Sort ranges by start time
    ducking_ranges.sort(key=lambda x: x["start"])

    # Merge overlapping ranges
    merged_ranges = []
    if ducking_ranges:
        current_range = ducking_ranges[0]
        for next_range in ducking_ranges[1:]:
            if next_range["start"] <= current_range["end"]:
                # Ranges overlap, merge them
                current_range["end"] = max(current_range["end"], next_range["end"])
                # Use more aggressive ducking level (lower value)
                current_range["ducking_level"] = min(
                    current_range["ducking_level"], next_range["ducking_level"]
                )
                # Use longer fade duration
                current_range["fade_duration"] = max(
                    current_range["fade_duration"], next_range["fade_duration"]
                )
            else:
                # No overlap, add current range and start new one
                merged_ranges.append(current_range)
                current_range = next_range
        merged_ranges.append(current_range)

    # Apply ducking for each merged range
    for range_info in merged_ranges:
        if range_info["fade_duration"] > 0:
            # Apply fade effects when fade duration is specified
            background_music = background_music.with_effects(
                [
                    afx.AudioFadeIn(range_info["fade_duration"]),
                    afx.AudioFadeOut(range_info["fade_duration"]),
                ]
            ).with_volume_scaled(
                range_info["ducking_level"], range_info["start"], range_info["end"]
            )
        else:
            # Apply instant volume change when fade duration is 0
            background_music = background_music.with_volume_scaled(
                range_info["ducking_level"], range_info["start"], range_info["end"]
            )

    return background_music

Error Handling Across Distributed Components: With multiple Lambda functions, S3 operations, and external webhook calls, failure scenarios were complex. I implemented comprehensive retry logic, dead letter queues for failed jobs, and detailed error reporting that helps users understand what went wrong and how to fix it.

Empty Timeline Support: A late addition was supporting videos with just background music (empty timeline). This seemed simple but required refactoring the entire processing pipeline to handle the edge case gracefully while maintaining all the audio mixing capabilities.

🏆 Accomplishments that I am proud of

Solving Real Business Problems: Auto-Vid addresses genuine pain points in content creation - the "content treadmill" that marketing teams face, the high cost of video production, and the lack of scalable solutions for repetitive editing tasks.

Technical Excellence in Serverless Architecture: Successfully implemented complex video processing in a fully serverless environment, handling memory optimization, container builds, and distributed error handling across multiple Lambda functions while maintaining production-ready reliability.

Declarative Video Editing: Created an intuitive JSON-based specification format that makes professional video editing accessible to non-technical users, transforming complex MoviePy operations into simple configuration files.

Advanced Audio Processing: Developed sophisticated audio ducking algorithms that automatically lower background music during speech with smooth fade transitions, plus crossfading between music tracks - features typically found only in professional editing software.

Production-Ready Infrastructure: Built comprehensive error handling, retry logic, webhook notifications, and automatic resource cleanup - demonstrating that hackathon projects can achieve enterprise-grade quality and reliability.

📚 What I Learned

Building Auto-Vid taught me several crucial lessons about serverless video processing:

Lambda Container Optimization: Video processing requires significant memory and storage. I learned to optimize Docker containers for Lambda, reducing the image size from 800MB to 360MB through multi-stage builds and careful dependency management. The biggest challenge was working within Lambda's memory limits - many AWS accounts default to 3GB, requiring quota increase requests for the full 10GB needed for complex video processing.

Advanced MoviePy Techniques: Processing video in a serverless environment requires different approaches than traditional desktop editing. I developed techniques for precise audio ducking (automatically lowering background music during speech), crossfading between music tracks, and synchronizing multiple audio layers without memory overflow.

AWS Polly's Evolution: I discovered the differences between Polly's engines - standard voices for basic needs, neural for natural speech, long-form for extended content, and the new generative engine for ultra-realistic voices. Each has different latency and cost characteristics that affect the overall user experience.

Serverless Architecture Patterns: Managing a complex workflow across multiple Lambda functions taught me about event-driven architecture, proper error handling with SQS dead letter queues, and designing for eventual consistency with DynamoDB.

🚀 What's Next

Real-World Applications: Auto-Vid solves genuine business problems. I've identified use cases ranging from automated social media content creation to e-commerce product demos at scale. The declarative JSON approach means it can integrate with existing content management systems and marketing workflows.

Technical Improvements: Future enhancements include:

AI-powered video spec generation from natural language prompts using AWS Bedrock
Support for multiple video inputs (picture-in-picture, transitions)
Visual effects and text overlays
Integration with more TTS providers
Batch processing for multiple videos
Cost optimization through spot instances for non-urgent jobs

Business Potential: The serverless architecture means zero infrastructure costs when idle, making it viable for both small businesses and enterprise customers. The pay-per-use model aligns costs directly with value delivered.

Auto-Vid demonstrates that complex, traditionally expensive workflows can be democratized through thoughtful serverless architecture. By combining AWS Lambda's scalability with modern video processing libraries, it transforms video editing from a specialized skill into a simple API call.