Edmon Marine Clota

Posted on Apr 7

Building a Video Automation Pipeline with Remotion and AI APIs

#react #video #automation #ai

Most video tools treat automation as an afterthought. You get a drag and drop editor, maybe some presets, and a "batch export" button that barely works. Real video automation requires a different approach. One that treats video as code.

That is exactly what Remotion enables. And when you wire it up to modern AI APIs for voice, images, and motion, you get something powerful: a pipeline that turns a text prompt into a finished video with zero manual editing.

This article walks through the architecture of a system like this. No toy demos. A real pipeline that generates narrated, captioned, multi scene videos for platforms like YouTube Shorts, TikTok, and Instagram Reels.

Why Remotion as the Rendering Engine

Remotion turns React components into video frames. Each frame is a function of time. That sounds simple, but it unlocks a lot.

Traditional video editors store projects as opaque binary files. Remotion stores them as typed data structures. A video is a JSON object. A scene is a React component. A transition is a function. This means you can generate videos programmatically, test them, version control them, and render them in CI.

For automation, this is everything. You do not need a GUI to produce a video. You need data in, video out.

The Five Layer Pipeline

A fully automated video pipeline has five layers. Each layer handles one concern and passes its output to the next.

Layer 1: Script Generation

Every video starts with a script. For automated content, this comes from an LLM. You send a topic, tone, target duration, and structure requirements. The model returns a structured script with segments, hooks, body content, and a call to action.

The key detail: the script output is not free text. It is structured JSON with explicit segment boundaries. Each segment has a type (hook, narration, transition cue, CTA), text content, and estimated duration.

Models like Gemini 2.5 Flash handle this at roughly $0.15 per million tokens. A typical video script costs a fraction of a cent to generate.

Layer 2: Audio Synthesis

The script feeds into a text to speech engine. This is where model choice matters most for quality.

Kokoro runs at $0.02 per 1,000 characters. ElevenLabs costs $0.05 per 1,000 characters but offers voice cloning. The pipeline lets you swap providers without changing anything else.

The audio layer returns two things: the audio file and a word level timing map. Every word in the script gets a start time and end time in milliseconds. This timing map is the backbone of the entire video.

Layer 3: Visual Generation

With the script segmented and audio timed, the pipeline generates visuals for each scene. Each script segment gets an image prompt derived from the segment text.

Flux Schnell generates images at $0.003 each. Recraft costs $0.04 but produces higher quality illustrations. For motion, Veo 3.1 adds camera movement at $0.10 per second. Or you skip motion and use Ken Burns style pan and zoom effects rendered directly in Remotion at zero cost.

Layer 4: Caption Rendering

Remotion handles captions natively. The word level timing map from Layer 2 drives a caption component that highlights each word as it is spoken. TikTok style, one to three words at a time.

Captions are not burned in post. They are React components rendered in real time. You can change fonts, colors, animations, and positioning by updating props.

Layer 5: Composition and Rendering

The final layer assembles everything in Remotion. A Composition component receives the full data payload: script segments, audio file paths, visual asset manifest, caption timing, transition configuration, and output format.

Remotion's calculateMetadata function computes the total duration dynamically based on the audio length. The render step uses Remotion's headless renderer. On a modern machine, a 60 second video renders in about 30 to 90 seconds.

Architecture Decisions That Matter

Typed Data Contracts: Every layer communicates through typed interfaces. If the types compile, the pipeline works.

Provider Abstraction: Each AI service sits behind an adapter interface. The pipeline does not call FAL.ai directly. It calls a voice provider, which happens to be backed by FAL.ai today but could be anything tomorrow.

BYOK: Bring Your Own Keys. This architecture uses a BYOK model. Instead of proxying all API calls through a central server, each user signs up with AI providers like FAL.ai, ElevenLabs, or OpenAI directly and uses their own API keys. No rate limit sharing. No markup. No data routing through a third party.

Cost Estimation Before Rendering: The pipeline calculates cost before executing. A typical faceless narrated video costs $0.05 to $0.50 in total API calls.

What This Looks Like in Practice

VideoAIStudio implements this exact architecture as a desktop app for Windows and macOS. The Faceless Studio, which handles narrated and educational content, is free forever. You bring your own API keys and pay only for the AI calls you make.

The full app includes five studios covering different video types: faceless narrated content, UGC with AI avatars, gameplay montages, cinematic documentaries, and product marketing videos. The paid version is a $199 one time payment. No monthly fees.

Closing Thoughts for Developers

If you are building video automation, Remotion is the right rendering layer. It gives you programmatic control without sacrificing quality. The React model means your video logic is testable, composable, and version controlled.

The AI layer is modular by nature. Every major AI capability is available as a stateless API call. You do not need to train models or manage GPU infrastructure.

The hard part is not any single layer. It is the orchestration. Getting timing right across audio, visuals, and captions. Handling provider failures gracefully. Keeping costs predictable. That is where the real engineering lives.

And that is what makes this space interesting. The building blocks are commodity. The pipeline is the product.

DEV Community