Video to Text to Video: Building LLM-Ready Pipelines from MP4 Files

#video2flow

Every LLM engineer I know has hit this wall: you have hours of video content — tutorials, meeting recordings, dashcam footage — and you want to feed it into an LLM for summarization, Q&A, or fine-tuning. But LLMs don't eat MP4 files. They eat text.

So you write a script. ffmpeg to extract frames. A loop to encode each frame as base64. A JSON blob with timestamps. It works. It's also fragile, one-off, and doesn't scale to the next project.

I built video2flow to close that gap in a single CLI command.

The core idea

video2flow extracts frames from any video file at a configurable FPS, assigns timestamps, and outputs a structured JSON "text flow" — ready to drop into any LLM prompt that supports vision.

# Extract 1 frame per second, describe scenes, export as JSON
video2flow extract demo.mp4 --fps 1 --describe --output flow.json

What you get back is a

frames: list of base64-encoded frames with their offset in seconds
scenes: automatically grouped sequences with descriptions
total_duration, fps, frame_count: everything the LLM needs to understand temporal context

Real use-case: I piped this output into a GPT-4 system prompt that said "You are a meeting summarizer. These are frames from a 30-minute product demo. Summarize decisions, code snippets shown, and unresolved questions." The result was shockingly coherent — it caught details from slide 5 that appeared at minute 12.

Beyond extraction: generation

The same tool also goes the other direction. Text-to-video generation:

video2flow generate "A developer typing code at sunset, drone shot" \
    --duration 8 --output sunset_demo.mp4

By default it uses OpenAI DALL-E 3 to generate frames, then stitches them with ffmpeg into an MP4. No video generation model to host. No GPU needed.

Need a local-only pipeline? Use --mode local for placeholder frames (colored rectangles with text overlays) — zero API calls.

The slideshow trick

For content teams: dump a folder of images into a video with crossfade transitions.

video2flow slideshow ./screenshots/ --transition crossfade --duration 3

What it is not

This is not a Sora competitor. The generated videos are frame-stitched, not diffusion-native. If you need 30-second cinematic clips with coherent motion, this isn't that tool.

What it is: a Swiss Army knife for developers who need to bridge video content and LLMs — fast, scriptable, local-first, and free.