Every LLM engineer I know has hit this wall: you have hours of video content — tutorials, meeting recordings, dashcam footage — and you want to feed it into an LLM for summarization, Q&A, or fine-tuning. But LLMs don't eat MP4 files. They eat text.
So you write a script. ffmpeg to extract frames. A loop to encode each frame as base64. A JSON blob with timestamps. It works. It's also fragile, one-off, and doesn't scale to the next project.
I built video2flow to close that gap in a single CLI command.
The core idea
video2flow extracts frames from any video file at a configurable FPS, assigns timestamps, and outputs a structured JSON "text flow" — ready to drop into any LLM prompt that supports vision.
# Extract 1 frame per second, describe scenes, export as JSON
video2flow extract demo.mp4 --fps 1 --describe --output flow.json
What you get back is a
- frames: list of base64-encoded frames with their offset in seconds
- scenes: automatically grouped sequences with descriptions
- total_duration, fps, frame_count: everything the LLM needs to understand temporal context
Real use-case: I piped this output into a GPT-4 system prompt that said "You are a meeting summarizer. These are frames from a 30-minute product demo. Summarize decisions, code snippets shown, and unresolved questions." The result was shockingly coherent — it caught details from slide 5 that appeared at minute 12.
Beyond extraction: generation
The same tool also goes the other direction. Text-to-video generation:
video2flow generate "A developer typing code at sunset, drone shot" \
--duration 8 --output sunset_demo.mp4
By default it uses OpenAI DALL-E 3 to generate frames, then stitches them with ffmpeg into an MP4. No video generation model to host. No GPU needed.
Need a local-only pipeline? Use --mode local for placeholder frames (colored rectangles with text overlays) — zero API calls.
The slideshow trick
For content teams: dump a folder of images into a video with crossfade transitions.
video2flow slideshow ./screenshots/ --transition crossfade --duration 3
What it is not
This is not a Sora competitor. The generated videos are frame-stitched, not diffusion-native. If you need 30-second cinematic clips with coherent motion, this isn't that tool.
What it is: a Swiss Army knife for developers who need to bridge video content and LLMs — fast, scriptable, local-first, and free.
pip install video2flow
Repo: https://github.com/massiron/video2flow
Docs: https://deepstrain.dev
No API keys required to start. No data leaves your machine unless you choose DALL-E generation.
Top comments (0)