Extract Video Frames for LLM Vision Pipelines with video2flow

#video2flow

If you're building AI pipelines that need to understand video — captioning, RAG over video content, or multimodal dataset creation — you've likely hit the same wall: LLMs eat text, but your source material is video.

You need frames as timestamped text. You need metadata. You need it without shipping your data to a cloud API.

video2flow is an open-source CLI tool that does exactly this. Install it, point it at a video, and get structured JSON output ready for any LLM.

What it does

pip install video2flow gives you a v2f command with four operations:

v2f extract — pull frames from MP4/MOV/AVI at custom FPS
v2f describe — generate timestamped text descriptions of each scene
v2f generate — create a video from a text prompt (uses DALL-E 3 or local placeholder mode)
v2f slideshow — turn an image directory into a video with transitions

Tutorial: Frame extraction for an LLM pipeline

Let's say I have a 5-minute product demo and I want an LLM to answer questions about what's shown.

Step 1: Install

pip install video2flow

Step 2: Extract frames with metadata

v2f extract demo.mp4 --fps 1 --output frames/

This writes one frame per second into frames/ and a JSON file with timestamps.

Step 3: Get the JSON

cat frames/metadata.json

Output looks like:

[
  {
    "frame": "frame_0000.jpg",
    "timestamp_s": 0.0,
    "timestamp": "00:00:00"
  },
  {
    "frame": "frame_0001.jpg",
    "timestamp_s": 1.0,
    "timestamp": "00:00:01"
  }
]

Step 4: Feed into an LLM

import json

with open("frames/metadata.json") as f:
    timeline = json.load(f)

# Each entry is a frame you can pass to a vision model
for entry in timeline[:3]:
    print(f"{entry['timestamp']} — {entry['frame']}")

Where this shines

Local-first: no data ever leaves your machine. The describe command runs frame analysis locally.
LLM-ready output: timestamps, paths, and metadata in one structured file.
Reverse pipeline: use v2f generate to create video from text when you need demo content.

Trade-offs to know

Frame-by-frame processing at high FPS is I/O heavy. Start at 1 FPS and adjust up.
v2f describe uses a local vision model — quality depends on the model and GPU availability.
v2f generate without an OpenAI key produces placeholder frames. For production-quality generation, set OPENAI_API_KEY.

Why not just ffmpeg?

ffmpeg extracts frames. video2flow extracts frames and gives you timestamped JSON, descriptions, and LLM-ready structure in one command. If you're already piping ffmpeg output into a Python script, video2flow replaces that boilerplate.