If you're building AI pipelines that need to understand video — captioning, RAG over video content, or multimodal dataset creation — you've likely hit the same wall: LLMs eat text, but your source material is video.
You need frames as timestamped text. You need metadata. You need it without shipping your data to a cloud API.
video2flow is an open-source CLI tool that does exactly this. Install it, point it at a video, and get structured JSON output ready for any LLM.
What it does
pip install video2flow gives you a v2f command with four operations:
-
v2f extract— pull frames from MP4/MOV/AVI at custom FPS -
v2f describe— generate timestamped text descriptions of each scene -
v2f generate— create a video from a text prompt (uses DALL-E 3 or local placeholder mode) -
v2f slideshow— turn an image directory into a video with transitions
Tutorial: Frame extraction for an LLM pipeline
Let's say I have a 5-minute product demo and I want an LLM to answer questions about what's shown.
Step 1: Install
pip install video2flow
Step 2: Extract frames with metadata
v2f extract demo.mp4 --fps 1 --output frames/
This writes one frame per second into frames/ and a JSON file with timestamps.
Step 3: Get the JSON
cat frames/metadata.json
Output looks like:
[
{
"frame": "frame_0000.jpg",
"timestamp_s": 0.0,
"timestamp": "00:00:00"
},
{
"frame": "frame_0001.jpg",
"timestamp_s": 1.0,
"timestamp": "00:00:01"
}
]
Step 4: Feed into an LLM
import json
with open("frames/metadata.json") as f:
timeline = json.load(f)
# Each entry is a frame you can pass to a vision model
for entry in timeline[:3]:
print(f"{entry['timestamp']} — {entry['frame']}")
Where this shines
-
Local-first: no data ever leaves your machine. The
describecommand runs frame analysis locally. - LLM-ready output: timestamps, paths, and metadata in one structured file.
-
Reverse pipeline: use
v2f generateto create video from text when you need demo content.
Trade-offs to know
- Frame-by-frame processing at high FPS is I/O heavy. Start at 1 FPS and adjust up.
-
v2f describeuses a local vision model — quality depends on the model and GPU availability. -
v2f generatewithout an OpenAI key produces placeholder frames. For production-quality generation, setOPENAI_API_KEY.
Why not just ffmpeg?
ffmpeg extracts frames. video2flow extracts frames and gives you timestamped JSON, descriptions, and LLM-ready structure in one command. If you're already piping ffmpeg output into a Python script, video2flow replaces that boilerplate.
Next steps
pip install video2flow
v2f --help
Repo: github.com/massiron/video2flow
Docs: deepstrain.dev
It's free, open-source, and pip-installable. No API key required for basic extraction.
Top comments (0)