I Open-Sourced VideoAnalyzer: Turn Raw Video Into Structured AI Context

#opensource #ai #showdev #python

Most “video AI” demos stop at transcription.

That’s useful, but it leaves a lot of signal on the table.

I wanted a pipeline that could take a video upload and automatically extract multi-modal context: transcript, scenes, objects, audio segments, OCR text, and technical metadata — then merge it into one structured output that another AI system can actually use.

So I open-sourced VideoAnalyzer:

GitHub: https://github.com/chrisk60331/VideoAnalyzer

What it does

Upload a video → everything below runs automatically in the background:

Step	What happens
Metadata probe	Duration, resolution, FPS via ffprobe
Whisper transcription	Full VTT transcript with timestamps using faster-whisper
YOLO object detection	Frame-by-frame detection at 1 FPS with YOLOv8
Scene segmentation	Cut detection + per-scene brightness, motion, color palette
Audio classification	Speech / silence / music+noise segmentation
OCR	On-screen text extracted from scene keyframes using EasyOCR
Context assembly	Everything merged into a structured document for the AI

Why this matters

If you want LLMs or downstream systems to work well with video, a transcript alone usually isn’t enough.

A better representation includes:

What was said
What appeared on screen
What text was visible
When scenes changed
What the audio environment was doing
What the source video technically looked like

That richer context opens the door for better:

video search
summarization
highlight extraction
compliance review
media intelligence
RAG pipelines over video
agent workflows that need grounded visual/audio evidence

Pipeline overview

The basic idea is simple:

A user uploads a video
Background jobs process it automatically
Each stage extracts a different layer of signal
The results are merged into a structured AI-friendly representation

Instead of treating video as just text, VideoAnalyzer treats it as a multi-modal document.

Stack

Current pipeline includes tools like:

ffprobe
faster-whisper
YOLOv8
EasyOCR

And additional processing for:

scene detection
brightness/motion analysis
color palette extraction
audio segmentation
structured context generation

Example use cases

A few obvious ones:

Searchable video archives
AI agents that answer questions about video
Automatic tagging/indexing
Detecting visual entities across time
Extracting on-screen text for knowledge pipelines
Building structured context for summarization and QA

Why I open-sourced it

Because this kind of tooling is useful beyond one project.

There’s a growing need for systems that can convert unstructured media into something AI can reason over reliably. I’d rather put the foundation out in the open so other builders can use it, improve it, or adapt it to their own stack.