Most “video AI” demos stop at transcription.
That’s useful, but it leaves a lot of signal on the table.
I wanted a pipeline that could take a video upload and automatically extract multi-modal context: transcript, scenes, objects, audio segments, OCR text, and technical metadata — then merge it into one structured output that another AI system can actually use.
So I open-sourced VideoAnalyzer:
GitHub: https://github.com/chrisk60331/VideoAnalyzer
What it does
Upload a video → everything below runs automatically in the background:
| Step | What happens |
|---|---|
| Metadata probe | Duration, resolution, FPS via ffprobe |
| Whisper transcription | Full VTT transcript with timestamps using faster-whisper |
| YOLO object detection | Frame-by-frame detection at 1 FPS with YOLOv8 |
| Scene segmentation | Cut detection + per-scene brightness, motion, color palette |
| Audio classification | Speech / silence / music+noise segmentation |
| OCR | On-screen text extracted from scene keyframes using EasyOCR |
| Context assembly | Everything merged into a structured document for the AI |
Why this matters
If you want LLMs or downstream systems to work well with video, a transcript alone usually isn’t enough.
A better representation includes:
- What was said
- What appeared on screen
- What text was visible
- When scenes changed
- What the audio environment was doing
- What the source video technically looked like
That richer context opens the door for better:
- video search
- summarization
- highlight extraction
- compliance review
- media intelligence
- RAG pipelines over video
- agent workflows that need grounded visual/audio evidence
Pipeline overview
The basic idea is simple:
- A user uploads a video
- Background jobs process it automatically
- Each stage extracts a different layer of signal
- The results are merged into a structured AI-friendly representation
Instead of treating video as just text, VideoAnalyzer treats it as a multi-modal document.
Stack
Current pipeline includes tools like:
ffprobefaster-whisperYOLOv8EasyOCR
And additional processing for:
- scene detection
- brightness/motion analysis
- color palette extraction
- audio segmentation
- structured context generation
Example use cases
A few obvious ones:
- Searchable video archives
- AI agents that answer questions about video
- Automatic tagging/indexing
- Detecting visual entities across time
- Extracting on-screen text for knowledge pipelines
- Building structured context for summarization and QA
Why I open-sourced it
Because this kind of tooling is useful beyond one project.
There’s a growing need for systems that can convert unstructured media into something AI can reason over reliably. I’d rather put the foundation out in the open so other builders can use it, improve it, or adapt it to their own stack.
Repo
If you want to check it out, here it is:
https://github.com/chrisk60331/VideoAnalyzer
If it’s useful, feel free to star it, open issues, or contribute.
Top comments (0)