DEV Community

Chris King
Chris King

Posted on

I Open-Sourced VideoAnalyzer: Turn Raw Video Into Structured AI Context

Most “video AI” demos stop at transcription.

That’s useful, but it leaves a lot of signal on the table.

I wanted a pipeline that could take a video upload and automatically extract multi-modal context: transcript, scenes, objects, audio segments, OCR text, and technical metadata — then merge it into one structured output that another AI system can actually use.

So I open-sourced VideoAnalyzer:

GitHub: https://github.com/chrisk60331/VideoAnalyzer

What it does

Upload a video → everything below runs automatically in the background:

Step What happens
Metadata probe Duration, resolution, FPS via ffprobe
Whisper transcription Full VTT transcript with timestamps using faster-whisper
YOLO object detection Frame-by-frame detection at 1 FPS with YOLOv8
Scene segmentation Cut detection + per-scene brightness, motion, color palette
Audio classification Speech / silence / music+noise segmentation
OCR On-screen text extracted from scene keyframes using EasyOCR
Context assembly Everything merged into a structured document for the AI

Why this matters

If you want LLMs or downstream systems to work well with video, a transcript alone usually isn’t enough.

A better representation includes:

  • What was said
  • What appeared on screen
  • What text was visible
  • When scenes changed
  • What the audio environment was doing
  • What the source video technically looked like

That richer context opens the door for better:

  • video search
  • summarization
  • highlight extraction
  • compliance review
  • media intelligence
  • RAG pipelines over video
  • agent workflows that need grounded visual/audio evidence

Pipeline overview

The basic idea is simple:

  1. A user uploads a video
  2. Background jobs process it automatically
  3. Each stage extracts a different layer of signal
  4. The results are merged into a structured AI-friendly representation

Instead of treating video as just text, VideoAnalyzer treats it as a multi-modal document.

Stack

Current pipeline includes tools like:

  • ffprobe
  • faster-whisper
  • YOLOv8
  • EasyOCR

And additional processing for:

  • scene detection
  • brightness/motion analysis
  • color palette extraction
  • audio segmentation
  • structured context generation

Example use cases

A few obvious ones:

  • Searchable video archives
  • AI agents that answer questions about video
  • Automatic tagging/indexing
  • Detecting visual entities across time
  • Extracting on-screen text for knowledge pipelines
  • Building structured context for summarization and QA

Why I open-sourced it

Because this kind of tooling is useful beyond one project.

There’s a growing need for systems that can convert unstructured media into something AI can reason over reliably. I’d rather put the foundation out in the open so other builders can use it, improve it, or adapt it to their own stack.

Repo

If you want to check it out, here it is:

https://github.com/chrisk60331/VideoAnalyzer

If it’s useful, feel free to star it, open issues, or contribute.

Top comments (0)