DEV Community

larry
larry

Posted on

How I Built a Browser-Based Video to PowerPoint Converter

Long lecture recordings are a strange format.

The useful part is often not the video itself. It is the slide deck trapped inside the video: a lecture, webinar, product walkthrough, Zoom recording, or screen share where the presenter spends most of the time on mostly-static slides.

So I built Video2Any, a browser-based tool that turns videos and screen recordings into PowerPoint decks, PDFs, image frames, and subtitles. The unusual part is that the video does not need to be uploaded. The browser does the decoding, slide detection, and export locally.

On my tests, an hour-long lecture-style video can usually be sampled in under a minute on a normal laptop. The exact speed depends on the video, browser, codec, resolution, and device, but the important design choice is this: the tool does not need to understand every pixel of every frame at full resolution.

It only needs to answer one question quickly:

Did the visible slide change?

The first version was not AI

The tempting answer today is: use AI.

But for this problem, AI is not the first tool I reach for.

A recorded slide deck has a useful property: it is mostly still. The same slide remains on screen for seconds or minutes, then a large part of the frame changes when the presenter advances to the next slide.

That means a simple visual detector can work surprisingly well:

  1. Sample the video every few seconds.
  2. Downscale each sampled frame to a small size, such as 160x90.
  3. Compare the current frame to the last kept slide frame.
  4. Keep the frame if enough visual blocks changed.
  5. Export the kept frames into a deck.

This is not semantic understanding. It does not know what the title says. It does not know whether a chart is important. It only detects meaningful visual changes.

For lecture videos, webinars, and screen recordings, that is often enough to get a useful first deck.

Why downscale first?

A 1080p frame has more than two million pixels. Comparing full-resolution frames repeatedly is wasteful if the goal is only to detect slide transitions.

Instead, Video2Any compares a small version of the frame. At 160x90, the detector is looking at 14,400 pixels. That is small enough to be fast, but still large enough to tell whether a slide changed.

The export step can still capture a higher-quality image later. Detection and export do not need to use the same resolution.

This split matters:

  • Low-resolution frames for fast detection.
  • Higher-resolution frames for final PowerPoint/PDF/image export.

Block-based frame differencing

A naive frame diff compares individual pixels. That is too sensitive for compressed video.

Video compression creates small changes even when the slide did not actually change. A cursor might move. A webcam bubble might flicker. A progress bar might animate.

So the detector uses blocks.

At a high level:

function changedEnough(previousFrame, currentFrame) {
  const blockSize = 8;
  const blockDelta = 14;
  const changedRatio = 0.02;

  let changedBlocks = 0;
  let totalBlocks = 0;

  for (const block of blocks(currentFrame, blockSize)) {
    const diff = meanAbsoluteDifference(previousFrame, currentFrame, block);
    if (diff > blockDelta) changedBlocks++;
    totalBlocks++;
  }

  return changedBlocks / totalBlocks > changedRatio;
}
Enter fullscreen mode Exit fullscreen mode

The exact implementation has more edge cases, but the idea is simple: a block counts as changed only if its average visual difference is high enough. Then a frame becomes a new slide only if enough blocks changed.

This makes the detector less jumpy than pixel-by-pixel comparison.

Why the browser is enough

Modern browsers can do a lot of media work locally:

  • Decode video from a File or Blob.
  • Seek through timestamps.
  • Draw frames to a canvas.
  • Read image data for diffing.
  • Generate PowerPoint, PDF, or ZIP exports in JavaScript.

The privacy benefit is obvious: if the browser does the work, the file does not need to touch a server.

The cost benefit is also important. If users are converting large private lecture recordings, uploading and processing everything server-side becomes expensive and slow. Local processing makes the free tier much more practical.

The open-core detector

I extracted the basic detector into a tiny open-source package:

npm install video-slide-extractor
Enter fullscreen mode Exit fullscreen mode

Usage looks like this:

import { createSlideDetector } from 'video-slide-extractor';

const width = 160;
const height = 90;
const detect = createSlideDetector(width, height);

for (const frame of sampledFrames) {
  const result = detect(frame.rgba);
  if (result.keep) {
    console.log('new slide near', frame.time);
  }
}
Enter fullscreen mode Exit fullscreen mode

The package is here:

It is intentionally small. It does not export PowerPoint files or run OCR. It just detects slide and scene changes from frames.

What is harder than it sounds?

A few cases make this problem more interesting.

Fade transitions

If a deck uses fade animations, the detector might catch a half-transition frame instead of the final clean slide. A better pipeline should refine the timestamp after a change and capture the settled frame.

Webcam overlays

If a webcam bubble moves constantly, it can create visual noise. Block-based diffing helps, but large overlays can still confuse the detector. One improvement is activity masking: identify small regions that move all the time and reduce their weight.

Returning to an earlier slide

Presenters often jump back. Local frame comparison only knows about the last kept frame. A stronger pipeline also compares against earlier kept slides and removes duplicates.

Text-only changes

The hardest useful case is a mostly identical slide where only a line of text changes. The threshold needs to be sensitive enough to catch that, without firing on video noise.

This is where calibration helps. Instead of using one fixed threshold for every video, you can sample the video first and estimate the gap between same-slide noise and real slide changes.

What AI is still useful for

I do not think AI is unnecessary here. I just think it belongs later in the pipeline.

Frame differencing is good for finding slide frames. AI and OCR are useful after that:

  • OCR to rebuild editable text.
  • Speech-to-text for subtitles or speaker notes.
  • Layout analysis to turn screenshots into structured slides.
  • Summarization to produce study notes.

That split keeps the first pass fast, private, and cheap.

Where this works best

Video2Any works best for:

  • Lecture recordings.
  • Webinars.
  • Conference talks.
  • Zoom, Meet, and Teams recordings.
  • Screen recordings and product walkthroughs.
  • Mostly-static slide decks captured as video.

It is less ideal for:

  • Handheld camera footage.
  • Highly animated videos.
  • Videos where slides are tiny or blurred.
  • Recordings where the presenter covers the slide content.

Try it

The browser tool is here:

https://video2any.com

The open-source detector is here:

https://github.com/larry-xue/video-slide-extractor

I also started a small curated list of related tools and libraries:

https://github.com/larry-xue/awesome-video-to-slides

I would be especially interested in feedback from people who deal with long lecture recordings, internal training videos, or webinar archives. The core question is simple: is the extracted deck good enough to replace scrubbing through the video?

Top comments (0)