larry

Posted on Jul 4

How I Built a Browser-Based Video to PowerPoint Converter

#javascript #webdev #video #opensource

Long lecture recordings are a strange format.

The useful part is often not the video itself. It is the slide deck trapped inside the video: a lecture, webinar, product walkthrough, Zoom recording, or screen share where the presenter spends most of the time on mostly-static slides.

So I built Video2Any, a browser-based tool that turns videos and screen recordings into PowerPoint decks, PDFs, image frames, and subtitles. The unusual part is that the video does not need to be uploaded. The browser does the decoding, slide detection, and export locally.

On my tests, an hour-long lecture-style video can usually be sampled in under a minute on a normal laptop. The exact speed depends on the video, browser, codec, resolution, and device, but the important design choice is this: the tool does not need to understand every pixel of every frame at full resolution.

It only needs to answer one question quickly:

Did the visible slide change?

The first version was not AI

The tempting answer today is: use AI.

But for this problem, AI is not the first tool I reach for.

A recorded slide deck has a useful property: it is mostly still. The same slide remains on screen for seconds or minutes, then a large part of the frame changes when the presenter advances to the next slide.

That means a simple visual detector can work surprisingly well:

Sample the video every few seconds.
Downscale each sampled frame to a small size, such as 160x90.
Compare the current frame to the last kept slide frame.
Keep the frame if enough visual blocks changed.
Export the kept frames into a deck.

This is not semantic understanding. It does not know what the title says. It does not know whether a chart is important. It only detects meaningful visual changes.

For lecture videos, webinars, and screen recordings, that is often enough to get a useful first deck.

Why downscale first?

A 1080p frame has more than two million pixels. Comparing full-resolution frames repeatedly is wasteful if the goal is only to detect slide transitions.

Instead, Video2Any compares a small version of the frame. At 160x90, the detector is looking at 14,400 pixels. That is small enough to be fast, but still large enough to tell whether a slide changed.

The export step can still capture a higher-quality image later. Detection and export do not need to use the same resolution.

This split matters:

Low-resolution frames for fast detection.
Higher-resolution frames for final PowerPoint/PDF/image export.

Block-based frame differencing

A naive frame diff compares individual pixels. That is too sensitive for compressed video.

Video compression creates small changes even when the slide did not actually change. A cursor might move. A webcam bubble might flicker. A progress bar might animate.

So the detector uses blocks.

At a high level:

function changedEnough(previousFrame, currentFrame) {
  const blockSize = 8;
  const blockDelta = 14;
  const changedRatio = 0.02;

  let changedBlocks = 0;
  let totalBlocks = 0;

  for (const block of blocks(currentFrame, blockSize)) {
    const diff = meanAbsoluteDifference(previousFrame, currentFrame, block);
    if (diff > blockDelta) changedBlocks++;
    totalBlocks++;
  }

  return changedBlocks / totalBlocks > changedRatio;
}

The exact implementation has more edge cases, but the idea is simple: a block counts as changed only if its average visual difference is high enough. Then a frame becomes a new slide only if enough blocks changed.

This makes the detector less jumpy than pixel-by-pixel comparison.

Why the browser is enough

Modern browsers can do a lot of media work locally:

Decode video from a File or Blob.
Seek through timestamps.
Draw frames to a canvas.
Read image data for diffing.
Generate PowerPoint, PDF, or ZIP exports in JavaScript.

The privacy benefit is obvious: if the browser does the work, the file does not need to touch a server.

The cost benefit is also important. If users are converting large private lecture recordings, uploading and processing everything server-side becomes expensive and slow. Local processing makes the free tier much more practical.

The open-core detector

I extracted the basic detector into a tiny open-source package:

npm install video-slide-extractor

Usage looks like this:

import { createSlideDetector } from 'video-slide-extractor';

const width = 160;
const height = 90;
const detect = createSlideDetector(width, height);

for (const frame of sampledFrames) {
  const result = detect(frame.rgba);
  if (result.keep) {
    console.log('new slide near', frame.time);
  }
}

The package is here:

It is intentionally small. It does not export PowerPoint files or run OCR. It just detects slide and scene changes from frames.

What is harder than it sounds?

A few cases make this problem more interesting.

Fade transitions

If a deck uses fade animations, the detector might catch a half-transition frame instead of the final clean slide. A better pipeline should refine the timestamp after a change and capture the settled frame.

Webcam overlays

If a webcam bubble moves constantly, it can create visual noise. Block-based diffing helps, but large overlays can still confuse the detector. One improvement is activity masking: identify small regions that move all the time and reduce their weight.

Returning to an earlier slide

Presenters often jump back. Local frame comparison only knows about the last kept frame. A stronger pipeline also compares against earlier kept slides and removes duplicates.

Text-only changes

The hardest useful case is a mostly identical slide where only a line of text changes. The threshold needs to be sensitive enough to catch that, without firing on video noise.

This is where calibration helps. Instead of using one fixed threshold for every video, you can sample the video first and estimate the gap between same-slide noise and real slide changes.

What AI is still useful for

I do not think AI is unnecessary here. I just think it belongs later in the pipeline.

Frame differencing is good for finding slide frames. AI and OCR are useful after that:

OCR to rebuild editable text.
Speech-to-text for subtitles or speaker notes.
Layout analysis to turn screenshots into structured slides.
Summarization to produce study notes.

That split keeps the first pass fast, private, and cheap.

Where this works best

Video2Any works best for:

Lecture recordings.
Webinars.
Conference talks.
Zoom, Meet, and Teams recordings.
Screen recordings and product walkthroughs.
Mostly-static slide decks captured as video.

It is less ideal for:

Handheld camera footage.
Highly animated videos.
Videos where slides are tiny or blurred.
Recordings where the presenter covers the slide content.

Try it

The browser tool is here:

https://video2any.com

The open-source detector is here:

https://github.com/larry-xue/video-slide-extractor

I also started a small curated list of related tools and libraries:

https://github.com/larry-xue/awesome-video-to-slides

I would be especially interested in feedback from people who deal with long lecture recordings, internal training videos, or webinar archives. The core question is simple: is the extracted deck good enough to replace scrubbing through the video?

Top comments (6)

Pratik sharma • Jul 5

this is great idea. nice work with the application. Looks fantastic.

Idan Bakal • Jul 5

wow it is great to see developers like you

Frank • Jul 4

How did you handle video encoding and decoding in the browser, was it a challenge? I'd love to swap ideas on this, following for more content on video processing.

larry • Jul 5 • Edited

Thanks! Funny thing — the encoding side turned out to be a non-issue, because I never re-encode video. The output is just still frames assembled into PPTX/PDF, so the whole problem collapses to decoding + picking the right frames. Decoding is where all the fun (and pain) was.

I ended up with a three-tier ladder that degrades gracefully:

WebCodecs VideoDecoder — the fast path. The catch: WebCodecs won't take a file, it wants encoded chunks. So you have to demux the container yourself and feed it EncodedVideoChunks. I hand-wrote MP4 (ISO-BMFF) and WebM/MKV demuxers instead of pulling in ffmpeg.wasm — parsing the sample tables (stsz/stco/stss…), rebuilding keyframe→target GOP chains, adding margin for B-frame reordering. That was the hard but satisfying part.
video + requestVideoFrameCallback at 16× playback when WebCodecs/demux isn't available.
Canvas seek loop as the universal last resort.

Two gotchas if you go down this road:

Codec support is genuinely fragmented — lean on VideoDecoder.isConfigSupported() and be ready to drop a tier. (H.264 needs its avcC box passed as the decoder description.)
My own MediaRecorder output bit me — WebM with unknown-size clusters and no Cues, so the demuxer has to handle unknown-length EBML elements.

One more that mattered for memory: scan MP4 boxes header-only and slice the File on demand, so a multi-GB video never fully lands in RAM.

Happy to go deeper on any of these — the demux path is the most reusable bit. What are you building?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.