Yurukusa

Posted on Feb 23 • Edited on Jun 29

I Built a CLI That Searches 3 Hours of Twitch VODs in 7 Minutes

#claudecode #opensource #productivity #python

I watch a lot of streams. And every few days, I find myself doing the same thing: scrubbing through a 3-hour VOD at 2x speed, trying to find the 30 seconds where the streamer mentioned a specific game.

That's 90 minutes of my life to find half a minute of content.

So I built a CLI tool that solves this in 7 minutes. It downloads the VOD audio, transcribes it with Whisper on my local GPU, and lets me search by keyword. No cloud. No API keys. Just timestamps.

The tool is called VOD Search, and it's open source.

The Problem

Finding a specific moment in a stream archive is surprisingly painful. You have two options:

Rewatch the whole thing. A 3-hour stream at 2x speed still takes 90 minutes of your time.
Use a cloud transcription service. Upload the audio to a third-party server. Pay per minute. Hope they handle the data responsibly.

Both options are overkill when all you want to know is "did the streamer mention Elden Ring, and when?"

I wanted something closer to grep for stream audio. Type a keyword, get a timestamp, move on.

What VOD Search Does

One command. Timestamped results.

vod-search https://www.twitch.tv/CHANNEL --keyword "elden ring"

This downloads the audio from the channel's recent VODs, transcribes them using Whisper on your local GPU, and searches the transcripts for your keyword. The output looks like this:

============================================================
Found 3 matches for: elden ring
============================================================

--- v2198374561_full.txt ---
  [1:23:40] yeah I was thinking about going back to
  >>> [1:23:45] elden ring actually, the DLC is pretty good
  [1:23:51] I haven't finished it yet though

  [2:05:12] chat is asking about
  >>> [2:05:15] elden ring again, I'll probably stream it next week
  [2:05:20] we'll see how the schedule works out

--- v2197650832_full.txt ---
  [0:45:01] someone in chat recommended
  >>> [0:45:05] elden ring as a good game for stream
  [0:45:10] I've heard that a lot lately

Each match shows the timestamp and a few lines of surrounding context, so you can jump straight to the right moment in the VOD.

How It Works

The pipeline is simple: three well-known tools chained together.

Channel URL ? yt-dlp (audio download) ? faster-whisper (transcribe) ? regex search
                                                ?
                                         Runs on your local GPU

Step 1: Download audio with yt-dlp

yt-dlp handles the download. It supports Twitch, YouTube, Niconico, and over 1,000 other sites. VOD Search extracts just the audio track as WAV -- no video needed, which keeps the download fast and the file sizes manageable.

Step 2: Transcribe with faster-whisper

faster-whisper is a CTranslate2 port of OpenAI's Whisper. It runs the same models but significantly faster, especially on GPU. The tool loads the model once and processes all VODs sequentially, so there's no repeated startup overhead.

Each segment gets a timestamp. The output is saved as a plain text file:

[0:00:12] hey everyone welcome back to the stream
[0:00:18] today we're going to be playing some...
[0:01:45] let me check the chat real quick

Step 3: Search with regex

Once transcribed, searching is just regex over text files. This means:

New keyword? Instant results. No re-downloading, no re-transcribing.
Multiple keywords? Comma-separated: --keyword "elden ring,dark souls,fromsoft"
Need context? --context 3 shows 3 lines before and after each match.
Building something on top? --json gives you structured output for piping into other tools.

The whole thing is a single Python script under 350 lines.

Performance

Here are the actual numbers from my machine:

GPU	1 hour of stream	3 hours of stream
RTX 3080	~2-3 min	~7 min
RTX 3060	~4-6 min	~15 min
CPU only	~30-60 min	~1.5-3 hours

These are measured with the default medium model, which balances accuracy and speed well for most stream audio. You can trade off between the two:

--model tiny is the fastest but least accurate. Good enough for keyword spotting in clear audio where the streamer is speaking directly into a good microphone.
--model medium (default) is the sweet spot for most use cases. Handles background noise and casual speech reasonably well.
--model large is the most accurate but slowest. Worth it for noisy audio, heavy accents, or when you need near-perfect transcription.

The first run downloads the Whisper model (~3 GB for medium). After that, it's cached locally and loads in a few seconds.

One detail worth noting: VOD Search loads the model once and reuses it across all VODs in a batch. If you're processing 10 VODs, the model initialization cost is paid only once. This matters because model loading can take 5-10 seconds -- negligible for a single VOD, but it would add up if you loaded it fresh for each one.

Why Local?

I could have built this as a web service that calls the OpenAI Whisper API. It would have been easier. But I had three reasons to keep everything local:

Privacy. Stream audio often contains off-the-cuff remarks, personal details, or things said in the heat of the moment. I didn't want any of that leaving my machine. With VOD Search, nothing gets uploaded anywhere. The audio goes from your disk to your GPU and back. That's it.

Cost. Cloud transcription services charge per minute of audio. If you're processing multiple 3-hour streams per week, that adds up. VOD Search costs nothing to run after the initial setup -- your GPU is already sitting there.

Offline access. Once you've transcribed a set of VODs, the transcripts live on your machine forever. Search them on a plane, on a train, on a network with no internet. The transcripts are plain text files. You can grep them directly if you want.

Language Support

Whisper supports over 100 languages. VOD Search works with all of them. The default config targets Japanese (since that's what I use it for), but changing the language is a one-line edit in the script. English, Korean, Chinese, Spanish, French -- if Whisper can transcribe it, VOD Search can search it.

Platform Support

Because VOD Search uses yt-dlp under the hood, it works with any site yt-dlp supports. That's over 1,000 platforms:

Twitch -- VODs and past broadcasts
YouTube -- Stream archives, regular videos
Niconico -- Timeshifts and archives
And hundreds more

If yt-dlp can download the audio, VOD Search can transcribe and search it.

Installation

# vod-search (no longer available)
cd vod-search
bash install.sh

The installer checks for Python 3.10+, installs yt-dlp and faster-whisper via pip, and sets up CUDA libraries if you have an NVIDIA GPU. You'll also need ffmpeg installed separately (sudo apt install ffmpeg on Ubuntu, brew install ffmpeg on macOS).

Usage Examples

# Search a Twitch channel's recent VODs
vod-search https://www.twitch.tv/CHANNEL --keyword "keyword1,keyword2"

# Search YouTube stream archives
vod-search "https://www.youtube.com/@CHANNEL/streams" --keyword "keyword1"

# Only search existing transcripts (skip download + transcription)
vod-search --dir ./transcripts --keyword "keyword1,keyword2" --search-only

# Output as JSON for scripting
vod-search https://www.twitch.tv/CHANNEL --keyword "word" --json

# Limit to last 5 VODs
vod-search https://www.twitch.tv/CHANNEL --keyword "word" --limit 5

Limitations (Honest Ones)

No tool is perfect, and I want to be upfront about where this one falls short:

Transcription accuracy depends on audio quality. Loud background music, sound effects, or multiple people talking at once will degrade accuracy. Whisper is impressive, but it's not magic.
GPU recommended. CPU mode works, but it's 10-30x slower. For regular use, you'll want an NVIDIA GPU.
First run takes extra time. The Whisper model (~3 GB for medium) needs to download once. After that, it's cached.
Legal responsibility is on you. Downloading and transcribing streams may violate platform Terms of Service depending on your jurisdiction and use case. VOD Search shows a disclaimer on first run. Use it responsibly.

What I Use It For

My primary use case is finding specific moments in Japanese streams. A streamer mentions a game I'm interested in, and I want to find exactly when -- so I can clip it, reference it, or just rewatch that particular segment.

Before VOD Search, this meant scrubbing through hours of footage manually. Now it's a single command and a 7-minute wait. I run the command, do something else for a few minutes, and come back to a list of timestamps.

I also use the --search-only flag a lot. Once a set of VODs is transcribed, I'll come back days later with a new keyword and search the existing transcripts instantly. No re-downloading, no re-transcribing. The transcripts accumulate into a searchable archive over time -- and since they're just plain text files, they take up almost no disk space compared to the original video.

The --json output has been useful too. I've piped results into scripts that generate a simple HTML page with clickable timestamp links. If you're a clip creator or a fan wiki editor, you could build similar workflows on top of it.

Try It

The code is MIT-licensed and open source:

GitHub: vod-search

If you want the full package with extended features and priority support, a Pro version is available on Gumroad.

VOD Search was built with Claude Code. The pipeline design, code, and this article were developed collaboratively with an AI coding agent.

Free Tools for Claude Code Operators

Tool	What it does
cc-health-check	20-check setup diagnostic (CLI + web)
cc-session-stats	Usage analytics from session data
cc-audit-log	Human-readable audit trail
cc-cost-check	Cost per commit calculator

Interactive: Are You Ready for an AI Agent? - 10-question readiness quiz | 50 Days of AI - the raw data

More tools: Dev Toolkit - 440+ free browser-based tools for developers. JSON, regex, colors, CSS, SQL, and more. All single HTML files, no signup.

Make Claude Code safe: npx cc-safe-setup — 8 hooks, 10 seconds, zero config. GitHub

DEV Community