Umair Bilal

Posted on Mar 26 • Originally published at buildzn.com

Local LLM Video Captioning: Private, Powerful, Open-Source

#ai #llm #privacy #opensource

Introduction

Video content dominates today's digital landscape, yet accessibility through captions remains underutilized. Traditional approaches rely on expensive cloud APIs, compromise data privacy, or demand tedious manual work. This guide explores building broadcast-quality captions locally using open-source AI—keeping your sensitive content on your own hardware while eliminating recurring costs.

The Rise of Local LLM Video Captioning: A Paradigm Shift

Cloud-based automatic speech recognition (ASR) services like Google Cloud Speech-to-Text, Azure AI Speech, and AWS Transcribe deliver solid results but carry significant expenses. We're talking anywhere from $0.016 to $0.024 per minute, which quickly escalates for long-form content creators or businesses processing hundreds of hours of video weekly. A creator publishing two hours weekly could spend over $200 monthly—exceeding $2,400 annually.

Beyond costs, data privacy concerns are paramount. When uploading to cloud APIs, you entrust providers with intellectual property and confidential information. Organizations in healthcare, finance, legal, and government sectors often cannot accept this risk.

Open-source innovations have democratized powerful AI. Projects like OpenAI's Whisper, Llama 2, Mistral, and Gemma—paired with inference engines like llama.cpp and ctransformers—now enable running sophisticated models on consumer hardware with performance matching or exceeding cloud alternatives.

The advantages of local processing are compelling:

Data Security: Content stays on your machine
Compliance: Meets strict governance requirements
Connectivity Independence: Works offline
Predictable Performance: No cloud bottlenecks

How Open Source AI Captioning Works Locally

Building a complete solution involves multiple specialized components working together:

1. Automatic Speech Recognition (ASR)

Whisper, OpenAI's open-source ASR model, excels at converting spoken words to text. Different sizes of Whisper models (tiny, base, small, medium, large-v2, large-v3) offer trade-offs between accuracy and computational cost.

The model uses a transformer encoder-decoder architecture, processing audio waveforms to output text. For example, large-v3 can achieve word error rates (WER) as low as 3-4% on clean audio, which is competitive with many cloud offerings.

2. Audio Extraction & Preprocessing

Before ASR processing, extract audio from video files using tools like FFmpeg. This step handles format conversion, audio extraction, and level normalization.

3. Large Language Models for Refinement

While Whisper excels at transcription, raw output often lacks proper punctuation and capitalization. LLMs like Mistral, Llama 2, or Gemma can enhance transcripts by:

Punctuation & Capitalization: Adding proper sentence structure
Speaker Identification: Improving speaker labels from diarization
Summarization: Extracting key themes and metadata
Error Correction: Fixing common ASR mistakes using context
Translation: Converting to different languages

4. Inference Engines

llama.cpp with Python bindings enables running various LLMs in GGUF format. GGUF (GPT-Generated Unified Format) models are quantized versions of larger models, reducing their size and memory footprint without significant performance degradation.

Example: A Mistral 7B model (13GB in full float16 precision) becomes ~4.5GB in Q4_K_M quantized format, runnable on laptops with 8-16GB RAM.

Hugging Face transformers library provides unified APIs for loading pre-trained models, handling GPU acceleration automatically.

Workflow Overview

Video → Audio Extraction → Whisper Transcription → LLM Refinement → SRT/VTT Captions

Practical Implementation Guide

Prerequisites

Python 3.8+ for development
FFmpeg for multimedia processing
GGUF Model: Download from Hugging Face
Whisper Model: Auto-downloaded by transformers library

Step 1: Environment Setup

# Create virtual environment
python -m venv llm_captioning_env

# Activate (macOS/Linux)
source llm_captioning_env/bin/activate

# Install dependencies
pip install transformers[torch] accelerate soundfile moviepy llama-cpp-python

Step 2: Audio Extraction

from moviepy.editor import VideoFileClip
import os

def extract_audio(video_path: str, audio_output_path: str):
    """Extracts audio track from video file."""
    if not os.path.exists(video_path):
        raise FileNotFoundError(f"Video file not found: {video_path}")

    print(f"Extracting audio from {video_path}...")
    video_clip = VideoFileClip(video_path)
    video_clip.audio.write_audiofile(audio_output_path)
    print(f"Audio extracted to {audio_output_path}")

Step 3: Whisper Transcription

from transformers import pipeline
import torch

def transcribe_audio_whisper(audio_path: str, 
                            model_name: str = "openai/whisper-large-v3"):
    """Transcribe audio using Whisper model."""
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device} for Whisper transcription.")

    asr_pipeline = pipeline(
        "automatic-speech-recognition",
        model=model_name,
        torch_dtype=torch.float16 if device == "cuda:0" else torch.float32,
        device=device
    )

    print(f"Transcribing {audio_path}... This may take a while.")
    transcription = asr_pipeline(
        audio_path, 
        generate_kwargs={"task": "transcribe", "language": "en"}
    )

    return transcription['text']

Model Selection Notes:

small/medium: Faster, acceptable for clean audio
large-v3: Best accuracy (recommended)

Step 4: LLM Refinement

from llama_cpp import Llama

def refine_transcript_with_llm(raw_transcript: str, llm_model_path: str):
    """Refine transcript using local LLM."""
    if not os.path.exists(llm_model_path):
        raise FileNotFoundError(f"LLM model not found: {llm_model_path}")

    print(f"Loading local LLM from {llm_model_path}...")
    llm = Llama(
        model_path=llm_model_path,
        n_ctx=4096,
        n_gpu_layers=-1 if torch.cuda.is_available() else 0,
        verbose=False
    )

    prompt = f"""[INST] You are an expert copy editor. Correct the following 
raw transcript for punctuation and capitalization. Do not add or remove content, 
only fix formatting.

Raw Transcript:
{raw_transcript}

Corrected Transcript:
[/INST]"""

    output = llm(
        prompt,
        max_tokens=len(raw_transcript) + 100,
        stop=["</s>"],
        echo=False,
        temperature=0.1,
    )

    refined_text = output["choices"][0]["text"].strip()
    return refined_text

Step 5: Generate SRT Captions

import srt

def create_srt_from_segments(segments: list, output_srt_path: str):
    """Create SRT file from text segments with timings."""
    print(f"Generating SRT file: {output_srt_path}")
    subs = []

    for i, segment in enumerate(segments):
        start_time = srt.timedelta(seconds=segment['start'])
        end_time = srt.timedelta(seconds=segment['end'])
        subs.append(srt.Subtitle(
            index=i+1, 
            start=start_time, 
            end=end_time, 
            content=segment['text'].strip()
        ))

    with open(output_srt_path, "w", encoding="utf-8") as f:
        f.write(srt.compose(subs))
    print("SRT file created successfully.")

Advanced Techniques & Optimization

Model Selection & Benchmarks

Whisper Variants:

tiny/base/small: Fastest, lower accuracy (~70MB-500MB)
medium: Balanced speed/accuracy (~770MB)
large-v2/large-v3: Highest accuracy, most resource-intensive (~3GB)

GGUF Quantization Levels:

Q4_K_M: Best balance—4.5GB for Mistral 7B with minimal quality loss
Q8_0: Higher fidelity but larger files and increased VRAM needs
Q2_K/Q3_K: Smaller but noticeably reduced quality

Hardware Requirements

GPU Setup (Recommended):

NVIDIA RTX 3060 (12GB VRAM) or better
16GB+ system RAM (32GB preferred)
Time for large-v3 Whisper: ~5-10 minutes per 60 minutes audio

CPU-Only Setup:

Modern multi-core processor (Intel i7/i9 or AMD Ryzen 7/9)
32GB RAM minimum
Time for large-v3 Whisper: ~1+ hour per 60 minutes audio (5-10x slower)

Speaker Diarization

Identifying individual speakers requires specialized tools. pyannote-audio is the leading open-source solution. Processing order:

Extract audio
Run diarization (generates speaker timestamps)
Feed segments to Whisper
Use LLM to refine speaker labels with context

Batch Processing & Optimization

Parallel Processing: Use multiprocessing for concurrent video handling
Model Caching: Reuse loaded instances rather than reloading for each segment
Memory Management: Monitor peak usage; adjust context window sizes as needed

Error Handling

Poor Audio Quality: Apply preprocessing with pydub or librosa for noise reduction.

Multilingual Content: Set language="auto" for automatic language detection—Whisper handles multiple languages seamlessly.

Context Limitations: For long transcripts, process in chunks with overlap.

Cost Analysis: Local vs. Cloud

	Cloud Services	Local LLM
ASR cost	~$0.016/minute	$0
100 hrs/month	~$100	~$0
1000 hrs/month	~$1,000	~$0
Hardware	None	$500-$1,500 (one-time)
Break-even	N/A	~15 months at 100 hrs/month

For high-volume operations, local processing delivers compelling ROI.

Why Local LLMs Matter

Privacy: Sensitive content remains under your control
Cost-Effectiveness: Eliminate recurring API fees after initial investment
Customization: Fine-tune models for specific domains, accents, or jargon
Offline Capability: Works without internet connectivity
Performance: Avoids cloud upload/download bottlenecks for large batches

Frequently Asked Questions

Q: What hardware do I need?
A dedicated NVIDIA GPU (RTX 3060 12GB+) is recommended for optimal performance with Whisper large-v3 and 7B+ LLMs. CPU-only systems require modern multi-core processors and 32GB+ RAM but process 5-10x slower.

Q: How accurate are local models compared to cloud APIs?
Open-source models like Whisper large-v3 offer accuracy highly competitive with cloud-based ASR services. Combined with LLM refinement, results rival commercial solutions.

Q: Can I use this for real-time captioning?
True real-time captioning with large-v3 and LLM refinement is challenging on consumer hardware. Smaller models or dedicated streaming ASR (NVIDIA NeMo) work better for live applications.

Q: What's the best LLM for transcript refinement?
Mistral 7B Instruct and Gemma 7B Instruct excel at punctuation, capitalization, and grammar corrections.

Conclusion

Building local LLM video captioning tools is now practical and accessible. By combining Whisper for transcription with local LLMs for refinement, you achieve broadcast-quality results while maintaining complete data privacy. The financial and technical advantages over cloud-dependent approaches are substantial and growing as open-source models improve.

Start experimenting with different models, optimize prompts for your use case, and discover the potential of truly private, locally-controlled AI video processing.

Originally published at buildzn.com

DEV Community