DEV Community

Pritom Mazumdar
Pritom Mazumdar

Posted on

How I Built Video Token Optimization for Vision LLMs: Cutting Costs 13-45% with Frame Dedup + Scene Detection

A few weeks ago I launched Token0 -- an open-source proxy that optimizes images before they hit vision LLMs like GPT-4o, Claude, and Ollama models. The reception was good, so I kept building.

The most requested feature was video. If images are expensive, video is brutal -- every second at 30fps is 30 images. This post covers how I built the video optimization pipeline, what I learned benchmarking it across 5 models, and the model-aware edge case that nearly broke everything.


The Problem with Naive Video

Most apps that analyze video do one of two things:

  1. Extract frames at 1fps and send every one of them
  2. Send a handful of manually selected keyframes

Both approaches waste tokens in predictable ways. At 1fps on a 60-second product demo video:

  • You get 60 frames
  • Frames 1-29 of the same talking head are near-identical (Hamming distance < 10 between perceptual hashes)
  • The only frames with unique information are at scene transitions

You're paying for 60 images when 8-12 contain all the information.


The Pipeline: 4 Layers

Token0's video optimization runs in four stages, each optional and composable:

Layer 1: Frame Extraction

OpenCV extracts frames at 1fps (configurable). A 60s video at 30fps → 60 frames. Hard cap at 32 frames sent to the LLM.

def extract_frames(video_path, fps=1.0, max_frames=32):
    cap = cv2.VideoCapture(video_path)
    video_fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    frame_interval = max(1, int(video_fps / fps))
    # yield every frame_interval-th frame as PIL image
Enter fullscreen mode Exit fullscreen mode

Layer 2: QJL Perceptual Hash Deduplication

This is the core insight. I reused the same QJL (Quantized Johnson-Lindenstrauss) hash infrastructure I built for the image cache:

  1. Compute 256-bit perceptual hash of each frame (dhash on 16x16 grayscale)
  2. Compress to 128-bit binary signature using a random JL projection matrix
  3. Compute Hamming distance between consecutive frames
  4. If distance < 12, drop the frame (near-duplicate)
DEDUP_HAMMING_THRESHOLD = 12  # tighter than cache (consecutive frames are very similar)

def deduplicate_frames(frames, hamming_threshold=DEDUP_HAMMING_THRESHOLD):
    kept = [frames[0]]
    prev_sig = _jl_compress(_image_hash(frames[0][1]))

    for timestamp, frame in frames[1:]:
        sig = _jl_compress(_image_hash(frame))
        dist = _hamming_distance(prev_sig, sig)
        if dist > hamming_threshold:
            kept.append((timestamp, frame))
            prev_sig = sig
    return kept
Enter fullscreen mode Exit fullscreen mode

On a document scanning video (invoice + receipt + screenshot on screen), this collapsed 15 consecutive near-duplicate frames down to 3 unique ones.

Layer 3: Scene Change Detection

Pixel-level diff between consecutive frames (160x120 downsampled, mean absolute difference). Frames above the threshold (15.0 mean pixel diff) are kept as scene boundaries.

def detect_scene_changes(frames, threshold=15.0):
    kept = [frames[0]]
    for i in range(1, len(frames)):
        prev_arr = np.array(frames[i-1][1].resize((160, 120))).astype(np.float32)
        curr_arr = np.array(frames[i][1].resize((160, 120))).astype(np.float32)
        diff = np.mean(np.abs(curr_arr - prev_arr))
        if diff > threshold:
            kept.append(frames[i])
    return kept
Enter fullscreen mode Exit fullscreen mode

Layer 4: CLIP Scoring (optional, Layer 2)

If sentence-transformers is installed, Token0 scores each remaining frame against the user's prompt using CLIP (ViT-B/32) and keeps the top-K most relevant. Code is wired in but CLIP is an optional dependency -- most deployments skip this and the first three layers are already effective.


Each Keyframe Goes Through the Full Image Pipeline

After frame selection, every keyframe runs through the existing image optimization stack:

  • Smart resize (downscale to provider max)
  • OCR routing (if the frame is text-heavy)
  • JPEG recompression
  • Prompt-aware detail mode
  • Tile-optimized resize

This means you get compounding savings: fewer frames and each frame is smaller.


Benchmark Results

I tested against 5 Ollama vision models using 3 videos (product showcase, document montage, mixed content). Naive baseline = all frames at 1fps sent raw. Token0 = full pipeline.

Model Naive Tokens Token0 Tokens Savings
gemma3:4b 14,706 8,081 45.0%
llava:7b 15,731 12,845 18.3%
llava-llama3 15,658 12,789 18.3%
minicpm-v 7,428 6,447 13.2%
moondream 12,288 11,714 4.7%

Why the spread? Gemma3 uses a high-resolution image encoder -- it's 45% because there are many tokens to remove per frame. Moondream uses a tiny encoder (~50 tokens/frame) -- frame dedup has less absolute impact even when it removes the same number of frames.

GPT-4o extrapolation (using OpenAI's published tile formula):

60s video, 30fps → 1fps = 60 frames → dedup to ~10 keyframes:

  • Naive: 60 × 425 tokens = 25,500 tokens (~$0.064/video)
  • Token0: 10 × 425 = 4,250 tokens (~$0.011/video)
  • ~83% savings per video

At 10K videos/day: $19,125/mo → $3,188/mo.


The Edge Case That Nearly Broke Everything

While benchmarking, I discovered that llama3.2-vision was showing -124% savings (negative -- Token0 was making it worse).

The root cause was two bugs stacked on top of each other:

Bug 1: Provider detection miss

get_provider_from_model() didn't include llama3.2-vision, so it fell through to the "openai" default. OCR routing was then skipped because it's only enabled for models where image tokens > OCR text tokens -- but with the wrong provider, the estimate formula was wrong.

Fix: explicitly add llama3.2-vision, llama3.2, gemma3, granite3.2, qwen2.5vl, qwen3-vl to the Ollama model list.

Bug 2: Ultra-efficient encoders break the OCR savings assumption

llama3.2-vision uses ~8-27 tokens per image natively. The standard OCR flow routes text-heavy images to EasyOCR and returns extracted text (~200-700 tokens depending on content). For a model that uses 15 tokens/image, returning 300 tokens of OCR text is 20x more expensive, not cheaper.

The fix was a named allowlist of ultra-efficient models that skip OCR entirely:

_ultra_efficient_models = ("llama3.2-vision", "llama3.2")
is_ultra_efficient = any(k in model.lower() for k in _ultra_efficient_models)

if provider == "ollama" and is_ultra_efficient:
    # Skip OCR -- image tokens are already cheaper than text extraction
    plan.reasons.append(f"skip OCR: ultra-efficient encoder (~{estimated_image_tokens} tokens < OCR cost)")
Enter fullscreen mode Exit fullscreen mode

After both fixes: llama3.2-vision went from -124% to 0% (correct passthrough). gemma3 stayed at 24.8% (was briefly broken by an intermediate fix attempt). granite3.2-vision: 53.1%.

The lesson: optimization strategies that help high-token-count models hurt ultra-efficient ones. You need model-aware routing, not just image-aware routing.


How to Use Video in Token0

pip install token0
token0 serve
Enter fullscreen mode Exit fullscreen mode
from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-...",
)

with open("product_demo.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What happens in this video?"},
            {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}}
        ]
    }],
    extra_headers={"X-Provider-Key": "sk-..."}
)

# Token0 extracted keyframes, deduped, optimized, forwarded
# response.token0.tokens_saved, optimizations_applied, etc.
Enter fullscreen mode Exit fullscreen mode

Already using LiteLLM? Video works through the hook too:

import litellm
from token0.litellm_hook import Token0Hook

litellm.callbacks = [Token0Hook()]
# video_url content type automatically handled
Enter fullscreen mode Exit fullscreen mode

What's Next

  • CLIP scoring (Layer 2): score each frame against the user's prompt and keep the top-K most relevant. Code is wired, needs pip install sentence-transformers clip to activate.
  • Saliency-based ROI cropping: detect what region the prompt is asking about, crop and send only that. "What's the total?" on an invoice → crop to bottom-right only.
  • Adaptive quality escalation: send low-detail first (85 tokens), retry at high-detail only if the response shows uncertainty. Happy path (60-70% of cases) = massive savings.

Apache 2.0. pip install token0.

GitHub: github.com/Pritom14/token0

If you're processing video through vision LLMs and have benchmarks on your own models, I'd love to compare notes. Especially curious about Gemini 2.5 Pro's native video support vs frame-by-frame through Token0.


Top comments (0)