Pritom Mazumdar

Posted on Mar 27

I Cut Vision LLM Costs by 98.9% -> Here's How Token0 Works Under the Hood

#vision #python #webdev #deepseek

Every time you send an image to GPT-4o, Claude, or Gemini, you are paying for vision tokens. And most of them are wasted.

I built Token0 : an open-source API proxy that sits between your app and the LLM provider, optimizes every image request automatically, and typically saves 70-99% on vision costs. It is now live on PyPI.

In this post, I will walk through the problem, the seven optimization strategies, the benchmarks, and how to get started in under a minute.

The Problem: Vision Tokens Are Expensive and Poorly Optimized

Text token optimization is a solved problem. Prompt caching, compression, smart routing : the tooling is mature.

But images : the modality that costs 2-5x more per token have almost no optimization tooling.

Here is what happens today:

Wasted pixels. You send a 4000x3000 photo to Claude. Claude silently downscales it to 1568px max. You paid for the original resolution. Those tokens are gone.

Wrong modality. A screenshot of a document costs ~765 tokens on GPT-4o as an image. The same information extracted as text costs ~30 tokens. That is a 25x markup for identical information.

Wrong detail level. "Classify this image" on GPT-4o uses high-detail mode at 1,105 tokens. Low-detail mode gives the same answer for 85 tokens. A 13x difference that nobody is optimizing for.

Wasted tiles. GPT-4o tiles images into 512x512 blocks. A 1280x720 image creates 4 tiles (765 tokens). Resizing to 1024x768 gives 2 tiles (425 tokens). 44% savings, zero quality loss.

** How Token0 Works **

Your App --> Token0 Proxy --> [Analyze -> Classify -> Route -> Transform -> Cache] --> LLM Provider

You change one line -- your base URL -- and Token0 handles everything automatically.

Token0 applies seven optimizations:

1. Smart Resize

Each provider has a maximum resolution it actually processes. Claude caps at 1568px, GPT-4o at 2048px. Token0 downscales to these limits before sending. No quality loss because the provider would have done the same thing you just stop paying for the discarded pixels.

2. OCR Routing

When an image is mostly text (screenshots, receipts, invoices, documents), Token0 extracts the text via OCR and sends that instead. Text tokens cost 10-50x less than vision tokens.

The detection uses a multi-signal heuristic: background uniformity, color variance, horizontal line structure, and edge density. Validated at 91% accuracy on real world images. Photos are never falsely OCR routed.

3. JPEG Recompression

PNG screenshots get converted to optimized JPEG when transparency is not needed. Smaller payload, faster upload, same visual information.

4. Prompt-Aware Detail Mode

This is the interesting one. Token0 analyzes your prompt, not just the image, to decide the detail level.

"What is in this image?" --> low detail (85 tokens)
"Extract all the text from this receipt" --> high detail (1,105 tokens)

A keyword classifier on the prompt text makes this decision. Simple queries get low-detail mode automatically.

5. Tile-Optimized Resize

OpenAI charges by 512px tiles. Token0 resizes images to land exactly on tile boundaries, minimizing the number of tiles without changing the aspect ratio meaningfully.

6. Model Cascade

Not every image needs the flagship model. Token0 analyzes task complexity and routes simple tasks to cheaper models:

GPT-4o --> GPT-4o-mini (16.7x cheaper)
Claude Opus --> Claude Haiku (6.25x cheaper)

Complex tasks stay on the original model.

7. Semantic Response Cache

Token0 generates a perceptual hash of each image combined with the prompt text. If a similar request has been seen before, the cached response is returned. Zero tokens consumed.

This is particularly effective on repetitive workloads: product image classification, document processing pipelines, batch operations.

Benchmarks

I tested Token0 on four Ollama vision models with real-world images -- actual photos, a real store receipt, a typed invoice, and a desktop screenshot.

Model	Params	Token Savings
moondream	1.7B	36.3%
llava-llama3	8B	31.2%
minicpm-v	8B	25.9%
llava:7b	7B	24.2%

On GPT-4o with all seven optimizations enabled:

Scale	Direct Cost	Token0 Cost	Monthly Savings
1K images/day	$67.58	$0.74	$66.83
10K images/day	$675.75	$7.45	$668.30
100K images/day	$6,757.50	$74.47	$6,683.03

That is a 98.9% cost reduction.

Key finding: OCR routing alone delivers 47-70% token savings on text-heavy images. If you do nothing else, just routing screenshots and documents through OCR instead of vision is worth it.

Quick Start

Install from PyPI:

pip install token0

Add your API key to a .env file:

OPENAI_API_KEY=sk-...

Start the server:

token0 serve

That is it. No Docker, no Postgres, no Redis. Token0 starts in lite mode by default with SQLite and in-memory cache.

Now change your base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-...",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
    }],
    extra_headers={"X-Provider-Key": "sk-..."}
)

Check your savings:

curl http://localhost:8000/v1/usage

{
    "total_requests": 47,
    "total_tokens_saved": 12840,
    "total_cost_saved_usd": 0.0321,
    "avg_compression_ratio": 3.2
}

Works With Everything

Token0 supports four providers:

OpenAI -- GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
Anthropic -- Claude Sonnet, Claude Opus, Claude Haiku
Google -- Gemini 2.5 Flash, Gemini 2.5 Pro
Ollama -- moondream, llava, llava-llama3, minicpm-v, any vision model

For production, switch to full mode with PostgreSQL, Redis, and S3:

pip install token0[full]

Try It

PyPI: pip install token0
GitHub: github.com/Pritom14/token0
License: Apache 2.0

Fully open source. If you are sending images to LLMs and paying for vision tokens, give it a try and let me know what savings you see.

DEV Community

I Cut Vision LLM Costs by 98.9% -> Here's How Token0 Works Under the Hood

Top comments (0)