Every time you send an image to GPT-4o, Claude, or Gemini, you are paying for vision tokens. And most of them are wasted.
I built Token0 : an open-source API proxy that sits between your app and the LLM provider, optimizes every image request automatically, and typically saves 70-99% on vision costs. It is now live on PyPI.
In this post, I will walk through the problem, the seven optimization strategies, the benchmarks, and how to get started in under a minute.
The Problem: Vision Tokens Are Expensive and Poorly Optimized
Text token optimization is a solved problem. Prompt caching, compression, smart routing : the tooling is mature.
But images : the modality that costs 2-5x more per token have almost no optimization tooling.
Here is what happens today:
Wasted pixels. You send a 4000x3000 photo to Claude. Claude silently downscales it to 1568px max. You paid for the original resolution. Those tokens are gone.
Wrong modality. A screenshot of a document costs ~765 tokens on GPT-4o as an image. The same information extracted as text costs ~30 tokens. That is a 25x markup for identical information.
Wrong detail level. "Classify this image" on GPT-4o uses high-detail mode at 1,105 tokens. Low-detail mode gives the same answer for 85 tokens. A 13x difference that nobody is optimizing for.
Wasted tiles. GPT-4o tiles images into 512x512 blocks. A 1280x720 image creates 4 tiles (765 tokens). Resizing to 1024x768 gives 2 tiles (425 tokens). 44% savings, zero quality loss.
** How Token0 Works **
Your App --> Token0 Proxy --> [Analyze -> Classify -> Route -> Transform -> Cache] --> LLM Provider
You change one line -- your base URL -- and Token0 handles everything automatically.
Token0 applies seven optimizations:
1. Smart Resize
Each provider has a maximum resolution it actually processes. Claude caps at 1568px, GPT-4o at 2048px. Token0 downscales to these limits before sending. No quality loss because the provider would have done the same thing you just stop paying for the discarded pixels.
2. OCR Routing
When an image is mostly text (screenshots, receipts, invoices, documents), Token0 extracts the text via OCR and sends that instead. Text tokens cost 10-50x less than vision tokens.
The detection uses a multi-signal heuristic: background uniformity, color variance, horizontal line structure, and edge density. Validated at 91% accuracy on real world images. Photos are never falsely OCR routed.
3. JPEG Recompression
PNG screenshots get converted to optimized JPEG when transparency is not needed. Smaller payload, faster upload, same visual information.
4. Prompt-Aware Detail Mode
This is the interesting one. Token0 analyzes your prompt, not just the image, to decide the detail level.
"What is in this image?" --> low detail (85 tokens)
"Extract all the text from this receipt" --> high detail (1,105 tokens)
A keyword classifier on the prompt text makes this decision. Simple queries get low-detail mode automatically.
5. Tile-Optimized Resize
OpenAI charges by 512px tiles. Token0 resizes images to land exactly on tile boundaries, minimizing the number of tiles without changing the aspect ratio meaningfully.
6. Model Cascade
Not every image needs the flagship model. Token0 analyzes task complexity and routes simple tasks to cheaper models:
- GPT-4o --> GPT-4o-mini (16.7x cheaper)
- Claude Opus --> Claude Haiku (6.25x cheaper)
Complex tasks stay on the original model.
7. Semantic Response Cache
Token0 generates a perceptual hash of each image combined with the prompt text. If a similar request has been seen before, the cached response is returned. Zero tokens consumed.
This is particularly effective on repetitive workloads: product image classification, document processing pipelines, batch operations.
Benchmarks
I tested Token0 on four Ollama vision models with real-world images -- actual photos, a real store receipt, a typed invoice, and a desktop screenshot.
| Model | Params | Token Savings |
|---|---|---|
| moondream | 1.7B | 36.3% |
| llava-llama3 | 8B | 31.2% |
| minicpm-v | 8B | 25.9% |
| llava:7b | 7B | 24.2% |
On GPT-4o with all seven optimizations enabled:
| Scale | Direct Cost | Token0 Cost | Monthly Savings |
|---|---|---|---|
| 1K images/day | $67.58 | $0.74 | $66.83 |
| 10K images/day | $675.75 | $7.45 | $668.30 |
| 100K images/day | $6,757.50 | $74.47 | $6,683.03 |
That is a 98.9% cost reduction.
Key finding: OCR routing alone delivers 47-70% token savings on text-heavy images. If you do nothing else, just routing screenshots and documents through OCR instead of vision is worth it.
Quick Start
Install from PyPI:
pip install token0
Add your API key to a .env file:
OPENAI_API_KEY=sk-...
Start the server:
token0 serve
That is it. No Docker, no Postgres, no Redis. Token0 starts in lite mode by default with SQLite and in-memory cache.
Now change your base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-...",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}],
extra_headers={"X-Provider-Key": "sk-..."}
)
Check your savings:
curl http://localhost:8000/v1/usage
{
"total_requests": 47,
"total_tokens_saved": 12840,
"total_cost_saved_usd": 0.0321,
"avg_compression_ratio": 3.2
}
Works With Everything
Token0 supports four providers:
- OpenAI -- GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano
- Anthropic -- Claude Sonnet, Claude Opus, Claude Haiku
- Google -- Gemini 2.5 Flash, Gemini 2.5 Pro
- Ollama -- moondream, llava, llava-llama3, minicpm-v, any vision model
For production, switch to full mode with PostgreSQL, Redis, and S3:
pip install token0[full]
Try It
-
PyPI:
pip install token0 - GitHub: github.com/Pritom14/token0
- License: Apache 2.0
Fully open source. If you are sending images to LLMs and paying for vision tokens, give it a try and let me know what savings you see.
Top comments (0)