A few days ago I launched Token0 -- an open-source API proxy that makes vision LLM calls cheaper by optimizing images before they hit the model. The response was great, so here is the first real update: v0.2.0 with full streaming support and expanded benchmarks.
What's New in v0.2.0
1. Streaming support (stream=true)
This was the most requested feature. Token0 now supports Server-Sent Events streaming across all four providers -- OpenAI, Anthropic, Google, and Ollama.
How it works: Token0 optimizes your images before streaming begins, then tokens flow word-by-word exactly like native provider APIs. You get the cost savings without sacrificing the real-time UX.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-...",
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}],
stream=True,
extra_headers={"X-Provider-Key": "sk-..."}
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
# Final chunk includes token0 stats (tokens_saved, optimizations_applied)
A few details worth noting:
-
OpenAI-compatible SSE format --
data: {...}\n\nchunks withdelta(notmessage), ending withdata: [DONE] -
Optimization stats on the final chunk -- the last streaming chunk includes a
token0field with tokens saved and which optimizations were applied - Cached responses stream too -- if Token0 has a cache hit, it simulates streaming by sending the cached response in small chunks, so your client code does not need to handle two different response formats
- Zero overhead on text-only -- if there are no images in the request, streaming passes through with no added latency
2. Expanded benchmarks (full suite)
In v0.1.0, I only benchmarked on the real-world image suite (5 images). For v0.2.0, I ran the full benchmark suite across 6 categories: single images, text passthrough, multi-image requests, multi-turn conversations, different task types (classification, extraction, description, Q&A), and real-world images.
Results across all four Ollama vision models:
| Model | Params | Direct Tokens | Token0 Tokens | Savings |
|---|---|---|---|---|
| minicpm-v | 8B | 10,877 | 6,276 | 42.3% |
| moondream | 1.7B | 16,457 | 10,240 | 37.8% |
| llava-llama3 | 8B | 13,365 | 8,486 | 36.5% |
| llava:7b | 7B | 13,384 | 8,701 | 35.0% |
The numbers are higher than v0.1.0 because the full suite includes more text-heavy test cases where OCR routing delivers 93-97% savings per image.
Key findings from the expanded benchmarks:
- OCR routing is the biggest win: 93-97% savings on text-heavy images (documents, screenshots, receipts)
- Zero overhead on text-only: confirmed 0 extra tokens across all 4 models on text-only requests
- Multi-turn conversations: images in conversation history get optimized too -- no wasted tokens on re-sent images
- Latency improves in most cases: OCR routing is actually faster than sending the full image to the model
3. Ollama provider routing fix
v0.1.0 had a bug where Ollama models (moondream, llava, etc.) could be incorrectly routed to the OpenAI provider. Fixed -- Token0 now correctly detects and routes all Ollama vision models.
GPT-4o Cost Projections (unchanged)
These projections from v0.1.0 still hold -- they are based on OpenAI's published token formulas, not local model benchmarks:
| Scale | Without Token0 | With Token0 | Savings |
|---|---|---|---|
| 1K images/day | $67.58/mo | $0.74/mo | 98.9% |
| 100K images/day | $6,757.50/mo | $74.47/mo | 98.9% |
Upgrade
pip install --upgrade token0
That is it. No config changes needed. Streaming works automatically when you pass stream=True.
What's Next
- Video optimization -- keyframe extraction + per-frame optimization for video LLM calls
- More provider-specific optimizations as new models launch
Links
-
PyPI:
pip install token0 - GitHub: github.com/Pritom14/token0
- License: Apache 2.0
Already using LiteLLM? Token0 plugs in as a callback hook -- litellm.callbacks = [Token0Hook()] -- no proxy needed. If you tried v0.1.0, upgrade and let me know how streaming works on your workload. If you haven't tried it yet -- pip install token0 && token0 serve and change your base URL. That is all it takes.
Top comments (0)