Token0 v0.2.0: Streaming Support + Updated Benchmarks : 35-42% Savings Across 4 Vision Models

#webdev #vision #python

A few days ago I launched Token0 -- an open-source API proxy that makes vision LLM calls cheaper by optimizing images before they hit the model. The response was great, so here is the first real update: v0.2.0 with full streaming support and expanded benchmarks.

What's New in v0.2.0

1. Streaming support (stream=true)

This was the most requested feature. Token0 now supports Server-Sent Events streaming across all four providers -- OpenAI, Anthropic, Google, and Ollama.

How it works: Token0 optimizes your images before streaming begins, then tokens flow word-by-word exactly like native provider APIs. You get the cost savings without sacrificing the real-time UX.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-...",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
    }],
    stream=True,
    extra_headers={"X-Provider-Key": "sk-..."}
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
# Final chunk includes token0 stats (tokens_saved, optimizations_applied)

A few details worth noting:

OpenAI-compatible SSE format -- data: {...}\n\n chunks with delta (not message), ending with data: [DONE]
Optimization stats on the final chunk -- the last streaming chunk includes a token0 field with tokens saved and which optimizations were applied
Cached responses stream too -- if Token0 has a cache hit, it simulates streaming by sending the cached response in small chunks, so your client code does not need to handle two different response formats
Zero overhead on text-only -- if there are no images in the request, streaming passes through with no added latency

2. Expanded benchmarks (full suite)

In v0.1.0, I only benchmarked on the real-world image suite (5 images). For v0.2.0, I ran the full benchmark suite across 6 categories: single images, text passthrough, multi-image requests, multi-turn conversations, different task types (classification, extraction, description, Q&A), and real-world images.

Results across all four Ollama vision models:

Model	Params	Direct Tokens	Token0 Tokens	Savings
minicpm-v	8B	10,877	6,276	42.3%
moondream	1.7B	16,457	10,240	37.8%
llava-llama3	8B	13,365	8,486	36.5%
llava:7b	7B	13,384	8,701	35.0%

The numbers are higher than v0.1.0 because the full suite includes more text-heavy test cases where OCR routing delivers 93-97% savings per image.

Key findings from the expanded benchmarks:

OCR routing is the biggest win: 93-97% savings on text-heavy images (documents, screenshots, receipts)
Zero overhead on text-only: confirmed 0 extra tokens across all 4 models on text-only requests
Multi-turn conversations: images in conversation history get optimized too -- no wasted tokens on re-sent images
Latency improves in most cases: OCR routing is actually faster than sending the full image to the model

3. Ollama provider routing fix

v0.1.0 had a bug where Ollama models (moondream, llava, etc.) could be incorrectly routed to the OpenAI provider. Fixed -- Token0 now correctly detects and routes all Ollama vision models.

GPT-4o Cost Projections (unchanged)

These projections from v0.1.0 still hold -- they are based on OpenAI's published token formulas, not local model benchmarks:

Scale	Without Token0	With Token0	Savings
1K images/day	$67.58/mo	$0.74/mo	98.9%
100K images/day	$6,757.50/mo	$74.47/mo	98.9%

Upgrade

pip install --upgrade token0

That is it. No config changes needed. Streaming works automatically when you pass stream=True.

What's Next

Video optimization -- keyframe extraction + per-frame optimization for video LLM calls
More provider-specific optimizations as new models launch

Links

PyPI: pip install token0
GitHub: github.com/Pritom14/token0
License: Apache 2.0

Already using LiteLLM? Token0 plugs in as a callback hook -- litellm.callbacks = [Token0Hook()] -- no proxy needed. If you tried v0.1.0, upgrade and let me know how streaming works on your workload. If you haven't tried it yet -- pip install token0 && token0 serve and change your base URL. That is all it takes.

DEV Community

Token0 v0.2.0: Streaming Support + Updated Benchmarks : 35-42% Savings Across 4 Vision Models

Top comments (0)