DEV Community

Pritom Mazumdar
Pritom Mazumdar

Posted on

Token0 v0.2.0: Streaming Support + Updated Benchmarks : 35-42% Savings Across 4 Vision Models

A few days ago I launched Token0 -- an open-source API proxy that makes vision LLM calls cheaper by optimizing images before they hit the model. The response was great, so here is the first real update: v0.2.0 with full streaming support and expanded benchmarks.

What's New in v0.2.0

1. Streaming support (stream=true)

This was the most requested feature. Token0 now supports Server-Sent Events streaming across all four providers -- OpenAI, Anthropic, Google, and Ollama.

How it works: Token0 optimizes your images before streaming begins, then tokens flow word-by-word exactly like native provider APIs. You get the cost savings without sacrificing the real-time UX.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-...",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
        ]
    }],
    stream=True,
    extra_headers={"X-Provider-Key": "sk-..."}
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
# Final chunk includes token0 stats (tokens_saved, optimizations_applied)
Enter fullscreen mode Exit fullscreen mode

A few details worth noting:

  • OpenAI-compatible SSE format -- data: {...}\n\n chunks with delta (not message), ending with data: [DONE]
  • Optimization stats on the final chunk -- the last streaming chunk includes a token0 field with tokens saved and which optimizations were applied
  • Cached responses stream too -- if Token0 has a cache hit, it simulates streaming by sending the cached response in small chunks, so your client code does not need to handle two different response formats
  • Zero overhead on text-only -- if there are no images in the request, streaming passes through with no added latency

2. Expanded benchmarks (full suite)

In v0.1.0, I only benchmarked on the real-world image suite (5 images). For v0.2.0, I ran the full benchmark suite across 6 categories: single images, text passthrough, multi-image requests, multi-turn conversations, different task types (classification, extraction, description, Q&A), and real-world images.

Results across all four Ollama vision models:

Model Params Direct Tokens Token0 Tokens Savings
minicpm-v 8B 10,877 6,276 42.3%
moondream 1.7B 16,457 10,240 37.8%
llava-llama3 8B 13,365 8,486 36.5%
llava:7b 7B 13,384 8,701 35.0%

The numbers are higher than v0.1.0 because the full suite includes more text-heavy test cases where OCR routing delivers 93-97% savings per image.

Key findings from the expanded benchmarks:

  • OCR routing is the biggest win: 93-97% savings on text-heavy images (documents, screenshots, receipts)
  • Zero overhead on text-only: confirmed 0 extra tokens across all 4 models on text-only requests
  • Multi-turn conversations: images in conversation history get optimized too -- no wasted tokens on re-sent images
  • Latency improves in most cases: OCR routing is actually faster than sending the full image to the model

3. Ollama provider routing fix

v0.1.0 had a bug where Ollama models (moondream, llava, etc.) could be incorrectly routed to the OpenAI provider. Fixed -- Token0 now correctly detects and routes all Ollama vision models.

GPT-4o Cost Projections (unchanged)

These projections from v0.1.0 still hold -- they are based on OpenAI's published token formulas, not local model benchmarks:

Scale Without Token0 With Token0 Savings
1K images/day $67.58/mo $0.74/mo 98.9%
100K images/day $6,757.50/mo $74.47/mo 98.9%

Upgrade

pip install --upgrade token0
Enter fullscreen mode Exit fullscreen mode

That is it. No config changes needed. Streaming works automatically when you pass stream=True.

What's Next

  • Video optimization -- keyframe extraction + per-frame optimization for video LLM calls
  • More provider-specific optimizations as new models launch

Links

Already using LiteLLM? Token0 plugs in as a callback hook -- litellm.callbacks = [Token0Hook()] -- no proxy needed. If you tried v0.1.0, upgrade and let me know how streaming works on your workload. If you haven't tried it yet -- pip install token0 && token0 serve and change your base URL. That is all it takes.


Top comments (0)