DEV Community

Dor Amir
Dor Amir

Posted on

NadirClaw 0.8: Vision Routing and the Silent Failure It Fixed

Here's a bug that's annoying to diagnose: you send a screenshot to Cursor, get a response that clearly didn't look at the image. You try again. Same thing. You figure it's a model issue and move on.

If you're running NadirClaw in front of Cursor, the bug was in the router.


How NadirClaw routes requests

Before 0.8, here's what happened when you sent an image:

  1. NadirClaw's classifier embeds your prompt using sentence embeddings and compares it to two pre-computed centroid vectors (one for "simple", one for "complex"). This takes ~10ms. No extra API call.
  2. Your screenshot is probably attached to a short message like "what's wrong here?" - that classifies as simple.
  3. Simple routes to your cheap model. If that's DeepSeek or an Ollama model, neither supports vision.
  4. The multimodal content array (the image_url part) got flattened to text before hitting LiteLLM. The image disappeared.
  5. DeepSeek answered based on the text alone. Looked wrong. Was wrong.

No error. No log warning. Just a bad answer.


What 0.8 changes

The model registry now has a has_vision field on every model:

"gemini-2.5-flash":     {"has_vision": True,  "cost_per_m_input": 0.15}
"deepseek/deepseek-chat": {"has_vision": False, "cost_per_m_input": 0.28}
"ollama/llama3.1:8b":   {"has_vision": False, "cost_per_m_input": 0}
Enter fullscreen mode Exit fullscreen mode

When NadirClaw detects image_url or base64 image content in a request, it checks the selected model's has_vision flag. If it's False, it swaps to the cheapest vision-capable model in your configured tiers.

That's usually Gemini Flash ($0.15/M input) rather than Sonnet ($3.00/M) or GPT-5.2 ($1.75/M). You're not paying premium rates for vision, you're paying the cheapest rate that actually works.


The fix that mattered as much as the routing

Separately from the routing logic, there was a bug: even if you'd manually pointed your image request at a vision-capable model, the content array was still being flattened to text-only before reaching LiteLLM. Both streaming and non-streaming paths.

That's fixed in 0.8. Image content parts now pass through unchanged.


Upgrade

pip install --upgrade nadirclaw
Enter fullscreen mode Exit fullscreen mode

If you've been getting inconsistent answers on image-heavy requests, this is probably why. Run nadirclaw report after upgrading and look at the has_images field in your request logs to see how often this was silently misfiring.

Full changelog: v0.7.0...v0.8.0

(Full disclosure: I work on this project.)

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

The silent failure mode here is the worst kind — the system returns something plausible but wrong, which means you can't tell from the output alone that routing failed. Diagnosing that requires knowing what the router decided, not just what the model said.

The embedding-based routing approach makes sense but exposes a classic failure surface: the prompt text doesn't always carry enough signal to distinguish vision from text tasks. A user writing "look at this" provides almost no embedding signal — the intent is in the image, not the text.

One related thing I've noticed in agent systems: when the upstream prompt is poorly structured, routing classifiers downstream have to work harder because intent is buried in prose rather than explicitly declared. A structured prompt with a typed Input block (e.g. "image: screenshot of...") gives the router a cleaner signal. Basically, prompt structure is part of the routing infrastructure.

I've been working on exactly this — flompt (flompt.dev) is a visual prompt builder that structures prompts into 12 semantic blocks so intent is always explicit. The MCP server could sit in a pipeline like NadirClaw to generate well-structured prompts before they hit the router. Open-source at github.com/Nyrok/flompt if curious.