DEV Community

Cover image for I Read One Paper and Ended Up Swapping Visual AI Models 3 Times
as1as
as1as

Posted on

I Read One Paper and Ended Up Swapping Visual AI Models 3 Times

One day I stumbled across a paper called ShowUI. A vision model that looks at screenshots and understands UI elements. "That sounds fun" — I thought. That curiosity led to 3 model swaps, an accessibility app concept, and a project I never shipped.


🧪 It Started with a Paper

I came across ShowUI-2B by OpenBMB. Feed it a screenshot, and it detects buttons, text fields, icons — all the UI elements on screen. A Vision model purpose-built for understanding interfaces.

"I could build something with this." That thought started everything.

Testing Reality: Underwhelming

When I actually ran it, the results didn't match the paper. On Korean-language UIs — especially heavily styled sites with custom CSS — it was bad. It couldn't even locate the username and password input fields. Not "low accuracy." It couldn't find them at all. Maybe 1 success out of 10 attempts. The model was also 4.7GB — not small.

The testing environment was painful too. I couldn't set up a proper GPU environment, so I force-quantized the model and ran it on CPU. A simple test — feed a screenshot, get back UI element coordinates — took up to 5 minutes to return results. On a GPU, this would take seconds. Instead, I could make coffee and come back to find it still running.

The concept of "AI that understands UIs" was compelling. This particular model just wasn't good enough.

🔀 The Pivot: "Could This Help Blind People?"

ShowUI wasn't perfect, but the idea of "AI that sees and understands screens" stuck with me. As I searched for better models, the concept expanded.

If AI can understand UI elements on screen, could it also read traffic light colors? Bus numbers? Could it help visually impaired people navigate daily life?

That's how I started planning an accessibility assistant app for Android — the camera sees the world, AI processes it, and voice tells the user what's happening.

Features I needed:

  • Traffic light recognition (red/green)
  • Bus number reading (OCR)
  • App UI automation (detect buttons and fields → automate interactions)

🟡 Attempt #2: UI-TARS-2B — "A Better UI Model"

To fix ShowUI's accuracy issues, I found UI-TARS-2B by ByteDance (the company behind TikTok).

It was definitely better than ShowUI. More accurate at distinguishing specific UI elements, and about 2GB with INT8 quantization — less than half the size. But this model could only understand UIs.

My tech stack at this point looked like:

Traffic lights / bus numbers → Qwen-7B (general vision model)
UI detection                 → UI-TARS-2B (UI specialist)
Enter fullscreen mode Exit fullscreen mode

Two models to manage simultaneously. Memory allocation, model switching logic, error handling — everything doubled in complexity.

💡 The Turning Point: "Why Do I Need Two?"

Then it clicked. Qwen-7B is a general-purpose Vision Language Model. Can't it understand UIs too?

I tested on desktop and the results were promising:

Task Qwen-7B Accuracy (Desktop)
Traffic light recognition 88%
Bus number OCR 80%
UI element detection 75%

75% is lower than UI-TARS's 90%, but being able to do everything with a single model meant cutting complexity in half. UI-TARS was no longer necessary.

The question was: could this run on a phone?

🔒 Why It Had to Run On-Device

"Just run the AI on a server" — obviously that would be easier. But for this app, it wasn't an option.

When a visually impaired person points their camera at the world, the footage captures their home, their routes, the people around them. Continuously streaming this sensitive video data to a server is a serious privacy problem. Especially since assistive devices tend to stay on all day — you'd essentially be enabling real-time location tracking.

So all AI inference had to happen on the device itself. Camera data never leaves the phone. This decision became the constraint that shaped every model choice that followed.

With that in mind, I set out to move Qwen-7B from desktop to mobile.

🔴 Putting Qwen-7B on Mobile — Failed

Here's where reality hit hard.

First, Android can't run PyTorch or HuggingFace models directly. You must convert to ONNX format. Finding a good model isn't enough — you also need to confirm it can be converted to ONNX and that performance holds after conversion.

I tried converting Qwen-7B to ONNX myself, but converting a 7B-parameter VLM turned out to be far more complex than expected. I gave up. And even before conversion was a problem, the model was simply too large — most devices ran out of memory and couldn't even load it.

The direction — "one model for everything" — was right. But 7B was more than mobile could handle.

✅ Qwen-2B VL — "The Small Generalist"

The final answer was Qwen-2B VL. A smaller version of Qwen-7B that retains Vision Language capabilities. Where 7B couldn't even load on mobile, 2B VL actually ran.

Spec Qwen-2B VL
Size 1.2–1.5GB (INT4)
Inference speed 7–9 seconds
Battery life 3–6 hours
Heat 42–44°C
Traffic lights 88%
Bus numbers 80%
UI elements 75%

The accuracy wasn't stellar, but I figured that could be improved later with fine-tuning. What mattered was that it actually ran on a phone. And an ONNX-converted version was already available on HuggingFace — no manual conversion needed.

Technically, I'd finally found the answer.

🛑 Why I Stopped: Solo Dev Reality

The model problem was solved. But when I stepped back and looked at the full project, the scope was impossible for one person.

Traffic light recognition, bus OCR, UI automation, voice guidance, GPS navigation, accessibility testing — each of these is a project on its own. The biggest burden was accessibility testing. Building an app that blind users can actually use requires TalkBack (screen reader) compatibility, voice feedback timing, haptic pattern design — specialized domains that you can't just learn solo. It requires iterative testing with actual visually impaired users.

And for a service targeting blind users, "mostly works" isn't acceptable. 88% accuracy is fine for a regular app, but misreading a traffic light is a matter of life and safety. Even if fine-tuning could improve accuracy, collecting and validating that fine-tuning data would be yet another project in itself.

Researching existing apps like BlindSquare confirmed it. This space has dedicated teams who've been refining their products for years. Trying to build an MVP solo in 4 weeks wasn't a technology problem — it was a scope problem.

Stopping the project wasn't giving up. It was redirecting my resources where they could actually make an impact.

📊 The Journey

Read ShowUI paper    →  "This looks fun"              → Tested it, 1 in 10 success rate
Idea expands         →  "Could this help blind people?"→ Start planning the app
Switch to UI-TARS    →  "A better model"              → Complexity doubles
Merge into Qwen-7B   →  "One model for everything"    → Can't run on mobile
Find Qwen-2B VL      →  "This is it!"                 → Actually works on phone
Reality check        →  "Too big for one person"       → Project ends
Enter fullscreen mode Exit fullscreen mode

🎓 What I Learned

1. Paper benchmarks ≠ real-world performance

ShowUI's paper looked impressive, but on Korean UIs with heavy CSS styling, it couldn't even find input fields — 1 in 10 attempts. Papers report results under optimal conditions. Your environment will be different.

2. One generalist model > multiple specialists

Rather than ShowUI for UIs, a separate model for traffic lights, another for OCR — a single Vision Language Model like Qwen VL did "everything well enough." One model at 75% across all tasks beats four models at 90% each, in practice.

3. Mobile is a different world

A 7B model that runs beautifully on desktop couldn't even load on a phone. If you're planning on-device AI, start with mobile constraints — memory, battery, heat — not desktop performance.

4. The detour was worth it

This app never shipped, but I don't regret it. Quantizing models by hand, waiting 5 minutes for CPU inference on a single coordinate test, measuring phone temperatures — you can't learn this stuff from tutorials. Most importantly, it was fun. That experience is why I can confidently run local AI (Ollama + Qwen3) in TalkWith.chat today.


Top comments (3)

Collapse
 
nyrok profile image
Hamza KONTE

This kind of empirical paper-driven model evaluation is exactly what the community needs more of — too much "I switched because of a tweet" and not enough actual testing.

One angle worth noting: across all these model swaps, prompt structure often matters as much as model capability. I've seen the same model produce dramatically different outputs when given structured vs unstructured prompts. I built flompt (flompt.dev) — a free visual prompt builder that decomposes prompts into typed semantic blocks (role, context, constraints, examples, output format, chain-of-thought, etc.) and compiles to Claude-optimized XML. When evaluating visual AI models, running structured prompts gives you cleaner benchmark data since you're reducing the variable of prompt quality.

Great story — curious which model you finally settled on and what paper tipped the scale?

Collapse
 
nyrok profile image
Hamza KONTE

Qwen-2B VL is a great pick for that use case — and yes, the ShowUI paper is a classic example of how benchmarking actually changes your model selection rather than just confirming your existing choice.

Totally agree on the token-efficiency point with smaller models. With 2B params, a well-structured prompt with explicit role + constraints + output format blocks can recover a lot of the gap vs larger models. Flompt actually makes this pretty easy to iterate on — you can tweak individual blocks independently without rewriting the whole prompt. Worth trying if you're still tuning your Qwen-2B VL prompts!

Collapse
 
as1as profile image
as1as

Thanks! I ended up going with Qwen-2B VL — the ShowUI paper was what started it all. You're right that prompt structure matters a lot, especially with smaller models where every token counts.