xulingfeng

Posted on May 20

I Cut My AI Test Automation Cost by 300x by Ditching Vision Models

#ai #testing #deepseek #playwright

I Cut My AI Test Automation Cost by 300x by Ditching Vision Models

From $0.011 per step to $0.00004 — here's how I learned vision models are overkill for most web testing, and what I built instead.

It started with a $400 monthly API bill (and yes, that's USD — I'm in China, but you'll feel the same pain in any currency).

I was running an AI-powered test automation platform built on Midscene.js with Qwen-VL vision models. Every test step meant sending a full-page screenshot to a multimodal LLM — and paying about $0.011 per step.

A 50-step test case cost about $0.55. Run it daily? $16.50/month. Add a few more test scenarios, and suddenly I was spending more on API calls than on coffee.

And the worst part? Most of those screenshots contained information I already had for free.

The Platform That Taught Me a Lesson

First, a quick backstory.

I built ai-test-platform, a full-stack test automation management system:

Frontend: Vue 3 + ElementUI Plus
Backend: Express + Node.js + MySQL
Test engine: Midscene.js 1.5.2 + Playwright + Qwen-VL
Dockerized, with a management UI for test cases, reports, and models

It worked. Beautiful reports, clean UI, easy test management. I even pushed it to Docker Hub (xulingfeng/ai-test-platform:latest).

But every time I ran a test, I could almost hear the coins dropping. $0.011 here, $0.011 there. A 29-step doctor-onboarding flow cost $0.32.

For a solo QA engineer running tests multiple times a day, that adds up fast.

The Moment It Clicked

I was watching a test run one afternoon. The AI was analyzing a screenshot of a web page — and I realized something:

The AI could see 45 interactive elements in the screenshot. But Playwright had already extracted all 45 of them as clean structured text.

I was paying to process pixels when the data was already neatly organized in the DOM tree.

Here's what a page looks like to a vision model:

[screenshot image with pixel data, rendering details, colors, shadows...]

And here's what it looks like in the DOM:

[0] <input placeholder="Search..." name="q">
[1] <button>Sign in</button>
[2] <a>Add new doctor</a>
...

The AI doesn't need to "see" the page. It needs to understand the structure and decide what to click. And structured text does that perfectly.

The 300x Optimization: deep-test

I built deep-test — a pure-text AI testing framework.

The architecture is embarrassingly simple:

Task: "Login system, search product, add to cart"
         ↓
① Extract interactive elements (DOM tree / uiautomator)
   (No screenshots. No vision models.)
         ↓
② DeepSeek V4 analyzes structure + decides next action
   (~2000 tokens/step × $0.14/M = $0.0001/step)
         ↓
③ Execute action (Playwright click / ADB tap)
         ↓
④ Back to ① until task completes

The cost comparison is ridiculous:

Approach	Per step	50-step test
Midscene.js + Qwen-VL-Plus	~$0.011	~$0.55
browser-use + Claude	~$0.10	~$5.00
deep-test + DeepSeek V4	~$0.00004	~$0.002

200-300x cheaper. The 50-step test that cost $0.55 now costs less than a cent.

The Real-World Numbers

I ran a complete hospital management workflow — login, navigate menus, add a new doctor with 12 fields, verify the result. 29 steps total.

Result: 81.8 seconds, ~$0.001 total cost.

For context, that's less than the price of a single step on the vision-based approach.

But Wait — What About Android Apps?

Here's where it gets even more interesting.

Android apps can't give you a clean DOM tree like a web page. So I added a hybrid approach:

Use uiautomator2 to extract the native UI tree (it's text, just like DOM)
Use ADB screencap + OCR only when the UI tree doesn't have enough info
Same DeepSeek V4 decision engine — just different input sources

This means one AI agent handles both Web and Android with the same architecture.

And I even solved the notorious hybrid app WebView input problem — where in-app web views ignore standard automation commands. The fix: uiautomator2.send_keys() instead of set_text(). Took days to figure out, one line to implement.

What I Learned

Vision models are overkill for most web testing.

They're great for:

Visual regression testing (did the layout break?)
CAPTCHA solving
Canvas/SVG-heavy applications

But for standard CRUD operations — filling forms, clicking buttons, navigating menus — the DOM already has all the information you need.

The real optimization isn't about better prompting or smarter AI. It's about choosing the right data format for the job.

The Tools

Both projects are not yet public — they contain real test data from production healthcare applications. I plan to clean and open-source them once the company-specific content is stripped out. If you'd like early access or want to discuss the approach, feel free to reach out.

The tech stack:

LLM: DeepSeek V4 Flash ($0.14/M input, $0.28/M output)
Web automation: Playwright
Android automation: uiautomator2 + ADB
OCR: EasyOCR (local, no API cost)

I'm a test manager with 15 years of experience. I've been building AI testing tools on the side because I believe good testing shouldn't cost a fortune. If this resonates, I share more practical testing prompts and techniques in my toolkit: xulingfeng.gumroad.com/l/vkhhq