Browser-Use Is Solving the Wrong Half of the Problem

#ai #opensource #productivity #discuss

TL;DR — when to use browserground (and when to use UI-TARS-MLX instead)

If you're on Apple Silicon with ≥16 GB RAM and you need generic, max-accuracy UI grounding, use mlx-community/UI-TARS-1.5-7B-4bit. It's the obvious default — ~94% on ScreenSpot-v2, MLX-native, drops into mlx-vlm directly. ByteDance research-lab compute, you couldn't reproduce it on a budget. I'm not the right pick for that workload.

browserground is for two narrower jobs:

1. The recipe for your product's custom UI grounder. UI-TARS is a finished model — closed pipeline, proprietary data, hard to extend. browserground is the opposite: a reproducible template. Open base (Qwen3-VL-2B), open training scripts, open data mix (26k records, OS-Atlas + wave-ui). Swap in your dashboard's screenshots / your customer app / your internal tooling → ship a domain-trained UI grounder over a weekend. The 60% generic ScreenSpot-v2 score isn't the deliverable; the recipe is. A 60-point baseline on generic screens becomes 85-95% on your own product's narrow surface because the test distribution finally matches the training distribution.

2. The smallest viable slot in a multi-model stack. browserground 4-bit MLX is ~1 GB on disk / ~2 GB RAM. UI-TARS-1.5-7B-MLX is ~4 GB / ~5-6 GB RAM. The difference matters on 8 GB Macs and in agent stacks that already run a 7B planner + an OCR model + an embedding model in the same RAM budget. Plus strict JSON output (100% parseable, no regex on prose) — small win, but real.

A direct head-to-head benchmark of browserground vs UI-TARS-1.5-7B-MLX on the same Apple Silicon hardware is forthcoming.

The broader argument — why a parser-stage specialist matters at all

And if you're new to the hybrid pattern — why this exists at all

Everyone's posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.

Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn't see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $5 hits that same button 3.3x more reliably on ScreenSpot-v2 (60.0% vs GPT-4o's 18.3%).

The architecture is the bug, not the model.

Two Jobs, One Forward Pass

Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.

browser-use (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:

Given this page, what should the agent do next, and where exactly does it click?

Two jobs welded together. Reasoning ("what next") is a probabilistic problem worth a frontier model. Grounding ("where exactly") is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.

You're paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.

Grounding Is a Parser Problem

Once you name it as a parser problem, the right tool changes. You don't need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:

Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision
Outputs strict JSON without hallucinating bounding boxes
Runs locally so the per-step cost is zero

A 2B specialist trained on screen-parsing data. Not a frontier model.

I trained one. Total cost: ~$5 of RunPod compute on a single A6000 GPU. The result, browserground, hits 60.0% on ScreenSpot-v2 vs GPT-4o's 18.3% — a 3.3x beat at the click-grounding job. More telling: it beats SeeClick (9.6B params, 55.1%) at 4.8x smaller. A drop-in for any agent loop currently handing screenshots to a frontier API. Today the CLI runs via transformers on Apple Silicon (~14 s/call); MLX-native build coming for the ~1.5 s path.

The Reasoning Model Gets Its Reasoning Capacity Back

When you split the call, the frontier model stops seeing pixels. It sees:

{
  "elements": [
    {"id": "e7", "label": "Submit order", "type": "button", "bbox": [344, 612, 478, 658]},
    {"id": "e8", "label": "Edit cart",    "type": "link",   "bbox": [...]}
  ]
}

Now the frontier model does the job it's good at: deciding e7 vs e8 given the agent's goal. A reasoning question over structured input. Cheap. Reliable. Auditable.

Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 35M LoRA parameters on a Qwen3-VL-2B base). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.

What Anthropic and OpenAI Ship Next

The frontier providers will absorb grounding into their own small models. Within twelve months, "fast vision" or "tool vision" tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-5 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.

When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn't. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.

Stack owners who didn't split will find their grounding step has quietly become someone else's API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor's deprecation calendar.

The Diagnostic

Pull up your last failed agent trace. Three numbers:

Total tokens spent on vision calls per agent step.
Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).
Per-run vision cost at your current API rates.

If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.

Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.

I build the split layer. browserground is the open-source reference for the local grounding half. v0.3 ships three packagings so it drops into any stack:

npm CLI (daemon, HTTP REST server, batch, confidence, eval): npm install -g browserground → browserground parse <img> --target "..."
PyPI (no Node required, MLX or transformers): pip install "browserground[mlx]" → from browserground import click_xy
Ollama (cross-platform, GGUF Q4_K_M + f16 mmproj): ollama run renezander030/browserground

Adapters land in the repo for browser-use (drop-in Controller action) and Skyvern (ground_with_fallback for local-first + cloud-fallback). Model: huggingface.co/renezander030/browserground. MLX 4-bit: browserground-mlx. GGUF: browserground-gguf. Source: github.com/renezander030/browserground. Apache-2.0. v0.2 LoRA trained on 26k mixed-domain examples (macOS + Android + UIBert + web). PRs welcome, especially eval cases where it fails.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.