Run the same image through the same Qwen2.5-VL model on different runtimes, and it can cost anywhere from 8 to 16,384 visual tokens — a 2,000× spread — depending on which inference stack you picked. Nobody changed the model. They just disagree about one config value: max_pixels.
This post is three things: a one-line identity that makes max_pixels legible, a measured set of accuracy-vs-budget curves showing that the optimal cap depends on the size of the thing you're trying to find, and a survey of what every major runtime defaults to.
TL;DR —
max_pixels ÷ 28²is your image-token budget. The vendor recommends ≤ 1,280 tokens; most runtimes silently ship 16,384. Measured on real 4K screenshots: big targets (panels) peak at 1,280 — more tokens makes them worse. Tiny targets (icons) never stop gaining. Set the cap to the size of what you're hunting.
Decoding max_pixels: one line of math
Qwen-VL models turn pixels into tokens in two stages. The vision transformer slices the image into non-overlapping 14×14 px patches; then a small MLP merges every 2×2 block of neighboring patches into one token. Compose the two and each LLM token owns a contiguous 28×28 px square of the image:
factor = patch_size × merge_size # 28 for Qwen2/2.5-VL, 32 for Qwen3-VL
tokens = (W × H) / factor² # area ÷ area-per-token
Which means the max_pixels value every config ships is just a token cap wearing a big number:
max_pixels = max_tokens × factor²
Decode the Qwen2.5-VL checkpoint default with it: max_pixels = 12,845,056 = 16,384 × 28². That's a 16,384-token image budget. The model card's own recommended range is 256–1,280 tokens. The shipped default is 12.8× the recommended ceiling — and because smart_resize only downscales to the cap, every sufficiently large image silently runs at a resolution the model wasn't tuned for.
(The recommended token budget is the same across the whole Qwen-VL line; only factor changes per model — 28 vs 32. So Qwen3-VL's recommended pixel cap is 1280 × 32² = 1,310,720, not 1,003,520. The token number is the portable one; the pixel number is not.)
So what's the right cap? Wrong question — right for what target?
An earlier article established that this cap is the load-bearing preprocessing variable for grounding (capping at 1,280 alone lifted mean IoU from 0.54 to 0.74 on high-res screenshots — Bug #5 there). This post asks the next question: is 1,280 right for everything?
I first ran the obvious follow-up — sweep the budget, find the knee — and got a clean answer that fell apart the moment I controlled for one variable: how big the target is relative to the frame. So here is the experiment that holds up.
Setup. Qwen2.5-VL-7B-4bit on Apple Silicon (mlx-vlm), grounding on real ~4K screenshots from ScreenSpot-Pro (26 professional apps). 200 target elements spanning the size spectrum — from sub-0.1% icons (ScreenSpot-Pro's own human-annotated targets) up to >5% panels and canvases. Budgets 256 / 1,280 / 4,096 / 16,384 tokens; metric is IoU against the reference box.
Where the large-target references come from (ScreenSpot-Pro has essentially no large-target annotations — its biggest is 4.7% of frame):
- The model proposes named regions it recognizes on each screen.
- Each proposal is re-localized by an independent zoom-in pass at native resolution.
- Every element passes an ambiguity screen: if its description also matches another region on the same screen, or the box fails crop-level verification, it's out.
That screen removed 64% of candidates.
The screen isn't pedantry — it's load-bearing. An ambiguous description doesn't just add noise; it systematically fakes a high-budget penalty: at a low budget the model picks the one salient candidate, at a high budget it can suddenly resolve the other match and box that instead. On the contaminated subset the high-budget drop measures −30%. On the cleaned set, −18%. The cleaned number is the claim.
The result — peaks march right as targets shrink:
| target size | n | GT source | 256 | 1,280 | 4,096 | 16,384 | optimum |
|---|---|---|---|---|---|---|---|
| large (>5% of frame) | 36 | audited | 0.343 | 0.643 | 0.574 | 0.526 | 1,280 |
| medium (1–5%) | 8 | audited | 0.221 | 0.295 | 0.279 | 0.277 | 1,280, then flat |
| small (0.1–1%) | 9 | human | 0.029 | 0.111 | 0.179 | 0.154 | ~4,096 |
| tiny (<0.1%) | 43 | human | 0.003 | 0.040 | 0.056 | 0.079 | still climbing at 16,384 |
Two failure directions, and they belong to different size regimes:
- Large targets degrade above ~1,280 tokens (−18% IoU by 16k, audited — and 9.7s → 62s per query on these 4K images, a 6× latency tax). The failure mode is specific: the model still finds the panel (click-through-center holds at 0.86 → 0.75) but the predicted box drifts past the true boundary. More tokens, worse geometry.
- Tiny targets never stop gaining. Their curve is still rising at the model's architectural ceiling. An icon a few dozen pixels wide on a 4K frame dissolves at any budget the large-target regime would call sane.
Why would more detail hurt? The budget isn't a compute limit — it's a distribution limit: 16,384 tokens is 12× past the range the model was tuned on, and long-range box geometry is what goes first when position statistics drift out of that range.
So the model card's 1,280 isn't a universal sweet spot — it's the large-target optimum, and the crossover point where the two regimes trade places. If your targets are panels, 1,280 is exactly right. If they're icons on 4K screens, no flat cap is right, which is why the current research wave (more below) zooms instead of raising resolution.
Honest scope notes: one model family, one task family (UI grounding), and the large-target references are model-anchored (validated, but not third-party human labels — those don't exist for large UI regions; the small/tiny rows are pure human GT). The medium/small bins are thin (n=8/9). And one finding about the method itself: asked to propose elements, the model essentially cannot name tiny ones uniquely — only 2 of the 148 elements it proposed were tiny — so the tiny bin rides on ScreenSpot-Pro's human annotations.
This is an old law in new units
If "optimal scale depends on target size" sounds familiar, it should — other fields measured it years ago. Object detection: SNIP (CVPR 2018) showed large objects "become too big to be correctly classified" at high resolution, and HRDNet measured large-category AP dropping 7.6 points when input resolution doubled. Aerial imagery: SAHI tiles images for small objects but has to add full-frame inference back specifically to recover large ones. The principle is scale-space theory's "characteristic scale" (Lindeberg, IJCV 1998): every structure has one scale at which it's best detected.
What hasn't existed — as far as I can find — is this curve in VLM token-budget units, binned by target size, on screens. The pieces are published separately: Phi-Ground measured token-budget ablations (plateau past ~2k, no size bins); AdaZoom-GUI measured unconditional zoom hurting easy targets; Qwen's own report (Table 7) saw the off-distribution mechanism from the upscaling side; I reported the shipped-default discrepancy in mlx-vlm #1175. This post adds the size-binned curve that connects them. Connecting dots, not discovering them.
It also explains why the 2025–26 GUI-grounding literature (ZoomClick, UI-Zoomer, MEGA-GUI, AdaZoom-GUI) converged on coarse-pass-then-conditional-zoom: a first pass at a moderate cap is the large-target optimum and localizes coarsely; the zoom pass gives small targets high effective resolution inside a crop. The curve above is, in effect, the operating-point table those policies have been picking by ad-hoc sweep.
What every runtime actually defaults to
Every major stack implements the same resize algorithm (HF's smart_resize, sometimes imported verbatim) — and then they all pick different budgets:
| runtime | effective default budget | notes |
|---|---|---|
| transformers | 16,384 tokens | the checkpoint config wins over the class default of 1,280 |
| vLLM | 16,384 | inherits the checkpoint; profiling reserves 16,384/image |
| SGLang | 16,384 | env default SGLANG_IMAGE_MAX_PIXELS = 16384·28²
|
| TensorRT-LLM / NeMo | 16,384 | reuse the HF processor as-is |
| llama.cpp | max 4,096, min 8 (!) | so low it logs a warning telling you to pass --image-min-tokens 1024 for grounding |
| Ollama | ~1,280 | hardcoded ~1 MP cap — closest to the recommended value |
| mlx-swift-lm | 1,280 | defaults to the card's recommendation as of PR #243, override per request |
Same model, same algorithm, defaults spanning three orders of magnitude. If your Qwen-VL grounding accuracy differs across serving stacks, check this value before suspecting the weights. The inflated default is dangerous precisely because it only bites on large inputs: test on 1080p screenshots and everything looks fine; feed a Retina capture hunting for a panel and the boxes quietly bloat.
What to actually do
-
Know your size regime, then set the cap from the table. Panels, windows, layout regions:
max_pixels = 1280 × factor²— the vendor number, now with a measured reason. Buttons and fields on ordinary screenshots: headroom to ~2–4k tokens. Icons on 4K: no flat cap saves you — budget what latency allows, or use a coarse-then-zoom pass like the papers above. - Never let large images run uncapped into the ≥5k zone when your targets are large — the curve says you pay 6× latency to lose accuracy.
-
Per runtime: transformers/vLLM/SGLang — pass
max_pixelsexplicitly (e.g.mm_processor_kwargs) rather than trusting the checkpoint config. llama.cpp — raise the floor (--image-min-tokens 1024) exactly as its warning says. Ollama and current mlx-swift-lm — you're already at the large-target optimum; raise deliberately for small-target workloads. -
Don't copy pixel numbers across models — Qwen3's factor is 32, so the same token budget is a different
max_pixels.
Provenance
This grew out of an earlier project getting Qwen2.5-VL grounding working natively in Swift on macOS, where the pixel cap turned out to be the load-bearing preprocessing variable. The default-to-recommended-budget change is merged upstream in mlx-swift-lm #243. The sweep harness, the element set with its audit verdicts, and the raw per-test results are reproducible; the budget math is checkable from any model's preprocessor_config.json in two lines.
If you maintain a runtime that serves Qwen-VL models and ships the 16,384 default — the fix is one line, and your users' large-target grounding gets better and 6× faster.



Top comments (0)