Online skill distillation and graph‑guided knowledge substantially reduce the compute bill of LLM agents while keeping success rates competitive. These reductions could enable shifting agents from cloud‑only services to modest on‑device hardware.
Earlier web and mobile agents leaned on heavyweight tricks: multi‑rollout searches, separate verifier passes, and stacks of specialist vision‑language models that balloon token counts and memory footprints. Such pipelines achieved high task‑success but only by paying huge inference costs.
PANDO reaches a 58.3 % success rate on the full 910 VisualWebArena suite while using 115 K tokens per task, and it does so “while using 58% fewer tokens than SGV and 61% fewer than WALT, with no pre‑evaluation discovery budget” [1]. This headline figure shows that a single‑rollout, online distillation loop can surpass strong baselines without the token overhead that traditionally powers web agents.
Beyond success, PANDO also delivers the best intrinsic efficiency: it records the lowest Action Repetition Rate (9.1 %), the lowest Step Overhead Ratio (1.8), and the highest prompt‑cache utilization (72.4 %) among automated methods [1]. These metrics translate directly into reduced memory churn and faster per‑step latency.
UI‑KOBE lifts a lightweight 4 B‑parameter backbone to 70.7 % success on mobile GUI tasks, “substantially outperforming the same backbone model without graph guidance, which achieves 58.6 %” [2]. The gain comes from re‑using an app‑specific knowledge graph that steers the agent through local decisions instead of forcing a monolithic end‑to‑end planner.
The results leave open two practical questions. PANDO still exhibits a 9.1 % action‑repetition rate and a step‑overhead ratio of 1.8, meaning some inefficiency remains even after distillation. UI‑KOBE “does introduce an exploration cost, averaging $6.2 and 6.4 hours per app” [2], and its graph is only as good as the explored UI surface, limiting portability to unseen applications. Moreover, the formulation “reduces GUI task execution to a sequence of guided local decisions, significantly lowering the reasoning burden on small models” [2], which may not hold for highly dynamic or multimodal interfaces.
If these savings hold across domains, the community should start evaluating agents under strict token‑budget constraints rather than raw success alone. Revisiting VisualWebArena and MobileGUI benchmarks with a ≤ 120 K token ceiling per task would surface designs that truly scale to on‑device deployment.
Top comments (0)