Vision-language-action models, the AI systems that let a robot see a scene, read an instruction, and act on it, lose most of their general world knowledge when they are fine-tuned for physical control. A study released this week finds these robot models keep the narrow knowledge that helps them manipulate objects, like color, but collapse to near coin-flip accuracy on questions their underlying vision-language models answered almost perfectly, such as whether something is alive or who a public figure is.
Key facts
- Robot control models scored around 45-58% (near chance) on "living world," "celebrity," and "attribute" questions where their source models scored 64-100%.
- The one category they held onto was color (roughly 82-91%), the knowledge most useful for grasping and sorting objects.
- The team tested 7-8 robot models against 9 vision-language baselines on 1,720 questions across 12 categories.
- Source: the paper "Does VLA Even Know the Basics?" with an open project page and code on GitHub.
The pitch behind vision-language-action (VLA) models is seductive: take a model that already understands the visual world from internet-scale training, teach it to output robot movements, and you get a machine with broad common sense that can also act. But turning a general model into a robot controller means fine-tuning it heavily on manipulation data, and this study asks the uncomfortable question of what that process quietly destroys.
The clever part is the test itself. You cannot just ask a robot "is this a dog?" and check its words, because a robot's whole job is to act, not talk. So the researchers, a team spanning several Russian AI labs including HSE University's Cognitive AI and FusionBrain groups, built a protocol called Act2Answer. It converts standard knowledge quizzes into tabletop episodes: the robot must physically place a cube on one of two candidate answer images. That isolates the pure question, does the model still know the fact, from the confound of whether it can control the arm at all.
The results are stark. On "is this alive and what kind of animal," the vision-language baselines scored in the mid-90s; the robot models scored around 45-58%, no better than guessing. On celebrity recognition, baselines hit 99-100% while robot models fell to 38-55%. Even on basic object attributes, several robot models could not beat a coin flip on questions their own un-fine-tuned backbone answered two-thirds of the time. The single exception was color, where the robot models held up, precisely because color is directly useful when your training data is all about picking things up. As the authors put it, "VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs."
The analogy is a brilliant student who trains so hard for one specific job that they forget everything outside it. Ask them anything about their narrow task and they shine; ask them who painted the Mona Lisa and they blank, even though they knew it a year ago. The paper adds a hopeful wrinkle: probing the network layer by layer shows the correct answer often still lives in the model's middle layers. The knowledge is not fully erased, it just fails to reach the part of the network that chooses the action. The models that fared best, like Magma, were the ones trained with visual-question-answering mixed in alongside robot control.
Why this matters: robotics companies routinely market these models as broadly capable, open-world generalists. This is a rigorous, numbers-backed demonstration that a robot which looks competent at its trained tasks may have silently lost the everyday understanding you would assume it has, and standard robot benchmarks, which only measure task success, would never catch it. A home robot that cannot reliably tell whether something is alive is not a hypothetical safety concern.
The honest caveat: this measures knowledge retention, not real-world task performance, and one category (color) survived intact, so the loss is selective rather than total. It also does not prove the trade-off is unavoidable; the better-retaining models point to fixes like co-training on general questions. But it sharpens a warning the field has been circling, echoing earlier findings that world models forget what they are not actively using. Broad competence and narrow skill may not come free together, at least not yet.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (0)