<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yanzhe Xie</title>
    <description>The latest articles on DEV Community by Yanzhe Xie (@fn8211).</description>
    <link>https://dev.to/fn8211</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951127%2F44767ad8-38e9-42b6-a4f1-09f21f41abfd.png</url>
      <title>DEV Community: Yanzhe Xie</title>
      <link>https://dev.to/fn8211</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fn8211"/>
    <language>en</language>
    <item>
      <title>VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner</title>
      <dc:creator>Yanzhe Xie</dc:creator>
      <pubDate>Tue, 26 May 2026 00:33:13 +0000</pubDate>
      <link>https://dev.to/fn8211/vla-or-il-a-controlled-dataset-for-testing-whether-finetuning-turns-your-vla-into-a-fancy-4m6i</link>
      <guid>https://dev.to/fn8211/vla-or-il-a-controlled-dataset-for-testing-whether-finetuning-turns-your-vla-into-a-fancy-4m6i</guid>
      <description>&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;Robot manipulation is the ability of a robot to interact with and manipulate objects in the physical world, such as grasping objects, moving them precisely, and adapting to changes in the environment. Traditional approaches such as Imitation Learning (IL) [&lt;a href="https://arxiv.org/abs/2304.13705" rel="noopener noreferrer"&gt;ACT&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2304.13705" rel="noopener noreferrer"&gt;Diffusion Policy&lt;/a&gt;] learn directly from human demonstrations, mapping visual observations to actions. While effective in controlled settings, these policies are  difficult to generalize. Vision-Language-Action (VLA) models [&lt;a href="https://arxiv.org/abs/2304.13705" rel="noopener noreferrer"&gt;RT-2&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2304.13705" rel="noopener noreferrer"&gt;OpenVLA&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2304.13705" rel="noopener noreferrer"&gt;π series&lt;/a&gt;] represent a promising new paradigm. A VLA typically consists of a VLM backbone and an action expert: the VLM, pretrained on internet-scale vision-language data, provides rich high-level semantic understanding of the scene and the natural language instruction; the action expert then takes this semantic representation and outputs concrete robot actions. The entire architecture is trained end-to-end, enabling VLAs to not only understand what they are asked to do, but also execute it — rather than simply memorizing fixed scene-action mappings like traditional IL approaches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3k9kdlyjkjayhpblcs3k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3k9kdlyjkjayhpblcs3k.png" alt="VLA architecture" width="800" height="326"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A typical VLA model consisting of a VLM backbone and an action expert (image from &lt;a href="https://arxiv.org/abs/2410.24164" rel="noopener noreferrer"&gt;π₀&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A VLA model is first pretrained on large-scale diverse data to acquire general visual and language understanding, then finetuned on a smaller dataset of demonstrations for a target task and environment. However, recent work has raised serious concerns about this finetuning process. Several studies suggest that finetuning causes VLAs to degrade into imitation learners that memorize scene-specific action sequences based on training distribution, rather than genuine understanding of the scene through the VLM backbone. &lt;a href="https://arxiv.org/abs/2510.03827" rel="noopener noreferrer"&gt;LIBERO-PRO&lt;/a&gt; finds that model trajectories remain nearly identical when the target object is replaced, removed, or the instruction is corrupted. &lt;a href="https://arxiv.org/abs/2510.13626" rel="noopener noreferrer"&gt;LIBERO-Plus&lt;/a&gt; further shows that models fail when the target object is displaced.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Property I Test
&lt;/h2&gt;

&lt;p&gt;These observations raise a fundamental question: &lt;strong&gt;after finetuning, does the VLA degrade into a fancy imitation learner that relies purely on memorized scene-action mappings?&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;To test this, I identify two key properties that an effective VLA should satisfy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Language grounding&lt;/strong&gt;: the action output by the model should correctly follow the given instruction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spatial generalization&lt;/strong&gt;: the model should locate the correct target object regardless of its position in the scene.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I design a controlled dataset that independently varies these two properties, forming a 2x2 experimental design. &lt;/p&gt;

&lt;p&gt;If a VLA truly understands language, changing the prompt to refer to a different object that is present in the scene should change the model's behavior accordingly. If a VLA truly generalizes spatially, moving the target object to an unseen position should not affect its ability to locate and grasp it. Failure in either case would suggest that the model relies on memorized scene-action mappings rather than genuine understanding.&lt;/p&gt;
&lt;h2&gt;
  
  
  Dataset Design
&lt;/h2&gt;

&lt;p&gt;VLA models are commonly finetuned on the &lt;a href="https://libero-project.github.io/datasets" rel="noopener noreferrer"&gt;LIBERO&lt;/a&gt; simulation benchmark. To precisely test language grounding and spatial generalization, I construct a controlled dataset based on one of its sub-suites, LIBERO-Object, which allows me to independently vary the prompt and object positions while keeping everything else fixed.&lt;/p&gt;

&lt;p&gt;In LIBERO-Object, each task shares the same structure: a floor scene with one target object and 5 distractor objects, where the robot must pick up the target object and place it in a basket.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx87m2mbooj2zubtpdkxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx87m2mbooj2zubtpdkxj.png" alt=" " width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 10 tasks in LIBERO-Object are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick up the milk and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the tomato sauce and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the butter and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the cream cheese and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the orange juice and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the chocolate pudding and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the bbq sauce and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the ketchup and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the alphabet soup and place it in the basket&lt;/li&gt;
&lt;li&gt;Pick up the salad dressing and place it in the basket&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To construct the 2x2 controlled dataset, I vary two factors independently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt (Seen vs. Unseen)&lt;/strong&gt;: In the seen condition, the original training prompt is used (e.g., "Pick the milk and place it in the basket"). In the unseen condition, the prompt is changed to refer to a different object that is physically present in the scene as a distractor (e.g., "Pick the tomato sauce and place it in the basket"). This ensures that any failure can only be attributed to language grounding, not to object absence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Position (Original vs. Shuffled)&lt;/strong&gt;: In the original condition, all objects remain in their training positions. In the shuffled condition, object positions are randomly reassigned across regions, such that the target object appears in a location never seen during training.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This yields 4 conditions per task, and 40 controlled scenes in total:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Seen prompt&lt;/th&gt;
&lt;th&gt;Unseen prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Original position&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;Tests language grounding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shuffled position&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tests spatial generalization&lt;/td&gt;
&lt;td&gt;Tests both&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;p&gt;One example series from the controlled dataset is shown above. To better highlight the target object in each scene, a blue circle is drawn around it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmcsaw4lv2tcl8ovchjnz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmcsaw4lv2tcl8ovchjnz.png" alt=" " width="800" height="662"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  How to Generate
&lt;/h2&gt;

&lt;p&gt;In LIBERO, each task is defined by a BDDL configuration file, which specifies the scene layout, object placements, and the natural language prompt. During both training and inference, the VLA model receives the &lt;code&gt;:language&lt;/code&gt; field as its prompt.&lt;/p&gt;

&lt;p&gt;Below is the baseline BDDL for the milk task (&lt;code&gt;original_seen&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(define (problem LIBERO_Floor_Manipulation)
  (:domain robosuite)
  (:language Pick the milk and place it in the basket)  ; [CHANGEABLE] language prompt

  (:objects
    milk_1 - milk
    basket_1 - basket
    cream_cheese_1 - cream_cheese
    tomato_sauce_1 - tomato_sauce
    butter_1 - butter
    orange_juice_1 - orange_juice
    chocolate_pudding_1 - chocolate_pudding
  )

  (:obj_of_interest
    milk_1    ; [CHANGEABLE] target object
    basket_1
  )

  (:init
    (On milk_1 floor_target_object_region)           ; [CHANGEABLE] object positions
    (On cream_cheese_1 floor_other_object_region_0)
    (On tomato_sauce_1 floor_other_object_region_1)
    (On butter_1 floor_other_object_region_2)
    (On orange_juice_1 floor_other_object_region_3)
    (On chocolate_pudding_1 floor_other_object_region_4)
    (On basket_1 floor_bin_region)                   ; fixed
  )

  (:goal
    (And (In milk_1 basket_1_contain_region))  ; [CHANGEABLE] target object
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To generate the controlled variants, I modify the fields marked &lt;code&gt;[CHANGEABLE]&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unseen prompt conditions&lt;/strong&gt;: The &lt;code&gt;:language&lt;/code&gt; field is changed to refer to a distractor object that is physically present in the scene. For example, “Pick the milk and place it in the basket“ is changed to “Pick the tomato sauce and place it in the basket“. The &lt;code&gt;:obj_of_interest&lt;/code&gt; field is updated from &lt;code&gt;milk_1&lt;/code&gt; to &lt;code&gt;tomato_sauce_1&lt;/code&gt;, and the &lt;code&gt;:goal&lt;/code&gt; field is updated from &lt;code&gt;(In milk_1 basket_1_contain_region)&lt;/code&gt; to &lt;code&gt;(In tomato_sauce_1 basket_1_contain_region)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shuffled position conditions&lt;/strong&gt;: The object placements in the &lt;code&gt;:init&lt;/code&gt; section are randomly reassigned across the available floor regions (&lt;code&gt;target_object_region&lt;/code&gt;, &lt;code&gt;other_object_region_0&lt;/code&gt; to &lt;code&gt;other_object_region_4&lt;/code&gt;). For example, &lt;code&gt;milk_1&lt;/code&gt; which was originally at &lt;code&gt;floor_target_object_region&lt;/code&gt; may be reassigned to &lt;code&gt;floor_other_object_region_3&lt;/code&gt; after shuffling. The basket position at &lt;code&gt;floor_bin_region&lt;/code&gt; remains fixed.&lt;/p&gt;

&lt;p&gt;The generation script and the full dataset are available at: &lt;a href="https://github.com/FN8211/Control-Dataset" rel="noopener noreferrer"&gt;https://github.com/FN8211/Control-Dataset&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Preliminary Results
&lt;/h2&gt;

&lt;p&gt;To validate the dataset, I ran &lt;a href="https://arxiv.org/abs/2504.16054" rel="noopener noreferrer"&gt;pi0.5&lt;/a&gt; with the LIBERO finetuned checkpoint on the four conditions using the milk task. The results are shown below:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Seen prompt&lt;/th&gt;
&lt;th&gt;Unseen prompt&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Original position&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Success&lt;/td&gt;
&lt;td&gt;❌ Failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shuffled position&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Failure&lt;/td&gt;
&lt;td&gt;❌ Failure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4d6cg34630zk79s1oheu.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4d6cg34630zk79s1oheu.gif" alt=" " width="320" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;original_seen: Pick the milk and place it in the basket (original position) — Success&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgrskquyourhedgxn3ed9.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgrskquyourhedgxn3ed9.gif" alt=" " width="320" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;original_unseen: Pick the tomato sauce and place it in the basket (original position) — Failure&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjlw9bt2zi92t1safk1in.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjlw9bt2zi92t1safk1in.gif" alt=" " width="320" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;shuffled_seen: Pick the milk and place it in the basket (shuffled position) — Failure&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzfxps9m5o7d2v2g5ghcl.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzfxps9m5o7d2v2g5ghcl.gif" alt=" " width="320" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;shuffled_unseen: Pick the tomato sauce and place it in the basket (shuffled position) — Failure&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The model succeeds only in the baseline condition, where both the prompt and object positions match the training distribution exactly. Changing either the prompt or the object positions — even when the target object is still present in the scene — causes complete failure. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
