DEV Community

Cover image for How I Loaded a Compact Open LLM Into a Robot and Told It to Walk (and Grab Things)
Artem X
Artem X

Posted on • Originally published at habr.com

How I Loaded a Compact Open LLM Into a Robot and Told It to Walk (and Grab Things)

Let us get straight to it.

All artifacts, as usual, are linked at the end of the article: model weights on Hugging Face and source code on Codeberg.

What Is This Article About?

I will describe how I trained Google's 270M-parameter Gemma-3 language model to control a tracked robot with a robot arm in the MuJoCo environment, using natural-language commands from a human.

It can move freely around the map, go forward and backward, turn left and right, grab objects, and put them down.

How the Idea Appeared

I have a DIY tracked robot kit with a claw at home. When I got acquainted with Gemma-3-270M, which is surprisingly capable for its size, I had the idea of loading it onto that robot and making it control the robot.

DIY tracked robot kit with a claw

Here is the DIY tracked robot kit with a claw.

According to my estimates, this model should fit into a Raspberry Pi Zero 2 W without problems. The idea was to make an intelligent robot that can move according to human commands written in free form, such as:

  • "turn left"
  • "go forward 10 meters"
  • "pick up the box"

But before digging into hardware, it was worth checking the capabilities of the model itself. Fortunately, all of this can be tested in simulation first. That is what this article is about.

Starting the Experiments

First, it was worth checking how the default Gemma-3-270M, without fine-tuning, handles instruction following.

User:
You are a robot controller. Convert user commands to JSON.
Available actions:
 - {"action": "move", "direction": "forward|backward", "distance_cm":
 - {"action": "rotate", "direction": "left|right", "angle_deg":
 - {"action": "pick_up", "target": "
 - {"action": "place", "target": "
 - {"action": "stop"}
Output only JSON, no explanations. Now just say Ready, and then the commands will follow.

Model: Ready
User: Command: drive forward 30 centimeters
Model: Drive forward 30 centimeters.
User: Are you idiot?
Model: No.
Enter fullscreen mode Exit fullscreen mode

As you can judge from this Tarantino-esque dialogue, the model does not handle the task very well. Fine-tuning is needed.

Apart from the model itself, I also had to choose the environment where the model would act.

I chose MuJoCo, an excellent simulator for robotics. Its physics is not as realistic as NVIDIA Isaac Sim, but it runs smoothly on an ordinary laptop, and I was already familiar with it from previous pet projects.

Strange things in MuJoCo docs

Sometimes the MuJoCo documentation site shows truly strange things...

The command language was also easy to decide. Gemma-3-270M is a very small model, and it works best with English text. There is no reason to "stress" its weights by also teaching it to understand Russian better; it might simply not have enough capacity for everything at once. So the command language will be English only.

The dataset will be synthetic, generated through powerful free models available on OpenRouter: gpt-oss-120b from OpenAI and nemotron-super-120b from NVIDIA.

In the end, the initial goal can be formulated like this:

Fine-tune Gemma-3 270M to translate English commands into valid JSON that controls a tracked robot in MuJoCo, using a synthetic dataset generated by gpt-oss-120b and nemotron-super-120b.

The experiment is split into two phases:

  • Phase 1: create a tracked robot in MuJoCo, without the arm; create a synthetic dataset for controlling it; train Gemma-3 270M on that dataset; test the trained model in simulation.
  • Phase 2: add a claw limb to the MuJoCo tracked robot; add two new actions, grasp and release; generate an expanded synthetic dataset describing those actions; fine-tune the model again; test it again in simulation with the new actions.

Phase 1: Generating the Synthetic Dataset

Goal of this step: obtain pairs of {"instruction": ..., "output": ...} on which Gemma-3-270M will be fine-tuned.

Source: synthetic data. About 70 manually written examples are inflated by a large 120B model through OpenRouter, then rigidly validated against a JSON Schema.

1. The Actual Generator Prompt

This can be reproduced with:

python -m dataset_gen.generate --dry-run
Enter fullscreen mode Exit fullscreen mode

SYSTEM: Schema and Rules

The prompt begins with the full JSON Schema. For example, here is a fragment of the movement schema:

{
  "type": "object",
  "required": ["commands"],
  "properties": {
    "commands": {
      "type": "array",
      "items": { "$ref": "#/definitions/command" }
    }
  },
  "definitions": {
    "command": {
      "oneOf": [
        { "$ref": "#/definitions/move" },
        { "$ref": "#/definitions/turn" },
        { "$ref": "#/definitions/stop" },
        { "$ref": "#/definitions/wait" },
        { "$ref": "#/definitions/grasp" },
        { "$ref": "#/definitions/release" }
      ]
    },
    "move": {
      "required": ["action", "direction", "distance_m"],
      "properties": {
        "action": { "const": "move" },
        "direction": { "enum": ["forward", "backward"] },
        "distance_m": {
          "type": "number",
          "exclusiveMinimum": 0,
          "maximum": 100
        },
        "speed": { "enum": ["slow", "normal", "fast"] }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Then come the text rules:

ROLE: You translate ONE English natural-language instruction for a
tracked robot into a JSON object that strictly validates against the
schema above.

HARD RULES:
- Output ONLY the JSON object, no prose, no markdown fences.
- Always wrap commands in {"commands": [ ... ]}.
- distance_m and angle_deg are ALWAYS positive; sign via "direction".
- "speed" optional enum (slow|normal|fast) - only if pace implied.
- left = counter-clockwise (CCW). right = clockwise (CW).

INTERPRETATION CONVENTIONS:
- Distance: unspecified -> 1.0 m; "a bit"/"a little" -> 0.5;
  "a touch"/"slightly" -> 0.3; "a meter and a half" -> 1.5;
  word numbers -> the number ("two" -> 2.0).
- Angle: "quarter turn" -> 90; "half turn"/"turn around" -> 180;
  "full turn"/"360"/"spin around" -> 360. Unstated dir -> "right".
- Speed: slowly/creep/gently -> "slow"; quickly/fast/rush -> "fast".
- "stop"/"halt"/"freeze" -> [{"action":"stop"}].
- "wait/pause for N seconds" -> [{"action":"wait","duration_s":N}].
- Multi-step ("X then Y") -> ordered list.
Enter fullscreen mode Exit fullscreen mode

USER: Example and Task

EXAMPLES (format reference, do not copy):
{"instruction": "turn left then pick up the cube", "output":
 {"commands": [{"action":"turn","direction":"left","angle_deg":90},
               {"action":"grasp"}]}}
{"instruction": "put it down", "output":
 {"commands": [{"action":"release"}]}}
{"instruction": "do a half turn", "output":
 {"commands": [{"action":"turn","direction":"right","angle_deg":180}]}}

TASK: Produce 25 NEW and DIVERSE training pairs. Vary phrasing heavily:
imperative, polite, terse, conversational, robotic, with/without
units, word-numbers vs digits, multi-step, and include some
out-of-scope/nonsense mapped to {"commands": []}. Do NOT copy the
examples verbatim.

Return ONLY a JSON array; each element:
  {"instruction": "", "output": }
Enter fullscreen mode Exit fullscreen mode

The examples are sampled randomly with --fewshot N, so each batch sees different examples and the dataset becomes more diverse.

2. Pipeline Mechanics (dataset_gen/generate.py)

  • build_messages() assembles SYSTEM + USER as shown above; few-shot examples are sampled randomly from the seed set.
  • A POST request goes to OpenRouter, using only the Python standard library urllib, with no dependencies. Two 120B models are rotated: openai/gpt-oss-120b:free and nvidia/nemotron-3-super-120b-a12b:free.
  • The response is parsed as a JSON array. Then each pair is validated by jsonschema against the same schema, deduplicated by normalized instruction, and appended to JSONL.

3. Final Dataset (data/dataset.jsonl)

Total examples, instruction-to-JSON pairs: 2505.

Commands by type, summed across all commands in all pairs:

Action Command count
move 1938
turn 1133
wait 456
stop 281

In the end, there are four actions. Let us look at some of them, and their modifiers, in more detail.

wait: Waiting in Seconds

The wait command has a duration_s parameter, where the model specifies the wait time in seconds. Nothing complicated. For example, this sequence from the dataset:

"Move backward one meter, then pause for three seconds, then move forward one meter"
  -> [{"action":"move","direction":"backward","distance_m":1.0},
      {"action":"wait","duration_s":3},
      {"action":"move","direction":"forward","distance_m":1.0}]
Enter fullscreen mode Exit fullscreen mode

turn: Rotating the Body in Place

The robot turns around itself, differential-drive style: the sides rotate in opposite directions, changing the heading by a specified angle. Unlike move, which is translational forward/backward motion, turn is rotation only.

Schema:

{
  "action": "turn",
  "direction": "left|right",
  "angle_deg": ">0 and <=360",
  "speed": "slow|normal|fast"
}
Enter fullscreen mode Exit fullscreen mode

left means counter-clockwise, right means clockwise. angle_deg is always positive; the sign is encoded in direction. speed is optional.

Examples from the dataset: "turn left 90 degrees" becomes {"action":"turn","direction":"left","angle_deg":90}; "Rotate 360 degrees to the left." becomes angle_deg: 360.

Speed Control: Optional speed Enum

Speed is not specified as a number, but as the enum {slow, normal, fast} for move and turn. stop and wait do not have it.

This was a deliberate choice: the model should not have to guess meters per second.

"as fast as you can" -> ???

The model emits speed only if the tempo is implied in the text: "slowly" -> slow, "quickly", "rush", "swiftly" -> fast. Otherwise the field is omitted and the controller uses normal.

Speed move linear speed turn angular speed
slow 0.2 m/s 0.5 rad/s
normal (default) 0.5 m/s 1.0 rad/s
fast 1.0 m/s 2.0 rad/s

Some commands in the dataset carry an explicit speed; the rest omit the field, implying normal.

Example from the data: "turn left ninety degrees then creep back 0.5 meters" becomes [{turn left 90}, {move backward 0.5, "speed":"slow"}]. The word "creep" maps to slow only where the tempo is actually specified.

As you may have noticed, there is no map information here at all. The robot is "blind" by design, to keep things simple. Future stages of the experiment are supposed to add this. For example, the robot should eventually be able to hear "pick up the red cube", find that cube itself, and pick it up.

Phase 1: Making a Tank in MuJoCo

Tank reference image

Only with a robot claw instead of a cannon, and without anime girls.

The robot grew gradually from an empty MJCF file: first the world, then the body, wheels, supports, and only after that did it move without flying into orbit or some other astral plane.

We will make an obvious simplification. A real track, meaning a closed belt with dozens of segments, is not needed for this experiment, so the tracks will not be physically accurate.

Instead, we use a differential drive: two driven wheels, one on each side, each with its own velocity actuator. Turning happens because the two sides have different speeds, exactly as in a real tracked chassis.

World and Floor

Everything starts with the physical settings of the scene, in XML format. Apart from the global settings, there are also floor-specific settings. Let us go through the six parameters one by one.

timestep="0.002"

A step of 0.002 seconds means 500 Hz, a familiar frequency for simulations with contacts. If the step is larger, the robot starts "twitching" on wheel contacts.

gravity="0 0 -9.81"

This is the gravity vector: three numbers that specify where everything is pulled and how strongly.

  • The three numbers are the X, Y, Z axes: forward, sideways, upward.
  • 0 0 -9.81 means there is no pull along X or Y, and the pull along Z is -9.81.
  • 9.81 is Earth's gravitational acceleration, 9.81 m/s2, the same g from school physics.
  • The minus sign is there because the Z axis points upward and gravity pulls downward.

In simple terms, this line means "enable normal Earth gravity pulling toward the floor". If we wrote 0 0 0, the robot would float in weightlessness; 0 0 -1.62 would be the Moon; 0 0 -9.81 is ordinary Earth.

integrator="implicitfast"

The simulator does not compute motion continuously. It advances in small time steps; in our case, each step is 0.002 seconds. At each step it has to answer the question: "the robot is here and moving like this now; where will it be one step later?" The method the engine uses to answer that question is called the integrator.

Explicit Euler is the simplest method. It looks only at what is happening right now and assumes the whole next step will remain exactly the same.

The problem appears when forces are sharp, for example when a wheel hits the floor hard. During one step the situation can change a lot, but explicit Euler does not notice: it acts on an outdated picture. As a result, it does not damp the impulse but amplifies it. The robot gains energy from nowhere, starts shaking, and in the worst case flies away.

An implicit integrator, and implicitfast is its faster lightweight version, works more cleverly. It takes into account not only "where the robot is now", but also "where it will have reached by the end of the step", and picks an answer that does not contradict itself.

That is why with implicitfast the simulation stays calm even with hard contacts and a relatively large timestep.

"Fast" is in the name because the full implicit scheme is expensive, and MuJoCo applies it only to the part of the forces where it really matters, such as viscosity, rather than recalculating everything.

geom name="floor" type="plane" size="0 0 0.05" material="grid"

Now the floor parameters, except for friction, which is discussed below:

  • name="floor" is the name of the geom, so it can be referenced in contacts, exclusions, and so on.
  • type="plane" means an infinite plane, not a finite slab.
  • size="0 0 0.05": for a plane, the first two numbers are ignored because it is infinite; the third is the visual grid step.
  • material="grid" is just appearance: a checkered texture so movement is visible in the viewer.

friction="1.0 0.005 0.0001"

The floor itself is one flat geom. Its friction is not a single number, but three: 1.0, 0.005, 0.0001.

Why three? Because an object can rub against a surface in three different ways, and the physics engine treats them separately:

  • Sliding friction (1.0): when something slides across the floor, like a pushed box. This is high because the wheels should grip, not slide apart.
  • Torsional friction (0.005): when an object rotates in place around the contact point. This is almost zero because otherwise it would interfere with turning.
  • Rolling friction (0.0001): resistance to rolling, like a ball that rolls and gradually slows down. This is tiny so the wheel can roll freely.

So one number answers "how hard is it to slide", the second answers "how hard is it to twist in place", and the third answers "how hard is it to roll". For the floor we need sliding to be hard, while twisting and rolling should be easy.

Body

The body is a chassis body with a free joint: six degrees of freedom. The robot is not attached to the world and can drive, turn, and, in the worst case, fall over.

It is placed at a height of 8.5 cm so the wheels touch the floor rather than hanging in the air or sinking through it.

The geometry is a box with half-sizes 0.15 0.08 0.035, meaning dimensions of 30 x 16 x 7 cm. Mass and center of gravity are specified: 2 kg, with the center of mass shifted backward by 2 cm. The reason for that shift becomes clear below, in the slipping story.

Two Driven Wheels

Each wheel is a child body of the chassis:

Parameter Value
Geometry cylinder, 6 cm radius, 2 cm half-width
Orientation rotated so the axis points sideways, along Y
Position on the sides, +-11 cm, slightly below the body center
Joint hinge rotating around the side axis
Mass 0.3 kg
Friction high, 2.0, to prevent slipping

The distance between the sides is exactly 22 cm, and the effective radius is 6 cm.

Two Support "Skis"

On two wheels alone, the robot would nose-dive or fall backward. To prevent this, two small spherical supports were added at the front and rear: radius 1.5 cm, 50 g each.

They are deliberately very slippery. Their job is to prevent the robot from tipping over, while not slowing it down and, most importantly, not carrying its weight.

They are also raised slightly above the floor. This way they barely touch it, and the wheels carry almost all the load.

Actuators

Each wheel joint has one velocity actuator.

kv controls how stiffly the actuator chases the target angular velocity. ctrlrange limits the velocity command itself. forcerange, however, is the thing that saved us from a very spectacular bug described below.

Hard Contacts

By default, the joints have a small amount of damping, and the contacts use hard, almost non-penetrating parameters. If contacts are made soft, the wheels literally "sink" into the floor and slip. I also explicitly disabled self-collisions between the body and wheels and between the wheels themselves: those collisions are physically impossible.

Battles With Physics

None of the "magic numbers" above appeared immediately. Each of them was obtained through a bug.

Robot catapult. Without a force limit, the robot flew away on the very first step: height 1.27 m, body almost vertical. This was fixed by torque limiting through forcerange and stiffer contacts.

Everything was exactly like this

Everything was exactly like this.

Slipping. At first, the support spheres were level with the wheels and took part of the weight. Efficiency dropped to 25-50%. The measurement showed it directly: the wheels rotate at the required speed, but the body barely moves. Pure slipping.

The fix required three changes at once:

  • raise the supports,
  • shift the center of gravity of the body backward,
  • increase wheel-floor friction.

Together, these changes put about 88% of the weight onto the wheels.

Turn sign. Trivial, but it has to be checked: using velocities from the simulation, I verified that the command "left" produces counter-clockwise rotation when viewed from above. It matched the standard convention: X forward, Y left, Z up. Nothing had to be inverted.

Result: the robot loads, stands stably, drives and turns in the correct direction, and the accuracy of open-loop control is good enough for further work.

Phase 1: Fine-Tuning the Model

The training dataset and the physical module the model will control are ready. Now the model has to be trained and tested.

For fine-tuning I chose a Kaggle machine. Kaggle gives users 30 free hours per week for ML experiments, with a choice between two T4 GPUs, each 16 GB, or one P100 with 16 GB.

I had to drop the P100 because of compatibility issues, so training was done on a single NVIDIA T4 to avoid dealing with parallelism. The model was small and did not require many resources anyway.

The first attempt failed. The training example was built like this: the same JSON Schema used for synthetic dataset generation was placed into every single example.

In shortened form, it looked like this:

JSON SCHEMA (draft-07):
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Robot command list",
  "type": "object",
  "additionalProperties": false,
  "required": ["commands"],
  "properties": {
    "commands": { "type": "array", "items": { "$ref": "#/definitions/command" } }
  },
  "definitions": {
    "move": {
      "required": ["action","direction","distance_m"],
      "properties": {
        "action": { "const": "move" },
        "direction": { "enum": ["forward","backward"] },
        "distance_m": { "type": "number", "exclusiveMinimum": 0, "maximum": 100 },
        ...
      }
    },
    "turn": { ... "angle_deg": { "maximum": 360 } ... },
    "stop": { ... }, "wait": { ... }, "grasp": { ... }, "release": { ... }
  }
}

ROLE: You translate ONE English instruction ... into a JSON object that
strictly validates against the schema above.
HARD RULES:
- Output ONLY the JSON object, no prose, no markdown fences.
- distance_m and angle_deg are ALWAYS positive ...
- "speed" is an optional enum (slow|normal|fast) ...
INTERPRETATION CONVENTIONS:
- "quarter turn" -> 90; "turn around" -> 180; "full turn" -> 360 ...
- word numbers -> the number ("two" -> 2.0) ...
   ... dozens more lines of rules ...

---
INSTRUCTION: turn left 90 degrees, then go forward 2 meters
Enter fullscreen mode Exit fullscreen mode

This whole wall of text is about 1500 tokens, and it repeats in each of thousands of training examples, even though only the final INSTRUCTION: ... line changes.

As a result, training took 3.5 hours, and the program crashed in the eval cell before it had time to save the model.

I did not want to spend another 3.5 hours, so I chose a different approach: each example would contain only a short instruction.

You translate ONE English instruction for a tracked robot with a gripper
arm into a single JSON object {"commands":[...]} using actions: move, turn,
stop, wait, grasp, release. Output ONLY the JSON object, no prose, no
markdown. If the instruction is out of scope or nonsense, output
{"commands": []}.

---
INSTRUCTION: turn left 90 degrees, then go forward 2 meters
Enter fullscreen mode Exit fullscreen mode

That is about 60 tokens instead of 1500. No schema, no rule tables, just the request:

You translate a phrase into JSON of this shape; here is the action list; answer nonsense with an empty list.

With this approach, training took 30-40 minutes. The model weights were downloaded, and all that remained was to test them in real conditions, or rather in the MuJoCo simulation.

Phase 1: Testing the Model

Testing was done the usual way fine-tuning results are tested: on a held-out split. The model never saw 10% of the pairs during training, with a fixed random split so the result could be reproduced.

Several metrics were computed, each answering its own question:

  • schema_valid: what fraction of responses are valid JSON according to our schema. Result: 1.000, or 100%.
  • exact_match: how often the response matched the reference literally. Result: 0.920.
  • action_seq: whether the action sequence is semantically correct, even if a number differs slightly. "Turn, then drive" matters more than exactly 90.0 versus 90. Result: 0.980.
  • ood_f1: whether the model avoids inventing commands for nonsense like "make coffee". The correct answer is an empty list, meaning "do nothing". Result: 0.846.

But the most important check is the last one. All previous metrics compare texts. What matters to us is whether the robot actually drives where it should.

So we took the reference command and the model prediction, ran both in MuJoCo, and compared where the robot ended up, with a reasonable tolerance because the control is open-loop and the physics is slightly noisy.

This metric is task_success = 0.975: 39 runs out of 40 brought the robot to the same place as the reference, with zero execution errors.

We proved experimentally that the schema had been "imprinted" into the model weights during training, and feeding the full schema to the model was completely redundant.

Now we could move to Phase 2 with a clear conscience.

Phase 2: Updating the Dataset for Claw Examples

We add two manipulation actions: release and grasp. Each has an optional cell parameter, which in turn is an enum: front | front_left | front_right | left | right, with front as the default when the parameter is omitted.

Before forming the expanded dataset, I had to fix a small mistake in the first version. Some examples treated object-pickup phrases as impossible by definition. Phrases like "pick up the box" were considered nonsense and trained to the answer "do nothing", or [].

Before changing anything, we measured it. A script went through all 2505 old pairs with regexes and counted how many truly conflicted: manipulation phrase -> empty answer.

Phase-1 pairs=2505 | manip-phrase=17 | CONFLICT=17
reuse unchanged=2488 (99.3%)
Enter fullscreen mode Exit fullscreen mode

There were 17 conflicts out of 2505, or 0.7%. Phase 1 had barely generated such phrases. The conflicting examples were removed from the new sample.

Then we ran the same generation pipeline through OpenRouter, already tested in Phase 1, with two targeted changes:

  • few-shot examples in the prompt were taken only from manipulation seeds, meaning pairs with grasp / release;
  • the prompt explicitly emphasized: "generate ONLY arm tasks; every pair must contain grasp and/or release; do NOT output pure locomotion."

The target was 1000 pairs. The final run accepted all 1000 examples, with 0 schema rejections across the entire run.

What We Got

Final combined dataset: 3518 pairs.

Action Count
move 2239
turn 1322
grasp 814
release 584
wait 488
stop 285

Phase 2: Adding the Claw to the Robot

Robot claw image

Fortunately, we do not need as many claws as a crab.

For the hand to reach the target point, we need to calculate the arm joint angles. This is inverse kinematics, or IK.

We implemented it with the damped least-squares method over the Jacobian of the grasp point. Roughly speaking, the joints are adjusted in small steps toward the target until the tip of the arm reaches where it needs to be.

Two details are worth noting:

  • IK moves only the arm joints; the base is considered fixed. This simplifies the math and makes logical sense: while grasping, the robot stands still.
  • IK does not touch the live simulation. It saves the world state, computes the solution on a copy, restores everything, and returns only the target angles. Otherwise every IK calculation would disturb the physics.

The model sends {"action":"grasp"}, optionally with a cell, and the controller expands it into a fixed canonical routine:

  • cell -> point in the world, accounting for the robot's orientation;
  • find the nearest free object nearby;
  • open the claw;
  • move the arm above the object;
  • lower it;
  • close the claw;
  • lift it.

release works in the opposite direction: move to the cell, lower, open.

There is no "creativity" at execution time. The model decides what to do and roughly where, through the cell. The exact joint motion is handled by a deterministic procedure and IK.

Now let us look at the cell system. It has this structure:

_CELLS = {
    "front":       (0.30, 0.00),
    "front_left":  (0.27, 0.13),
    "front_right": (0.27, -0.13),
    "left":        (0.23, 0.20),
    "right":       (0.23, -0.20),
}
Enter fullscreen mode Exit fullscreen mode

If we compute the distance and angle of each cell from the robot center, in the robot frame where X is forward and Y is left:

Cell Radius sqrt(x^2 + y^2) Angle from nose
front 0.300 m 0 deg
front_left 0.300 m +25.7 deg
front_right 0.300 m -25.7 deg
left 0.305 m +41.0 deg
right 0.305 m -41.0 deg

The principle is visible: all five cells lie on the same arc with a radius of about 0.30 m in front of the robot, simply spread across different directions. It is not a coordinate grid, but one comfortable reach radius divided into five directions in the front sector of roughly +-41 degrees.

The radius of about 0.30 m is the "arm zone". It was chosen empirically so IK can reach it reliably.

Phase 2: Fine-Tuning, Testing, and Adding a User API

Fine-tuning on the updated dataset went without surprises: the same 30-40 minutes. One point worth noting: the fine-tune was done from scratch, without using Phase 1 weights, because Phase 1 included instructions where picking up objects was treated as nonsense. In any case, nothing was lost because the model trains very quickly.

Next came testing. To measure training quality, we introduced a metric that compares the final position of the grasped object between the reference and the prediction, with a tolerance of 0.10 m.

If the model says "pick it up and put it on the left", success is counted only if the cube really ends up on the left, in the same place where the reference puts it.

What the Numbers Showed

Run on the held-out split, 352 pairs, the same split as in Phase 1:

Metric Phase 2 Phase 1
schema_valid 0.991 1.000
exact_match 0.943 0.920
action_seq 0.980 0.980
ood_f1 0.857 0.846
task_success (MuJoCo, 40) 0.975 0.975

How to read this:

  • schema_valid = 0.991, not 1.0: a small regression.
  • exact_match = 0.943: even higher than Phase 1, which had 0.920. The model learns manipulation patterns more sharply than conversational distance formulations.
  • task_success = 0.975 with zero execution errors: grasp, release, and cells work cleanly in physics, and the cube ends up where the reference places it.

Adding Interactive Mode

The pipeline is closed, but running a fixed dataset is still "laboratory" work. I wanted the ability to command the robot with free text in real time.

We chose the most visual option: a REPL plus a MuJoCo window. You type a phrase, and the robot immediately executes it in a live simulation. State accumulates between phrases, with no reset: if you say "drive forward" and then "now left", it moves from wherever it stopped.

Architecturally, this required two threads:

  • Input thread: blocking input() plus model inference; it puts parsed commands into a queue.
  • Main thread: owns the window and physics. If there is a command in the queue, it executes it in real time. If not, it runs idle steps so the robot stands still and the window remains alive. Only this thread calls physics steps, so there are no data races in MuJoCo.

REPL means Read-Eval-Print Loop: a cycle of read, execute, show result, wait for input again. In plain terms, an interactive console that loops through:

  • Read: waits for you to enter a line;
  • Eval: executes it;
  • Print: shows the result;
  • Loop: returns to step 1 and waits for the next input.

Everything worked without problems here. You could see the result in the GIF at the beginning of the article.

Results

In the end, here is what was done:

  • A physical model of a tracked robot with a claw was created from scratch in MuJoCo.
  • Gemma-3 with 270M parameters was trained to accept natural-language commands from a human and translate them into JSON for controlling the tracked robot.
  • Training was performed on a free Kaggle machine, and the synthetic dataset was collected using the free gpt-oss-120b and nemotron-super-120b models on OpenRouter.

And so we reached the end of this journey. But the project is not finished yet. Two main things still need to be improved:

  • Add map perception to the robot so it can act more independently and analyze its surroundings. For example, it should be able to pick up objects from free-form human commands such as "pick up the red cube", overcome obstacles, and so on.
  • Move the model into a real tracked robot. For that, I need to assemble the RaspTank DIY kit I have at home, the photo of which was shown at the beginning of the article.

If you find this article interesting, I will try to work on that in the next part.

Until next time.

Sources

Source code on Codeberg: https://codeberg.org/imperius/llm-tank

Model weights on Hugging Face: https://huggingface.co/Imperius/llm-tank

Top comments (0)