Yuki Imamura

Posted on Jul 20

A Practical Guide to Fine-Tuning and Inference with GR00T N1 & LeRobot on a Custom Dataset

#ai #gr00t #lerobot #robotics

Introduction

NVIDIA's "Project GR00T," a foundation model for humanoid robots, has the potential to revolutionize the world of AI and robotics. The GR00T N1 model can be fine-tuned not just with simulation data but also with real-world data, enabling it to generate behaviors specialized for specific tasks.

This article provides a comprehensive, step-by-step guide to the entire process: from fine-tuning the GR00T N1 model using a custom dataset collected with a "SO-ARM101" single arm, to running inference on the physical robot, and finally, evaluating the trained model. We'll include concrete steps and detailed troubleshooting advice along the way.

By the end of this guide, you'll be ready to take your first steps toward building a robot model that can perform your own custom tasks.

What This Article Covers

Data Selection: An overview of the training dataset and the settings used during data collection.
Fine-Tuning: Fine-tuning the GR00T N1 model with our custom data.
Inference: Controlling the physical robot using the fine-tuned model.
Evaluation: Quantitatively evaluating the performance of the trained model.

References

This article is based on the following official documentation and blogs. All procedures have been verified with the specific versions listed below.

Official Blog: Fine-tuning GR00T N1 with LeRobot on a custom dataset
GitHub: NVIDIA/Isaac-GR00T / huggingface/lerobot
- Isaac-GR00T: d598400 (used in this guide)
- LeRobot: 519b761 (used in this guide)

NOTE: These repositories are updated frequently. Please be aware of this if you are trying to replicate this environment.

Part 1: Data Selection

High-quality training data is essential for fine-tuning. This article assumes the data collection process is already complete, but we'll explain what kind of dataset we used and how it was collected.

Task Selection

We chose the simplest of the following three tasks: "pick up a single piece of tape and place it in a box." Complex tasks have a lower success rate, so we recommend starting with something simple.

You can visualize the datasets we created using the LeRobot Dataset Visualizer.

Complex Task: pen-cleanup
Simple Task: tapes-cleanup
Easiest Task: onetape-cleanup (Used in this guide)

Data Collection Settings

Below are the collection settings for the dataset we used (50 episodes).

The cameras configuration is particularly important. The camera names defined here (e.g., tip, front) will be referenced during the fine-tuning process, so you must remember them exactly. We recommend using the name wrist to align with GR00T's default settings, which will save you some configuration changes later.

dataset:
  repo_id: "yuk6ra/so101-onetape-cleanup"  # Hugging Face repository ID
  single_task: "Grab the tape and place it in the box."  # Task description
  num_episodes: 50  # Number of episodes to record
  fps: 30  # Frame rate
  episode_time_s: 15  # Max time per episode (seconds)
  reset_time_s: 15  # Reset time after episode recording (seconds)

# Follower Arm
robot:
  type: "so101_follower"
  port: "/dev/ttyACM0"  # Serial port
  id: "white"  # Follower ID

  # Camera settings (the names defined here are used later)
  cameras:
    tip: # Using 'wrist' here will streamline later steps
      type: "opencv"
      index_or_path: 0
      fps: 30
      width: 640
      height: 480
    front:
      type: "opencv"
      index_or_path: 2
      fps: 30
      width: 640
      height: 480

# Leader Arm
teleop:
  type: "so101_leader" 
  port: "/dev/ttyACM1"  # Serial port
  id: "black"  # Leader ID

# Additional options
options:
  display_data: false  # Whether to display camera feed
  push_to_hub: true # Whether to automatically upload to Hugging Face Hub

Part 2: Fine-Tuning

Now, let's use the collected data to fine-tune the GR00T N1 model.

Preparing the Execution Environment

Fine-tuning requires a high-spec machine. We used the following cloud environment for this process.

Component	Specs
GPU	NVIDIA H100 SXM (80GB VRAM)
Disk	300GB+ (5000 steps consumed ~100GB)
RAM	128GB+
OS	Ubuntu 24.04
Network Speed	4Gbps (Upload/Download)

NOTE: A slow network connection will significantly increase the time it takes to upload the model.

After connecting to the remote server via SSH, let's verify the specs with a few commands.

ssh -p 30454 root@xxx.xxx.xxx.xx -L 8080:localhost:8080

$ nvidia-smi
Sun Jul 13 06:57:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:E4:00.0 Off |                    0 |
| N/A   47C    P0             73W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ df /home -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         300G   90M  300G   1% /

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           503Gi        34Gi       372Gi        47Mi       101Gi       469Gi
Swap:          8.0Gi       186Mi       7.8Gi

$ lsb_release -d
No LSB modules are available.
Description:    Ubuntu 24.04.2 LTS

Next, following the official instructions, we'll set up a Conda virtual environment.

# Clone the repository
git clone https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T

# Create and activate the Conda environment
conda create -n gr00t python=3.10
conda activate gr00t

# Install required packages
pip install --upgrade setuptools
pip install -e .[base]
pip install --no-build-isolation flash-attn==2.7.1.post4

Finally, log in to Hugging Face and Weights & Biases (Wandb).

Hugging Face Token: huggingface.co/settings/tokens
Wandb API Key: wandb.ai/authorize

huggingface-cli login
wandb login

Preparing the Training Data

Download your chosen dataset from the Hugging Face Hub.

huggingface-cli download \
       --repo-type dataset yuk6ra/so101-onetape-cleanup \
       --local-dir ./demo_data/so101-onetape-cleanup

To ensure GR00T correctly recognizes the data format, a configuration file named modality.json is required. Copy the sample and modify its contents to match your environment.

# Copy the sample file
cp getting_started/examples/so100_dualcam__modality.json ./demo_data/so101-onetape-cleanup/meta/modality.json

Change the wrist entry to tip to match the camera name you set during data collection.

# Edit the configuration file
vim ./demo_data/so101-onetape-cleanup/meta/modality.json

 {
    ...
        "video": {
            "front": {
                "original_key": "observation.images.front"
            },
-           "wrist": {
-               "original_key": "observation.images.wrist"
+           "tip": { 
+               "original_key": "observation.images.tip" 
            }
        },
    ...

Verify that the data can be loaded correctly with the following script.

python scripts/load_dataset.py \
    --dataset-path ./demo_data/so101-onetape-cleanup \
    --plot-state-action \
    --video-backend torchvision_av

If successful, it will output the dataset structure and frame information.

====================================================================================================
========================================= Humanoid Dataset =========================================
====================================================================================================
{'action.gripper': 'np scalar: 1.1111111640930176 [1, 1] float64',
 'action.single_arm': 'np: [1, 5] float64',
 'annotation.human.task_description': ['Grab the tape and place it in the '
                                       'box.'],
 'state.gripper': 'np scalar: 2.410423517227173 [1, 1] float64',
 'state.single_arm': 'np: [1, 5] float64',
 'video.front': 'np: [1, 480, 640, 3] uint8',
 'video.tip': 'np: [1, 480, 640, 3] uint8'}
dict_keys(['video.front', 'video.tip', 'state.single_arm', 'state.gripper', 'action.single_arm', 'action.gripper', 'annotation.human.task_description'])
==================================================
video.front: (1, 480, 640, 3)
video.tip: (1, 480, 640, 3)
state.single_arm: (1, 5)
state.gripper: (1, 1)
action.single_arm: (1, 5)
action.gripper: (1, 1)
annotation.human.task_description: ['Grab the tape and place it in the box.']
...
Warning: Skipping left_arm as it's not found in both state and action dictionaries
Warning: Skipping right_arm as it's not found in both state and action dictionaries
Warning: Skipping left_hand as it's not found in both state and action dictionaries
Warning: Skipping right_hand as it's not found in both state and action dictionaries
Plotted state and action space

Running the Training

Once everything is set up, start the fine-tuning process. On an H100 GPU, this took about 30 minutes and consumed around 100GB of disk space for 5000 steps.

python scripts/gr00t_finetune.py \
      --dataset-path ./demo_data/so101-onetape-cleanup/ \
      --num-gpus 1 \
      --output-dir ./so101-checkpoints  \
      --max-steps 5000 \
      --data-config so100_dualcam \
      --video-backend torchvision_av

If you run into memory issues or have lower specs, try reducing --dataloader-num-workers or --batch-size.

For lower-spec machines:

python scripts/gr00t_finetune.py \
     --dataset-path ./demo_data/so101-onetape-cleanup/ \
     --num-gpus 1 \
     --output-dir ./so101-checkpoints  \
     --max-steps 5000 \
     --data-config so100_dualcam \
     --batch-size 8 \
     --video-backend torchvision_av \
     --dataloader-num-workers 0

Troubleshooting

Error 1: `ValueError: Video key wrist not found`

This error occurs because the training script is looking for the default camera name wrist but finds tip in your dataset.

Solution: Directly edit gr00t/experiment/data_config.py to fix the camera name.

# Around line 225
class So100DualCamDataConfig(So100DataConfig):
-   video_keys = ["video.front", "video.wrist"]
+   video_keys = ["video.front", "video.tip"]

Error 2: `av.error.MemoryError: [Errno 12] Cannot allocate memory`

This error happens if you run out of memory while decoding video data.

...
RuntimeError: Caught MemoryError in DataLoader worker process 0.
Original Traceback (most recent call last):
...
  File "av/error.pyx", line 326, in av.error.err_check
av.error.MemoryError: [Errno 12] Cannot allocate memory

Solution: Updating the PyAV library to the latest version might solve this.

pip install -U av

Uploading the Trained Model

Once training is complete, upload the generated checkpoint to the Hugging Face Hub.

cd so101-checkpoints/checkpoint-5000/

# Remove unnecessary files (optional)
# rm -rf scheduler.pt optimizer.pt

# Upload to Hugging Face Hub
huggingface-cli upload \
      --repo-type model yuk6ra/so101-onetape-cleanup . \
      --commit-message="Finetuned model with 5000 steps"

NOTE: If you're using a cloud server, don't forget to shut down the instance after the upload is complete.

Part 3: Inference

Let's use our fine-tuned model to control a physical robot. The inference setup consists of two main components: an inference server to host the model and a client node to control the robot.

Inference Server Setup and Execution

The inference server can run on a local or cloud GPU machine. We used the following local environment:

GPU: NVIDIA GeForce RTX 4070 Ti (12GB)
RAM: 128GB
OS: Ubuntu 22.04

If you're running it on the cloud, make sure to open the necessary port for the inference server (e.g., 5555).

First, set up the GR00T environment just as you did for fine-tuning. Then, download the model from the Hugging Face Hub and start the server.

# Set up GR00T environment (see Fine-Tuning section)
git clone https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
conda create -n gr00t python=3.10
conda activate gr00t
pip install --upgrade setuptools
pip install -e .[base]
pip install --no-build-isolation flash-attn==2.7.1.post4

# Download the model
huggingface-cli download \
     --repo-type model yuk6ra/so101-onetape-cleanup \
     --local-dir ./model/so101-onetape-cleanup

# Start the inference server
python scripts/inference_service.py \
    --model_path ./model/so101-onetape-cleanup \
    --embodiment_tag new_embodiment \
    --data_config so100_dualcam \
    --server \
    --port 5555

If you see Server is ready and listening on tcp://0.0.0.0:5555, the server has started successfully.

NOTE: You will need to modify data_config.py to change the camera name to tip here as well.

Troubleshooting

Error 1: `OSError: CUDA_HOME environment variable is not set`

This happens during the flash-attn installation if the CUDA path isn't found.

...
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
...

Solution: Install the CUDA Toolkit via conda.

conda install -c nvidia cuda-toolkit=12.4

Error 2: `ModuleNotFoundError: No module named 'flash_attn'`

Solution: Make sure you have correctly activated the gr00t conda environment with conda activate gr00t. You might have accidentally ended up in the base or lerobot environment.

Client Node Setup and Execution

The client runs in the lerobot environment used during data collection. If you don't have the environment, create it with these steps:

If you don't have a lerobot virtual environment:

# Clone the LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create and activate the Conda environment
conda create -y -n lerobot python=3.10
conda activate lerobot

# Install required packages
conda install ffmpeg -c conda-forge
pip install -e .
pip install -e ".[feetech]"

conda activate lerobot

After activating the lerobot environment, install any other necessary packages.

pip install matplotlib

Navigate to your Isaac-GR00T directory.

cd ~/Documents/Isaac-GR00T/ # or your path to Isaac-GR00T

First, use lerobot.find_cameras to identify the camera IDs connected to your system. You'll use these IDs as arguments when launching the client.

$ python -m lerobot.find_cameras opencv
# ... (From the output, find the camera IDs for 'tip' and 'front')
# Example: tip is 2, front is 0

Next, we need to modify getting_started/examples/eval_lerobot.py to integrate with GR00T's inference capabilities.

- from lerobot.common.cameras.opencv.configuration_opencv import (
+ from lerobot.cameras.opencv.configuration_opencv import (
    OpenCVCameraConfig,
)
- from lerobot.common.robots import (
+ from lerobot.robots import (
    Robot,
    RobotConfig,
    koch_follower,
    make_robot_from_config,
    so100_follower,
    so101_follower,
)
- from lerobot.common.utils.utils import (
+ from lerobot.utils.utils import (
    init_logging,
    log_say,
)

# NOTE:
# Sometimes we would like to abstract different env, or run this on a separate machine
# User can just move this single python class method gr00t/eval/service.py
# to their code or do the following line below
- # sys.path.append(os.path.expanduser("~/Isaac-GR00T/gr00t/eval/"))
+ import os 
+ import sys 
+ sys.path.append(os.path.expanduser("./gr00t/eval/")) # Fix the path
from service import ExternalRobotInferenceClient

Launch the client with the following command to send instructions to the robot.

For a local environment:

python getting_started/examples/eval_lerobot.py \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM1 \
    --robot.id=white \
    --robot.cameras="{
        tip: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30},
        front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}
    }" \
    --lang_instruction="Grab the tape and place it in the box."

If you are using a cloud server for inference, set --policy_host and --policy_port accordingly.

For a cloud environment:

python getting_started/examples/eval_lerobot.py \
    --robot.type=so101_follower \
    --robot.port=/dev/ttyACM0 \
    --robot.id=white \
    --robot.cameras="{
        tip: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30},
        front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}
    }" \
    --policy_host xxx.xx.xx.xx \
    --policy_port xxxxx \
    --lang_instruction="Grab tapes and place into pen holder."

Execution Results

Here are the results for GR00T N1. For comparison, we've also included success and failure examples from an ACT model.

GR00T N1 (Success)

ACT (Success)

GR00T N1 (Failure)

ACT (Failure)

Troubleshooting

Error: Inference results are `NaN` or the robot movement is jerky.

The model itself might be fine, but the input from the robot could be incorrect.

Solution: Double-check your cameras configuration and try rebuilding the environment from scratch.

Part 4: Model Evaluation

Finally, let's evaluate how well the trained model can reproduce the tasks from the dataset.

Preparing for Evaluation

We'll work in the gr00t environment. Download the evaluation dataset and model, and prepare the modality.json file. This process is the same as in the fine-tuning section.

conda activate gr00t

# Download the dataset
huggingface-cli download \
       --repo-type dataset yuk6ra/so101-onetape-cleanup \
       --local-dir ./demo_data/so101-onetape-cleanup

# Download the model (use --revision to evaluate a specific step)
huggingface-cli download \
    --repo-type model  yuk6ra/so101-onetape-cleanup \
    --local-dir ./model/so101-onetape-cleanup
    # --revision checkpoint-2000

# Prepare modality.json
cp getting_started/examples/so100_dualcam__modality.json ./demo_data/so101-onetape-cleanup/meta/modality.json
# Use vim to change the camera name to 'tip'

Running the Evaluation

Once ready, run the evaluation script.

python scripts/eval_policy.py \
    --plot \
    --embodiment_tag new_embodiment \
    --model_path ./model/so101-onetape-cleanup/ \
    --data_config so100_dualcam \
    --dataset_path ./demo_data/so101-onetape-cleanup/ \
    --video_backend torchvision_av \
    --modality_keys single_arm gripper \
    --denoising_steps 4

NOTE: You'll need to modify data_config.py to change the camera name to tip here as well.

Troubleshooting

Error: `ModuleNotFoundError: No module named 'tyro'`

Solution: Double-check that you are in the correct gr00t virtual environment.

Evaluation Results

When the script finishes, it will generate a plot comparing the model's prediction (Prediction: green line) with the actual recorded action (Ground truth: orange line). This graph allows you to visually confirm how accurately the model has learned the task.

Comparative Analysis

Let's use the evaluation script to compare how training progress and task difficulty affect model performance. The plots visualize the difference between the model's prediction (green line) and the ground truth (orange line). The closer the curves are, the more accurately the model is reproducing the task.

Performance Comparison by Training Steps

We'll compare the performance on the same task (onetape-cleanup) at 2000 and 5000 training steps.

Evaluation at 2000 Steps

Analysis: At 2000 steps, the prediction (green) roughly follows the ground truth (orange), but there are noticeable deviations and oscillations. The arm's movement (single_arm) in certain dimensions is not smooth, indicating that the model has not yet fully learned to reproduce the task.

Evaluation at 5000 Steps

Analysis: After 5000 steps, the prediction and ground truth curves are nearly identical, showing that the model can reproduce the motion very smoothly. This clearly demonstrates that additional training improved the model's performance.

Performance Comparison by Task Complexity

Next, let's compare the performance of models trained for 5000 steps on tasks of varying complexity.

Easiest Task: `onetape-cleanup`

Analysis: As shown before, the model reproduces the easiest task almost perfectly.

Simple Task: `tapes-cleanup`

Analysis: Simply by having multiple tapes, the deviation in the prediction (green line) becomes slightly larger.

Complex Task: `pen-cleanup`

Analysis: For the more complex task of cleaning up pens (featured in the official blog), the gap between the prediction and ground truth becomes significant. There are large deviations in specific joint movements (e.g., single_arm_4), suggesting that 5000 training steps are insufficient for this level of complexity.

Effect of Further Training (7000 Steps)

Let's see the evaluation for the complex pen-cleanup task after training for 7000 steps.

Analysis: Compared to 5000 steps, even after 7000 steps, a significant gap remains between the prediction and ground truth. In fact, the MSE has increased, meaning the prediction has moved further away from the ground truth. This suggests that simply increasing the number of training steps may not solve the problem. It could point to other issues, such as a lack of diversity or quantity in the dataset, or that the model architecture itself is not capable of handling the task's complexity.

These comparisons reaffirm that you need a sufficient number of training steps and a high-quality, diverse dataset that corresponds to the complexity of your task.

Conclusion

In this article, we walked through the entire workflow of fine-tuning an NVIDIA GR00T N1 model with custom data collected via LeRobot, followed by inference and evaluation on a physical robot, complete with detailed commands and logs.

Here are the key takeaways:

Data Consistency: It is crucial to ensure that settings, especially camera names, are consistent between data collection (LeRobot) and training/inference (GR00T).
Environment Setup: Properly separating virtual environments with conda and installing the correct libraries are key to success.
Troubleshooting: You'll need to be able to read error logs carefully and adapt to errors specific to your custom dataset, such as by directly editing configuration files.

I hope this guide helps you in developing your own robotics applications with GR00T. Future work could involve tackling more complex tasks or comparing model performance across different numbers of training steps.

If you have any feedback or find any mistakes, please don't hesitate to reach out. Let's build this exciting future together.

https://www.zrek.co/

Top comments (2)

Eric Zhang • Sep 19

This is such a detailed and well-written article, thank you so much for putting this together, Yuki!

Yuki Imamura • Oct 4

I’m happy if it helped at all!

Introduction

What This Article Covers

References

Part 1: Data Selection

Task Selection

Data Collection Settings

Part 2: Fine-Tuning

Preparing the Execution Environment

Preparing the Training Data

Running the Training

Troubleshooting

Error 1: ValueError: Video key wrist not found

Error 2: av.error.MemoryError: [Errno 12] Cannot allocate memory

Uploading the Trained Model

Part 3: Inference

Inference Server Setup and Execution

Troubleshooting

Error 1: OSError: CUDA_HOME environment variable is not set

Error 2: ModuleNotFoundError: No module named 'flash_attn'

Client Node Setup and Execution

Execution Results

GR00T N1 (Success)

ACT (Success)

GR00T N1 (Failure)

ACT (Failure)

Troubleshooting

Error: Inference results are NaN or the robot movement is jerky.

Part 4: Model Evaluation

Preparing for Evaluation

Running the Evaluation

Troubleshooting

Error: ModuleNotFoundError: No module named 'tyro'

Evaluation Results

Comparative Analysis

Performance Comparison by Training Steps

Evaluation at 2000 Steps

Evaluation at 5000 Steps

Performance Comparison by Task Complexity

Easiest Task: onetape-cleanup

Simple Task: tapes-cleanup

Complex Task: pen-cleanup

Effect of Further Training (7000 Steps)

Conclusion

Error 1: `ValueError: Video key wrist not found`

Error 2: `av.error.MemoryError: [Errno 12] Cannot allocate memory`

Error 1: `OSError: CUDA_HOME environment variable is not set`

Error 2: `ModuleNotFoundError: No module named 'flash_attn'`

Error: Inference results are `NaN` or the robot movement is jerky.

Error: `ModuleNotFoundError: No module named 'tyro'`

Easiest Task: `onetape-cleanup`

Simple Task: `tapes-cleanup`

Complex Task: `pen-cleanup`