Introduction
NVIDIA's "Project GR00T," a foundation model for humanoid robots, has the potential to revolutionize the world of AI and robotics. The GR00T N1 model can be fine-tuned not just with simulation data but also with real-world data, enabling it to generate behaviors specialized for specific tasks.
This article provides a comprehensive, step-by-step guide to the entire process: from fine-tuning the GR00T N1 model using a custom dataset collected with a "SO-ARM101" single arm, to running inference on the physical robot, and finally, evaluating the trained model. We'll include concrete steps and detailed troubleshooting advice along the way.
By the end of this guide, you'll be ready to take your first steps toward building a robot model that can perform your own custom tasks.
What This Article Covers
- Data Selection: An overview of the training dataset and the settings used during data collection.
- Fine-Tuning: Fine-tuning the GR00T N1 model with our custom data.
- Inference: Controlling the physical robot using the fine-tuned model.
- Evaluation: Quantitatively evaluating the performance of the trained model.
References
This article is based on the following official documentation and blogs. All procedures have been verified with the specific versions listed below.
- Official Blog: Fine-tuning GR00T N1 with LeRobot on a custom dataset
- GitHub: NVIDIA/Isaac-GR00T / huggingface/lerobot
NOTE: These repositories are updated frequently. Please be aware of this if you are trying to replicate this environment.
Part 1: Data Selection
High-quality training data is essential for fine-tuning. This article assumes the data collection process is already complete, but we'll explain what kind of dataset we used and how it was collected.
Task Selection
We chose the simplest of the following three tasks: "pick up a single piece of tape and place it in a box." Complex tasks have a lower success rate, so we recommend starting with something simple.
You can visualize the datasets we created using the LeRobot Dataset Visualizer.
- Complex Task:
pen-cleanup
- Simple Task:
tapes-cleanup
- Easiest Task:
onetape-cleanup
(Used in this guide)
Data Collection Settings
Below are the collection settings for the dataset we used (50 episodes).
The cameras
configuration is particularly important. The camera names defined here (e.g., tip
, front
) will be referenced during the fine-tuning process, so you must remember them exactly. We recommend using the name wrist
to align with GR00T's default settings, which will save you some configuration changes later.
dataset:
repo_id: "yuk6ra/so101-onetape-cleanup" # Hugging Face repository ID
single_task: "Grab the tape and place it in the box." # Task description
num_episodes: 50 # Number of episodes to record
fps: 30 # Frame rate
episode_time_s: 15 # Max time per episode (seconds)
reset_time_s: 15 # Reset time after episode recording (seconds)
# Follower Arm
robot:
type: "so101_follower"
port: "/dev/ttyACM0" # Serial port
id: "white" # Follower ID
# Camera settings (the names defined here are used later)
cameras:
tip: # Using 'wrist' here will streamline later steps
type: "opencv"
index_or_path: 0
fps: 30
width: 640
height: 480
front:
type: "opencv"
index_or_path: 2
fps: 30
width: 640
height: 480
# Leader Arm
teleop:
type: "so101_leader"
port: "/dev/ttyACM1" # Serial port
id: "black" # Leader ID
# Additional options
options:
display_data: false # Whether to display camera feed
push_to_hub: true # Whether to automatically upload to Hugging Face Hub
Part 2: Fine-Tuning
Now, let's use the collected data to fine-tune the GR00T N1 model.
Preparing the Execution Environment
Fine-tuning requires a high-spec machine. We used the following cloud environment for this process.
Component | Specs |
---|---|
GPU | NVIDIA H100 SXM (80GB VRAM) |
Disk | 300GB+ (5000 steps consumed ~100GB) |
RAM | 128GB+ |
OS | Ubuntu 24.04 |
Network Speed | 4Gbps (Upload/Download) |
NOTE: A slow network connection will significantly increase the time it takes to upload the model.
After connecting to the remote server via SSH, let's verify the specs with a few commands.
ssh -p 30454 root@xxx.xxx.xxx.xx -L 8080:localhost:8080
$ nvidia-smi
Sun Jul 13 06:57:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:E4:00.0 Off | 0 |
| N/A 47C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ df /home -h
Filesystem Size Used Avail Use% Mounted on
overlay 300G 90M 300G 1% /
$ free -h
total used free shared buff/cache available
Mem: 503Gi 34Gi 372Gi 47Mi 101Gi 469Gi
Swap: 8.0Gi 186Mi 7.8Gi
$ lsb_release -d
No LSB modules are available.
Description: Ubuntu 24.04.2 LTS
Next, following the official instructions, we'll set up a Conda virtual environment.
# Clone the repository
git clone https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
# Create and activate the Conda environment
conda create -n gr00t python=3.10
conda activate gr00t
# Install required packages
pip install --upgrade setuptools
pip install -e .[base]
pip install --no-build-isolation flash-attn==2.7.1.post4
Finally, log in to Hugging Face and Weights & Biases (Wandb).
- Hugging Face Token: huggingface.co/settings/tokens
- Wandb API Key: wandb.ai/authorize
huggingface-cli login
wandb login
Preparing the Training Data
Download your chosen dataset from the Hugging Face Hub.
huggingface-cli download \
--repo-type dataset yuk6ra/so101-onetape-cleanup \
--local-dir ./demo_data/so101-onetape-cleanup
To ensure GR00T correctly recognizes the data format, a configuration file named modality.json
is required. Copy the sample and modify its contents to match your environment.
# Copy the sample file
cp getting_started/examples/so100_dualcam__modality.json ./demo_data/so101-onetape-cleanup/meta/modality.json
Change the wrist
entry to tip
to match the camera name you set during data collection.
# Edit the configuration file
vim ./demo_data/so101-onetape-cleanup/meta/modality.json
{
...
"video": {
"front": {
"original_key": "observation.images.front"
},
- "wrist": {
- "original_key": "observation.images.wrist"
+ "tip": {
+ "original_key": "observation.images.tip"
}
},
...
Verify that the data can be loaded correctly with the following script.
python scripts/load_dataset.py \
--dataset-path ./demo_data/so101-onetape-cleanup \
--plot-state-action \
--video-backend torchvision_av
If successful, it will output the dataset structure and frame information.
====================================================================================================
========================================= Humanoid Dataset =========================================
====================================================================================================
{'action.gripper': 'np scalar: 1.1111111640930176 [1, 1] float64',
'action.single_arm': 'np: [1, 5] float64',
'annotation.human.task_description': ['Grab the tape and place it in the '
'box.'],
'state.gripper': 'np scalar: 2.410423517227173 [1, 1] float64',
'state.single_arm': 'np: [1, 5] float64',
'video.front': 'np: [1, 480, 640, 3] uint8',
'video.tip': 'np: [1, 480, 640, 3] uint8'}
dict_keys(['video.front', 'video.tip', 'state.single_arm', 'state.gripper', 'action.single_arm', 'action.gripper', 'annotation.human.task_description'])
==================================================
video.front: (1, 480, 640, 3)
video.tip: (1, 480, 640, 3)
state.single_arm: (1, 5)
state.gripper: (1, 1)
action.single_arm: (1, 5)
action.gripper: (1, 1)
annotation.human.task_description: ['Grab the tape and place it in the box.']
...
Warning: Skipping left_arm as it's not found in both state and action dictionaries
Warning: Skipping right_arm as it's not found in both state and action dictionaries
Warning: Skipping left_hand as it's not found in both state and action dictionaries
Warning: Skipping right_hand as it's not found in both state and action dictionaries
Plotted state and action space
Running the Training
Once everything is set up, start the fine-tuning process. On an H100 GPU, this took about 30 minutes and consumed around 100GB of disk space for 5000 steps.
python scripts/gr00t_finetune.py \
--dataset-path ./demo_data/so101-onetape-cleanup/ \
--num-gpus 1 \
--output-dir ./so101-checkpoints \
--max-steps 5000 \
--data-config so100_dualcam \
--video-backend torchvision_av
If you run into memory issues or have lower specs, try reducing --dataloader-num-workers
or --batch-size
.
For lower-spec machines:
python scripts/gr00t_finetune.py \
--dataset-path ./demo_data/so101-onetape-cleanup/ \
--num-gpus 1 \
--output-dir ./so101-checkpoints \
--max-steps 5000 \
--data-config so100_dualcam \
--batch-size 8 \
--video-backend torchvision_av \
--dataloader-num-workers 0
Troubleshooting
Error 1: ValueError: Video key wrist not found
This error occurs because the training script is looking for the default camera name wrist
but finds tip
in your dataset.
Solution: Directly edit gr00t/experiment/data_config.py
to fix the camera name.
# Around line 225
class So100DualCamDataConfig(So100DataConfig):
- video_keys = ["video.front", "video.wrist"]
+ video_keys = ["video.front", "video.tip"]
Error 2: av.error.MemoryError: [Errno 12] Cannot allocate memory
This error happens if you run out of memory while decoding video data.
...
RuntimeError: Caught MemoryError in DataLoader worker process 0.
Original Traceback (most recent call last):
...
File "av/error.pyx", line 326, in av.error.err_check
av.error.MemoryError: [Errno 12] Cannot allocate memory
Solution: Updating the PyAV library to the latest version might solve this.
pip install -U av
Uploading the Trained Model
Once training is complete, upload the generated checkpoint to the Hugging Face Hub.
cd so101-checkpoints/checkpoint-5000/
# Remove unnecessary files (optional)
# rm -rf scheduler.pt optimizer.pt
# Upload to Hugging Face Hub
huggingface-cli upload \
--repo-type model yuk6ra/so101-onetape-cleanup . \
--commit-message="Finetuned model with 5000 steps"
NOTE: If you're using a cloud server, don't forget to shut down the instance after the upload is complete.
Part 3: Inference
Let's use our fine-tuned model to control a physical robot. The inference setup consists of two main components: an inference server to host the model and a client node to control the robot.
Inference Server Setup and Execution
The inference server can run on a local or cloud GPU machine. We used the following local environment:
- GPU: NVIDIA GeForce RTX 4070 Ti (12GB)
- RAM: 128GB
- OS: Ubuntu 22.04
If you're running it on the cloud, make sure to open the necessary port for the inference server (e.g., 5555
).
First, set up the GR00T environment just as you did for fine-tuning. Then, download the model from the Hugging Face Hub and start the server.
# Set up GR00T environment (see Fine-Tuning section)
git clone https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
conda create -n gr00t python=3.10
conda activate gr00t
pip install --upgrade setuptools
pip install -e .[base]
pip install --no-build-isolation flash-attn==2.7.1.post4
# Download the model
huggingface-cli download \
--repo-type model yuk6ra/so101-onetape-cleanup \
--local-dir ./model/so101-onetape-cleanup
# Start the inference server
python scripts/inference_service.py \
--model_path ./model/so101-onetape-cleanup \
--embodiment_tag new_embodiment \
--data_config so100_dualcam \
--server \
--port 5555
If you see Server is ready and listening on tcp://0.0.0.0:5555
, the server has started successfully.
NOTE: You will need to modify data_config.py
to change the camera name to tip
here as well.
Troubleshooting
Error 1: OSError: CUDA_HOME environment variable is not set
This happens during the flash-attn
installation if the CUDA path isn't found.
...
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
...
Solution: Install the CUDA Toolkit via conda
.
conda install -c nvidia cuda-toolkit=12.4
Error 2: ModuleNotFoundError: No module named 'flash_attn'
Solution: Make sure you have correctly activated the gr00t
conda environment with conda activate gr00t
. You might have accidentally ended up in the base
or lerobot
environment.
Client Node Setup and Execution
The client runs in the lerobot
environment used during data collection. If you don't have the environment, create it with these steps:
If you don't have a lerobot virtual environment:
# Clone the LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create and activate the Conda environment
conda create -y -n lerobot python=3.10
conda activate lerobot
# Install required packages
conda install ffmpeg -c conda-forge
pip install -e .
pip install -e ".[feetech]"
conda activate lerobot
After activating the lerobot
environment, install any other necessary packages.
pip install matplotlib
Navigate to your Isaac-GR00T
directory.
cd ~/Documents/Isaac-GR00T/ # or your path to Isaac-GR00T
First, use lerobot.find_cameras
to identify the camera IDs connected to your system. You'll use these IDs as arguments when launching the client.
$ python -m lerobot.find_cameras opencv
# ... (From the output, find the camera IDs for 'tip' and 'front')
# Example: tip is 2, front is 0
Next, we need to modify getting_started/examples/eval_lerobot.py
to integrate with GR00T's inference capabilities.
- from lerobot.common.cameras.opencv.configuration_opencv import (
+ from lerobot.cameras.opencv.configuration_opencv import (
OpenCVCameraConfig,
)
- from lerobot.common.robots import (
+ from lerobot.robots import (
Robot,
RobotConfig,
koch_follower,
make_robot_from_config,
so100_follower,
so101_follower,
)
- from lerobot.common.utils.utils import (
+ from lerobot.utils.utils import (
init_logging,
log_say,
)
# NOTE:
# Sometimes we would like to abstract different env, or run this on a separate machine
# User can just move this single python class method gr00t/eval/service.py
# to their code or do the following line below
- # sys.path.append(os.path.expanduser("~/Isaac-GR00T/gr00t/eval/"))
+ import os
+ import sys
+ sys.path.append(os.path.expanduser("./gr00t/eval/")) # Fix the path
from service import ExternalRobotInferenceClient
Launch the client with the following command to send instructions to the robot.
For a local environment:
python getting_started/examples/eval_lerobot.py \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM1 \
--robot.id=white \
--robot.cameras="{
tip: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30},
front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}
}" \
--lang_instruction="Grab the tape and place it in the box."
If you are using a cloud server for inference, set --policy_host
and --policy_port
accordingly.
For a cloud environment:
python getting_started/examples/eval_lerobot.py \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=white \
--robot.cameras="{
tip: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30},
front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}
}" \
--policy_host xxx.xx.xx.xx \
--policy_port xxxxx \
--lang_instruction="Grab tapes and place into pen holder."
Execution Results
Here are the results for GR00T N1. For comparison, we've also included success and failure examples from an ACT model.
GR00T N1 (Success)
ACT (Success)
GR00T N1 (Failure)
ACT (Failure)
Troubleshooting
Error: Inference results are NaN
or the robot movement is jerky.
The model itself might be fine, but the input from the robot could be incorrect.
Solution: Double-check your cameras
configuration and try rebuilding the environment from scratch.
Part 4: Model Evaluation
Finally, let's evaluate how well the trained model can reproduce the tasks from the dataset.
Preparing for Evaluation
We'll work in the gr00t
environment. Download the evaluation dataset and model, and prepare the modality.json
file. This process is the same as in the fine-tuning section.
conda activate gr00t
# Download the dataset
huggingface-cli download \
--repo-type dataset yuk6ra/so101-onetape-cleanup \
--local-dir ./demo_data/so101-onetape-cleanup
# Download the model (use --revision to evaluate a specific step)
huggingface-cli download \
--repo-type model yuk6ra/so101-onetape-cleanup \
--local-dir ./model/so101-onetape-cleanup
# --revision checkpoint-2000
# Prepare modality.json
cp getting_started/examples/so100_dualcam__modality.json ./demo_data/so101-onetape-cleanup/meta/modality.json
# Use vim to change the camera name to 'tip'
Running the Evaluation
Once ready, run the evaluation script.
python scripts/eval_policy.py \
--plot \
--embodiment_tag new_embodiment \
--model_path ./model/so101-onetape-cleanup/ \
--data_config so100_dualcam \
--dataset_path ./demo_data/so101-onetape-cleanup/ \
--video_backend torchvision_av \
--modality_keys single_arm gripper \
--denoising_steps 4
NOTE: You'll need to modify data_config.py
to change the camera name to tip
here as well.
Troubleshooting
Error: ModuleNotFoundError: No module named 'tyro'
Solution: Double-check that you are in the correct gr00t
virtual environment.
Evaluation Results
When the script finishes, it will generate a plot comparing the model's prediction (Prediction: green line) with the actual recorded action (Ground truth: orange line). This graph allows you to visually confirm how accurately the model has learned the task.
Comparative Analysis
Let's use the evaluation script to compare how training progress and task difficulty affect model performance. The plots visualize the difference between the model's prediction (green line) and the ground truth (orange line). The closer the curves are, the more accurately the model is reproducing the task.
Performance Comparison by Training Steps
We'll compare the performance on the same task (onetape-cleanup
) at 2000 and 5000 training steps.
Evaluation at 2000 Steps
Analysis: At 2000 steps, the prediction (green) roughly follows the ground truth (orange), but there are noticeable deviations and oscillations. The arm's movement (single_arm
) in certain dimensions is not smooth, indicating that the model has not yet fully learned to reproduce the task.
Evaluation at 5000 Steps
Analysis: After 5000 steps, the prediction and ground truth curves are nearly identical, showing that the model can reproduce the motion very smoothly. This clearly demonstrates that additional training improved the model's performance.
Performance Comparison by Task Complexity
Next, let's compare the performance of models trained for 5000 steps on tasks of varying complexity.
Easiest Task: onetape-cleanup
Analysis: As shown before, the model reproduces the easiest task almost perfectly.
Simple Task: tapes-cleanup
Analysis: Simply by having multiple tapes, the deviation in the prediction (green line) becomes slightly larger.
Complex Task: pen-cleanup
Analysis: For the more complex task of cleaning up pens (featured in the official blog), the gap between the prediction and ground truth becomes significant. There are large deviations in specific joint movements (e.g., single_arm_4
), suggesting that 5000 training steps are insufficient for this level of complexity.
Effect of Further Training (7000 Steps)
Let's see the evaluation for the complex pen-cleanup
task after training for 7000 steps.
Analysis: Compared to 5000 steps, even after 7000 steps, a significant gap remains between the prediction and ground truth. In fact, the MSE has increased, meaning the prediction has moved further away from the ground truth. This suggests that simply increasing the number of training steps may not solve the problem. It could point to other issues, such as a lack of diversity or quantity in the dataset, or that the model architecture itself is not capable of handling the task's complexity.
These comparisons reaffirm that you need a sufficient number of training steps and a high-quality, diverse dataset that corresponds to the complexity of your task.
Conclusion
In this article, we walked through the entire workflow of fine-tuning an NVIDIA GR00T N1 model with custom data collected via LeRobot, followed by inference and evaluation on a physical robot, complete with detailed commands and logs.
Here are the key takeaways:
- Data Consistency: It is crucial to ensure that settings, especially camera names, are consistent between data collection (LeRobot) and training/inference (GR00T).
- Environment Setup: Properly separating virtual environments with
conda
and installing the correct libraries are key to success. - Troubleshooting: You'll need to be able to read error logs carefully and adapt to errors specific to your custom dataset, such as by directly editing configuration files.
I hope this guide helps you in developing your own robotics applications with GR00T. Future work could involve tackling more complex tasks or comparing model performance across different numbers of training steps.
If you have any feedback or find any mistakes, please don't hesitate to reach out. Let's build this exciting future together.
Top comments (1)
This is such a detailed and well-written article, thank you so much for putting this together, Yuki!