SimTooReal

Posted on Jun 6

How to Add Live Telemetry and Failure Diagnosis to Isaac Lab, MuJoCo, or Gazebo Training in Under 5 Minutes

#ai #robotics #mujoco #reinforcementlearning

If you train robot policies long enough, you eventually realize the main problem is not launching runs.

It is answering these questions fast enough:

Is the run actually learning?
Is it stuck?
Did the reward improve for the right reason?
Is one joint or one failure mode quietly ruining transfer?
Did the crash already tell us what to fix?

This post walks through a practical approach using SimTooReal, a platform built for robotics teams working across Isaac Lab, MuJoCo, Gazebo, and LeRobot workflows.

The nice part is that the basic setup does not require rewriting your training loop.

What SimTooReal gives you

From the public docs and feature pages, the platform is built around a few concrete capabilities:

a lightweight Python agent that wraps existing training commands
live streaming of 28+ training signals per iteration
automatic classification of 50+ failure patterns
sim-to-real scoring and readiness checks
deployment gates before hardware promotion

For this post, we will focus on the fastest path: getting live metrics and failure diagnosis around an existing run.

1. Install the Python agent

pip install simtooreal-agent

That is the shortest path shown in the docs.

2. Wrap your existing training command

For a generic training script:

simtooreal-agent run -- python train.py --env HalfCheetah-v5 --algo ppo

For an Isaac Lab run:

simtooreal-agent run -- python scripts/rsl_rl/train.py --task Isaac-Ant-v0 --headless

The agent sits around your existing command, parses stdout in real time, and streams metrics into the SimTooReal dashboard.

3. Open the dashboard

The docs route is here:

https://www.simtooreal.com/docs

Once the run is wrapped, you can monitor live training behavior rather than waiting for logs to finish writing.

4. Watch the signals that actually change decisions

The platform advertises 28+ streamed signals. In practice, the most useful ones for day-to-day robotics training are likely to be:

mean reward
entropy
KL divergence
reward component breakdowns
GPU utilization
failure counts
convergence behavior

The reason this matters is simple: a reward curve alone is not enough.

I have seen runs where:

reward rises while entropy collapses
one bad contact event creates unstable gradients
a policy seems healthy until a single joint starts violating limits repeatedly

If you only check the final chart, you catch those issues too late.

5. Use failure diagnosis while the run is still alive

This is the part I like most in the public feature set.

SimTooReal's failure intelligence is designed to recognize more than 50 patterns from live logs, including:

CUDA OOM
illegal memory access
entropy collapse
reward plateau
KL runaway
NaN reward from physics contact
joint limit violations
unstable simulation timestep
task config mismatch

The platform also exposes a free public diagnosis page:

https://www.simtooreal.com/diagnose

That page accepts logs from frameworks like:

Isaac Lab
rsl_rl
Stable-Baselines3
LeRobot
MuJoCo
Gazebo

So even before you wire the full workflow, you can already use the diagnosis engine as a quick debugging surface.

6. Optional: add richer telemetry through SDK-style logging

The training monitor page also shows an optional direct integration path for teams that want deeper instrumentation.

Example:

from isaacmonitor_sdk import MonitorClient

client = MonitorClient(run_id="my-run-001")

client.log_metrics(
    iteration=i,
    mean_reward=mean_reward,
    entropy=entropy,
    reward_components={
        "velocity": 0.8,
        "upright": 0.6,
        "contact": -0.1,
    },
)

client.log_failure(joint="knee_left", reason="joint_limit_exceeded")
client.finish_run()

If your team already has a structured training loop and wants richer event data than stdout parsing alone, this is the cleaner path.

7. Score the sim-to-real gap directly

Once you start collecting trajectories from both sim and real hardware, the next useful step is the public transfer scoring tool:

https://www.simtooreal.com/score

The CLI example from the product pages is:

simtooreal score --sim sim_traj.csv --real real_traj.csv

According to the site, this compares trajectories with Dynamic Time Warping and returns a transfer score out of 100.

That is useful because it turns the vague statement "transfer seems okay" into something you can compare over time.

8. Move from run tracking to deployment discipline

This is where the product goes beyond metrics collection.

The deployment flow on the site is built around:

transfer validation
physics safety gates
operator preflight
shadow mode
canary rollout

If your team is already training policies successfully, this is the next maturity step. Training visibility helps you produce better checkpoints. Deployment gates help you avoid promoting the wrong ones.

Why this setup is practical

A lot of robotics infra tools fail because they ask for too much up front.

What makes this workflow practical is that it gives you a progressive path:

wrap an existing run
get live metrics
use automatic diagnosis
add transfer scoring
enforce deployment gates

That is a much better adoption curve than "rebuild your training stack around our SDK."

Final thoughts

If you work in robot learning, the gap between simulation and hardware is not just a modeling problem. It is a visibility problem and a decision problem.

A good workflow should help you answer:

what is happening now
why it failed
whether transfer is improving
whether a policy is safe enough to move forward

That is what SimTooReal is aiming to solve.

Useful links:

Docs: https://www.simtooreal.com/docs
Training monitor: https://www.simtooreal.com/features/train
Failure intelligence: https://www.simtooreal.com/features/failures
Free diagnosis: https://www.simtooreal.com/diagnose
Transfer score: https://www.simtooreal.com/score

Suggested dev.to Metadata

Tags: robotics, machinelearning, python, ai
SEO title: How to Add Live Telemetry to Isaac Lab, MuJoCo, or Gazebo Training
CTA: Start with the free diagnosis tool or the docs quickstart.

DEV Community