DEV Community

SimTooReal
SimTooReal

Posted on

How to Add Live Telemetry and Failure Diagnosis to Isaac Lab, MuJoCo, or Gazebo Training in Under 5 Minutes

If you train robot policies long enough, you eventually realize the main problem is not launching runs.

It is answering these questions fast enough:

  • Is the run actually learning?
  • Is it stuck?
  • Did the reward improve for the right reason?
  • Is one joint or one failure mode quietly ruining transfer?
  • Did the crash already tell us what to fix?

This post walks through a practical approach using SimTooReal, a platform built for robotics teams working across Isaac Lab, MuJoCo, Gazebo, and LeRobot workflows.

The nice part is that the basic setup does not require rewriting your training loop.

What SimTooReal gives you

From the public docs and feature pages, the platform is built around a few concrete capabilities:

  • a lightweight Python agent that wraps existing training commands
  • live streaming of 28+ training signals per iteration
  • automatic classification of 50+ failure patterns
  • sim-to-real scoring and readiness checks
  • deployment gates before hardware promotion

For this post, we will focus on the fastest path: getting live metrics and failure diagnosis around an existing run.

1. Install the Python agent

pip install simtooreal-agent
Enter fullscreen mode Exit fullscreen mode

That is the shortest path shown in the docs.

2. Wrap your existing training command

For a generic training script:

simtooreal-agent run -- python train.py --env HalfCheetah-v5 --algo ppo
Enter fullscreen mode Exit fullscreen mode

For an Isaac Lab run:

simtooreal-agent run -- python scripts/rsl_rl/train.py --task Isaac-Ant-v0 --headless
Enter fullscreen mode Exit fullscreen mode

The agent sits around your existing command, parses stdout in real time, and streams metrics into the SimTooReal dashboard.

3. Open the dashboard

The docs route is here:

https://www.simtooreal.com/docs

Once the run is wrapped, you can monitor live training behavior rather than waiting for logs to finish writing.

4. Watch the signals that actually change decisions

The platform advertises 28+ streamed signals. In practice, the most useful ones for day-to-day robotics training are likely to be:

  • mean reward
  • entropy
  • KL divergence
  • reward component breakdowns
  • GPU utilization
  • failure counts
  • convergence behavior

The reason this matters is simple: a reward curve alone is not enough.

I have seen runs where:

  • reward rises while entropy collapses
  • one bad contact event creates unstable gradients
  • a policy seems healthy until a single joint starts violating limits repeatedly

If you only check the final chart, you catch those issues too late.

5. Use failure diagnosis while the run is still alive

This is the part I like most in the public feature set.

SimTooReal's failure intelligence is designed to recognize more than 50 patterns from live logs, including:

  • CUDA OOM
  • illegal memory access
  • entropy collapse
  • reward plateau
  • KL runaway
  • NaN reward from physics contact
  • joint limit violations
  • unstable simulation timestep
  • task config mismatch

The platform also exposes a free public diagnosis page:

https://www.simtooreal.com/diagnose

That page accepts logs from frameworks like:

  • Isaac Lab
  • rsl_rl
  • Stable-Baselines3
  • LeRobot
  • MuJoCo
  • Gazebo

So even before you wire the full workflow, you can already use the diagnosis engine as a quick debugging surface.

6. Optional: add richer telemetry through SDK-style logging

The training monitor page also shows an optional direct integration path for teams that want deeper instrumentation.

Example:

from isaacmonitor_sdk import MonitorClient

client = MonitorClient(run_id="my-run-001")

client.log_metrics(
    iteration=i,
    mean_reward=mean_reward,
    entropy=entropy,
    reward_components={
        "velocity": 0.8,
        "upright": 0.6,
        "contact": -0.1,
    },
)

client.log_failure(joint="knee_left", reason="joint_limit_exceeded")
client.finish_run()
Enter fullscreen mode Exit fullscreen mode

If your team already has a structured training loop and wants richer event data than stdout parsing alone, this is the cleaner path.

7. Score the sim-to-real gap directly

Once you start collecting trajectories from both sim and real hardware, the next useful step is the public transfer scoring tool:

https://www.simtooreal.com/score

The CLI example from the product pages is:

simtooreal score --sim sim_traj.csv --real real_traj.csv
Enter fullscreen mode Exit fullscreen mode

According to the site, this compares trajectories with Dynamic Time Warping and returns a transfer score out of 100.

That is useful because it turns the vague statement "transfer seems okay" into something you can compare over time.

8. Move from run tracking to deployment discipline

This is where the product goes beyond metrics collection.

The deployment flow on the site is built around:

  • transfer validation
  • physics safety gates
  • operator preflight
  • shadow mode
  • canary rollout

If your team is already training policies successfully, this is the next maturity step. Training visibility helps you produce better checkpoints. Deployment gates help you avoid promoting the wrong ones.

Why this setup is practical

A lot of robotics infra tools fail because they ask for too much up front.

What makes this workflow practical is that it gives you a progressive path:

  1. wrap an existing run
  2. get live metrics
  3. use automatic diagnosis
  4. add transfer scoring
  5. enforce deployment gates

That is a much better adoption curve than "rebuild your training stack around our SDK."

Final thoughts

If you work in robot learning, the gap between simulation and hardware is not just a modeling problem. It is a visibility problem and a decision problem.

A good workflow should help you answer:

  • what is happening now
  • why it failed
  • whether transfer is improving
  • whether a policy is safe enough to move forward

That is what SimTooReal is aiming to solve.

Useful links:

  • Docs: https://www.simtooreal.com/docs
  • Training monitor: https://www.simtooreal.com/features/train
  • Failure intelligence: https://www.simtooreal.com/features/failures
  • Free diagnosis: https://www.simtooreal.com/diagnose
  • Transfer score: https://www.simtooreal.com/score

Suggested dev.to Metadata

  • Tags: robotics, machinelearning, python, ai
  • SEO title: How to Add Live Telemetry to Isaac Lab, MuJoCo, or Gazebo Training
  • CTA: Start with the free diagnosis tool or the docs quickstart.

Top comments (0)