The missing layer between W&B and Datadog: observability for AI robots

#robotics #ai #observability #mlops

A backend service falls over at 2am and you know the drill: open the dashboard, follow the trace, find the bad deploy, roll back. Twenty years of tooling (logs, metrics, traces, APM) exists to answer "what just happened, and why?"

Now your robot bricks a grasp at 2am. What do you open?

There's no trace. The "request" was a 40-second episode of a policy reacting to the physical world. The failure isn't in a log line. It's in the half-second where the gripper closed early, which only makes sense if you can see the wrist camera, the joint torques, and the policy's action outputs on the same clock. You can't grep that. And the regression that caused it shipped because "it worked in sim" and nobody re-ran it against the 3,000 episodes where it used to work.

We have Datadog for services and Weights & Biases for training. We have almost nothing for the part in between: the run itself. That gap is where robotics observability lives, and it's about to matter a lot, because every team shipping VLA, imitation, and RL policies is hitting the same wall.

The unit of debugging is the episode, not the log line

This is the whole thesis. Backend observability is built around the request. Robotics observability has to be built around the episode, a synchronized bundle of:

video (one or more camera streams),
sensors (joint states, forces, IMU, battery),
actions (what the policy actually commanded),

all locked to a single timeline, and tagged with the four things that make it reproducible: policy_version, env_version, git_sha, seed.

Drop any of those and you've got a video you can't trust. Keep them, and an episode stops being a memory and becomes a test case.

What "observability" has to mean here

Steal the shape of service observability, but redefine each pillar for physical runs:

Record. Capture episodes with near-zero friction from inside the training loop. If logging a run is more than a few lines, nobody does it. Heavy bytes (tens of GB of video) go straight to object storage; only metadata hits your API.

import robotrace as rt

# One call: uploads the artifacts, stamps the reproducibility fields,
# and returns an Episode you can open in the portal.
ep = rt.log_episode(
    policy_version="grasp-v7",
    env_version="cell-3",
    seed=42,
    video="wrist_cam.mp4",
    sensors="joint_states.npz",   # timestamped
    actions="policy_outputs.npz", # same clock
)

Replay. Scrub every run frame-accurate, all streams on one timeline. Pause on the frame where it broke, copy a ?t=12840ms link, and a teammate lands on the exact moment. This is the "follow the trace" of robotics.
Explain. A failed run should tell you why, not hand you a metadata dump. Ranked root causes (replay regression, raised exception, battery brownout, action saturation) surfaced the moment the episode finalizes.
Verify. The one service observability never needed. Before you ship a new policy to real hardware, re-roll it against thousands of historical episodes and read the diff: where does the candidate do better, where does it regress? Gate the deploy on that, without booking another hour on the arm.

That last pillar is the point. It closes the loop from "we recorded a run" to "we won't ship that regression to a real robot."

RoboTrace: log an episode in a few lines, get a portal URL back

Why now

Three things are converging. Foundation/VLA policies are getting cheap enough to iterate on weekly, so the bottleneck moves from training to trusting. Real-robot time stays expensive and scarce, so you can't validate by re-running on hardware; you validate against history. And teams are scaling from one robot to fleets, where "it worked on my arm" is no longer an argument.

The teams that win won't be the ones with the flashiest policy. They'll be the ones who can answer "is this safe to ship?" in minutes instead of days, because they treated every run as a reproducible, replayable, re-rollable artifact from day one.

The bet

Robotics is about to get its observability moment, the same way backend did in the 2010s and ML training did with experiment trackers. The winning tool won't be a log viewer bolted onto robots. It'll be episode-first: replay, explain, and regression-gating as first-class primitives, with reproducibility baked into the data model instead of bolted on later.

That's the bet we're making with RoboTrace, observability and evals for AI-powered robots. The SDK is pip install robotrace-dev (early access during alpha), and the source lives on GitHub. If you're training policies and debugging them with screen recordings and a spreadsheet, I'd genuinely like to hear how you're doing it today.

You can't grep a robot. So let's build the thing you can do instead.