I built a tool that finds silent corruption in robot training datasets. Now it fixes it too.

Kunal — Mon, 06 Jul 2026 13:49:38 +0000

That audit became trajlens, an open source lint tool for LeRobot datasets. v0.1 could detect 20+ classes of these problems and give a dataset a trust score.

Today I released v0.2. It doesn't just detect anymore. It repairs.

What's new

Three fixers, one command.

pip install trajlens
trajlens fix ./my_dataset --out ./my_dataset_repaired --apply

timestamp_dedrift rewrites drifted timestamps back to frame_index / fps, matching the exact float32 quantization the detection check uses
stats_recompute streams the full dataset through a Welford pass and rewrites stats.json with correct mean, std, min, max, and count
episode_reindex derives true episode boundaries from the per row index column in the data itself and corrects the metadata to match

Safety was the whole design problem. A repair tool that writes wrong data is worse than no tool. So every fixer is:

Copy on write. The original dataset is never touched. Ever.
Dry run by default. You see a full diff of what would change before anything is written.
Round trip tested. Every fixer's test suite proves that repair followed by re-lint clears the finding and introduces no new ones.
Fail closed. If a dataset is internally inconsistent in a way that makes repair ambiguous, you get a clear error, not a guess.

The hardest case was episode reindexing. The corruption it fixes is silent by nature, so a buggy fixer could manufacture the exact corruption it claims to repair. The fixer treats the actual data as ground truth and only ever corrects metadata to match it, never the reverse. I also verified the load bearing assumption (that every v3.0 data row carries a trustworthy global index column) against the lerobot source at the exact commit, not from memory.

A web dashboard.

trajlens web ./my_dataset

One command, localhost only, renders the full lint report in your browser. Trust score gauge, severity color coding, per check drill down. No React build chain, no CDN calls, no analytics. It's a single static page served by FastAPI, because a tool you point at private robot data should not phone home through your browser.

Fun bug from building it: all 437 tests were green and the dashboard was completely broken in a real browser. The Content Security Policy I'd set (no inline scripts) was blocking the page's own inline JavaScript. Test clients check that headers exist. Only browsers enforce them. The fix was externalizing the JS and CSS rather than loosening the policy, plus a structural test that fails if anyone ever adds an inline script block again.

Why this matters

Imitation learning is only as good as its demonstrations. The ecosystem has great tools for collecting data and training policies, and almost nothing for answering "is this dataset actually what it claims to be." The bugs trajlens catches are real, documented issues in the wild (LeRobot #2401 and #3177 are two of them), and they don't announce themselves.

If you train on LeRobot format data, two minutes of linting might save you a week of debugging a policy that won't converge.

Links

GitHub: https://github.com/Kunal-Somani/trajlens
PyPI: https://pypi.org/project/trajlens/
Changelog: https://github.com/Kunal-Somani/trajlens/releases/tag/v0.2.0

v0.3 direction: more fixers, deeper per episode analytics, and upstreaming the boundary corruption fix to LeRobot itself. Issues and PRs welcome. If you run it on your dataset and it finds something weird, I genuinely want to hear about it.

We linted 100 public LeRobot datasets. Here's what we found.

Kunal — Mon, 29 Jun 2026 07:05:27 +0000

The Hugging Face Hub now hosts 58,000+ community LeRobotDataset repos, the single largest dataset category on the Hub, up roughly 50x in five months. LeRobotDataset has won the format war for robot-learning data. Nobody has been checking whether that data is actually safe to train on.

So I built trajlens, a linter for LeRobotDataset data, and pointed it at 100 real public datasets to find out.

The numbers

Status	Count	What it means
PASS	19	Clean, no issues found
WARN	0
FAIL	13	A real validation check fired: schema mismatch, corrupted episode metadata, missing language annotations
ERROR	47	The dataset couldn't even be loaded: unsupported format version, missing metadata, dead or mistagged repos
TIMEOUT	21	Exceeded the lint budget, mostly genuinely large datasets

81% of the datasets I tested had something wrong with them, or couldn't be linted cleanly at all.

The two named bugs

Two specific upstream lerobot issues show up often enough to be worth naming directly, not just calling them "quality issues":

Timestamp float drift (#3177). Accumulating floating-point rounding error in stored timestamps causes video decode to fail partway through training, often dozens of episodes in. Found in 3.1% of datasets that linted successfully.

v2.1 to v3.0 conversion corruption (#2401). The episode-to-frame index boundaries written during the v2.1 to v3.0 migration can silently disagree with the actual data. No error is raised. Frames get assigned to the wrong episode. A policy trains on mislabeled data and nobody notices until results look wrong for reasons no one can pin down. Found in 18.8% of successfully-linted datasets, the single most common real failure in the whole sample.

Neither of these crashes a training run immediately. Both are the kind of bug that burns a GPU-day before you find out your data was the problem, not your model.

What trajlens actually checks

16 checks across six categories: structural integrity, timestamp and temporal consistency, video decodability, semantic correctness (task labels, feature shapes), and statistics divergence, all run as a single pluggable check engine. Full catalog and severities are in the README.

Try it

pip install trajlens
trajlens lint your-org/your-dataset

Under 30 seconds for a 100-episode local dataset. CI-friendly exit codes (0/1/2), JSON, HTML, and SARIF report formats, and a --deep flag if you want full video decode instead of the default spot-check.

What's next

fix (safe, dry-run-by-default auto-repair for what lint finds) and a web dashboard are next. After that, synthetic demonstration generation: turning a handful of seed demos into hundreds of clean, lint-passing training trajectories, free and runnable without a GPU cluster.

The check registry is pluggable. If you've hit a LeRobot data bug that isn't on this list, a contributed check is the fastest way to get it caught for everyone, not just you.

GitHub · PyPI · Full audit script (rerun it yourself, it resamples a fresh random subset each time, so exact percentages will vary run to run, but the shape holds)

DEV Community: Kunal