DEV Community

Kunal
Kunal

Posted on

We linted 100 public LeRobot datasets. Here's what we found.

The Hugging Face Hub now hosts 58,000+ community LeRobotDataset repos, the single largest dataset category on the Hub, up roughly 50x in five months. LeRobotDataset has won the format war for robot-learning data. Nobody has been checking whether that data is actually safe to train on.

So I built trajlens, a linter for LeRobotDataset data, and pointed it at 100 real public datasets to find out.

The numbers

Status Count What it means
PASS 19 Clean, no issues found
WARN 0
FAIL 13 A real validation check fired: schema mismatch, corrupted episode metadata, missing language annotations
ERROR 47 The dataset couldn't even be loaded: unsupported format version, missing metadata, dead or mistagged repos
TIMEOUT 21 Exceeded the lint budget, mostly genuinely large datasets

81% of the datasets I tested had something wrong with them, or couldn't be linted cleanly at all.

The two named bugs

Two specific upstream lerobot issues show up often enough to be worth naming directly, not just calling them "quality issues":

Timestamp float drift (#3177). Accumulating floating-point rounding error in stored timestamps causes video decode to fail partway through training, often dozens of episodes in. Found in 3.1% of datasets that linted successfully.

v2.1 to v3.0 conversion corruption (#2401). The episode-to-frame index boundaries written during the v2.1 to v3.0 migration can silently disagree with the actual data. No error is raised. Frames get assigned to the wrong episode. A policy trains on mislabeled data and nobody notices until results look wrong for reasons no one can pin down. Found in 18.8% of successfully-linted datasets, the single most common real failure in the whole sample.

Neither of these crashes a training run immediately. Both are the kind of bug that burns a GPU-day before you find out your data was the problem, not your model.

What trajlens actually checks

16 checks across six categories: structural integrity, timestamp and temporal consistency, video decodability, semantic correctness (task labels, feature shapes), and statistics divergence, all run as a single pluggable check engine. Full catalog and severities are in the README.

Try it

pip install trajlens
trajlens lint your-org/your-dataset
Enter fullscreen mode Exit fullscreen mode

Under 30 seconds for a 100-episode local dataset. CI-friendly exit codes (0/1/2), JSON, HTML, and SARIF report formats, and a --deep flag if you want full video decode instead of the default spot-check.

What's next

fix (safe, dry-run-by-default auto-repair for what lint finds) and a web dashboard are next. After that, synthetic demonstration generation: turning a handful of seed demos into hundreds of clean, lint-passing training trajectories, free and runnable without a GPU cluster.

The check registry is pluggable. If you've hit a LeRobot data bug that isn't on this list, a contributed check is the fastest way to get it caught for everyone, not just you.

GitHub ยท PyPI ยท Full audit script (rerun it yourself, it resamples a fresh random subset each time, so exact percentages will vary run to run, but the shape holds)

Top comments (0)