<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kunal</title>
    <description>The latest articles on DEV Community by Kunal (@kunalsomani).</description>
    <link>https://dev.to/kunalsomani</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4007474%2F89b8cd5c-ee4d-4df1-a1aa-23df9318c33d.jpg</url>
      <title>DEV Community: Kunal</title>
      <link>https://dev.to/kunalsomani</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kunalsomani"/>
    <language>en</language>
    <item>
      <title>We linted 100 public LeRobot datasets. Here's what we found.</title>
      <dc:creator>Kunal</dc:creator>
      <pubDate>Mon, 29 Jun 2026 07:05:27 +0000</pubDate>
      <link>https://dev.to/kunalsomani/we-linted-100-public-lerobot-datasets-heres-what-we-found-3no0</link>
      <guid>https://dev.to/kunalsomani/we-linted-100-public-lerobot-datasets-heres-what-we-found-3no0</guid>
      <description>&lt;p&gt;The Hugging Face Hub now hosts 58,000+ community LeRobotDataset repos, the single largest dataset category on the Hub, up roughly 50x in five months. LeRobotDataset has won the format war for robot-learning data. Nobody has been checking whether that data is actually safe to train on.&lt;/p&gt;

&lt;p&gt;So I built trajlens, a linter for LeRobotDataset data, and pointed it at 100 real public datasets to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;Clean, no issues found&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WARN&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;A real validation check fired: schema mismatch, corrupted episode metadata, missing language annotations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ERROR&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;The dataset couldn't even be loaded: unsupported format version, missing metadata, dead or mistagged repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIMEOUT&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;Exceeded the lint budget, mostly genuinely large datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;81% of the datasets I tested had something wrong with them, or couldn't be linted cleanly at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two named bugs
&lt;/h2&gt;

&lt;p&gt;Two specific upstream &lt;code&gt;lerobot&lt;/code&gt; issues show up often enough to be worth naming directly, not just calling them "quality issues":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timestamp float drift (&lt;a href="https://github.com/huggingface/lerobot/issues/3177" rel="noopener noreferrer"&gt;#3177&lt;/a&gt;)&lt;/strong&gt;. Accumulating floating-point rounding error in stored timestamps causes video decode to fail partway through training, often dozens of episodes in. Found in 3.1% of datasets that linted successfully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v2.1 to v3.0 conversion corruption (&lt;a href="https://github.com/huggingface/lerobot/issues/2401" rel="noopener noreferrer"&gt;#2401&lt;/a&gt;)&lt;/strong&gt;. The episode-to-frame index boundaries written during the v2.1 to v3.0 migration can silently disagree with the actual data. No error is raised. Frames get assigned to the wrong episode. A policy trains on mislabeled data and nobody notices until results look wrong for reasons no one can pin down. Found in 18.8% of successfully-linted datasets, the single most common real failure in the whole sample.&lt;/p&gt;

&lt;p&gt;Neither of these crashes a training run immediately. Both are the kind of bug that burns a GPU-day before you find out your data was the problem, not your model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What trajlens actually checks
&lt;/h2&gt;

&lt;p&gt;16 checks across six categories: structural integrity, timestamp and temporal consistency, video decodability, semantic correctness (task labels, feature shapes), and statistics divergence, all run as a single pluggable check engine. Full catalog and severities are in the README.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;trajlens
trajlens lint your-org/your-dataset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under 30 seconds for a 100-episode local dataset. CI-friendly exit codes (0/1/2), JSON, HTML, and SARIF report formats, and a &lt;code&gt;--deep&lt;/code&gt; flag if you want full video decode instead of the default spot-check.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;fix&lt;/code&gt; (safe, dry-run-by-default auto-repair for what &lt;code&gt;lint&lt;/code&gt; finds) and a web dashboard are next. After that, synthetic demonstration generation: turning a handful of seed demos into hundreds of clean, lint-passing training trajectories, free and runnable without a GPU cluster.&lt;/p&gt;

&lt;p&gt;The check registry is pluggable. If you've hit a LeRobot data bug that isn't on this list, a contributed check is the fastest way to get it caught for everyone, not just you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Kunal-Somani/trajlens" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://pypi.org/project/trajlens/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; · &lt;a href="https://github.com/Kunal-Somani/trajlens/blob/main/scripts/audit_hub.py" rel="noopener noreferrer"&gt;Full audit script&lt;/a&gt; (rerun it yourself, it resamples a fresh random subset each time, so exact percentages will vary run to run, but the shape holds)&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>robotics</category>
    </item>
  </channel>
</rss>
