DEV Community

Cover image for DataLens: A Read-Only Image Dataset Sanity Checker
Ertugrul
Ertugrul

Posted on

DataLens: A Read-Only Image Dataset Sanity Checker

DataLens: A Read‑Only Image Dataset Sanity Checker

Training a model rarely fails loudly.

Most of the time, it kind of works — loss decreases, accuracy moves, but the results feel unstable, brittle, or just wrong.

In my experience, when that happens, the root cause is often not the model, but the dataset.

So I built DataLens: a lightweight, read‑only sanity checker for image datasets.


The Problem: Silent Dataset Failures

Before training even starts, datasets often contain issues like:

  • Corrupted or unreadable images
  • Duplicate or near‑duplicate samples
  • Broken CSV → image mappings
  • Large numbers of orphan images
  • Severe class imbalance
  • Extremely small images or extreme aspect ratios
  • Mixed image modes (RGB / RGBA / grayscale)

None of these necessarily crash training.
They just quietly degrade everything downstream.


Why Read‑Only Matters

Many tools try to auto‑fix datasets.

I deliberately didn’t.

DataLens follows a simple rule:

Inspect, don’t mutate.

  • No files are moved
  • No labels are rewritten
  • No assumptions are silently applied

The tool’s job is to surface problems clearly, so you can decide what to do next.


What DataLens Does

DataLens is a Streamlit‑based audit tool with two modes.

Mode A — Images Only

For raw image folders:

  • Recursively scans images
  • Detects corrupted files
  • Finds exact and near‑duplicate images
  • Optionally infers classes from subfolders
  • Correctly handles unlabeled datasets (no fake “missing label” warnings)

Mode B — Images + Labels (CSV)

For supervised datasets:

  • Robust CSV reading (UTF‑8 with fallback)
  • Automatic filename & label column detection
  • Support for IDs without file extensions
  • Optional label normalization
  • Coverage analysis:

    • How many CSV rows actually resolve to images?
  • Orphan analysis:

    • How many images are never referenced by the CSV?

Duplicate Detection That Actually Helps

Exact duplicates are easy.
Near‑duplicates are not.

DataLens supports three methods:

  • sha256 — byte‑exact duplicates
  • quick hash — fast approximation
  • pHash — visually similar images (resized, recompressed, slightly cropped)

This is especially useful for datasets collected via scraping or merging multiple sources.


Data Hygiene Warnings

Beyond basic checks, DataLens flags issues that usually show up too late:

  • Very small images (e.g. <64px)
  • Extreme aspect ratios
  • High RGBA share (alpha channel surprises)
  • High image mode variance
  • Extension mismatches between CSV references and actual files

These are the kinds of things that quietly break training pipelines or augmentations.


Outputs You Can Share

After a run, DataLens produces:

  • An interactive dashboard (Streamlit)
  • A deterministic dataset_report.md
  • An issues.csv containing:

    • missing images
    • orphan images
    • corrupted files
    • duplicate groups

The report is designed to be:

  • commit‑friendly
  • reviewable
  • attachable to issues or PRs

Design Principles

I kept the scope intentionally tight:

  • Read‑only
  • Deterministic
  • Transparent
  • Pre‑training focused

This is not a dataset cleaning tool.
It’s a dataset inspection tool.


When I Use DataLens

  • Before starting any new training run
  • When receiving datasets from external sources
  • When debugging unstable or suspicious training behavior
  • As a lightweight QA step before investing GPU hours

Final Thought

We spend a lot of time tuning models.

But models are only as good as the data we feed them — and bad data usually doesn’t announce itself.

DataLens is my way of making datasets talk.


Links

Top comments (0)