Ertugrul

Posted on Jan 20

DataLens: A Read-Only Image Dataset Sanity Checker

#machinelearning #showdev #testing #tooling

DataLens: A Read‑Only Image Dataset Sanity Checker

Training a model rarely fails loudly.

Most of the time, it kind of works — loss decreases, accuracy moves, but the results feel unstable, brittle, or just wrong.

In my experience, when that happens, the root cause is often not the model, but the dataset.

So I built DataLens: a lightweight, read‑only sanity checker for image datasets.

The Problem: Silent Dataset Failures

Before training even starts, datasets often contain issues like:

Corrupted or unreadable images
Duplicate or near‑duplicate samples
Broken CSV → image mappings
Large numbers of orphan images
Severe class imbalance
Extremely small images or extreme aspect ratios
Mixed image modes (RGB / RGBA / grayscale)

None of these necessarily crash training.
They just quietly degrade everything downstream.

Why Read‑Only Matters

Many tools try to auto‑fix datasets.

I deliberately didn’t.

DataLens follows a simple rule:

Inspect, don’t mutate.

No files are moved
No labels are rewritten
No assumptions are silently applied

The tool’s job is to surface problems clearly, so you can decide what to do next.

What DataLens Does

DataLens is a Streamlit‑based audit tool with two modes.

Mode A — Images Only

For raw image folders:

Recursively scans images
Detects corrupted files
Finds exact and near‑duplicate images
Optionally infers classes from subfolders
Correctly handles unlabeled datasets (no fake “missing label” warnings)

Mode B — Images + Labels (CSV)

For supervised datasets:

Robust CSV reading (UTF‑8 with fallback)
Automatic filename & label column detection
Support for IDs without file extensions
Optional label normalization
Coverage analysis:
- How many CSV rows actually resolve to images?
Orphan analysis:
- How many images are never referenced by the CSV?

Duplicate Detection That Actually Helps

Exact duplicates are easy.
Near‑duplicates are not.

DataLens supports three methods:

sha256 — byte‑exact duplicates
quick hash — fast approximation
pHash — visually similar images (resized, recompressed, slightly cropped)

This is especially useful for datasets collected via scraping or merging multiple sources.

Data Hygiene Warnings

Beyond basic checks, DataLens flags issues that usually show up too late:

Very small images (e.g. <64px)
Extreme aspect ratios
High RGBA share (alpha channel surprises)
High image mode variance
Extension mismatches between CSV references and actual files

These are the kinds of things that quietly break training pipelines or augmentations.

Outputs You Can Share

After a run, DataLens produces:

An interactive dashboard (Streamlit)
A deterministic dataset_report.md
An issues.csv containing:
- missing images
- orphan images
- corrupted files
- duplicate groups

The report is designed to be:

commit‑friendly
reviewable
attachable to issues or PRs

Design Principles

I kept the scope intentionally tight:

Read‑only
Deterministic
Transparent
Pre‑training focused

This is not a dataset cleaning tool.
It’s a dataset inspection tool.

When I Use DataLens

Before starting any new training run
When receiving datasets from external sources
When debugging unstable or suspicious training behavior
As a lightweight QA step before investing GPU hours

Final Thought

We spend a lot of time tuning models.

But models are only as good as the data we feed them — and bad data usually doesn’t announce itself.

DataLens is my way of making datasets talk.

DEV Community