DataLens: A Read‑Only Image Dataset Sanity Checker
Training a model rarely fails loudly.
Most of the time, it kind of works — loss decreases, accuracy moves, but the results feel unstable, brittle, or just wrong.
In my experience, when that happens, the root cause is often not the model, but the dataset.
So I built DataLens: a lightweight, read‑only sanity checker for image datasets.
The Problem: Silent Dataset Failures
Before training even starts, datasets often contain issues like:
- Corrupted or unreadable images
- Duplicate or near‑duplicate samples
- Broken CSV → image mappings
- Large numbers of orphan images
- Severe class imbalance
- Extremely small images or extreme aspect ratios
- Mixed image modes (RGB / RGBA / grayscale)
None of these necessarily crash training.
They just quietly degrade everything downstream.
Why Read‑Only Matters
Many tools try to auto‑fix datasets.
I deliberately didn’t.
DataLens follows a simple rule:
Inspect, don’t mutate.
- No files are moved
- No labels are rewritten
- No assumptions are silently applied
The tool’s job is to surface problems clearly, so you can decide what to do next.
What DataLens Does
DataLens is a Streamlit‑based audit tool with two modes.
Mode A — Images Only
For raw image folders:
- Recursively scans images
- Detects corrupted files
- Finds exact and near‑duplicate images
- Optionally infers classes from subfolders
- Correctly handles unlabeled datasets (no fake “missing label” warnings)
Mode B — Images + Labels (CSV)
For supervised datasets:
- Robust CSV reading (UTF‑8 with fallback)
- Automatic filename & label column detection
- Support for IDs without file extensions
- Optional label normalization
-
Coverage analysis:
- How many CSV rows actually resolve to images?
-
Orphan analysis:
- How many images are never referenced by the CSV?
Duplicate Detection That Actually Helps
Exact duplicates are easy.
Near‑duplicates are not.
DataLens supports three methods:
- sha256 — byte‑exact duplicates
- quick hash — fast approximation
- pHash — visually similar images (resized, recompressed, slightly cropped)
This is especially useful for datasets collected via scraping or merging multiple sources.
Data Hygiene Warnings
Beyond basic checks, DataLens flags issues that usually show up too late:
- Very small images (e.g. <64px)
- Extreme aspect ratios
- High RGBA share (alpha channel surprises)
- High image mode variance
- Extension mismatches between CSV references and actual files
These are the kinds of things that quietly break training pipelines or augmentations.
Outputs You Can Share
After a run, DataLens produces:
- An interactive dashboard (Streamlit)
- A deterministic
dataset_report.md -
An
issues.csvcontaining:- missing images
- orphan images
- corrupted files
- duplicate groups
The report is designed to be:
- commit‑friendly
- reviewable
- attachable to issues or PRs
Design Principles
I kept the scope intentionally tight:
- Read‑only
- Deterministic
- Transparent
- Pre‑training focused
This is not a dataset cleaning tool.
It’s a dataset inspection tool.
When I Use DataLens
- Before starting any new training run
- When receiving datasets from external sources
- When debugging unstable or suspicious training behavior
- As a lightweight QA step before investing GPU hours
Final Thought
We spend a lot of time tuning models.
But models are only as good as the data we feed them — and bad data usually doesn’t announce itself.
DataLens is my way of making datasets talk.
Links
- GitHub: https://github.com/Ertugrulmutlu/DataLens
- LinkedIn: https://www.linkedin.com/in/ertugrul-mutlu/
- Personal Website: https://ertugrulmutlu.github.io
Top comments (0)