Ertugrul

Posted on Aug 2 • Edited on Aug 16

Modular Snip Recorder: A Data Collection Tool for Behavior Cloning (2/2)

#datascience #streamlit #python #data

📊 Part 2:

After building the data collection tool, I had my neatly packed .npz and .mp4 files. But then came the real question:

"Now that I’ve collected the data, how do I make sense of it?"

So I set out to build a set of tools that would help me not just review the data — but actually understand it. What came out of that is something I’m really proud of.

🖥️ Desktop Dataset Viewer: For Seeing the Details

This tool is all about feeling the data, one sample at a time. The desktop GUI is built with Tkinter and designed to make browsing, filtering, and understanding each entry as intuitive as flipping through photos.

🛍 Navigating Like a Human

Use Prev/Next buttons (or arrow keys) to flip through samples.
Apply filters for phases and keys — e.g., only show 'press' events where 'd' was active.
It even includes broken or empty samples if you want — great for debugging.

💡Pro tip: Filters are applied in real time using **self._apply_filters()* and update a filtered index map on-the-fly.*

🔍 Rich Metadata, Always Visible

Each entry displays:

Keys held (multi-hot vector)
Duration each key has been held
Whether it’s a "press" or "release" event
The exact video frame it maps to (yes, the viewer can open the matching .mp4 and seek there)

python
Active Keys: ['d']
Hold Durations: [0.0, 0.0, 0.12, 0.0]
Phase: press
Video Frame: 156

🎮 Synced Video Playback

Using OpenCV, the viewer can play the corresponding .mp4 inline. You can even jump to the frame number synced with the .npz entry.

It took a while to get the video + metadata sync right, but once it worked, it became one of my favorite parts.

🌐 Streamlit Dashboard: Zooming Out to Patterns

The desktop GUI helped me inspect individual samples in detail, but what if I wanted to zoom out and understand the forest instead of the trees?

That’s why I built this Streamlit-powered dashboard — to give me an interactive, global view of the dataset. It's designed to help with pattern recognition, anomaly detection, and feature exploration, all within a clean, responsive UI.

With just a few clicks, I can switch from reviewing a single sample to analyzing 1,000+ entries at once.

Built with Plotly, Pandas, and Streamlit, the dashboard supports:

Interactive visuals (pan, zoom, hover, tooltips)
Real-time filtering
Modular visual breakdowns across statistical, temporal, and image-derived domains

🧪 How the Data is Processed

Each entry is parsed using a pipeline that normalizes it into a feature-rich dictionary:

{
  'phase': 'press',
  'active_keys': ['d'],
  'avg_hold_duration': 0.42,
  'entropy': 4.87,
  'complexity_score': 2.04
}

The complexity score combines interaction density and temporal footprint:

complexity_score = num_keys * avg_duration * entropy

This formula helps surface "difficult" or unusual samples quickly.

📊 What You Can See in the Dashboard

🔢 Advanced Filtering

The sidebar lets you apply multi-criteria filters like:

Only show entries where key='d'
Entropy > 3.5
Phase = 'press'

These filters update every visualization in real time.

🥧 Phase Distribution

This pie chart gives a bird’s eye view of the ratio between press and release events. A heavily imbalanced dataset here might hint at errors during data collection — for example, if a key gets stuck or releases weren’t registered.

🛠 Key Usage Histogram

Each key is represented as a one-hot encoded column. We simply .sum() across all rows to produce a histogram of usage frequency. Useful for seeing which keys dominate, or whether any were underrepresented.

Example: In one session, I noticed the spacebar wasn’t used at all — turned out the overlay missed capturing it entirely.

🕐 Timeline Plots

This section plots metrics like:

Average key hold duration
Number of keys pressed per frame
Entropy of the key state vector
Mean image intensity

All of these are shown over time, helping you spot fatigue effects, repetitive patterns, or abnormal spikes.

🔗 Key Combination Frequency

Ever wondered which multi-key combos are most common? This bar chart aggregates those. It's helpful when you’re testing systems that require complex inputs (like A+ S or D+ B).

📆 Box/Violin Plots

These are especially good for:

Comparing hold durations across press and release
Spotting outliers in entropy distributions

Pro tip: You can toggle between Box and Violin for different statistical insights.

🔥 Correlation Matrix

This heatmap shows how features like entropy, duration, key count, and image brightness relate to each other.

px.imshow(df.corr())

Do more keys mean more entropy? Does brightness correlate with longer hold times? This tab helps answer such questions.

🌐 3D Feature Space

Using plotly.graph_objects, I render 3D scatter plots across features like:

x: avg_hold_duration
y: entropy
z: mean_image_intensity

It’s surprisingly intuitive — just rotate the cube and you start to see clusters and outliers.

📅 Descriptive Statistics & Distribution

The Statistical Analysis tab offers:

Mean, min, max, std
Histograms and KDE plots for every metric

You can use it to sanity check your dataset or define thresholds for cleaning.

🧰 Image Feature Visualization

This tab bridges image stats with metadata:

Mean vs std of intensity
Per-channel RGB histograms
Intensity vs entropy plots

Perfect for checking visual consistency across samples.

🔄 Similarity-Based Browsing

Find entries with similar statistical profiles (entropy, duration, brightness). Helpful for debugging or curating subsets for training.

🏐 Parallel Coordinates Plot

This visualization shows how multiple dimensions change per sample. Hovering over a line shows the exact stats, letting you spot rare or edge-case entries.

✨ Single Entry Inspector

Select any entry index and get a complete breakdown:

Metadata
Keys pressed
Video frame
Image preview (inline)

You can even compare two entries side-by-side to study variations.

📥 Export Tools

A simple download button lets you:

Export filtered results as CSV
Export summary statistics as JSON

Perfect for carrying forward into model training or reports.

✅ Why Streamlit?

I chose Streamlit for 3 main reasons:

Speed — I could prototype visuals in minutes.
Shareability — It runs in the browser; easy to demo.
Interactivity — Sliders, dropdowns, filters — all without heavy JS.

With just Python and a few lines of config, I had a production-ready, dark-themed analytics dashboard that made my dataset finally come alive.

🛠 Bonus Tools I Built

🔗 find_linked_video() — Matches .npz with its .mp4 by filename prefix
🧪 entry_to_dict() — Converts entries to detailed dicts with calculated stats
🧹 Error handling — No crash if a file is broken, empty, or malformed
🔐 Custom CSS for a dark and clean UI
🌐 In-browser report generation: just click 'Generate Report' in the sidebar

More Photos

What I Learned

This project reminded me of something simple but powerful:

"You can't improve what you can't see."

The second I could see what I had collected — the patterns, the outliers, the gaps — I started thinking more clearly about how to train and debug my models.

Now my pipeline looks like this:

Record with key sync
Review & fix visually
Analyze patterns and extract features

…and it all just flows.

🙏 Thanks for Reading

If this inspired you or helped your own data workflows, I’d love to hear from you. Fork the repo, star it, or just reach out.

🔗 GitHub: https://github.com/Ertugrulmutlu/-Data-Scrap-Tool-Advanced-Dataset-Viewer

🎥 YouTube Demo: https://youtu.be/s50oPjmyJ1w

Let’s build better datasets together. 🦿📀

DEV Community