DEV Community

Ertugrul
Ertugrul

Posted on • Edited on

Modular Snip Recorder: A Data Collection Tool for Behavior Cloning (2/2)

๐Ÿ“Š Part 2:

After building the data collection tool, I had my neatly packed .npz and .mp4 files. But then came the real question:

"Now that Iโ€™ve collected the data, how do I make sense of it?"

So I set out to build a set of tools that would help me not just review the data โ€” but actually understand it. What came out of that is something Iโ€™m really proud of.


๐Ÿ–ฅ๏ธ Desktop Dataset Viewer: For Seeing the Details

This tool is all about feeling the data, one sample at a time. The desktop GUI is built with Tkinter and designed to make browsing, filtering, and understanding each entry as intuitive as flipping through photos.

๐Ÿ› Navigating Like a Human

  • Use Prev/Next buttons (or arrow keys) to flip through samples.
  • Apply filters for phases and keys โ€” e.g., only show 'press' events where 'd' was active.
  • It even includes broken or empty samples if you want โ€” great for debugging.

๐Ÿ’กPro tip: Filters are applied in real time using **self._apply_filters()* and update a filtered index map on-the-fly.*

Editor

๐Ÿ” Rich Metadata, Always Visible

Each entry displays:

  • Keys held (multi-hot vector)
  • Duration each key has been held
  • Whether itโ€™s a "press" or "release" event
  • The exact video frame it maps to (yes, the viewer can open the matching .mp4 and seek there)
python
Active Keys: ['d']
Hold Durations: [0.0, 0.0, 0.12, 0.0]
Phase: press
Video Frame: 156
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฎ Synced Video Playback

Using OpenCV, the viewer can play the corresponding .mp4 inline. You can even jump to the frame number synced with the .npz entry.

It took a while to get the video + metadata sync right, but once it worked, it became one of my favorite parts.

Editor with video


๐ŸŒ Streamlit Dashboard: Zooming Out to Patterns

The desktop GUI helped me inspect individual samples in detail, but what if I wanted to zoom out and understand the forest instead of the trees?

Thatโ€™s why I built this Streamlit-powered dashboard โ€” to give me an interactive, global view of the dataset. It's designed to help with pattern recognition, anomaly detection, and feature exploration, all within a clean, responsive UI.

With just a few clicks, I can switch from reviewing a single sample to analyzing 1,000+ entries at once.

Built with Plotly, Pandas, and Streamlit, the dashboard supports:

  • Interactive visuals (pan, zoom, hover, tooltips)
  • Real-time filtering
  • Modular visual breakdowns across statistical, temporal, and image-derived domains

๐Ÿงช How the Data is Processed

Each entry is parsed using a pipeline that normalizes it into a feature-rich dictionary:

{
  'phase': 'press',
  'active_keys': ['d'],
  'avg_hold_duration': 0.42,
  'entropy': 4.87,
  'complexity_score': 2.04
}
Enter fullscreen mode Exit fullscreen mode

The complexity score combines interaction density and temporal footprint:

complexity_score = num_keys * avg_duration * entropy
Enter fullscreen mode Exit fullscreen mode

This formula helps surface "difficult" or unusual samples quickly.


๐Ÿ“Š What You Can See in the Dashboard

๐Ÿ”ข Advanced Filtering

The sidebar lets you apply multi-criteria filters like:

  • Only show entries where key='d'
  • Entropy > 3.5
  • Phase = 'press'

These filters update every visualization in real time.

Advanced Filtering

๐Ÿฅง Phase Distribution

This pie chart gives a birdโ€™s eye view of the ratio between press and release events. A heavily imbalanced dataset here might hint at errors during data collection โ€” for example, if a key gets stuck or releases werenโ€™t registered.

Phase Distribution Chart

๐Ÿ›  Key Usage Histogram

Each key is represented as a one-hot encoded column. We simply .sum() across all rows to produce a histogram of usage frequency. Useful for seeing which keys dominate, or whether any were underrepresented.

Example: In one session, I noticed the spacebar wasnโ€™t used at all โ€” turned out the overlay missed capturing it entirely.

Key Usage Histogram

๐Ÿ• Timeline Plots

This section plots metrics like:

  • Average key hold duration
  • Number of keys pressed per frame
  • Entropy of the key state vector
  • Mean image intensity

All of these are shown over time, helping you spot fatigue effects, repetitive patterns, or abnormal spikes.

Timeline Plots

๐Ÿ”— Key Combination Frequency

Ever wondered which multi-key combos are most common? This bar chart aggregates those. It's helpful when youโ€™re testing systems that require complex inputs (like A+ S or D+ B).

Key Combination Frequency

๐Ÿ“† Box/Violin Plots

These are especially good for:

  • Comparing hold durations across press and release
  • Spotting outliers in entropy distributions

Pro tip: You can toggle between Box and Violin for different statistical insights.

Box/Violin Plots

๐Ÿ”ฅ Correlation Matrix

This heatmap shows how features like entropy, duration, key count, and image brightness relate to each other.

px.imshow(df.corr())
Enter fullscreen mode Exit fullscreen mode

Do more keys mean more entropy? Does brightness correlate with longer hold times? This tab helps answer such questions.

Correlation Matrix

๐ŸŒ 3D Feature Space

Using plotly.graph_objects, I render 3D scatter plots across features like:

  • x: avg_hold_duration
  • y: entropy
  • z: mean_image_intensity

Itโ€™s surprisingly intuitive โ€” just rotate the cube and you start to see clusters and outliers.

3D Feature Space

๐Ÿ“… Descriptive Statistics & Distribution

The Statistical Analysis tab offers:

  • Mean, min, max, std
  • Histograms and KDE plots for every metric

You can use it to sanity check your dataset or define thresholds for cleaning.

Descriptive Statistics & Distribution

๐Ÿงฐ Image Feature Visualization

This tab bridges image stats with metadata:

  • Mean vs std of intensity
  • Per-channel RGB histograms
  • Intensity vs entropy plots

Perfect for checking visual consistency across samples.

๐Ÿ”„ Similarity-Based Browsing

Find entries with similar statistical profiles (entropy, duration, brightness). Helpful for debugging or curating subsets for training.

๐Ÿ Parallel Coordinates Plot

This visualization shows how multiple dimensions change per sample. Hovering over a line shows the exact stats, letting you spot rare or edge-case entries.

Image Feature Visualization 1

Image Feature Visualization 2

โœจ Single Entry Inspector

Select any entry index and get a complete breakdown:

  • Metadata
  • Keys pressed
  • Video frame
  • Image preview (inline)

Single Entry Inspector 1

Single Entry Inspector 2

Single Entry Inspector 3

You can even compare two entries side-by-side to study variations.

Comparing

๐Ÿ“ฅ Export Tools

A simple download button lets you:

  • Export filtered results as CSV
  • Export summary statistics as JSON

Perfect for carrying forward into model training or reports.


Export_sidebar


โœ… Why Streamlit?

I chose Streamlit for 3 main reasons:

  1. Speed โ€” I could prototype visuals in minutes.
  2. Shareability โ€” It runs in the browser; easy to demo.
  3. Interactivity โ€” Sliders, dropdowns, filters โ€” all without heavy JS.

With just Python and a few lines of config, I had a production-ready, dark-themed analytics dashboard that made my dataset finally come alive.


๐Ÿ›  Bonus Tools I Built

  • ๐Ÿ”— find_linked_video() โ€” Matches .npz with its .mp4 by filename prefix
  • ๐Ÿงช entry_to_dict() โ€” Converts entries to detailed dicts with calculated stats
  • ๐Ÿงน Error handling โ€” No crash if a file is broken, empty, or malformed
  • ๐Ÿ” Custom CSS for a dark and clean UI
  • ๐ŸŒ In-browser report generation: just click 'Generate Report' in the sidebar

More Photos

Data_Explorer

What I Learned

This project reminded me of something simple but powerful:

"You can't improve what you can't see."

The second I could see what I had collected โ€” the patterns, the outliers, the gaps โ€” I started thinking more clearly about how to train and debug my models.

Now my pipeline looks like this:

  1. Record with key sync
  2. Review & fix visually
  3. Analyze patterns and extract features

โ€ฆand it all just flows.


๐Ÿ™ Thanks for Reading

If this inspired you or helped your own data workflows, Iโ€™d love to hear from you. Fork the repo, star it, or just reach out.

๐Ÿ”— GitHub: https://github.com/Ertugrulmutlu/-Data-Scrap-Tool-Advanced-Dataset-Viewer

๐ŸŽฅ YouTube Demo: https://youtu.be/s50oPjmyJ1w

Letโ€™s build better datasets together. ๐Ÿฆฟ๐Ÿ“€

Top comments (0)