๐ Part 2:
After building the data collection tool, I had my neatly packed .npz and .mp4 files. But then came the real question:
"Now that Iโve collected the data, how do I make sense of it?"
So I set out to build a set of tools that would help me not just review the data โ but actually understand it. What came out of that is something Iโm really proud of.
๐ฅ๏ธ Desktop Dataset Viewer: For Seeing the Details
This tool is all about feeling the data, one sample at a time. The desktop GUI is built with Tkinter and designed to make browsing, filtering, and understanding each entry as intuitive as flipping through photos.
๐ Navigating Like a Human
- Use Prev/Next buttons (or arrow keys) to flip through samples.
- Apply filters for phases and keys โ e.g., only show 'press' events where 'd' was active.
- It even includes broken or empty samples if you want โ great for debugging.
๐กPro tip: Filters are applied in real time using **self._apply_filters()* and update a filtered index map on-the-fly.*
๐ Rich Metadata, Always Visible
Each entry displays:
- Keys held (multi-hot vector)
- Duration each key has been held
- Whether itโs a "press" or "release" event
- The exact video frame it maps to (yes, the viewer can open the matching .mp4 and seek there)
python
Active Keys: ['d']
Hold Durations: [0.0, 0.0, 0.12, 0.0]
Phase: press
Video Frame: 156
๐ฎ Synced Video Playback
Using OpenCV, the viewer can play the corresponding .mp4 inline. You can even jump to the frame number synced with the .npz entry.
It took a while to get the video + metadata sync right, but once it worked, it became one of my favorite parts.
๐ Streamlit Dashboard: Zooming Out to Patterns
The desktop GUI helped me inspect individual samples in detail, but what if I wanted to zoom out and understand the forest instead of the trees?
Thatโs why I built this Streamlit-powered dashboard โ to give me an interactive, global view of the dataset. It's designed to help with pattern recognition, anomaly detection, and feature exploration, all within a clean, responsive UI.
With just a few clicks, I can switch from reviewing a single sample to analyzing 1,000+ entries at once.
Built with Plotly, Pandas, and Streamlit, the dashboard supports:
- Interactive visuals (pan, zoom, hover, tooltips)
- Real-time filtering
- Modular visual breakdowns across statistical, temporal, and image-derived domains
๐งช How the Data is Processed
Each entry is parsed using a pipeline that normalizes it into a feature-rich dictionary:
{
'phase': 'press',
'active_keys': ['d'],
'avg_hold_duration': 0.42,
'entropy': 4.87,
'complexity_score': 2.04
}
The complexity score combines interaction density and temporal footprint:
complexity_score = num_keys * avg_duration * entropy
This formula helps surface "difficult" or unusual samples quickly.
๐ What You Can See in the Dashboard
๐ข Advanced Filtering
The sidebar lets you apply multi-criteria filters like:
- Only show entries where
key='d'
- Entropy > 3.5
- Phase = 'press'
These filters update every visualization in real time.
๐ฅง Phase Distribution
This pie chart gives a birdโs eye view of the ratio between press
and release
events. A heavily imbalanced dataset here might hint at errors during data collection โ for example, if a key gets stuck or releases werenโt registered.
๐ Key Usage Histogram
Each key is represented as a one-hot encoded column. We simply .sum()
across all rows to produce a histogram of usage frequency. Useful for seeing which keys dominate, or whether any were underrepresented.
Example: In one session, I noticed the spacebar wasnโt used at all โ turned out the overlay missed capturing it entirely.
๐ Timeline Plots
This section plots metrics like:
- Average key hold duration
- Number of keys pressed per frame
- Entropy of the key state vector
- Mean image intensity
All of these are shown over time, helping you spot fatigue effects, repetitive patterns, or abnormal spikes.
๐ Key Combination Frequency
Ever wondered which multi-key combos are most common? This bar chart aggregates those. It's helpful when youโre testing systems that require complex inputs (like A+ S or D+ B).
๐ Box/Violin Plots
These are especially good for:
- Comparing hold durations across
press
andrelease
- Spotting outliers in entropy distributions
Pro tip: You can toggle between Box and Violin for different statistical insights.
๐ฅ Correlation Matrix
This heatmap shows how features like entropy, duration, key count, and image brightness relate to each other.
px.imshow(df.corr())
Do more keys mean more entropy? Does brightness correlate with longer hold times? This tab helps answer such questions.
๐ 3D Feature Space
Using plotly.graph_objects
, I render 3D scatter plots across features like:
- x: avg_hold_duration
- y: entropy
- z: mean_image_intensity
Itโs surprisingly intuitive โ just rotate the cube and you start to see clusters and outliers.
๐ Descriptive Statistics & Distribution
The Statistical Analysis tab offers:
- Mean, min, max, std
- Histograms and KDE plots for every metric
You can use it to sanity check your dataset or define thresholds for cleaning.
๐งฐ Image Feature Visualization
This tab bridges image stats with metadata:
- Mean vs std of intensity
- Per-channel RGB histograms
- Intensity vs entropy plots
Perfect for checking visual consistency across samples.
๐ Similarity-Based Browsing
Find entries with similar statistical profiles (entropy, duration, brightness). Helpful for debugging or curating subsets for training.
๐ Parallel Coordinates Plot
This visualization shows how multiple dimensions change per sample. Hovering over a line shows the exact stats, letting you spot rare or edge-case entries.
โจ Single Entry Inspector
Select any entry index and get a complete breakdown:
- Metadata
- Keys pressed
- Video frame
- Image preview (inline)
You can even compare two entries side-by-side to study variations.
๐ฅ Export Tools
A simple download button lets you:
- Export filtered results as CSV
- Export summary statistics as JSON
Perfect for carrying forward into model training or reports.
โ Why Streamlit?
I chose Streamlit for 3 main reasons:
- Speed โ I could prototype visuals in minutes.
- Shareability โ It runs in the browser; easy to demo.
- Interactivity โ Sliders, dropdowns, filters โ all without heavy JS.
With just Python and a few lines of config, I had a production-ready, dark-themed analytics dashboard that made my dataset finally come alive.
๐ Bonus Tools I Built
- ๐ find_linked_video() โ Matches .npz with its .mp4 by filename prefix
- ๐งช entry_to_dict() โ Converts entries to detailed dicts with calculated stats
- ๐งน Error handling โ No crash if a file is broken, empty, or malformed
- ๐ Custom CSS for a dark and clean UI
- ๐ In-browser report generation: just click 'Generate Report' in the sidebar
More Photos
What I Learned
This project reminded me of something simple but powerful:
"You can't improve what you can't see."
The second I could see what I had collected โ the patterns, the outliers, the gaps โ I started thinking more clearly about how to train and debug my models.
Now my pipeline looks like this:
- Record with key sync
- Review & fix visually
- Analyze patterns and extract features
โฆand it all just flows.
๐ Thanks for Reading
If this inspired you or helped your own data workflows, Iโd love to hear from you. Fork the repo, star it, or just reach out.
๐ GitHub: https://github.com/Ertugrulmutlu/-Data-Scrap-Tool-Advanced-Dataset-Viewer
๐ฅ YouTube Demo: https://youtu.be/s50oPjmyJ1w
Letโs build better datasets together. ๐ฆฟ๐
Top comments (0)