DEV Community

🎧 Expanding the GTZAN Dataset: A Journey from YouTube to Mel Spectrograms

I am currently finishing my specialization in Artificial Intelligence and Big Data, and I’ve decided to document the progress of my final project. This work integrates everything I’ve learned in the modules of AI Models (Modelos de Inteligencia Artificial) and Machine Learning Systems (Sistemas de Aprendizaje Automático).

The goal? A robust Music Genre Classifier. But as any data scientist will tell you, the model is only as good as the data. Today, I focused on building the "Data Kitchen": the pipeline that fetches, cleans, and prepares audio for training.

1. The Challenge: Expanding the GTZAN Dataset

While the GTZAN dataset is the industry standard, it lacks modern genres. To make my project unique, I wanted to include Lofi and Rap another others.

I used yt-dlp to source high-quality audio from YouTube. However, I ran into a classic "Junior vs. Environment" boss fight: FFmpeg.

Technical Tip: Even if you install FFmpeg via Conda, Windows sometimes hides the binaries from your Python subprocesses. I solved this by explicitly mapping the ffmpeg_location in my script to ensure the conversion to .wav never fails.


2. Standardizing for Machine Learning Systems

In our Machine Learning Systems module, we emphasized that consistency is key. To make my new data compatible with GTZAN, I had to "clone" its technical specifications:

  • Sample Rate: 22,050 Hz.

  • Channels: Mono.

  • Duration: Exactly 30-second segments.

I developed a script that takes a 1-hour "Lofi Study Beats" mix and slices it into perfect 30-second chunks, maintaining a strict naming convention: lofi.00000.wav, lofi.00001.wav, etc. This ensures the data is ready for bulk processing without manual intervention.


3. Feature Extraction: The Mel Spectrogram

For the AI Models part of the project, we aren't just "listening" to the audio—we are "seeing" it. Using librosa, I transform the raw waveforms into Mel Spectrograms.

The Mel scale is vital because it represents frequencies the way humans actually perceive them. It turns a complex audio signal into a 2D image, allowing me to use Convolutional Neural Networks (CNNs) to identify patterns, like the low-pass filters typical of Lofi or the sharp transients in Rap.

Mel Spectogram of Lofi


4. Key Takeaways for Fellow Students

  • Relative Paths are Dangerous: When running scripts from the terminal, ../data might point to nowhere. I switched to Path(file).resolve() to make my project portable.

  • Data Validation: GTZAN has a famous corrupt file (jazz.00054.wav). Learning to handle these exceptions programmatically is a crucial skill I've sharpened during this project.


Next Steps

The pipeline is clean. The data is standardized. The next phase of my final project involves designing the CNN architecture and beginning the long-awaited training phase.

Are you a student or a pro in AI? How do you handle your audio preprocessing pipelines? Let’s discuss in the comments!

Top comments (0)