Nithin Bharadwaj

Posted on Feb 28

How to Process Audio With Python: From Waveforms to Beat Tracking

#programming #devto #python #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

I want to talk about making computers understand and work with sound. Think about your favorite song, a podcast, or the voice command you give to a smart speaker. All of that is just a pattern of pressure waves in the air. My job, in audio and signal processing, is to take those complex patterns and teach a computer to read, analyze, and even create them. Python is my tool of choice for this, and I'm going to show you how I use it, one clear step at a time. Forget complicated jargon; I'll explain it as if we're looking at a picture of sound together.

First, we need to get the sound into the computer. An audio file is like a container. Inside it, sound is stored as a long list of numbers. Each number is a snapshot, or a sample, of the air pressure at a tiny moment in time. The number of snapshots taken per second is the sample rate. A common one is 44,100 samples per second, which is what's used on CDs.

Let me show you how I open these containers with code. We'll use a library called soundfile to read a file and librosa for some helpful utilities.

import soundfile as sf
import librosa
import numpy as np

# Let's say I have a file called 'guitar_riff.wav'
file_path = 'guitar_riff.wav'

# Method 1: Using soundfile for direct control
data, samplerate = sf.read(file_path)
print(f"Loaded with soundfile. Shape: {data.shape}, Sample rate: {samplerate} Hz")

# Method 2: Using librosa for consistency (it resamples to a common rate easily)
data_librosa, sr_librosa = librosa.load(file_path, sr=22050, mono=True)  # Force to 22050 Hz, one channel
print(f"Loaded with librosa. Shape: {data_librosa.shape}, Sample rate: {sr_librosa} Hz")

When I run this, I see numbers like (220500, 2). This means my audio has 220,500 samples and 2 channels (left and right stereo). If I use mono=True with librosa, it mixes them down, giving me a shape like (220500,)—just one long list of numbers. This is my raw material. To hear it, I can save it back out.

# Let's save a modified version. First, I'll make a simple copy.
output_path = 'guitar_riff_copy.wav'
sf.write(output_path, data_librosa, sr_librosa)
print(f"Saved a mono copy to {output_path}")

# What if I want to convert it to an MP3? I might use pydub.
from pydub import AudioSegment
sound = AudioSegment.from_wav('guitar_riff.wav')
sound.export('guitar_riff.mp3', format="mp3", bitrate="192k")
print("Converted to MP3 format.")

Now we have the sound as numbers. The first thing I often do is look at it in the time domain. This is the most straightforward view. I just plot the numbers in the order they were recorded. The x-axis is time, and the y-axis is the amplitude—how much the speaker cone would move.

Looking at this plot tells me a story. I can see where the loud drum hits are (big spikes), the quiet parts (flat lines near zero), and the rhythm of the music. I calculate some basic things to understand this story better. One useful measure is the amplitude envelope. It smooths out the rapid wiggles of the sound wave and shows me just the overall loudness over time, like tracing the outline of the waveform's peaks.

import matplotlib.pyplot as plt
from scipy.ndimage import uniform_filter1d

def plot_waveform_and_envelope(audio, sr, title="Audio Waveform"):
    """
    Plots the raw waveform and its amplitude envelope.
    """
    time = np.arange(len(audio)) / sr  # Create time axis in seconds

    # Calculate a simple amplitude envelope using a moving maximum
    window_size = int(0.01 * sr)  # 10 ms window
    envelope = uniform_filter1d(np.abs(audio), size=window_size, mode='nearest')

    # Create the plot
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))

    ax1.plot(time, audio, alpha=0.6, linewidth=0.5, label='Raw Signal')
    ax1.set_ylabel('Amplitude')
    ax1.set_title(f'{title} - Raw Waveform')
    ax1.grid(True, alpha=0.3)
    ax1.legend()

    ax2.plot(time, envelope, 'r-', linewidth=1.5, label='Amplitude Envelope')
    ax2.fill_between(time, envelope, alpha=0.3, color='red')
    ax2.set_xlabel('Time (seconds)')
    ax2.set_ylabel('Smoothed Amplitude')
    ax2.set_title('Amplitude Envelope')
    ax2.grid(True, alpha=0.3)
    ax2.legend()

    plt.tight_layout()
    plt.show()

    return envelope

# Use it on our loaded audio
envelope = plot_waveform_and_envelope(data_librosa, sr_librosa, "Guitar Riff")

Another simple but powerful time-domain feature is the Zero-Crossing Rate (ZCR). It counts how many times the signal crosses the zero-amplitude line per second. Why do I care? Think of the difference between a sustained violin note (smooth, few zero crossings) and a snare drum hit (noisy, jagged, many zero crossings). ZCR is a quick way to guess if a sound is more tonal or more noisy.

def calculate_zcr(audio, frame_length=1024, hop_length=512):
    """
    Calculates the Zero-Crossing Rate over short frames of audio.
    """
    zcr_list = []
    num_frames = 1 + (len(audio) - frame_length) // hop_length

    for i in range(num_frames):
        start = i * hop_length
        end = start + frame_length
        frame = audio[start:end]

        # Count sign changes. A zero crossing happens when neighboring samples have different signs.
        signs = np.sign(frame)
        zero_crossings = np.sum(np.abs(np.diff(signs))) / 2
        zcr_per_second = (zero_crossings / (frame_length / sr_librosa))  # Normalize to Hz
        zcr_list.append(zcr_per_second)

    times = np.arange(num_frames) * (hop_length / sr_librosa)
    return times, np.array(zcr_list)

# Let's calculate and plot it
frame_duration = 0.023  # 23 ms is a common frame size
frame_length = int(frame_duration * sr_librosa)
hop_length = frame_length // 2  # 50% overlap

zcr_times, zcr_values = calculate_zcr(data_librosa, frame_length, hop_length)

plt.figure(figsize=(12, 4))
plt.plot(zcr_times, zcr_values, 'g-', linewidth=1.5)
plt.xlabel('Time (seconds)')
plt.ylabel('Zero-Crossing Rate (Hz)')
plt.title('Zero-Crossing Rate Over Time')
plt.grid(True, alpha=0.3)
plt.show()

When I run this on a piece of music, I can see the ZCR jump during percussive sections and stay lower during melodic parts. It's a simple number that carries a lot of information.

So far, we've only looked at sound as it changes over time. This is natural for us. But to understand the pitch or the timbre—why a piano sounds different from a flute playing the same note—I need a different view. I need to see which frequencies are present. This is where the frequency domain comes in, and the key to getting there is the Fourier Transform.

The most common algorithm used is the Fast Fourier Transform (FFT). Think of it this way: the time-domain view is like looking at the ingredients for a soup all mixed together. The FFT is like a magic filter that separates the carrots, the onions, and the broth, showing me exactly how much of each is in there. In sound, the "ingredients" are frequencies.

A single FFT gives me a snapshot of the frequencies present in a short slice of time. To see how frequencies change over the whole duration of a sound, I use the Short-Time Fourier Transform (STFT). It's just like taking many FFTs in a row, each on a small, overlapping window of the audio. The result is called a spectrogram—a heat map of sound. Time runs left to right, frequency runs bottom to top, and the color shows how strong each frequency is at each moment.

def create_spectrogram(audio, sr, n_fft=2048, hop_length=512):
    """
    Creates and plots a spectrogram using the STFT.
    """
    # Calculate the STFT
    D = librosa.stft(audio, n_fft=n_fft, hop_length=hop_length)

    # Convert the complex values to magnitude (amplitude)
    magnitude = np.abs(D)

    # Convert amplitude to decibels for a better visual range
    dB = librosa.amplitude_to_db(magnitude, ref=np.max)

    # Plot the spectrogram
    plt.figure(figsize=(14, 6))
    librosa.display.specshow(dB, sr=sr, hop_length=hop_length, 
                             x_axis='time', y_axis='log', cmap='magma')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Spectrogram (Log Frequency Scale)')
    plt.tight_layout()
    plt.show()

    return magnitude, dB, D

# Generate a simple test signal to see clear frequencies
duration = 3.0
sr_test = 22050
t = np.linspace(0, duration, int(sr_test * duration))
# Create a signal that sweeps from low to high frequency
test_signal = 0.5 * librosa.chirp(fmin=200, fmax=2000, sr=sr_test, duration=duration)

magnitude, dB, D_complex = create_spectrogram(test_signal, sr_test)

When you run this, you'll see a bright diagonal line sweeping upward. That's our chirp! For real music, a spectrogram is breathtaking. You can see the horizontal lines of steady notes (constant frequency), the vertical smears of drum hits (all frequencies at once), and the unique textures of different instruments.

The spectrogram is my most used tool. It helps me find the fundamental frequency of a voice (pitch tracking), see the harmonics that give an instrument its character, and identify problems like background hum (a thin, steady horizontal line).

Sometimes, I don't want to analyze sound; I want to change it. A fundamental operation is filtering. Think of it like the bass and treble knobs on a stereo. A low-pass filter lets low frequencies through but muffles the highs. A high-pass filter does the opposite. An equalizer is just a collection of filters.

Let's say I have a recording with an annoying, constant 60 Hz electrical hum (from AC power). I can build a filter to remove it. A common and effective design is a band-stop or notch filter.

from scipy import signal

def apply_notch_filter(audio, sr, notch_freq=60.0, quality_factor=30.0):
    """
    Applies a notch filter to remove a specific frequency and its harmonics.
    """
    # Design the notch filter
    b, a = signal.iirnotch(notch_freq, quality_factor, sr)

    # Apply the filter
    filtered_audio = signal.filtfilt(b, a, audio)  # filtfilt avoids phase distortion

    # Let's also check the harmonics
    for harmonic in [2*notch_freq, 3*notch_freq]:
        if harmonic < sr/2:  # Nyquist limit
            b_h, a_h = signal.iirnotch(harmonic, quality_factor, sr)
            filtered_audio = signal.filtfilt(b_h, a_h, filtered_audio)

    return filtered_audio

# Let's create a synthetic signal with hum to test it
t = np.linspace(0, 5.0, int(5.0 * sr_librosa))
clean_signal = 0.3 * np.sin(2 * np.pi * 440 * t)  # A 440 Hz tone
hum_noise = 0.2 * np.sin(2 * np.pi * 60 * t)      # A 60 Hz hum
noisy_signal = clean_signal + hum_noise

filtered_signal = apply_notch_filter(noisy_signal, sr_librosa, notch_freq=60.0)

# Plot the results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

ax1.plot(t[:2000], noisy_signal[:2000], 'b-', alpha=0.7, label='Noisy Signal (with 60 Hz hum)')
ax1.plot(t[:2000], clean_signal[:2000], 'r--', linewidth=2, label='Clean Signal (440 Hz tone)')
ax1.set_xlabel('Time (s)')
ax1.set_ylabel('Amplitude')
ax1.set_title('Time Domain: Before and After Filtering (Zoomed In)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot the frequency content before and after
freqs_before, psd_before = signal.welch(noisy_signal, sr_librosa, nperseg=1024)
freqs_after, psd_after = signal.welch(filtered_signal, sr_librosa, nperseg=1024)
ax2.semilogy(freqs_before, psd_before, 'b-', alpha=0.6, label='Before Filtering')
ax2.semilogy(freqs_after, psd_after, 'r-', label='After 60 Hz Notch Filter')
ax2.axvline(x=60, color='k', linestyle=':', label='Notch at 60 Hz')
ax2.axvline(x=440, color='g', linestyle=':', label='Desired tone at 440 Hz')
ax2.set_xlabel('Frequency (Hz)')
ax2.set_ylabel('Power Spectral Density')
ax2.set_title('Frequency Domain: Effect of Notch Filter')
ax2.set_xlim([0, 1000])
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

When you run this, you'll see in the frequency plot that the spike at 60 Hz is almost gone, while the spike at our desired 440 Hz tone remains strong. Filtering is powerful for cleaning up audio.

Now let's talk about something fascinating: breaking sound apart. One advanced technique is source separation. The classic example is the "cocktail party problem"—how do you focus on one person's voice in a noisy room? A popular library for this in Python is librosa, which has implementations of research algorithms.

A simpler, but very useful, form of separation is harmonic-percussive source separation (HPSS). It tries to split an audio signal into two parts: one containing the harmonic, tonal elements (like melody, vocals, strings), and one containing the percussive, transient elements (like drums, claps).

def separate_harmonic_percussive(audio, sr, margin=3.0):
    """
    Separates an audio signal into harmonic and percussive components.
    """
    # Let's use librosa's built-in HPSS
    audio_harmonic, audio_percussive = librosa.effects.hpss(audio, margin=margin)

    # Let's visualize what we've separated
    fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

    # Original Spectrogram
    D_orig = librosa.stft(audio)
    S_db_orig = librosa.amplitude_to_db(np.abs(D_orig), ref=np.max)
    librosa.display.specshow(S_db_orig, y_axis='log', x_axis='time', sr=sr, ax=axes[0], cmap='magma')
    axes[0].set(title='Original Signal Spectrogram', ylabel='Frequency')
    axes[0].label_outer()

    # Harmonic Spectrogram
    D_harm = librosa.stft(audio_harmonic)
    S_db_harm = librosa.amplitude_to_db(np.abs(D_harm), ref=np.max)
    librosa.display.specshow(S_db_harm, y_axis='log', x_axis='time', sr=sr, ax=axes[1], cmap='magma')
    axes[1].set(title='Harmonic Component (Tonal)', ylabel='Frequency')
    axes[1].label_outer()

    # Percussive Spectrogram
    D_perc = librosa.stft(audio_percussive)
    S_db_perc = librosa.amplitude_to_db(np.abs(D_perc), ref=np.max)
    img = librosa.display.specshow(S_db_perc, y_axis='log', x_axis='time', sr=sr, ax=axes[2], cmap='magma')
    axes[2].set(title='Percussive Component (Rhythmic)', xlabel='Time', ylabel='Frequency')

    fig.colorbar(img, ax=axes, format="%+2.f dB")
    plt.tight_layout()
    plt.show()

    return audio_harmonic, audio_percussive

# Use a short segment of our audio for clarity
segment = data_librosa[int(5*sr_librosa):int(10*sr_librosa)]  # Take a 5-second segment
harmonic, percussive = separate_harmonic_percussive(segment, sr_librosa)

In the resulting plots, you'll likely see that the harmonic component has strong horizontal lines (sustained notes), while the percussive component has vertical streaks (drum hits). I use this to create instrumental or "a capella" style versions of songs, or to analyze the rhythm separately from the melody.

Speaking of rhythm, beat and tempo tracking is a whole other world. I remember trying to write a program to clap along with a song—it's harder than it sounds! The computer has to listen to the energy in different frequency bands and find the regular pulses. librosa makes this surprisingly accessible.

def analyze_rhythm(audio, sr):
    """
    Estimates tempo, detects beats, and visualizes the pulse.
    """
    # First, let's get a feature that highlights rhythmic activity: the tempogram
    onset_env = librosa.onset.onset_strength(y=audio, sr=sr, aggregate=np.median)

    # Estimate the global tempo (beats per minute)
    tempo, beat_frames = librosa.beat.beat_track(onset_envelope=onset_env, sr=sr)
    print(f"Estimated tempo: {tempo[0]:.2f} BPM")

    # Convert frame numbers of beats to time in seconds
    beat_times = librosa.frames_to_time(beat_frames, sr=sr)

    # Let's also get a more detailed tempogram (how tempo changes over time)
    tempogram = librosa.feature.tempogram(onset_envelope=onset_env, sr=sr, hop_length=512)

    # Plot everything
    fig, axes = plt.subplots(3, 1, figsize=(14, 10))

    # Plot the waveform with beat markers
    time_wave = np.arange(len(audio)) / sr
    axes[0].plot(time_wave, audio, alpha=0.5)
    axes[0].vlines(beat_times, -audio.max(), audio.max(), color='r', linestyle='--', alpha=0.8, label='Detected Beats')
    axes[0].set_xlabel('Time (s)')
    axes[0].set_ylabel('Amplitude')
    axes[0].set_title('Waveform with Beat Markers')
    axes[0].legend(loc='upper right')
    axes[0].grid(True, alpha=0.3)

    # Plot the onset strength envelope
    times_onset = librosa.frames_to_time(np.arange(len(onset_env)), sr=sr, hop_length=512)
    axes[1].plot(times_onset, onset_env, label='Onset Strength', linewidth=2, color='orange')
    axes[1].vlines(beat_times, 0, onset_env.max(), color='r', linestyle='--', alpha=0.8)
    axes[1].set_xlabel('Time (s)')
    axes[1].set_ylabel('Strength')
    axes[1].set_title('Onset Strength Envelope')
    axes[1].grid(True, alpha=0.3)

    # Plot the tempogram as a heatmap
    librosa.display.specshow(tempogram, sr=sr, hop_length=512, x_axis='time', y_axis='tempo', 
                             ax=axes[2], cmap='coolwarm')
    axes[2].axhline(y=tempo[0], color='k', linestyle='--', linewidth=2, label=f'Est. Tempo {tempo[0]:.0f} BPM')
    axes[2].set_title('Tempogram - Local Tempo Over Time')
    axes[2].legend()

    plt.tight_layout()
    plt.show()

    return tempo, beat_times

tempo, beats = analyze_rhythm(segment, sr_librosa)

This code gives me a number—the beats per minute—and draws vertical red lines on the waveform where it thinks the beat is. It's not always perfect, especially with complex music, but it's a great starting point for creating visualizations that pulse with the music or for automatically segmenting a song into bars.

Finally, let's touch on audio synthesis. This is where we come full circle: from analyzing sound to creating it from scratch. The most basic synthesis is generating pure tones, like the beep of an alarm.

def generate_simple_tone(frequency, duration, sr=44100, amplitude=0.5, wave_type='sine'):
    """
    Generates a simple synthesized tone.
    """
    t = np.linspace(0, duration, int(sr * duration))

    if wave_type == 'sine':
        signal = amplitude * np.sin(2 * np.pi * frequency * t)
    elif wave_type == 'square':
        signal = amplitude * signal.square(2 * np.pi * frequency * t)
    elif wave_type == 'sawtooth':
        signal = amplitude * signal.sawtooth(2 * np.pi * frequency * t)
    else:
        signal = amplitude * np.sin(2 * np.pi * frequency * t)  # default to sine

    return signal

# Let's make a little C major chord: C4, E4, G4
sr_synth = 22050
c_freq = 261.63  # C4
e_freq = 329.63  # E4
g_freq = 392.00  # G4

c_note = generate_simple_tone(c_freq, 2.0, sr=sr_synth, amplitude=0.3)
e_note = generate_simple_tone(e_freq, 2.0, sr=sr_synth, amplitude=0.3)
g_note = generate_simple_tone(g_freq, 2.0, sr=sr_synth, amplitude=0.3)

# Mix them together
chord = c_note + e_note + g_note
# Normalize to prevent clipping (going above 1.0 / below -1.0)
chord = chord / np.max(np.abs(chord))

# Let's add a simple amplitude envelope (fade in and out) to make it less abrupt
def apply_adr_envelope(signal, sr, attack=0.05, decay=0.1, release=0.3):
    """
    Applies a simple Attack-Decay-Release envelope.
    """
    total_samples = len(signal)
    attack_samples = int(attack * sr)
    decay_samples = int(decay * sr)
    release_samples = int(release * sr)

    envelope = np.ones(total_samples)

    # Attack: ramp up from 0 to 1
    if attack_samples > 0:
        envelope[:attack_samples] = np.linspace(0, 1, attack_samples)

    # Decay: slight drop after attack (e.g., from 1 to 0.8)
    if decay_samples > 0:
        start_decay = attack_samples
        end_decay = start_decay + decay_samples
        if end_decay < total_samples:
            envelope[start_decay:end_decay] = np.linspace(1, 0.8, decay_samples)

    # Release: fade out at the end
    if release_samples > 0:
        start_release = max(0, total_samples - release_samples)
        envelope[start_release:] = np.linspace(envelope[start_release-1], 0, release_samples)

    return signal * envelope

chord_smoothed = apply_adr_envelope(chord, sr_synth, attack=0.1, decay=0.2, release=0.5)

# Plot and listen
plt.figure(figsize=(12, 4))
plt.plot(chord_smoothed[:5000], alpha=0.7)  # Plot first 5000 samples
plt.title('Synthesized C Major Chord (First ~0.23 seconds)')
plt.xlabel('Sample Number')
plt.ylabel('Amplitude')
plt.grid(True, alpha=0.3)
plt.show()

# Save it to listen
sf.write('c_major_chord.wav', chord_smoothed, sr_synth)
print("Saved 'c_major_chord.wav'. Try playing it!")

This is a very basic example. Real synthesizers add filters, modulation, and multiple oscillators to create rich sounds. But the principle is the same: you build sound from fundamental mathematical waves.

From opening a file to generating a new chord, these techniques form a toolkit. I use them to clean up old recordings, visualize music for art projects, extract features to train machine learning models that recognize sounds, and sometimes just to experiment and see what interesting noises I can create. The key is to start simple: load a sound, plot it, look at its spectrogram. Listen to what happens when you change it. The numbers and graphs are just a new way to explore the world of sound that's all around us. It's a process of discovery, where each line of code brings you closer to understanding the hidden patterns in the noise.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

We are on Medium

DEV Community

How to Process Audio With Python: From Waveforms to Beat Tracking

101 Books

Our Creations

We are on Medium

Top comments (0)