DEV Community

Cover image for How I Saved Myself Hours Using Python, Google Gemini, & Meta Llama to Create a Time Saving Script
Bradston Henry
Bradston Henry

Posted on

How I Saved Myself Hours Using Python, Google Gemini, & Meta Llama to Create a Time Saving Script

I have been editing audio since I was in high school using the freeware software tool Audacity, and there is ONE thing that has consistency driven me mad since I started editing audio.

The simple task of removing small areas of "empty space" within an audio waveform.

To take an audio file that looks like this:

Image description

And transform it into this:

Image description

In all honesty, removing "empty space" or areas of an audio waveform is relatively easy. Just simply highlight a specified area and click the delete key on your keyboard and you're done!

Image description

In a short audio file, that's only a few minutes long, it's at most a minor inconvenience. But in a long form audio file, like podcast recordings or voice-over sessions, it's a much more than tedious; it's a truly brain melting process.

Think about it this way:

On average it takes me about 5 seconds to locate, identify, select, and delete an area of "empty space" within the average audio recording waveform.

Let's say that in 1-min long audio clip, there are approximately 8 empty spaces. So for a 1-minute audio clip it would take...

5 seconds * (8 edits per minute * 1 minute) = 40 seconds

...approximately 40 seconds to edit out all empty spaces.

Not too bad, right?

Now imagine needing to edit an audio file that is approximately 30 minutes long.

That would take...

5 seconds * (8 edits per minute * 30 minute) = 1,200 Seconds = 20 minutes

...approximately 20 minutes to edit.

I cannot express in words how "not fun" 20 minutes of non-stop deleting of empty spaces in audio file happens to be. Maybe someone out there enjoys it, but I, personally, do not.

And recently, after doing some planning, I realized that I would need to do this very task for at least 60 audio tracks, each ranging from 15-30 minutes in length, in the relatively neat future.

When that became clear to me, a deep and powerful dread came upon me.

Image description

Searching for a Hero to Save the Day

From the moment I I realized how much work was coming my way, I knew had to find a way to optimize my audio editing routine. Knowing the power of AI and some of the tools that already exist, I figured I could find something that could take care of this undesirable task "auto-magically".

So I began my search...

Because I moonlight as a podcaster and often research podcasting tools, I was aware of the podcasting platform Descript and have used it for one of the podcasts I help co-host. Descript is embedded with many tools that help with recording and editing and I remember reading about it's ability to use AI to remove "umms and uhhs" and remove unwanted spaces between words.

So,I decided to try it out.

Image description

It kinda worked but I found myself struggling to contend with the UI and lack of control over the audio files. Also, it seemed to be arbitrarily deleting random sections of audio that were valuable to me and I didn't trust I could use it on very long files without needing to re-check it's work, which more or less defeated the point of using it.

So I decided to move on.

Following that, I did some basic Google research and found other "AI-powered" tools but either the tools misrepresented themselves or required expensive subscriptions in order for me to get access to their tool, which was non-starter.

So.. after some frustration and wasted time I decided to turn to my old friends, Ollama and Meta Llama.

LLamas to the Rescue

So I asked Llama a pretty simple question:

Do you know of any ai powered tools that can remove word gaps or umms and ahhs?

This was Llama's response:

Yes, there are several AI-powered tools that can help remove word gaps, umms, and ahhs from audio recordings. Here
are a few examples:

1. **Rev.com**: Rev.com is an online platform that uses AI-powered speech recognition technology to transcribe and
edit audio files. They offer a feature called "Automatic Speech Reproduction" (ASR) that can help remove filler
words like "umms" and "ahhs".
2. **TranscribeMe**: TranscribeMe is another online platform that offers AI-powered transcription services. They
use a combination of machine learning algorithms to detect and remove filler words, including word gaps.
3. **Speechmatics**: Speechmatics is an AI-powered speech recognition platform that can help remove filler words

.....
Enter fullscreen mode Exit fullscreen mode

I took a few moments to investigate some of the tools that Llama suggested and though none of the tools were up to the task to do what I needed, Llama reminded me that at the end of the day, all of these sites and tools were just created via code.

And more importantly, that I could create code and that I might be able to do exactly what I needed without an pre-programmed tool's help.

A Developer Enters the Chat

As a developer, I've had the opportunity to work with a few audio frameworks for mobile and web applications, so I was confident that I could find a framework that would allow me to accomplish my task with a little elbow grease.

And as I geared up to write some code and functionality from scratch, I recalled the lesson I learned not too long ago, "AI is a developer's friend".

But this time, instead of turning to Llama for assistance, I turned to another Generative AI friend I had become pretty good friends with, Gemini. I particularly decided to go with Gemini because I have had some solid code generation outcomes from it in the past.

So, I booted up Gemini and immediately asked:

Are are there any programming frameworks good at evaluating audio (e.g. volume level, length, etc)?

Gemini quickly responded with some suggestions:

Yes, there are several programming frameworks well-suited for audio evaluation tasks like measuring volume level, length, and more. Here are a few popular options:

Python-based Frameworks:

- LibROSA: A powerful library for music and audio analysis. It offers a wide range of tools for feature extraction, classification, and signal processing.
- PyDub: A simple library for manipulating audio with Python. It's great for basic tasks like trimming, concatenating, and analyzing audio files.
- SoundFile: A library for reading and writing audio files in various formats. It can be used to extract information about audio files, such as duration and sample rate.

Other Frameworks:

- FFmpeg: A versatile command-line tool and library for handling video and audio. It can be used to analyze audio streams and extract metadata like duration and bitrate.
- MATLAB: A powerful mathematical computing environment that includes tools for audio signal processing and analysis.
Enter fullscreen mode Exit fullscreen mode

I have personal experience with both FFmpeg and MATLABand felt that they alone wouldn't be what I needed but in it's response it gave a quick example of some PyDub code and I decided to go ahead and check out PyDub to see how it worked.

from pydub import AudioSegment

def analyze_audio(file_path):
    audio = AudioSegment.from_file(file_path)

    # Get duration in milliseconds
    duration = len(audio)

    # Get RMS amplitude (a measure of volume)
    rms = audio.rms

    return duration, rms

# Example usage
file_path = "your_audio_file.mp3"
duration, rms = analyze_audio(file_path)
print("Duration:", duration, "ms")
print("RMS amplitude:", rms)
Enter fullscreen mode Exit fullscreen mode

I chatted a bit longer with Gemini about tools and possible approaches before finally deciding to ask it directly, how to code the exact solution to my problem:

Now, Could you create code using Pydub that identifies sections of low volume or silence of a specific length and replaces it with silence of a different length, then outputs a new audio file?

Gemini responded with this code which on first inspection looked like it might be what I needed.

from pydub import AudioSegment

def replace_silent_sections(input_file, output_file, silence_threshold=-16, min_silence_len=1000, replacement_silence_len=500):
    """
    Replaces silent sections in an audio file with silence of a different length.

    Args:
        input_file (str): Path to the input audio file.
        output_file (str): Path to the output audio file.
        silence_threshold (int, optional): RMS amplitude threshold for silence. Defaults to -16 dBFS.
        min_silence_len (int, optional): Minimum length of a silent segment to be replaced, in milliseconds. Defaults to 1000 ms.
        replacement_silence_len (int, optional): Length of the replacement silence, in milliseconds. Defaults to 500 ms.
    """

    audio = AudioSegment.from_file(input_file)

    # Split the audio into 100ms segments
    segments = []
    for i in range(0, len(audio), 100):
        segments.append(audio[i:i+100])

    # Identify and replace silent segments
    non_silent_segments = []
    current_silent_segment = None
    for segment in segments:
        if segment.rms < silence_threshold:
            if current_silent_segment is None:
                current_silent_segment = segment
            else:
                current_silent_segment += segment
        else:
            if current_silent_segment is not None:
                if len(current_silent_segment) >= min_silence_len:
                    # Replace with silence of a different length
                    replacement_silence = AudioSegment.silent(duration=replacement_silence_len)
                    non_silent_segments.append(replacement_silence)
                else:
                    # Add the original silent segment
                    non_silent_segments.append(current_silent_segment)
                current_silent_segment = None
            non_silent_segments.append(segment)

    # Concatenate the segments
    output_audio = sum(non_silent_segments)
    output_audio.export(output_file, format="mp3")

# Example usage:
input_file = "input.mp3"
output_file = "output_replaced_silence.mp3"
replace_silent_sections(input_file, output_file)
Enter fullscreen mode Exit fullscreen mode

BUT shortly after testing out the code, I encountered a problem that is often one of the weakness of asking AI tools to help with Code Generation...

PyDub, though a possible option for solving my problem, had not been updated in several years and some of its dependencies had been deprecated in newer versions of Python.

And though Pydub was a dead-end, it did prepare me to be a bit more discerning on the next framework I decided to investigate and helped me narrow down my efforts to libROSA.

Using a similar prompt as before, I asked Gemini:

Now, Could you create code using LibROSA that identifies sections of low volume or silence of a specific length and replaces it with silence of a different length, then outputs a new audio file?

To save some time, I will summarize what happened after getting a response from Gemini.

Gemini provided some workable code that had all the requirements I "thought" I needed to automate my task. I asked Gemini some follow-up questions to help clarify what its intent was in the code and coincidentally, received some clarification on a concept in Audio editing that I never had a full grasp on:

is that RMS value in dB?

Gemini's Response:

No, the RMS value in the code is not directly in dB. It's a normalized value between 0 and 1, where 0 represents complete silence and values closer to 1 indicate louder segments.

To convert this normalized RMS value to a dB scale, you would typically use a logarithmic function. However, in the context of silence detection, comparing the normalized RMS value to a threshold is sufficient.

The silence_threshold is also a normalized value, and comparing it to the calculated RMS values allows us to identify segments that are significantly quieter than the average.
Enter fullscreen mode Exit fullscreen mode

As I investigated the code and got more understanding of framework, I realized, I didn't really need exactly what I had been asking for in my previous prompt.

So, I updated my prompt to this:

Now, Could you create code using LibROSA that splices out sections of low volume or silence of a specific length?

And Gemini responded with the code that would eventually become the code that would save me HOURS of editing time in the near and long term:

import librosa
import numpy as np
import soundfile as sf

def splice_silent_sections(input_file, output_file, silence_threshold=0.01, min_silence_len=1000, sr=22050):
    """
    Splices out silent sections from an audio file.

    Args:
        input_file (str): Path to the input audio file.
        output_file (str): Path to the output audio file.
        silence_threshold (float, optional): RMS amplitude threshold for silence. Defaults to 0.01.
        min_silence_len (int, optional): Minimum length of a silent segment to be spliced out, in samples. Defaults to 1000 samples.
        sr (int, optional): Sample rate of the audio. Defaults to 22050 Hz.
    """

    # Load the audio file
    y, sr = librosa.load(input_file, sr=sr)

    # Calculate the RMS amplitude of each frame
    frame_size = 1024
    hop_length = 512
    rms = librosa.feature.rms(y=y, frame_length=frame_size, hop_length=hop_length)[0]

    # Identify silent segments
    silent_segments = []
    current_silent_segment = []
    for i, r in enumerate(rms):
        if r < silence_threshold:
            current_silent_segment.append(i)
        else:
            if len(current_silent_segment) >= min_silence_len:
                silent_segments.append(current_silent_segment)
            current_silent_segment = []

    # Splice out silent segments
    spliced_audio = []
    last_end = 0
    for segment in silent_segments:
        start_sample = segment[0] * hop_length
        end_sample = (segment[-1] + 1) * hop_length
        spliced_audio.extend(y[last_end:start_sample])
        last_end = end_sample
    spliced_audio.extend(y[last_end:])

    # Write the spliced audio to a new file
    sf.write(output_file, np.array(spliced_audio), sr)

# Example usage:
input_file = "input.wav"
output_file = "output_spliced.wav"
splice_silent_sections(input_file, output_file)
Enter fullscreen mode Exit fullscreen mode

Also, Gemini gave me some context to it's code:

This code works by:

1. Loading the audio file: Uses librosa.load() to load the audio file into a NumPy array y and its sample rate sr.
2. Calculating RMS amplitude: Calculates the RMS amplitude of each frame using librosa.feature.rms().
3. Identifying silent segments: Iterates through the RMS values, identifying segments that are below the silence_threshold and longer than the min_silence_len.
4. Splicing out silent segments: Iterates over the silent segments, extracting the audio portions before and after each segment and adding them to a new list spliced_audio.
5. Writing the spliced audio: Concatenates the spliced audio segments into a new NumPy array and writes it to a new file using soundfile.write().

This code effectively removes the silent sections from the audio file, resulting in a new file with only the non-silent portions.
Enter fullscreen mode Exit fullscreen mode

The End of Music

Though getting to my solution wasn't exactly a straight path, in the course of an hour or two, I was able to use Llama, Gemini and Python to create a piece of code that did EXACTLY what I needed it to do.

It is hard to explain how efficient this process was in terms of research and development of the end product code that was used.

In a non-AI infused development workflow, I would have to discover much of this information about tools and code frameworks through the traditional means of development; Searching Google, investigating Stack Overflow, trolling obscure blogs for framework documentation and then rinse and repeat until hopefully finding the correct resources and tool to solve my problem.

Though I'm confident I could have developed a solution through the "older style" processe, in a pre-AI world, I imagine this whole process would have taken me a few days to complete.

This workflow is an absolute boon to developers like myself who often are looking for novel ways to solve obscure problems. And I hope this little tale has inspired you to go on an AI-powered journey of your own!


For those interested, here is the code I ended up using in my audio editing workflow (Note: This is a basic implementation. I plan to add some additional features to this code in the future, so please feel free to do the same) and for those who are interested where I've taken this code since the it's initial implementation, check out my personal project around automating YouTube content creation on my github:

import librosa
import numpy as np
import soundfile as sf

input_file_name = "your-file-name" # Insert name of file to trim silence
input_file = input_file_name + ".wav" # Change file type if needed (E.G. ".wav", ".mp3")

def splice_silent_sections(input_file, output_file, silence_threshold_rms=0.001, min_silence_len=10, sample_rate=44100):
    """
    Splices out silent/low volume sections from an audio file.

    Args:
        input_file (str): Path to the input audio file.
        output_file (str): Path to the output audio file.
        silence_threshold_rms (float, optional): RMS amplitude threshold for silence. Defaults to 0.01.
        min_silence_len (int, optional): Minimum length of a silent segment to be spliced out, in samples. Defaults to 1000 samples.
        sample_rate (int, optional): Sample rate of the audio. Defaults to 22050 Hz.
    """

    # Load the audio file
    y, sr = librosa.load(input_file, sr=sample_rate)

    # Calculate the RMS amplitude of each frame
    frame_size = 1024
    hop_length = 512
    rms = librosa.feature.rms(y=y, frame_length=frame_size, hop_length=hop_length)[0]

    # Identify silent segments
    silent_segments = []
    current_silent_segment = []
    for i, r in enumerate(rms):
        if r < silence_threshold_rms:
            current_silent_segment.append(i)
        else:
            if len(current_silent_segment) >= min_silence_len:
                silent_segments.append(current_silent_segment)
            current_silent_segment = []

    # Splice out silent segments
    spliced_audio = []
    last_end = 0
    for segment in silent_segments:
        start_sample = segment[0] * hop_length
        end_sample = (segment[-1] + 1) * hop_length
        spliced_audio.extend(y[last_end:start_sample])
        last_end = end_sample
    spliced_audio.extend(y[last_end:])

    # Write the spliced audio to a new file
    sf.write(output_file, np.array(spliced_audio), sr)

# Example usage:
output_file =  input_file_name + "_spliced.wav" # Append sliced to file name for clarity on output file
splice_silent_sections(input_file, output_file)
Enter fullscreen mode Exit fullscreen mode

Photo Credits(Order of Appearance):

Cover Photo: Dragon Ball Z (FUNimation Productions, Ltd.)

Bleach Gif via tenor


Follow me on my Socials:
https://linktr.ee/bradstondev

Top comments (0)