I have been editing audio since I was in high school using the freeware software tool Audacity, and there is ONE thing that has consistency driven me mad since starting to edit audio.
The simple task of removing small areas of "empty space" within an audio waveform.
To take an audio file that looks like this:
And transform it into this:
In all honesty, removing "empty space" or areas of an audio waveform is relatively easy. Just simply highlight a specified area and click the delete key on your keyboard and you're done!
In a short audio file, that's only a few minutes long, it's at most a minor inconvenience. But in a long form audio file, like podcast recordings or voice-over sessions, it's a much more than tedious; it's a truly brain melting process.
Think about it this way:
On average, it takes me about 5 seconds to locate, identify, select, and delete an area of "empty space" within the average audio recording waveform.
Let's say that in 1-min long audio clip, there are approximately 8 empty spaces. So for a 1-minute audio clip it would take...
5 seconds * (8 edits per minute * 1 minute) = 40 seconds
...approximately 40 seconds to edit out all empty spaces.
Not too bad, right?
Now imagine needing to edit an audio file that is approximately 30 minutes long.
That would take...
5 seconds * (8 edits per minute * 30 minute) = 1,200 Seconds = 20 minutes
...approximately 20 minutes to edit.
I cannot express in words how "not fun" 20 minutes of non-stop deleting of empty spaces in audio file happens to be. Maybe someone out there enjoys it, but I, personally, do not.
And recently, after doing some planning, I realized that I would need to do this very task for at least 60 audio tracks, each ranging from 15-30 minutes in length, in the relatively near future.
When that became clear to me, a deep and powerful dread came upon me.
Searching for a Hero to Save the Day
From the moment I I realized how much work was coming my way, I knew I had to find some way to optimize my audio editing routine.
Knowing the power of AI and some of the tools that already exist, I figured I could find something that could take care of this undesirable task "auto-magically".
So I began my search...
Because I moonlight as a podcaster and often research podcasting tools, I was aware of the podcasting platform Descript and have used it for one of the podcasts I help co-host. Descript is embedded with many tools that help with recording and editing and I remember reading about it's ability to use AI to remove "umms and uhhs" and remove unwanted spaces between words.
So,I decided to try it out.
It kinda worked but I found myself struggling to contend with the UI and lack of control over the audio files. Also, it seemed to be arbitrarily deleting random sections of audio that were valuable to me and I didn't trust I could use it on very long files without needing to re-check it's work, which more or less defeated the purpose of using it.
So I decided to move on.
Following that, I did some basic Google research and found other "AI-powered" tools but either the tools misrepresented themselves or required expensive subscriptions in order for me to get access to their tool, which was a non-starter.
So.. after some frustration and wasted time I decided to turn to my old friends, Ollama and Meta Llama.
LLamas to the Rescue
So I asked Llama a pretty simple question:
Do you know of any ai powered tools that can remove word gaps or umms and ahhs?
This was Llama's response:
Yes, there are several AI-powered tools that can help remove word gaps, umms, and ahhs from audio recordings. Here
are a few examples:
1. **Rev.com**: Rev.com is an online platform that uses AI-powered speech recognition technology to transcribe and
edit audio files. They offer a feature called "Automatic Speech Reproduction" (ASR) that can help remove filler
words like "umms" and "ahhs".
2. **TranscribeMe**: TranscribeMe is another online platform that offers AI-powered transcription services. They
use a combination of machine learning algorithms to detect and remove filler words, including word gaps.
3. **Speechmatics**: Speechmatics is an AI-powered speech recognition platform that can help remove filler words
.....
I took a few moments to investigate some of the tools that Llama suggested and though none of the tools were up to the task to do what I needed, Llama reminded me that at the end of the day, all of these sites and tools were just created via code.
And more importantly, that I could create code and that I might be able to do exactly what I needed without an pre-programmed tool's help.
A Developer Enters the Chat
As a developer, I've had the opportunity to work with a few audio frameworks for mobile and web applications, so I was confident that I could find a framework that would allow me to accomplish my task with a little bit of elbow grease.
And as I geared up to write some code and functionality from scratch, I recalled the lesson I learned not too long ago, "AI is a developer's friend".
But this time, instead of turning to Llama for assistance, I turned to another Generative AI tool I had become pretty good friends with, Gemini. And I particularly decided to go with Gemini because I have had some solid code generation outcomes from it in the past.
So, I booted up Gemini and immediately asked:
Are are there any programming frameworks good at evaluating audio (e.g. volume level, length, etc)?
Gemini quickly responded with some suggestions:
Yes, there are several programming frameworks well-suited for audio evaluation tasks like measuring volume level, length, and more. Here are a few popular options:
Python-based Frameworks:
- LibROSA: A powerful library for music and audio analysis. It offers a wide range of tools for feature extraction, classification, and signal processing.
- PyDub: A simple library for manipulating audio with Python. It's great for basic tasks like trimming, concatenating, and analyzing audio files.
- SoundFile: A library for reading and writing audio files in various formats. It can be used to extract information about audio files, such as duration and sample rate.
Other Frameworks:
- FFmpeg: A versatile command-line tool and library for handling video and audio. It can be used to analyze audio streams and extract metadata like duration and bitrate.
- MATLAB: A powerful mathematical computing environment that includes tools for audio signal processing and analysis.
I have personal experience with both FFmpeg and MATLAB and felt that they alone wouldn't be what I needed but in it's response it gave a quick example of some PyDub code and I decided to go ahead and check out PyDub to see how it worked.
from pydub import AudioSegment
def analyze_audio(file_path):
audio = AudioSegment.from_file(file_path)
# Get duration in milliseconds
duration = len(audio)
# Get RMS amplitude (a measure of volume)
rms = audio.rms
return duration, rms
# Example usage
file_path = "your_audio_file.mp3"
duration, rms = analyze_audio(file_path)
print("Duration:", duration, "ms")
print("RMS amplitude:", rms)
I chatted a bit longer with Gemini about tools and possible approaches before finally deciding to ask it directly, how to code the exact solution to my problem:
Now, Could you create code using Pydub that identifies sections of low volume or silence of a specific length and replaces it with silence of a different length, then outputs a new audio file?
Gemini responded with this code which on first inspection looked like it might be what I needed.
from pydub import AudioSegment
def replace_silent_sections(input_file, output_file, silence_threshold=-16, min_silence_len=1000, replacement_silence_len=500):
"""
Replaces silent sections in an audio file with silence of a different length.
Args:
input_file (str): Path to the input audio file.
output_file (str): Path to the output audio file.
silence_threshold (int, optional): RMS amplitude threshold for silence. Defaults to -16 dBFS.
min_silence_len (int, optional): Minimum length of a silent segment to be replaced, in milliseconds. Defaults to 1000 ms.
replacement_silence_len (int, optional): Length of the replacement silence, in milliseconds. Defaults to 500 ms.
"""
audio = AudioSegment.from_file(input_file)
# Split the audio into 100ms segments
segments = []
for i in range(0, len(audio), 100):
segments.append(audio[i:i+100])
# Identify and replace silent segments
non_silent_segments = []
current_silent_segment = None
for segment in segments:
if segment.rms < silence_threshold:
if current_silent_segment is None:
current_silent_segment = segment
else:
current_silent_segment += segment
else:
if current_silent_segment is not None:
if len(current_silent_segment) >= min_silence_len:
# Replace with silence of a different length
replacement_silence = AudioSegment.silent(duration=replacement_silence_len)
non_silent_segments.append(replacement_silence)
else:
# Add the original silent segment
non_silent_segments.append(current_silent_segment)
current_silent_segment = None
non_silent_segments.append(segment)
# Concatenate the segments
output_audio = sum(non_silent_segments)
output_audio.export(output_file, format="mp3")
# Example usage:
input_file = "input.mp3"
output_file = "output_replaced_silence.mp3"
replace_silent_sections(input_file, output_file)
BUT shortly after testing out the code, I encountered a problem that is often one of the weakness of asking AI tools to help with Code Generation...
PyDub, though a possible option for solving my problem, had not been updated in several years and some of its dependencies had been deprecated in newer versions of Python.
And though Pydub was a dead-end, it did prepare me to be a bit more discerning on the next framework I decided to investigate and helped me narrow down my efforts to libROSA.
Using a similar prompt as before, I asked Gemini:
Now, Could you create code using LibROSA that identifies sections of low volume or silence of a specific length and replaces it with silence of a different length, then outputs a new audio file?
To save some time, I will summarize what happened after getting a response from Gemini.
Gemini provided some workable code that had all the requirements I "thought" I needed to automate my task. I asked Gemini some follow-up questions to help clarify what its intent was in the code and coincidentally, received some clarification on a concept in audio editing that I never had a full grasp on:
is that RMS value in dB?
Gemini's Response:
No, the RMS value in the code is not directly in dB. It's a normalized value between 0 and 1, where 0 represents complete silence and values closer to 1 indicate louder segments.
To convert this normalized RMS value to a dB scale, you would typically use a logarithmic function. However, in the context of silence detection, comparing the normalized RMS value to a threshold is sufficient.
The silence_threshold is also a normalized value, and comparing it to the calculated RMS values allows us to identify segments that are significantly quieter than the average.
As I investigated the code and got more understanding of framework, I realized, I didn't really need exactly what I had been asking for in my previous prompt.
So, I updated my prompt to this:
Now, Could you create code using LibROSA that splices out sections of low volume or silence of a specific length?
And Gemini responded with the code that would eventually become the code that would save me HOURS of editing time in the near and long term:
import librosa
import numpy as np
import soundfile as sf
def splice_silent_sections(input_file, output_file, silence_threshold=0.01, min_silence_len=1000, sr=22050):
"""
Splices out silent sections from an audio file.
Args:
input_file (str): Path to the input audio file.
output_file (str): Path to the output audio file.
silence_threshold (float, optional): RMS amplitude threshold for silence. Defaults to 0.01.
min_silence_len (int, optional): Minimum length of a silent segment to be spliced out, in samples. Defaults to 1000 samples.
sr (int, optional): Sample rate of the audio. Defaults to 22050 Hz.
"""
# Load the audio file
y, sr = librosa.load(input_file, sr=sr)
# Calculate the RMS amplitude of each frame
frame_size = 1024
hop_length = 512
rms = librosa.feature.rms(y=y, frame_length=frame_size, hop_length=hop_length)[0]
# Identify silent segments
silent_segments = []
current_silent_segment = []
for i, r in enumerate(rms):
if r < silence_threshold:
current_silent_segment.append(i)
else:
if len(current_silent_segment) >= min_silence_len:
silent_segments.append(current_silent_segment)
current_silent_segment = []
# Splice out silent segments
spliced_audio = []
last_end = 0
for segment in silent_segments:
start_sample = segment[0] * hop_length
end_sample = (segment[-1] + 1) * hop_length
spliced_audio.extend(y[last_end:start_sample])
last_end = end_sample
spliced_audio.extend(y[last_end:])
# Write the spliced audio to a new file
sf.write(output_file, np.array(spliced_audio), sr)
# Example usage:
input_file = "input.wav"
output_file = "output_spliced.wav"
splice_silent_sections(input_file, output_file)
Also, Gemini gave me some context to it's code:
This code works by:
1. Loading the audio file: Uses librosa.load() to load the audio file into a NumPy array y and its sample rate sr.
2. Calculating RMS amplitude: Calculates the RMS amplitude of each frame using librosa.feature.rms().
3. Identifying silent segments: Iterates through the RMS values, identifying segments that are below the silence_threshold and longer than the min_silence_len.
4. Splicing out silent segments: Iterates over the silent segments, extracting the audio portions before and after each segment and adding them to a new list spliced_audio.
5. Writing the spliced audio: Concatenates the spliced audio segments into a new NumPy array and writes it to a new file using soundfile.write().
This code effectively removes the silent sections from the audio file, resulting in a new file with only the non-silent portions.
The End of Music
Though getting to my solution wasn't exactly a straight path, in the course of an hour or two, I was able to use Llama, Gemini and Python to create a set of code that did EXACTLY what I needed it to do.
It's hard to explain how efficient this process was in terms of research and development and time saved in developing the end product.
In a non-AI infused development workflow, I would have had to discover much of this information about tools and code frameworks through the traditional means of development; searching Google, investigating Stack Overflow, trolling obscure blogs for framework documentation and then rinse and repeat, until hopefully finding the correct resources and tools to solve my problem.
Though I'm confident I could have developed a similar solution through the "older style" approach, in a pre-AI world, I imagine this whole process would have taken me a few days to complete.
This type of workflow is an absolute boon to developers like myself who often are looking for novel ways to solve obscure problems. And I hope this little tale has inspired you to go on an AI-powered journey of your own!
For those interested, here is the code I ended up using in my audio editing workflow (Note: This is a basic implementation. I plan to add some additional features to this code in the future, so please feel free to do the same) and for those who are interested where I've taken this code since the it's initial implementation, check out my personal project around automating YouTube content creation on my github:
import librosa
import numpy as np
import soundfile as sf
input_file_name = "your-file-name" # Insert name of file to trim silence
input_file = input_file_name + ".wav" # Change file type if needed (E.G. ".wav", ".mp3")
def splice_silent_sections(input_file, output_file, silence_threshold_rms=0.001, min_silence_len=10, sample_rate=44100):
"""
Splices out silent/low volume sections from an audio file.
Args:
input_file (str): Path to the input audio file.
output_file (str): Path to the output audio file.
silence_threshold_rms (float, optional): RMS amplitude threshold for silence. Defaults to 0.01.
min_silence_len (int, optional): Minimum length of a silent segment to be spliced out, in samples. Defaults to 1000 samples.
sample_rate (int, optional): Sample rate of the audio. Defaults to 22050 Hz.
"""
# Load the audio file
y, sr = librosa.load(input_file, sr=sample_rate)
# Calculate the RMS amplitude of each frame
frame_size = 1024
hop_length = 512
rms = librosa.feature.rms(y=y, frame_length=frame_size, hop_length=hop_length)[0]
# Identify silent segments
silent_segments = []
current_silent_segment = []
for i, r in enumerate(rms):
if r < silence_threshold_rms:
current_silent_segment.append(i)
else:
if len(current_silent_segment) >= min_silence_len:
silent_segments.append(current_silent_segment)
current_silent_segment = []
# Splice out silent segments
spliced_audio = []
last_end = 0
for segment in silent_segments:
start_sample = segment[0] * hop_length
end_sample = (segment[-1] + 1) * hop_length
spliced_audio.extend(y[last_end:start_sample])
last_end = end_sample
spliced_audio.extend(y[last_end:])
# Write the spliced audio to a new file
sf.write(output_file, np.array(spliced_audio), sr)
# Example usage:
output_file = input_file_name + "_spliced.wav" # Append sliced to file name for clarity on output file
splice_silent_sections(input_file, output_file)
Photo Credits(Order of Appearance):
Cover Photo: Dragon Ball Z (FUNimation Productions, Ltd.)
Bleach Gif via tenor
Follow me on my Socials:
https://linktr.ee/bradstondev
Top comments (5)
Have you tried working with videos, such as voice-over videos? Suppose there are many silent parts in both the audio and video, and these silent sections should be removed in the final production. How can we cut them together automatically? Do you have any ideas?
That's actually a nice step in this process I could start working toward. I've continued to build on my initial project and more recently added the MoviePy to my automation toolchain where I can now automatically trim videos based on auto-generated highlights. It's a bit different than the original trimming of audio but it's been my first step working with video.
I think one approach for doing both audio and video would look like this:
I would note, we could probably save a step by doing the Audio analysis with Librosa without having to extract the audio from the video but I haven't tried that functionality yet.
I might actually start working on this next to see how this works. Have been working with other ideas but I like doing the silence trim directly with video. Might be a nice challenge.
Sounds good. Thanks for sharing the plan. I will look into the repo. It would be great one if we can solve this problem. I tried with raw python. But the speed was terrible and also it broke the whole things. With that speed can edit lots of video. Haha.
haha! I feel like speed is definitely a thing I've noted. The Audio trim process has tended to be pretty quick but the video editing is a beast. 😅 I need to spend so more time seeing what way to optimize things. The big thing that concerns me is the video splitting/trimming and then merging. Just from what I've experienced so far, trimming video can take a looong time depending on the length.
Yes true thing. Even it worked, It will take huge time. Because during the process it use the audio video to decode to another format so in the LLM it also should be same to match the pattern. I'm unsure of it but just my guess. So still curious about this challenge. Good luck for the challenge 🙌