DEV Community

Cover image for Elastic D&D - Update 10 - Audio Transcription Changes
Joe
Joe

Posted on • Edited on

Elastic D&D - Update 10 - Audio Transcription Changes

Last week we talked about FastAPI. If you missed it, you can check that out here!

Introduction

I decided to write about the audio transcription changes this week, as I finally got some code in place to give users an alternative method. Previously, audio to text was using something called AssemblyAI. However, transcribing 15-20 hours of audio was costing ~$8-15 per month. This code gives users the option to do it for free, though it does take much longer.

Speech Recognition

Speech Recognition is a Python library for performing speech recognition via multiple APIs. It has support for both online and offline APIs, which makes it pretty powerful. For our use-case, I utilized the OpenAI Whisper method.

Here's the full code:

def transcribe_audio_free(file_object):
    # get extension
    filename, file_extension = os.path.splitext(file_object.name)

    # create temp file
    with NamedTemporaryFile(suffix=file_extension,delete=False) as temp:
        temp.write(file_object.getvalue())
        temp.seek(0)

        # split file into chunks
        audio = AudioSegment.from_file(temp.name)
        audio_chunks = split_on_silence(audio,
            # experiment with this value for your target audio file
            min_silence_len=3000,
            # adjust this per requirement
            silence_thresh=audio.dBFS-30,
            # keep the silence for 1 second, adjustable as well
            keep_silence=100,
        )

        # create a directory to store the audio chunks
        folder_name = "audio-chunks"
        if not os.path.isdir(folder_name):
            os.mkdir(folder_name)
        whole_text = ""

        # process each chunk 
        for i, audio_chunk in enumerate(audio_chunks, start=1):
            # export audio chunk and save it in the `folder_name` directory.
            chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
            audio_chunk.export(chunk_filename, format="wav")
            # recognize the chunk
            try:
                # audio to text
                r = sr.Recognizer()
                uploaded_chunk = sr.AudioFile(chunk_filename)
                with uploaded_chunk as source:
                    chunk_audio = r.record(source)
                text = r.recognize_whisper(chunk_audio,"medium")
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text

        # close temp file
        temp.close()
        os.unlink(temp.name)

    # clean up the audio-chunks folders
    shutil.rmtree(folder_name)

    # return the text for all chunks detected
    return whole_text
Enter fullscreen mode Exit fullscreen mode

There's really not much here, so I'll quickly step through the process:

  1. Creates a temporary file
  2. Loads temporary file into PyDub and splits it into smaller files
  3. Creates a directory to store the smaller files
  4. Iterates through the smaller files a. Places the file into the directory b. Performs speech-to-text via Whisper c. Adds transcribed text to "whole_text" variable
  5. Closes the temporary file
  6. Removes the directory
  7. Returns "whole_text"

NOTE:

You may have to change the values inside of audio_chunks = split_on_silence() to better work with your file. 3000, -30, 100 was the sweet spot during testing for me.
You may have to use a different model for Whisper. You can change "medium" to a model that better fits your use-case here: text = r.recognize_whisper(chunk_audio,"medium")

Closing Remarks

Please note that the paid method takes significantly less time and is generally worth using in my opinion. I may work on writing in a progress bar for the free method at some point. Regardless, both methods will be available for use.

Next week, I will begin showing off Veverbot and the mechanisms in place to get him to work. I promise.

Check out the GitHub repo below. You can also find my Twitch account in the socials link, where I will be actively working on this during the week while interacting with whoever is hanging out!

GitHub Repo
Socials

Happy Coding,
Joe

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up