Vicente G. Reyes

Posted on Mar 16 • Originally published at vicentereyes.org

Why I Built My Own Unlimited Audio & Video Transcriber

#ai #python #machinelearning #showdev

We've all been there. You have a massive audio file—a two-hour interview, a long lecture, or a recorded meeting—and you just need it converted to text.

You do a quick search for "free audio to text transcriber" and find dozens of online tools. But there's a catch. Once you upload your file, you're hit with a paywall: "File size exceeds limit" or "Upgrade to pro to transcribe more than 60 minutes."

I recently ran into this exact problem. I needed a reliable video and audio transcriber that didn't complain when I threw a massive, multi-hour file at it. After hitting limits on almost every online service, I realized the best solution wasn't to pay for a subscription I'd rarely use—it was to build it myself.

Here is how I built my own unlimited, completely free, and private transcriber using Python, OpenAI's Whisper, and Streamlit, accelerated by NVIDIA's CUDA.

The Tech Stack

I wanted the tool to be simple to use, run locally (so my data stays private), and handle both audio and video files flawlessly. Looking at my requirements.txt, here are the core tools making it possible:

Python: The backbone of the project.
Streamlit (v1.55): A fantastic library that turns Python scripts into interactive web apps in minutes. No HTML/CSS needed.
OpenAI's Whisper: An incredibly powerful and accurate open-source speech recognition model that runs locally.
PyTorch & NVIDIA CUDA (cu12): To make the transcription lightning-fast, I included torch along with NVIDIA dependencies like nvidia-cudnn-cu12 and nvidia-cublas-cu12. This allows Whisper to harness GPU acceleration, slashing transcription times down from hours to mere minutes.
MoviePy (v2.2): To handle video files seamlessly by extracting the audio tracks before passing them to Whisper.
python-docx & fpdf2: To allow exporting the final transcriptions into easily shareable Word documents and PDFs.

How It Works

The magic of the application lies in its simplicity. Here is the life cycle of a transcription task in the app:

1. Handling the Uploads

Using Streamlit's file_uploader, I built a quick UI that accepts various formats. Streamlit makes this incredibly easy with surprisingly few lines of code:

st.title("🎙️ Audio & Video Transcriber")
st.write("Upload a file to convert speech to text.")

uploaded_file = st.file_uploader(
    "Choose a file",
    type=["mp3", "wav", "m4a", "mp4", "mov", "avi"]
)

2. Extracting Audio from Video

Whisper transcribes audio. But often, the content we want to transcribe is locked inside a video format (like a recorded Zoom meeting). To solve this, I integrated MoviePy. If the user uploads a video file, the app intercepts it, extracts the audio track in the background, and forwards that to the AI model:

# If it's a video, extract the audio using MoviePy 2.0 syntax
if suffix.lower() in [".mp4", ".mov", ".avi"]:
    st.text("Processing video... extracting audio.")
    video = VideoFileClip(input_path)
    audio_path = input_path.replace(suffix, ".mp3")

    # Access audio and write it to a temp file
    video.audio.write_audiofile(audio_path, logger=None)
    processing_path = audio_path
    video.close() # Important to free up system resources

3. GPU-Accelerated Transcription

The heavy lifting is done by Whisper. I opted to use the base model, which provides a fantastic balance between transcription speed and accuracy.

Because Whisper runs locally on my machine utilizing PyTorch and my NVIDIA GPU via CUDA, there are no time limits. I can feed it a 3-hour podcast, and it will methodically transcribe the entire thing without ever asking me for a credit card.

To make the app snappy and avoid reloading massive AI weights on every action, I used Streamlit's @st.cache_resource decorator:

# Initialize Whisper model, caching it to avoid reloading
@st.cache_resource
def load_model():
    # Use "base" for a good balance of speed and accuracy
    return whisper.load_model("base")

model = load_model()

def transcribe_file(file_path):
    st.info("Transcribing... This may take a few minutes.")
    result = model.transcribe(file_path)
    return result['text']

4. Exporting the Results

A block of text on a web page is great, but a downloadable file is better. I used python-docx to generate Word documents formatting the output nicely. Once the transcription is done, the app provides shiny download buttons right there in the UI:

def save_as_docx(text, filename):
    doc = Document()
    doc.add_heading('Transcription', 0)
    doc.add_paragraph(text)
    doc.save(filename)
    return filename

# In the Streamlit UI later...
docx_file = "transcription.docx"
save_as_docx(transcript, docx_file)
with open(docx_file, "rb") as f:
    col1.download_button("Download as DOCX", f, file_name="transcription.docx")

The Result

In less than 100 lines of Python code, I built a fully functional, local web app that solves a very real, very annoying problem.

No file size limits. No hidden paywalls. Total privacy. Blazing fast GPU inference.

Building your own tools to solve personal bottlenecks is one of the most rewarding parts of knowing how to code. If you find yourself fighting against the limitations of "free" online software, take a step back and ask yourself: "Could I just build this?"

Chances are, with modern libraries and a spare afternoon, you entirely can.

Code can be found here: https://github.com/reyesvicente/audio-to-text

Top comments (2)

Harjot Singh • May 31

"Unlimited" is the giveaway that you hit the classic SaaS wall - the hosted transcribers meter you into oblivion, so building your own with local/open models flips it from per-minute pricing to fixed cost. For high-volume transcription that math is overwhelming, and Whisper-class models made it actually feasible to self-host now.

Classic build-vs-rent crossover: rent until the usage-based pricing outgrows the effort of owning it, then build. You clearly crossed it. The interesting follow-on most of these face is the moment you want to share it - then the boring 20% (auth, accounts, deploy) shows up, which is exactly the gap Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) closes. For a personal tool you nailed the right call. What model did you land on for the transcription, and is it fully local? (Moonshift's first run's free if useful.)

Vicente G. Reyes • Jun 1

You're exactly right about the crossover point. I wasn't trying to build a SaaS initially—I just got tired of watching transcription costs scale linearly with usage while the hardware sitting on my desk was mostly idle.

For the transcription engine, I landed on Faster-Whisper running locally. It gives me Whisper-level accuracy with significantly better performance, especially when paired with GPU acceleration. The entire transcription pipeline runs on my own machine, so there are no per-minute API charges, no upload requirements, and no practical usage limits beyond available compute and storage.

The funny thing is that the transcription itself turned out to be the easy part. As you mentioned, once you start thinking about sharing it with other people, the "boring" SaaS layer suddenly becomes the majority of the work—authentication, user management, billing, deployment, monitoring, support, and all the other things that have nothing to do with transcription.

At the moment it's primarily a personal tool, so keeping everything local and simple has been the biggest win. Out of curiosity, what are you seeing people use most these days when they outgrow hosted transcription APIs—self-hosted Whisper variants, cloud GPU deployments, or something else?