Shu

Posted on Oct 28

Whisper + Gradio on Colab: Speech-to-Text in Minutes

#python #openai #whisper #webdev

What you’ll learn?

How to transcribe speech into text using OpenAI Whisper
How to build a web-based transcription app using Gradio
How to run everything for free on Google Colab’s GPU runtime

Who this article is for ?

Developers interested in ChatGPT’s Audio mode
Anyone curious about building AI-powered Audio tools
Engineers who want to try Whisper or Gradio without local setup
Beginners looking to prototype an app quickly using free Colab GPU

Environment

Item	Details
Platform	Google Colab (Free Tier)
GPU	NVIDIA T4
Python	3.12
Key Libraries	`openai-whisper`, `gradio`
Setup Time	~5 minutes

Step1: Setup the Environment

Run the following cell in Colab to install all required packages.

!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
import importlib, sys
importlib.invalidate_caches()
sys.path.append('/usr/local/lib/python3.12/site-packages')
import gradio as gr
import whisper
print("Whisper loaded successfully")

Once you see the line Whisper loaded successfully, you’re good to go. Even on Colab’s free T4 GPU, Whisper performs smoothly for short recordings.

Step 2: Load the Whisper Model

model = whisper.load_model("small")

The "small" variant provides a good balance between accuracy and speed, and it works particularly well for Japanese. It downloads once (~460 MB) and then loads instantly from cache afterward.

Step 3: Create a Gradio Web App!!

def transcribe(audio):
    result = model.transcribe(audio, language="ja")
    return result["text"]
gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),
    outputs="text",
    title="Whisper Test",
    description="Record or upload audio and get Japanese transcription"
).launch(share=True)

After running the code, a Gradio web interface appears like this:

You can record your voice directly from the microphone, or upload an audio file from your device. Then click Submit, and your transcribed text will appear in the Output box. When running on Colab, Gradio automatically provides a temporary .gradio.live URL so you can test the app from your phone or another computer — free of charge.

How It Works

Component	Role
Whisper	Converts speech to text using transformer-based acoustic modeling
Gradio	Creates a web UI and handles audio I/O
Colab	Provides free GPU compute for model inference

Together, these form a lightweight, end-to-end speech-to-text pipeline.

Notes

The .gradio.live URL is temporary and public (no authentication). Don’t share it if your audio contains private data.
Once the Colab runtime stops, the URL expires automatically.
For a persistent deployment, consider using RunPod, Hugging Face Spaces, or Render.

Wrap-Up

In just about 20 lines of Python, you now have a fully working Japanese speech-to-text web app. This setup is ideal for experimenting with AI transcription, audio notes, or even meeting summaries — all without spending a single dollar.

About me

I’m an SRE engineer working mainly on infrastructure design and automation. Recently, I’ve been exploring the intersection of AI and speech technology, focusing on how to develop custom speech-enabled LLMs. My main stack includes Python, FastAPI, Next.js, and AWS.

My Motivation

I wrote this article because I want to develop my own custom speech-enabled LLM. As a ChatGPT Plus user, I often rely on the audio mode, but I wish I could use it freely for longer sessions throughout the day. Speaking helps me organize my thoughts and trigger new ideas — so I decided to recreate that experience myself. I’ll keep sharing articles about speech AI and LLM integration, so follow along if this project resonates with you.

DEV Community