DEV Community

Ao
Ao

Posted on

Whisper + Gradio on Colab: Speech-to-Text in Minutes

What you’ll learn?

  • How to transcribe speech into text using OpenAI Whisper
  • How to build a web-based transcription app using Gradio
  • How to run everything for free on Google Colab’s GPU runtime

Who this article is for ?

  • Developers interested in ChatGPT’s Audio mode
  • Anyone curious about building AI-powered Audio tools
  • Engineers who want to try Whisper or Gradio without local setup
  • Beginners looking to prototype an app quickly using free Colab GPU

Environment

Item Details
Platform Google Colab (Free Tier)
GPU NVIDIA T4
Python 3.12
Key Libraries openai-whisper, gradio
Setup Time ~5 minutes

Step1: Setup the Environment

Run the following cell in Colab to install all required packages.

!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
import importlib, sys
importlib.invalidate_caches()
sys.path.append('/usr/local/lib/python3.12/site-packages')
import gradio as gr
import whisper
print("Whisper loaded successfully")
Enter fullscreen mode Exit fullscreen mode

Once you see the line Whisper loaded successfully, you’re good to go. Even on Colab’s free T4 GPU, Whisper performs smoothly for short recordings.

Step 2: Load the Whisper Model

model = whisper.load_model("small")
Enter fullscreen mode Exit fullscreen mode

The "small" variant provides a good balance between accuracy and speed, and it works particularly well for Japanese. It downloads once (~460 MB) and then loads instantly from cache afterward.

Step 3: Create a Gradio Web App!!

def transcribe(audio):
    result = model.transcribe(audio, language="ja")
    return result["text"]
gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),
    outputs="text",
    title="Whisper Test",
    description="Record or upload audio and get Japanese transcription"
).launch(share=True)
Enter fullscreen mode Exit fullscreen mode

After running the code, a Gradio web interface appears like this:


You can record your voice directly from the microphone, or upload an audio file from your device. Then click Submit, and your transcribed text will appear in the Output box. When running on Colab, Gradio automatically provides a temporary .gradio.live URL so you can test the app from your phone or another computer — free of charge.

How It Works

Component Role
Whisper Converts speech to text using transformer-based acoustic modeling
Gradio Creates a web UI and handles audio I/O
Colab Provides free GPU compute for model inference

Together, these form a lightweight, end-to-end speech-to-text pipeline.

Notes

  • The .gradio.live URL is temporary and public (no authentication). Don’t share it if your audio contains private data.
  • Once the Colab runtime stops, the URL expires automatically.
  • For a persistent deployment, consider using RunPod, Hugging Face Spaces, or Render.

Wrap-Up

In just about 20 lines of Python, you now have a fully working Japanese speech-to-text web app. This setup is ideal for experimenting with AI transcription, audio notes, or even meeting summaries — all without spending a single dollar.

About me

I’m an SRE engineer working mainly on infrastructure design and automation. Recently, I’ve been exploring the intersection of AI and speech technology, focusing on how to develop custom speech-enabled LLMs. My main stack includes Python, FastAPI, Next.js, and AWS.

My Motivation

I wrote this article because I want to develop my own custom speech-enabled LLM. As a ChatGPT Plus user, I often rely on the audio mode, but I wish I could use it freely for longer sessions throughout the day. Speaking helps me organize my thoughts and trigger new ideas — so I decided to recreate that experience myself. I’ll keep sharing articles about speech AI and LLM integration, so follow along if this project resonates with you.

Top comments (0)