What you’ll learn?
- How to transcribe speech into text using OpenAI Whisper
- How to build a web-based transcription app using Gradio
- How to run everything for free on Google Colab’s GPU runtime
Who this article is for ?
- Developers interested in ChatGPT’s Audio mode
- Anyone curious about building AI-powered Audio tools
- Engineers who want to try Whisper or Gradio without local setup
- Beginners looking to prototype an app quickly using free Colab GPU
Environment
| Item | Details |
|---|---|
| Platform | Google Colab (Free Tier) |
| GPU | NVIDIA T4 |
| Python | 3.12 |
| Key Libraries |
openai-whisper, gradio
|
| Setup Time | ~5 minutes |
Step1: Setup the Environment
Run the following cell in Colab to install all required packages.
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
import importlib, sys
importlib.invalidate_caches()
sys.path.append('/usr/local/lib/python3.12/site-packages')
import gradio as gr
import whisper
print("Whisper loaded successfully")
Once you see the line Whisper loaded successfully, you’re good to go. Even on Colab’s free T4 GPU, Whisper performs smoothly for short recordings.
Step 2: Load the Whisper Model
model = whisper.load_model("small")
The "small" variant provides a good balance between accuracy and speed, and it works particularly well for Japanese. It downloads once (~460 MB) and then loads instantly from cache afterward.
Step 3: Create a Gradio Web App!!
def transcribe(audio):
result = model.transcribe(audio, language="ja")
return result["text"]
gr.Interface(
fn=transcribe,
inputs=gr.Audio(type="filepath"),
outputs="text",
title="Whisper Test",
description="Record or upload audio and get Japanese transcription"
).launch(share=True)
After running the code, a Gradio web interface appears like this:

You can record your voice directly from the microphone, or upload an audio file from your device. Then click Submit, and your transcribed text will appear in the Output box. When running on Colab, Gradio automatically provides a temporary .gradio.live URL so you can test the app from your phone or another computer — free of charge.
How It Works
| Component | Role |
|---|---|
| Whisper | Converts speech to text using transformer-based acoustic modeling |
| Gradio | Creates a web UI and handles audio I/O |
| Colab | Provides free GPU compute for model inference |
Together, these form a lightweight, end-to-end speech-to-text pipeline.
Notes
- The
.gradio.liveURL is temporary and public (no authentication). Don’t share it if your audio contains private data. - Once the Colab runtime stops, the URL expires automatically.
- For a persistent deployment, consider using RunPod, Hugging Face Spaces, or Render.
Wrap-Up
In just about 20 lines of Python, you now have a fully working Japanese speech-to-text web app. This setup is ideal for experimenting with AI transcription, audio notes, or even meeting summaries — all without spending a single dollar.
About me
I’m an SRE engineer working mainly on infrastructure design and automation. Recently, I’ve been exploring the intersection of AI and speech technology, focusing on how to develop custom speech-enabled LLMs. My main stack includes Python, FastAPI, Next.js, and AWS.
My Motivation
I wrote this article because I want to develop my own custom speech-enabled LLM. As a ChatGPT Plus user, I often rely on the audio mode, but I wish I could use it freely for longer sessions throughout the day. Speaking helps me organize my thoughts and trigger new ideas — so I decided to recreate that experience myself. I’ll keep sharing articles about speech AI and LLM integration, so follow along if this project resonates with you.
Top comments (0)