Have you ever needed to transcribe audio without sending your files to the cloud? Whether you're a journalist interviewing sources, a student recording lectures, or a developer building voice interfaces, privacy matters. In this guide, I'll show you how we built a speech-to-text system that runs entirely in your browser using OpenAI's Whisper model, the same technology behind our free online speech-to-text tool.
Why Browser-Based Speech Recognition?
Before diving into the code, let's talk about why you'd want speech-to-text processing to happen locally.
Your Audio Stays Private
When you upload audio to cloud transcription services, you're trusting third parties with potentially sensitive content. Browser-based processing means your recordings never leave your device. This is crucial for journalists, therapists, lawyers, or anyone handling confidential information.
No API Costs or Rate Limits
Cloud STT services charge by the minute and impose rate limits. With local processing, you can transcribe hours of audio without worrying about bills or throttling.
Works Offline
Once the page and AI model are loaded, you can transcribe audio even without an internet connection. Perfect for field work, travel, or areas with poor connectivity.
Real-Time and Batch Processing
Our implementation supports both real-time transcription via the Web Speech API and high-quality batch processing using Whisper AI.
The Architecture Overview
Our SST system uses a dual approach: Web Speech API for real-time recording and Whisper AI (via Transformers.js) for high-quality file transcription. Here's how it all fits together:
Understanding the Dual Approach
Our implementation offers two transcription modes:
1. Web Speech API (Real-Time)
Built into modern browsers, this provides instant transcription as you speak:
- Pros: Instant results, no model download, works offline after page load
- Cons: Accuracy varies by browser, requires internet for some implementations
2. Whisper AI (High-Quality)
OpenAI's Whisper model running locally via Transformers.js:
- Pros: State-of-the-art accuracy, supports 99 languages, generates timestamps
- Cons: Requires ~75MB model download, slower than real-time API
Core Data Structures
Let's examine the key data structures in our SST implementation:
Language Configuration
const LANGUAGE_OPTIONS = [
{ code: "en-US", name: "English (US)" },
{ code: "en-GB", name: "English (UK)" },
{ code: "zh-CN", name: "Chinese (Simplified)" },
{ code: "zh-TW", name: "Chinese (Traditional)" },
{ code: "ja-JP", name: "Japanese" },
{ code: "ko-KR", name: "Korean" },
{ code: "es-ES", name: "Spanish" },
{ code: "fr-FR", name: "French" },
{ code: "de-DE", name: "German" },
{ code: "pt-BR", name: "Portuguese" },
{ code: "ru-RU", name: "Russian" },
];
We support 11 languages, mapping browser language codes to Whisper's language identifiers.
React State Management
const [text, setText] = useState("");
const [isRecording, setIsRecording] = useState(false);
const [recognitionLanguage, setRecognitionLanguage] = useState("en-US");
const [audioFile, setAudioFile] = useState<File | null>(null);
const [audioUrl, setAudioUrl] = useState<string | null>(null);
const [isProcessing, setIsProcessing] = useState(false);
const [modelLoaded, setModelLoaded] = useState(false);
const [modelLoading, setModelLoading] = useState(false);
const [modelProgress, setModelProgress] = useState(0);
const [transcriptionChunks, setTranscriptionChunks] = useState<Array<{timestamp: [number, number], text: string}>>([]);
const [interimText, setInterimText] = useState("");
const finalTextRef = useRef("");
const recognitionRef = useRef<any>(null);
const pipelineRef = useRef<any>(null);
We track recording state, audio files, model loading progress, and transcription chunks with timestamps.
The Complete Processing Flow
Here's the entire flow from audio input to transcribed text:
Real-Time Speech Recognition
The Web Speech API provides instant transcription:
const startRecording = () => {
if (!isSpeechRecognitionSupported()) {
setRecognitionError(t.sstNotSupported || "Speech recognition is not supported in your browser.");
return;
}
setRecognitionError("");
setText("");
setInterimText("");
finalTextRef.current = "";
const SpeechRecognition = (window as any).SpeechRecognition || (window as any).webkitSpeechRecognition;
const recognition = new SpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
recognition.lang = recognitionLanguage;
recognition.onstart = () => {
setIsRecording(true);
};
recognition.onresult = (event: any) => {
let finalTranscript = "";
let interimTranscript = "";
for (let i = event.resultIndex; i < event.results.length; i++) {
const transcript = event.results[i][0].transcript;
if (event.results[i].isFinal) {
finalTranscript += transcript + " ";
} else {
interimTranscript += transcript;
}
}
// Only append final results to the accumulated text
if (finalTranscript) {
finalTextRef.current += finalTranscript;
setText(finalTextRef.current);
setInterimText("");
} else if (interimTranscript) {
// Show interim results temporarily without saving them
setInterimText(interimTranscript);
setText(finalTextRef.current + interimTranscript);
}
};
recognition.onerror = (event: any) => {
console.error("Speech recognition error:", event.error);
setRecognitionError(t.sstError || "Recognition error. Please try again.");
setIsRecording(false);
};
recognition.onend = () => {
setIsRecording(false);
};
recognitionRef.current = recognition;
recognition.start();
};
Key features:
- Continuous mode: Keeps listening until manually stopped
- Interim results: Shows live transcription as you speak
- Final results: Only saves confirmed transcriptions
- Language support: Uses the selected language code
Loading Whisper AI Model
For high-quality transcription, we load OpenAI's Whisper model:
const loadWhisper = async () => {
if (pipelineRef.current) return;
setModelLoading(true);
setModelProgress(0);
try {
// Dynamic import to avoid SSR issues
const { pipeline } = await import('@xenova/transformers');
// Create automatic speech recognition pipeline
// Using tiny model (~75MB) for faster download
const transcriber = await pipeline(
'automatic-speech-recognition',
'Xenova/whisper-tiny',
{
progress_callback: (progress: any) => {
if (progress && typeof progress.loaded === 'number' && typeof progress.total === 'number') {
const percent = Math.round((progress.loaded / progress.total) * 100);
setModelProgress(percent);
}
}
}
);
pipelineRef.current = transcriber;
setModelLoaded(true);
} catch (error) {
console.error("Failed to load Whisper:", error);
setRecognitionError("Failed to load AI model. Please check your connection and try again.");
} finally {
setModelLoading(false);
}
};
We use the Xenova/whisper-tiny model (~75MB) for a balance of accuracy and download size. The progress_callback provides real-time download progress.
Audio Preprocessing
Before sending audio to Whisper, we need to convert it to the right format:
const audioFileToFloat32 = async (file: File): Promise<Float32Array> => {
const audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();
const arrayBuffer = await file.arrayBuffer();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
// Get audio data from first channel
const channelData = audioBuffer.getChannelData(0);
const sampleRate = audioBuffer.sampleRate;
const targetSampleRate = 16000;
// Resample to 16kHz if needed
if (sampleRate === targetSampleRate) {
return channelData;
}
// Simple resampling
const ratio = sampleRate / targetSampleRate;
const newLength = Math.floor(channelData.length / ratio);
const resampled = new Float32Array(newLength);
for (let i = 0; i < newLength; i++) {
const index = Math.floor(i * ratio);
resampled[i] = channelData[index];
}
return resampled;
};
Whisper expects 16kHz mono audio, so we:
- Decode the audio file using Web Audio API
- Extract the first channel (mono)
- Resample to 16kHz using simple decimation
Transcribing Audio Files
Here's the complete transcription process:
const extractText = async () => {
if (!audioFile || !audioUrl) return;
setIsProcessing(true);
setRecognitionError("");
setText("");
setTranscriptionChunks([]);
try {
// Load whisper if not loaded
if (!pipelineRef.current) {
await loadWhisper();
}
if (!pipelineRef.current) {
throw new Error("Failed to initialize speech recognition");
}
// Map language codes
const langMap: Record<string, string> = {
'en-US': 'en', 'en-GB': 'en', 'zh-CN': 'zh', 'zh-TW': 'zh',
'ja-JP': 'ja', 'ko-KR': 'ko', 'es-ES': 'es', 'fr-FR': 'fr',
'de-DE': 'de', 'pt-BR': 'pt', 'ru-RU': 'ru',
};
const langCode = langMap[recognitionLanguage] || 'en';
// Transcribe with timestamps
const result = await pipelineRef.current(audioUrl, {
language: langCode,
task: 'transcribe',
return_timestamps: true,
});
if (result.chunks && result.chunks.length > 0) {
setTranscriptionChunks(result.chunks);
setText(result.text || result.chunks.map((c: any) => c.text).join(' '));
} else {
setText(result.text || '');
}
} catch (error: any) {
console.error("Audio processing error:", error);
setRecognitionError(error.message || t.sstProcessingError || "Error processing audio file.");
} finally {
setIsProcessing(false);
}
};
Key features:
- Language mapping: Converts browser locale codes to Whisper language codes
- Timestamps: Returns start/end times for each segment
- Chunks: Provides granular transcription segments for SRT export
Generating SRT Subtitles
One powerful feature is exporting transcriptions as SRT subtitle files:
// Format seconds to SRT time format: HH:MM:SS,mmm
const formatSrtTime = (seconds: number): string => {
const hrs = Math.floor(seconds / 3600);
const mins = Math.floor((seconds % 3600) / 60);
const secs = Math.floor(seconds % 60);
const ms = Math.floor((seconds % 1) * 1000);
return `${hrs.toString().padStart(2, '0')}:${mins.toString().padStart(2, '0')}:${secs.toString().padStart(2, '0')},${ms.toString().padStart(3, '0')}`;
};
// Generate SRT content from chunks
const generateSrtContent = (chunks: Array<{timestamp: [number, number], text: string}>): string => {
return chunks.map((chunk, index) => {
const [start, end] = chunk.timestamp;
return `${index + 1}\n${formatSrtTime(start)} --> ${formatSrtTime(end)}\n${chunk.text.trim()}\n`;
}).join('\n');
};
// Download SRT file
const downloadSrt = () => {
if (transcriptionChunks.length === 0) return;
const srtContent = generateSrtContent(transcriptionChunks);
const blob = new Blob([srtContent], { type: 'text/plain;charset=utf-8' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = 'transcription.srt';
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
URL.revokeObjectURL(url);
};
This converts timestamped chunks into standard SRT format for use in video players.
Transformers.js Configuration
We configure Transformers.js to use remote models:
// Set transformers.js environment to use remote CDN
if (typeof window !== 'undefined') {
(window as any).env = {
...(window as any).env,
TRANSFORMERS_CACHE: undefined,
USE_REMOTE_MODELS: true,
};
}
This ensures models are loaded from Hugging Face's CDN rather than being bundled.
Performance Considerations
Model Size Trade-offs
We use whisper-tiny (~75MB) for faster downloads, but you can use larger models for better accuracy:
- tiny: ~75MB, fastest, good accuracy
- base: ~150MB, balanced
- small: ~500MB, better accuracy
- medium: ~1.5GB, best accuracy
- large: ~3GB, state-of-the-art
File Size Limits
We limit uploads to 10MB for browser processing:
if (file.size > 10 * 1024 * 1024) {
setRecognitionError("File too large. Maximum size is 10MB for browser processing.");
return;
}
This prevents memory issues and ensures reasonable processing times.
Memory Management
We clean up object URLs to prevent memory leaks:
const clearText = () => {
setText("");
setInterimText("");
finalTextRef.current = "";
if (audioUrl) {
URL.revokeObjectURL(audioUrl);
}
setAudioFile(null);
setAudioUrl(null);
setRecognitionError("");
setTranscriptionChunks([]);
};
Browser Compatibility
Our SST system works in modern browsers:
- Chrome/Edge: Full support (Web Speech API + Whisper)
- Firefox: Full support (Web Speech API + Whisper)
- Safari: Partial support (Whisper only, limited Web Speech API)
Required APIs:
-
SpeechRecognitionorwebkitSpeechRecognition: For real-time mode -
AudioContext: For audio decoding -
WebAssembly: For ONNX Runtime
Try It Yourself
Ready to transcribe your audio? Visit our free online speech-to-text tool and give it a try. All processing happens locally - your audio files never leave your device.
Conclusion
Building a browser-based speech-to-text system demonstrates the power of modern AI in the browser:
- Dual approach flexibility: Web Speech API for speed, Whisper for accuracy
- Privacy by design: Local processing keeps sensitive audio private
- Multi-language support: 11 languages with Whisper's universal model
- Export options: Plain text and SRT subtitles for versatility
The complete source is available in our repository. Whether you're building accessibility tools, transcription services, or voice interfaces, I hope this guide helps you add speech recognition to your projects.
Happy transcribing! š¤š


Top comments (0)