Building a Browser-Based Voice-to-Text App with the Web Speech API
I recently built a voice-to-text tool that works entirely in the browser — no backend required for the core functionality. Here's what I learned about the Web Speech API and its quirks.
Why Browser-Based?
Privacy is the main sell. Audio never leaves the user's device. No uploads, no storage, no GDPR headaches. For a simple transcription tool, this is a huge advantage.
The Web Speech API Basics
The API is surprisingly simple:
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;
recognition.lang = 'en-US';
recognition.onresult = (event) => {
const transcript = Array.from(event.results)
.map(result => result[0].transcript)
.join('');
console.log(transcript);
};
recognition.start();
That's it. You now have live speech-to-text.
The Gotchas Nobody Warns You About
1. Browser support is inconsistent
Chrome uses Google's servers (ironically, not fully local). Safari uses on-device processing. Firefox support is limited. Always check:
if (!('SpeechRecognition' in window || 'webkitSpeechRecognition' in window)) {
// Show fallback UI
}
2. It stops listening randomly
The API has a habit of stopping after silence. You need to restart it:
recognition.onend = () => {
if (shouldKeepListening) {
recognition.start();
}
};
3. Punctuation doesn't exist
The API returns raw words with no periods, commas, or capitalization. You'll need to handle this yourself:
function addAutoPunctuation(text) {
// Add period after pause patterns
// Capitalize after periods
// Handle common patterns like "question mark" → "?"
}
4. Language switching is manual
You need to build your own language selector and set recognition.lang accordingly. The API supports 100+ languages but won't auto-detect.
When to NOT Use Web Speech API
For anything beyond basic dictation, you'll hit walls:
- Audio file transcription — API only does live mic input
- Speaker identification — Not supported
- Timestamps — Not provided
- Accuracy requirements — Enterprise use cases need Whisper, AssemblyAI, or Deepgram
I ended up building a hybrid: free tier uses Web Speech API for live dictation, Pro tier uses Whisper for file uploads and higher accuracy.
Native Language SEO Bonus
One unexpected win: I built language-specific pages with native script UI. The Hindi page is actually in Hindi (हिंदी में वॉइस टू टेक्स्ट), not just "Hindi Voice to Text" in English.
Result: Started ranking for native-language searches with way less competition than English keywords.
Try It
I built this into voicetotextonline.com — free to use, no signup for basic transcription.
If you're building something similar, happy to answer questions in the comments.
Top comments (0)