The Web Speech API has been in Chrome since 2013. It provides real-time speech-to-text directly in the browser, with no external service, no API key, and no server upload. And almost nobody uses it for anything beyond voice search demos.
The API
const recognition = new (window.SpeechRecognition ||
window.webkitSpeechRecognition)();
recognition.continuous = true;
recognition.interimResults = true;
recognition.lang = 'en-US';
recognition.onresult = (event) => {
let transcript = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
transcript += event.results[i][0].transcript;
}
document.getElementById('output').textContent = transcript;
};
recognition.start();
That's a working speech-to-text implementation in 15 lines. continuous: true keeps listening after pauses. interimResults: true shows text as you speak, not just after you finish.
What it can and can't do
Accuracy: Good for clear speech in a quiet environment. Native English speakers in a quiet room get 90-95% accuracy. Background noise, accents, and technical jargon reduce accuracy.
Languages: Supports 60+ languages and dialects. The accuracy varies by language based on the training data available.
Latency: Real-time. Words appear within 200-500ms of being spoken. Interim results appear even faster but may be revised as more context is available.
Limitations: Chrome's implementation sends audio to Google's servers for processing (despite running "in the browser"). This means it requires an internet connection and raises privacy considerations. Firefox's implementation uses local processing but has lower accuracy.
Duration: Chrome's implementation may stop after a period of silence or after extended continuous use. You need to handle the onend event and restart if continuous recognition is required.
Practical applications
Accessibility: Voice input for users who can't type. Forms, search, text editors. The simplest accessibility improvement you can add to a text input.
Note taking: Dictate meeting notes, voice journal entries, or draft emails by speaking. Faster than typing for most people (average speaking rate: 130 wpm vs average typing rate: 40 wpm).
Transcription: Live captioning for video calls or presentations. Not production-quality, but useful as a starting point that can be edited.
Voice commands: "Add to cart," "scroll down," "play next." Combined with basic natural language processing, voice commands can control web applications without complex infrastructure.
Improving accuracy
Grammar hints: The SpeechGrammar interface (limited support) lets you specify expected words and phrases, which improves recognition for domain-specific vocabulary.
Post-processing: Apply corrections for common misrecognitions. "Write" vs "right," "their" vs "there." Context-based correction using surrounding words improves accuracy significantly.
Punctuation: The API doesn't output punctuation by default. You need to add it programmatically based on pauses and intonation, or train a secondary model to insert punctuation.
I built a speech-to-text tool at zovo.one/free-tools/speech-to-text that captures continuous speech, displays real-time interim results, supports multiple languages, and lets you copy or download the transcript. No uploads, no accounts, no API keys.
I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.
Top comments (0)