The Web Speech API Is Built Into Your Browser (and It's Actually Useful)

#javascript #tutorial #webdev #beginners

There's a fully functional text-to-speech engine built into every modern browser. No API keys, no third-party services, no npm packages. It's the Web Speech API, specifically the SpeechSynthesis interface, and it works with four lines of JavaScript.

const utterance = new SpeechSynthesisUtterance('Hello, world');
speechSynthesis.speak(utterance);

That's it. The browser speaks the text aloud using system voices. I didn't know this API existed until I needed to build an accessibility feature for a reading application, and I was genuinely surprised by how capable it is.

Getting the available voices

Each browser and operating system provides a different set of voices. macOS ships with dozens. Windows has a handful. Chrome on Android uses Google's voices. You can list them:

function getVoices() {
  return new Promise(resolve => {
    let voices = speechSynthesis.getVoices();
    if (voices.length) {
      resolve(voices);
      return;
    }
    speechSynthesis.onvoiceschanged = () => {
      resolve(speechSynthesis.getVoices());
    };
  });
}

getVoices().then(voices => {
  voices.forEach(voice => {
    console.log(`${voice.name} (${voice.lang}) ${voice.localService ? 'local' : 'remote'}`);
  });
});

The Promise wrapper is necessary because getVoices() returns an empty array on the first call in some browsers. The voices are loaded asynchronously, and the onvoiceschanged event fires when they're ready.

Common voices you'll see:

macOS: "Samantha", "Alex", "Karen" (Australian), "Daniel" (British), plus many non-English voices
Windows: "Microsoft David", "Microsoft Zira", "Microsoft Mark"
Chrome: "Google US English", "Google UK English Male/Female"

The "remote" voices (like Google's) sound significantly more natural than the local system voices because they use cloud-based synthesis. However, they require an internet connection and may have rate limits.

Controlling speech parameters

The SpeechSynthesisUtterance object has several configurable properties:

const utterance = new SpeechSynthesisUtterance('Hello, world');

// Speed: 0.1 to 10, default is 1
utterance.rate = 1.2;

// Pitch: 0 to 2, default is 1
utterance.pitch = 1.0;

// Volume: 0 to 1, default is 1
utterance.volume = 0.8;

// Language (BCP 47 tag)
utterance.lang = 'en-US';

// Specific voice
const voices = speechSynthesis.getVoices();
utterance.voice = voices.find(v => v.name === 'Google US English');

speechSynthesis.speak(utterance);

Rate values between 0.8 and 1.5 sound natural. Below 0.8 sounds artificially slow, and above 2.0 becomes unintelligible. For accessibility applications where users listen to long text, letting them control the rate is essential -- preferences vary widely.

Events for synchronization

The utterance fires events that let you track progress:

utterance.onstart = () => console.log('Started speaking');
utterance.onend = () => console.log('Finished speaking');
utterance.onpause = () => console.log('Paused');
utterance.onresume = () => console.log('Resumed');
utterance.onerror = (e) => console.error('Error:', e.error);

// Word boundary events (not supported in all browsers)
utterance.onboundary = (e) => {
  console.log(`Word at position ${e.charIndex}: ${e.name}`);
};

The onboundary event is particularly useful for building a "karaoke-style" highlight that follows along with the speech, highlighting each word as it's spoken. Browser support is inconsistent -- Chrome supports it well, Firefox partially, Safari less so.

Pause, resume, and cancel

// Pause current speech
speechSynthesis.pause();

// Resume paused speech
speechSynthesis.resume();

// Stop everything and clear the queue
speechSynthesis.cancel();

The speech synthesis maintains a queue. If you call speak() while speech is in progress, the new utterance is added to the queue and speaks after the current one finishes. Call cancel() to clear the queue and stop immediately.

Practical use cases

Accessibility. The most important use case. Users with visual impairments, reading disabilities, or attention difficulties benefit from having text read aloud. Adding a "listen to this article" button is straightforward:

function readArticle() {
  const text = document.querySelector('article').textContent;
  const utterance = new SpeechSynthesisUtterance(text);
  utterance.rate = 1.0;
  speechSynthesis.speak(utterance);
}

Language learning applications. Hearing the correct pronunciation of words and sentences is essential for language learners. The Web Speech API supports dozens of languages with native-sounding voices.

Proofreading. Hearing your writing read aloud catches errors that silent reading misses. Awkward phrasing, missing words, and run-on sentences become obvious when spoken.

Notifications and alerts. For applications where the user's attention might be elsewhere (monitoring dashboards, IoT control panels), a spoken alert supplements visual notifications.

Browser quirks and limitations

There are a few pain points to know about:

Chrome's autoplay policy. Chrome requires a user gesture (click, tap) before speechSynthesis.speak() will work. Calling it on page load produces silence. This is the same policy that blocks autoplay audio and video.

The Chrome 15-second bug. In some Chrome versions, utterances longer than about 15 seconds of audio are cut off silently. The workaround is to split long text into sentences or chunks and queue them individually:

function speakLongText(text) {
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  sentences.forEach(sentence => {
    const utterance = new SpeechSynthesisUtterance(sentence.trim());
    speechSynthesis.speak(utterance);
  });
}

Mobile behavior. On iOS, only one voice is available per language, and the rate property has a narrower effective range. On Android, available voices depend on the installed TTS engine (usually Google's).

No SSML support. The Web Speech API does not support SSML (Speech Synthesis Markup Language), which means you can't control pronunciation of specific words, add pauses at specific positions, or use phonetic spelling. For applications that need this level of control, you'll need a cloud TTS service like Google Cloud TTS or Amazon Polly.

Beyond the browser API

When the Web Speech API isn't sufficient, the cloud alternatives are:

Google Cloud Text-to-Speech: high-quality WaveNet voices, SSML support, pay per character
Amazon Polly: similar quality, Neural and Standard voices, pay per character
Azure Cognitive Services TTS: strong multilingual support, custom voice training

These produce audio files (MP3, WAV) rather than real-time browser speech, which means they can be cached, used offline, and embedded in any context.

For trying out text-to-speech with different voices and settings without writing code, I built a text-to-speech tool at zovo.one/free-tools/text-to-speech that uses the browser's built-in voices with controls for rate, pitch, and voice selection.

The Web Speech API is one of those browser features that's been available for years but remains underused. It's not perfect -- the voice quality varies, the browser support has quirks, and SSML is missing. But for accessibility features, language learning tools, and proofreading aids, it's a capable tool that requires zero infrastructure. That's hard to beat.

I'm Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.