DEV Community

Arun Prakash Pandey
Arun Prakash Pandey

Posted on

What happened when I used the Web Speech API?

Behind the scene : The Goal...

I wanted to make a web app for my personal use, which involved

  • Continuously listen to the user’s speech
  • Convert it to text in real-time
  • Track the frequency of a specific word (e.g., “Shiv”)
  • Display that count on screen and store it locally

Simple idea. Surprisingly tricky to implement.

Expectations vs Reality

The app technically worked using the Web Speech API. It listened, transcribed, and returned results. However, instead of processing each word as it was spoken, it:

  • Listened to full chunks of speech
  • Waited for pauses or silence
  • Then returned a batch of words all at once

Even with { continuous: true }, the speech recognition didn’t behave the way I expected — no real-time, word-by-word updates.

Findings

It is not reliable for low-latency, high-frequency word detection

  1. The Web Speech API buffers audio and processes it after short silences.
  2. It often fails to keep up with fast, continuous speech.
  3. Repeating the same word (e.g., "Shiv Shiv Shiv Shiv Shiv") often causes the recognizer to combine or ignore duplicates.

Solution

Stream Audio → Transcribe → Return Text

  1. Use a Cloud-Based Speech-to-Text API
  2. Run an OpenAI whisper STT model on a server locally.
  3. Stream the voice data using a websocket to the backend.
  4. Backend transcribes the speech and sends reponse to the browser.

Result

Understood the limitations of the web speech api and will work on the project with a fresh start.

Top comments (0)