What happened when I used the Web Speech API?

#webdev #programming #javascript #ai

Behind the scene : The Goal...

I wanted to make a web app for my personal use, which involved

Continuously listen to the user’s speech
Convert it to text in real-time
Track the frequency of a specific word (e.g., “Shiv”)
Display that count on screen and store it locally

Simple idea. Surprisingly tricky to implement.

Expectations vs Reality

The app technically worked using the Web Speech API. It listened, transcribed, and returned results. However, instead of processing each word as it was spoken, it:

Listened to full chunks of speech
Waited for pauses or silence
Then returned a batch of words all at once

Even with { continuous: true }, the speech recognition didn’t behave the way I expected — no real-time, word-by-word updates.

Findings

It is not reliable for low-latency, high-frequency word detection

The Web Speech API buffers audio and processes it after short silences.
It often fails to keep up with fast, continuous speech.
Repeating the same word (e.g., "Shiv Shiv Shiv Shiv Shiv") often causes the recognizer to combine or ignore duplicates.

Solution

Stream Audio → Transcribe → Return Text

Use a Cloud-Based Speech-to-Text API
Run an OpenAI whisper STT model on a server locally.
Stream the voice data using a websocket to the backend.
Backend transcribes the speech and sends reponse to the browser.

Result

Understood the limitations of the web speech api and will work on the project with a fresh start.

DEV Community