Devarshi Shimpi

Posted on Mar 16 • Originally published at devarshi.dev

Voice Search in Chrome Extensions Is Harder Than It Looks

#webdev #javascript #react #beginners

I thought this was just an hour of work. It wasn't. Here's everything that went wrong and what actually worked in the end.

Starting with the obvious choice

window.SpeechRecognition is built into Chrome. No API keys, no server, completely free. It felt like the perfect starting point for adding voice search to my extension.
After a day of testing, I dropped it.
The mic permission popup was all over the place. Sometimes it would show up randomly mid-session. Sometimes it would just stop listening with no error, no warning, nothing. Results were different depending on the machine and OS. For a new tab extension that's meant to feel polished, that just wasn't good enough.
I switched to Deepgram's Nova-3 model. It's fast, accurate, and sends results back in real time over WebSocket, so you don't have to wait for a full sentence before seeing any output.
But now I needed a server, which kicked off a whole chain of problems.

Why you can't call Deepgram directly from the extension

Chrome extensions are easy to unpack and read. Any secret you put in the code is basically public. So keeping a Deepgram API key in the client was never going to work.
I needed a server in the middle. My first thought was Railway. I've used it before and getting a Node server running there takes maybe five minutes. But I second guessed that pretty quickly. A server on Railway means it has to start up cold sometimes, it costs money every month, and the speed depends on where the server is located. Cloudflare Workers runs closer to the user since it runs at the edge, and the free plan is generous enough that a voice search API would run for free forever. Easy choice.
The setup I had in mind: the extension sends raw audio to the Worker, the Worker passes it to Deepgram, and the text comes back the same way.

Simple enough on paper. Actually building it was another story.

Hono looked like the right tool. It wasn't.

I use Hono for pretty much everything on Cloudflare Workers. Clean, fast, easy routing. So I reached for it without thinking twice.
Set up a /transcribe route, handled the WebSocket upgrade, deployed it. The Worker just froze. No error from my code, just Cloudflare eventually killing the request with a 500 and a message saying the Worker "would never generate a response."
Took me a while to figure out what was going on. Cloudflare needs a special webSocket property on the Response object to finish a WebSocket connection. That's how it knows which socket to hand back to the client.
Hono wraps the built-in Response to add its own features. Somewhere in that process, the webSocket property gets lost. Cloudflare gets a response with no socket, doesn't know what to do, and kills the request.
The fix was to skip Hono for this route and just write it raw:

export default {
  async fetch(request, env) {
    if (url.pathname === "/transcribe") {
      const pair = new WebSocketPair();
      const [client, server] = pair;
      server.accept();
      return new Response(null, { status: 101, webSocket: client });
    }
  },
};

101 Switching Protocols. Finally.

The timing bug I should've seen coming

Audio was flowing but the Worker kept logging Deepgram not ready. Pretty obvious in hindsight.
The moment the client gets a 101 back, it starts sending audio right away. But the Worker had only just started its own connection to Deepgram. That connection wasn't open yet. So audio was showing up before Deepgram was ready to take it.
My first fix was to wait for the Deepgram connection to open before sending the 101 back:

await new Promise((resolve) => {
  deepgramSocket.onopen = resolve;
});
return new Response(null, { status: 101, webSocket: client });

This worked. But it quietly created a new problem I didn't notice until later.

Cloudflare's 10ms CPU limit is real

The free plan on Cloudflare Workers gives you 10ms of CPU time per request. Not total time, just active processing time. Waiting on a network call doesn't count, so in theory waiting for Deepgram to connect should be fine.
In practice, keeping the request open through the whole connection process added enough overhead that CPU usage was sitting at 40 to 90ms. Cloudflare was going to start dropping requests.

The fix was to stop waiting. Return the 101 right away, let Deepgram connect in the background, and just hold any audio that comes in too early:

let deepgramReady = false;
const pendingChunks = [];

deepgramSocket.addEventListener("open", () => {
  deepgramReady = true;
  for (const chunk of pendingChunks) deepgramSocket.send(chunk);
  pendingChunks.length = 0;
});

server.addEventListener("message", (event) => {
  if (!deepgramReady) {
    pendingChunks.push(event.data);
  } else {
    deepgramSocket.send(event.data);
  }
});

return new Response(null, { status: 101, webSocket: client });

CPU usage dropped to around 2ms. Deepgram still connects, it just doesn't hold everything else up while it does.

AudioWorklet and Manifest V3 don't work together

This was the most painful part of the whole thing.
MV3 extensions only let you run scripts that are part of your own extension package. No inline scripts, no scripts from other websites, nothing generated at runtime. AudioWorkletNode is the standard modern way to process audio in a background thread, and it needs you to load a separate file using addModule(url). You can already see where this is going.

First try was using Vite's ?url import:

import processorUrl from "./processor.js?url";
// Output: data:text/javascript;base64,...

Blocked. Makes sense, data URIs aren't allowed.
Second try was putting the file in the public/ folder and loading it with chrome.runtime.getURL:

const url = chrome.runtime.getURL("audio/pcm-processor.js");
await ctx.audioWorklet.addModule(url);

This really should have worked. A chrome-extension:// URL is from your own extension, so the policy should allow it. I was pretty sure this was the answer.
It failed with AbortError: Unable to load worklet module. No extra info, no useful error message.
After going through Chromium bug reports, turns out this is a known broken behaviour in how Chrome handles extension files in Worklet threads. There are open issues, no fix in sight.

The old deprecated API that actually works

I gave up on AudioWorklet and used ScriptProcessorNode instead. It's been marked as deprecated for years but it runs directly in the page using a callback, so there's no extra file to load and no policy issues. For converting Float32 audio to Int16, the performance hit is basically nothing.

const processor = audioCtx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
  const input = e.inputBuffer.getChannelData(0);
  const int16 = float32ToInt16(input);
  ws.send(int16);
};

Works every time without fail. The modern recommended way broke. The old way just works. That's extension development.

A few smaller things worth knowing

Cleaning up properly is harder than it sounds. Web Audio can leak memory if you don't clean up after yourself. When voice search stops, you need to close the WebSocket, disconnect the ScriptProcessor, stop all microphone tracks (this turns off the mic light in the browser toolbar), and close the AudioContext. Skip any one of these and you'll have audio processes running in the background that you can't see.
Silence detection matters a lot. Without it, the mic just stays on forever. Every time Deepgram sends back a transcript or a speech_started event, reset a 2.5 second timer. When the timer runs out, stop listening. It makes the whole thing feel much less annoying to use.
Add the Worker URL to host_permissions in your manifest. You need to allow wss://your-worker.workers.dev/* or the browser blocks the connection before it even gets to any policy check.

What actually changed

Before all of this, the extension was using window.SpeechRecognition with around 500ms before it even started listening, results that changed between machines, and random failures. After the rewrite, it connects in around 100ms, the Worker uses 1 to 3ms of CPU per request, and it works the same way every time.
Every single problem in this build came from the same place: the standard tools weren't built for this environment. Hono breaks WebSockets on Cloudflare. AudioWorklet breaks in MV3 extensions. Once you understand why something fails, the fix is usually obvious. The hard part is getting there.

Happy Coding!!!

Thank you for reading! If you found this blog post helpful, please consider sharing it with others who might benefit. Feel free to check out my other blog posts and visit my socials!

DEV Community