Client side audio transcription using Parakeet v3 and WebGPU

#whisper #productivity #tooling #ai

Processing audio files into text usually requires sending personal data to an external server. That approach always bothered me because of the privacy implications and the recurring API costs. As browser technologies advanced over the last few years, I started looking into ways to handle speech recognition locally without relying on external servers at all.

OpenAI released Whisper a while ago and it quickly became the standard for open source transcription. Developers did incredible work porting it to run in the browser using WebAssembly and WebGPU. I initially tried building my project around Whisper. The accuracy is great, but the hardware demands are very high.

Running Whisper locally in a browser tab often causes the entire page to freeze or lag. It heavily requires a dedicated GPU to perform at a reasonable speed. If you try to run a medium Whisper model on a standard laptop CPU, the transcription process can easily take much longer than the audio itself. This makes for a frustrating user experience when someone just wants to transcribe a ten minute meeting.

I started searching for lighter alternatives and discovered NVIDIA Parakeet v3. It is a highly optimized acoustic model designed specifically for speed and efficiency. To get it working within a web environment, I implemented a library called parakeet.js. This setup changed the performance dynamic of my project entirely.
The most noticeable difference between Parakeet and Whisper in the browser is the raw execution speed. Parakeet processes audio files significantly faster. Because the model architecture is vastly more efficient, it does not rely exclusively on heavy WebGPU compute pipelines. It actually runs with very decent speed on a standard CPU.

This is a massive benefit for web development. Most people browsing the web do not have a dedicated graphics card. Being able to transcribe an hour of audio on a basic office laptop using just the processor makes local machine learning much more accessible to the average person.

The efficiency of parakeet.js also extends to mobile devices. Running Whisper on a phone browser usually crashes the tab immediately due to strict memory limits imposed by mobile operating systems. Parakeet has a much smaller memory footprint. I tested it on several recent mobile phones and the models load and run successfully. You can record a voice memo on your phone and transcribe it directly in your mobile browser without uploading anything to a cloud provider.

I put this technology into a web application called Transcrisper. The goal was to make a simple interface where anyone can drop an audio or video file and get text back. The entire pipeline executes locally. Your media file never leaves your hard drive. No server uploads and no backend databases are storing your private conversations.

I implemented speaker diarization so the output identifies exactly when different people are talking in the audio track. This feature is usually locked behind expensive subscription tiers on commercial platforms. The application also generates standard text files and SRT files for video subtitles. Since the heavy lifting happens on the user device, I do not have to pay for server compute time. This means I can offer the tool completely for free with no artificial limits on file size or length.

Managing browser memory is still the main challenge when building client side tools. The browser has to download the model weights on the first visit. I used the Cache API to store these files locally on the hard drive. Subsequent visits load the model directly from the browser cache, which makes the application ready to use instantly without downloading megabytes of data again.

You also have to be careful with garbage collection in JavaScript when passing large audio buffers around. I spent a lot of time optimizing how the audio chunks are fed into the model so the tab does not run out of memory on long podcast episodes.

Moving machine learning to the client side solves major privacy concerns and eliminates expensive server costs. I think we will see many more applications adopt this local first approach as browser standards improve. You can try it out here. I am very interested to hear how it performs on different hardware setups, especially older CPUs and mobile devices. Let me know your thoughts in the comments.

DEV Community

Client side audio transcription using Parakeet v3 and WebGPU

Top comments (0)