Shane Duffy

Posted on Mar 3 • Originally published at shaneduffy.io

I Built a Voice Assistant That Runs Entirely in Your Browser

#ai #llm #webassembly #webdev

Web Assembly has been touted as the next big thing in the web space for the better part of a decade now, but it seems to have struggled in finding a clear identity and purpose. Perhaps AI is just the thing to fill that role?

AI and Web Assembly

I'd like to imagine a future where we can leverage AI in a fully decentralized manner, without being compelled to rely on our new technocratic overlords. It's great to see the explosion of local AI tooling in the past few years, but for the most part, these require local installations and often require a significant degree of technical depth.

In an attempt to determine the viability of running multiple AI models on the web simultaneously, I put together a fully in-browser platform for configurable, voice-activated AI assistants. It eats more resources than a crypto miner trying to qualify for a mortgage in SF, but hell, it works.

✨ xenith.ai

🌐 GitHub Repo

The Architecture

This project takes advantage of a variety of OSS packages and frameworks to handle the heavy lifting:

Whisper for Speech-to-Text (STT)
VITS for Text-to-Speech (TTS)
Silero for Voice Activity Detection (VAD)
WebLLM for running language models directly in the browser

I configured these to run on their own independent web workers, with their own respective services to handle interaction. A global audio worklet handles microphone input and passes data to the AudioService which routes it to the respective services for handling.

The Whisper Problem

What I wanted here was an Alexa-like application but with a wake word that can be arbitrarily set by the user. Along with that, I wanted the ability to have a variety of different Assistants running different voices, configurations, and language models, where you can set a unique wake word for each one.

I had initially looked into wake word AI models and standards, notably openWakeWord. While it defines a standardized structure for wake word detection models, I realized that this route just isn't user-friendly. You can't expect people to train their own word detection models. In the current state of the technology, that is still a fairly involved process.

The only clear solution was to use OpenAI's Whisper model (compiled to WebAssembly, shoutout to Georgi Gerganov for open sourcing his work here). Whisper can perform Speech-To-Text conversion, but there were two major issues:

#1 - Whisper is a Resource Hog

The first issue is that running Whisper eats up some serious computational resources, so it's not ideal to have it running constantly. In the Whisper WASM port, there is an example of a wake word detection implementation command.wasm, which partially addresses this with VAD, only running transcription once voice activity is detected.

However it only then returns the post-command transcription once voice activity stops completely. Meaning, it won't yield a constant stream of input as the user talks once the wake word triggers.

#2 - The Whisper Model Isn't Built for "Streamed" Input

The second issue is that Whisper is a pure IO model: you give it audio, and it outputs text. So how can we provide a constant "stream" for wake word detection? The most straightforward answer is to just take 1 second slices of audio and continually feed it in. However, in this case the wake word might be cut off in-between slices. For example if we were to say...

"The development of full artificial intelligence could spell the end of the human race"

Our 1 second time slices may lead to the following:

You could extend the slices to 5 seconds to reduce that risk, but then it would take 5 seconds to activate your assistant, and nobody wants that.

The "correct" solution would probably involve modifying how the model works fundamentally, but I'm a software engineer, not an AI engineer. So, I played around a bit and put together a sliding window solution, with word de-duplication.

The Sliding Window Approach

What if we were to provide audio buffers to Whisper for transcription every 1 second, but with each buffer we prepend the prior 1 second buffer? We might get slices like so:

We can trace backwards through the current transcriptions until we find a word that matches the last valid word of the prior transcription. Then we can continue our backward trace until we find a word that doesn't match, and throw that out!

std::string deduplicate_transcription(const std::string &new_text, bool is_final) {
    std::vector new_words = split_words(new_text);

    // Remove last word from transcription to avoid cutoff words
    if (!is_final && !new_words.empty()) {
        new_words.pop_back();
    }

    if (last_transcribed.empty()) {
        last_transcribed = join_words(new_words);
        return last_transcribed;
    }

    std::vector old_words = split_words(last_transcribed);

    // Search BACKWARD for the first matching word
    size_t match_index = old_words.size();
    std::string first_new_word = new_words.empty() ? "" : new_words[0];

    for (size_t i = old_words.size(); i-- > 0;) {  // Iterate backward
        if (words_match(old_words[i], first_new_word, false)) { // Exact match only for the first word
            match_index = i;
            break;
        }
    }

    // Move FORWARD and remove overlapping words
    size_t trim_index = 0;
    while (trim_index < new_words.size() && match_index < old_words.size()) {
        if (words_match(old_words[match_index], new_words[trim_index], true)) { // Exact match or levenshtein for overlapping words after first match
            trim_index++;  // Remove this word
            match_index++;
        } else {
            break;  // Stop at first non-match
        }
    }

    // Trim overlapping words
    std::vector deduped_words(new_words.begin() + trim_index, new_words.end());

    // Store cleaned transcription & return result
    if (is_final) {
        last_transcribed.clear(); // Clear previous transcription
        g_audio_chunks.clear(); // Clear the 1-second buffer AFTER processing the final frame
    } else {
        last_transcribed = join_words(new_words);
    }
    return join_words(deduped_words);
}

Whisper can yield different words depending on the sentence context as well, so instead of direct string comparison here we can perform Levenshtein word similarity to reduce the probability of false negatives.

int levenshtein_distance(const std::string &s1, const std::string &s2) {
    const size_t len1 = s1.size(), len2 = s2.size();
    std::vector> d(len1 + 1, std::vector(len2 + 1));

    for (size_t i = 0; i <= len1; i++) d[i][0] = i;
    for (size_t j = 0; j <= len2; j++) d[0][j] = j;

    for (size_t i = 1; i <= len1; i++) {
        for (size_t j = 1; j <= len2; j++) {
            d[i][j] = std::min({
                d[i - 1][j] + 1,  // Deletion
                d[i][j - 1] + 1,  // Insertion
                d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1) // Substitution
            });
        }
    }

    return d[len1][len2];
}

It's not perfect, but it works well for the most part. You can find my modified fork of Whisper.cpp here.

The VITS Problem

LLMs take awhile to generate full responses, especially in the context of running them locally. So we probably don't want to wait until the full response is generated to start generating audio output. With that in mind, WebLLM thankfully supports a streaming response mode, where it can yield an output stream as the model generates words.

But the nature of human speech prevents us from simply converting each word to audio and stitching those pieces together. It would sound robotic and stilted, sentences are meant to flow as one. To circumvent this, we can buffer the output text until we have a full sentence, then convert that to audio and play it.

public streamToken(token: string, guid: string, voiceId: string): void {
    if (!this.vitsWorker) {
      throw new Error('TTS Worker not initialized');
    }

    if (this.currentGuid && this.currentGuid !== guid) {
      this.cancelCurrentStream();
    }
    this.currentGuid = guid;

    const cleanToken = this.filterMessage(token);
    if (!cleanToken.trim()) return;

    this.bufferString += cleanToken;

    if (/[.!?]["']?\s*$/.test(this.bufferString)) {
      const sentence = this.bufferString.trim();
      this.bufferString = '';

      this.vitsWorker.postMessage({
        messageOutput: sentence,
        voiceId,
        guid,
      });
    }
}

As the worker completes audio generation for these discretized sentences, we can add them to a queue so we can play them in sequence:

this.vitsWorker.onmessage = (event: MessageEvent) => {
    const { type, wav, error, guid } = event.data;

    if (this.currentGuid !== guid) {
        console.warn(`[TTS] Ignoring stale audio for GUID ${guid}`);
        return;
    }

    if (type === 'tts-result') {
        const audio = new Audio();
        audio.src = URL.createObjectURL(wav);
        audio.addEventListener('ended', () => {
        this.isPlaying = false;
        this.currentAudio = null;
        this.playNextAudio();
        });

        this.audioQueue.push(audio);
        if (!this.isPlaying) {
        this.playNextAudio();
        }
    } else if (type === 'tts-error') {
        console.error('TTS worker error:', error);
        this.isPlaying = false;
        this.currentAudio = null;
        this.playNextAudio(); // Skip to next
    }
};

Again, it's not perfect and definitely could use some tuning (especially since we have a single global worker, meaning multiple assistants could theoretically interleave their audio if they respond at the same time), but it works well enough for a proof of concept.

The LLM Problem

LLMs sure love to yap don't they? This is fine for a textual assistant, but nobody wants their voice assistant rambling about the nuances of the downfall of the Roman empire (well maybe some people do) for 2-3 minutes in a single audio response. With that in mind I added a custom system prompt to all assistants ensuring they have limited response sizes to keep things manageable.

const formattedMessages = lastMsg
      ? [
          {
            role: 'system' as const,
            content: 'You are a virtual assistant. Your responses should be short (2-3 sentences) and be able to be read aloud verbatim.',
          },
          {
            role:
              lastMsg.chatParticipant === this
                ? ('assistant' as const)
                : ('user' as const),
            content: lastMsg.text,
          },
        ]
: [];

Going forward, I'd like to make it so that users can set their own system prompts, and define their own context windows.

Performance, Problems, and The Future

The most obvious issue is that running these massive models locally requires capable hardware. I've got a Razer Blade with an RTX 4080, and even I can hardly run these without putting my PC into overdrive. Part of this is due to Web Assembly: accessing your GPU from the browser is a miracle, sure, but performance is still limited due to the sandboxed environment.

Along with this, your browser needs WebGPU support and must allow FP16 (half-precision) support, which mobile browsers currently don't support. Even if they did, mobile devices probably don't have the chops for this kind of processing yet anyhow.

Another major issue was that these models are huge and consume massive amounts of memory. The browser limits how much you can consume, and you probably don't want five models loaded and eating up your RAM anyway. With that in mind, I configured WebLLM to load/unload models when an Assistant with a different language model is triggered. It adds some latency on the initial request, but it'll have to do.

As it stands, this is more of a proof of concept. But as models become more efficient and hardware improves, I'd like to think that we'll be able to perform AI tasks purely in the browser at some point in the near future. This would make private, decentralized AI tools much more accessible to technical and non-technical users alike.