Louis

Posted on Nov 5, 2021

Quick guide to AudioWorklet 🔉

#webdev #webaudio #audioworklet #tutorial

Ponder is a voice-first application with audio recording and processing as its core features. Unlike most audio Web Apps where the processing is done after the recording is finished, Ponder processes the audio in real-time to provide advanced business logic and features such as transcription. There are two requirements:

Storing audio with full integrity and cross-device playback
Processing the audio while recording in real-time

This article is a guide to how Ponder accomplishes these requirements using AudioWorklet.

🐞 The problem with ScriptProcessor

The WebAudio ScriptProcessor API was deprecated in favor of the AudioWorklet in 2014. However, the AudioWorklet API was unavailable on iOS Safari until the recent 15.0 update.

Over six months of beta testing our product, we received feedback from our customers with concerns about the recorded audio quality when using Safari on iOS.

We investigated two solutions to mitigate the issue:

Splitting the audio stream using two separate ScriptProcessor nodes:
- Node A with a buffer rate of 8192 for smoother audio playback
- Node B with a buffer rate of 4096 for frequent transcription requests
Using a WebWorker to process the audio buffer in a background thread, using GoogleChromeLabs/comlink and GoogleChromeLabs/worker-plugin

The audio stream splitting method was a neat technique that helped with optimizing our audio processing pipeline, and the WebWorker processor setup, with full TypeScript interop through comlink, was a breeze to work with.

The resulting code was concise and pretty, but it did not solve the audio problem. If anything, these improvements made our system even more fragile. On Android, the audio worked well. On iOS however, the audio suffered issues from lagging to stuttering, to missing a large chunk of recorded audio. We observed an improvement to the smoothness of our UI, however our priority is the quality and integrity of the audio.

From a first principle perspective, the audio buffer that we were getting data from is still being generated by the ScriptProcessor API's callback. This led us to conclude that it would likely remain until we replaced the ScriptProcessor API.

🐴 ScriptProcessor implementation

We initially used the approach described in this google documentation for our old ScriptProcessor based audio recorder.

Below is the most relevant code snippet inspired by that tutorial:

🦄 AudioWorklet implementation

In this section, we shall implement an in-place replacement for the above ScriptProcessor code, using the AudioWorkletProcessor API. The two key code paths were marked A and B:

A: Replace ScriptProcessor with an implementation using AudioWorkletProcessor
B: Get the data from AudioWorkletProcessor and consume it

A: Replacing ScriptProcessor

First, let's create a recorder.worklet.js file for our AudioWorkletProcessor:

This file should be served by your static web server. In a nextjs application, you can simply place it under the public directory. Then, in your web app, register the Worklet processor as follows:

B: Reimplementing ScriptProcessor

Currently, the AudioWorkletProcessor emit audio data blocks of 128 frames long - i.e a Float32Array of 128 data sample doc📄. To emulate the behavior of ScriptProcessor, we will need to buffer the sample data produced by AudioWorkletProcessor.

While researching this setup, I stumbled upon a helpful gist by flpvsk. The buffer implementation below borrowed many ideas from that gist.

Buffering is a technique in data processing where instead of consuming the input data right away, we store the data in a temporary location and wait until a certain threshold is met to start reading the data. The key use case is when the input is a stream of data, and the output is a decoded representation of the streamed data, such as audio and video.

Why do we need buffering? - Let’s take a look at a communication example: Alice is sending Bob a sequence of letters: H, E, L, L, O. If Bob’s short term memory was not functioning, he might have forgotten about the letter H the moment he received the letter E, and so on and so forth. Thus, the lack of buffering prevented Bob from forming a coherent word from Alice, leading to miscommunication. If his memory is normal, he would buffer the letters until his brain recognizes that a group of letters matches a word in his vocabulary.

The idea is similar when processing microphone input for human speech. The basic groundwork to set up a buffer requires 3 fields:

bufferSize: A variable to track the size of the buffer
_byteWritten: A pointer to track the location of the last byte written (i.e, an index)
_buffer: The buffer itself

To capture the Float32Array of data samples produced by AudioWorkletProcessor mentioned earlier, we implement its process method and store the data as follows:

The append method above allows us to fill the data buffer with microphone samples. Next, we will need to check if the buffer is full to emit the data back to our main thread for upstream processing. We will also want to reset the _bytesWritten pointer. Let's call this operation flush.

To make it easier to work with our buffer and implement flush, we can add a couple of helper methods:

initBuffer: Reset our _byteWritten pointer to its initial position
isBufferEmpty: Check if our pointer is at its initial position
isBufferFull: Check if our pointer is of the same size as our pre-allocated buffer

With our helpers, we can implement flush method as follows:

Finally, we can invoke this flush method at the beginning of our append like so:

With that, our re-implementation of the ScriptProcessor node with a 4096 buffer size and a single input/output channel is now complete. The full code is published here.

Lastly, you can retrieve the data and consume it the same way you would with ScriptProcessor in your web app:

NOTE: GCP's speech to text service requires an Int16Array buffer. An example utility to convert the Float32Array is shown below:

Have any questions or feedback? Feel free to leave a comment here or on the gist itself!

🧪 Why did we choose a buffer size of 4096?

Recall our buffering example above: the buffer size in that example is a single alphabet letter. For speech audio communication however, there is no direct equivalent. Thus, we need to experiment with the delay in real-time transcription responses coming from Google Speech to Text service.

A smaller buffer size results in more frequent invocation of the flush method, and vice versa. Flush effectively sends data to the speech service.

For our experimentation, we track 2 timestamps:

Time when flush was first called
Time when our client received the first transcribed word

By subtracting t1 from t2, we can obtain the total delay:

AudioWorklet emits a minimum of 128 bytes per tick. Following ScriptProcessor's implementation, our buffer size should be a power of 2 between 128 and 16384. For each size, we repeat the experimentation 10 times. The results are in the table below:

To analyze the data, we can visualize it with a box plot:

Based on the plot, 4096 was chosen since it has the lowest delay, while also having a reasonable buffer size to reduce processing frequency.

After switching to our new AudioWorklet implementation, Ponder has stopped seeing audio quality issues such as lagging, stuttering, or missing audio. Furthermore, the transcription speed is noticeably faster.

🙏 Thank you to our beta testers!

This post is dedicated to the constructive feedback that motivated our engineering team to keep on improving Ponder's asynchronous coaching experience. The architecture improvement detailed in this post is now on our staging environment and is being tested by our internal team. We will release it into production under Ponder v1.5.0. Stay tuned!