DEV Community ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป

DEV Community ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป is a community of 966,904 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
Cover image for Quick guide to AudioWorklet ๐Ÿ”‰
Louis
Louis

Posted on

Quick guide to AudioWorklet ๐Ÿ”‰

Ponder is a voice-first application with audio recording and processing as its core features. Unlike most audio Web Apps where the processing is done after the recording is finished, Ponder processes the audio in real-time to provide advanced business logic and features such as transcription. There are two requirements:

  1. Storing audio with full integrity and cross-device playback
  2. Processing the audio while recording in real-time

This article is a guide to how Ponder accomplishes these requirements using AudioWorklet.

๐Ÿž The problem with ScriptProcessor

The WebAudio ScriptProcessor API was deprecated in favor of the AudioWorklet in 2014. However, the AudioWorklet API was unavailable on iOS Safari until the recent 15.0 update.

Over six months of beta testing our product, we received feedback from our customers with concerns about the recorded audio quality when using Safari on iOS.

We investigated two solutions to mitigate the issue:

  1. Splitting the audio stream using two separate ScriptProcessor nodes:
    • Node A with a buffer rate of 8192 for smoother audio playback
    • Node B with a buffer rate of 4096 for frequent transcription requests
  2. Using a WebWorker to process the audio buffer in a background thread, using GoogleChromeLabs/comlink and GoogleChromeLabs/worker-plugin

Audio stream splitting

The audio stream splitting method was a neat technique that helped with optimizing our audio processing pipeline, and the WebWorker processor setup, with full TypeScript interop through comlink, was a breeze to work with.

The resulting code was concise and pretty, but it did not solve the audio problem. If anything, these improvements made our system even more fragile. On Android, the audio worked well. On iOS however, the audio suffered issues from lagging to stuttering, to missing a large chunk of recorded audio. We observed an improvement to the smoothness of our UI, however our priority is the quality and integrity of the audio.

From a first principle perspective, the audio buffer that we were getting data from is still being generated by the ScriptProcessor API's callback. This led us to conclude that it would likely remain until we replaced the ScriptProcessor API.

๐Ÿด ScriptProcessor implementation

We initially used the approach described in this google documentation for our old ScriptProcessor based audio recorder.

Below is the most relevant code snippet inspired by that tutorial:

ScriptProcessor Code

๐Ÿฆ„ AudioWorklet implementation

In this section, we shall implement an in-place replacement for the above ScriptProcessor code, using the AudioWorkletProcessor API. The two key code paths were marked A and B:

  • A: Replace ScriptProcessor with an implementation using AudioWorkletProcessor
  • B: Get the data from AudioWorkletProcessor and consume it

A: Replacing ScriptProcessor

First, let's create a recorder.worklet.js file for our AudioWorkletProcessor:

First AudioWorkletProcessor

This file should be served by your static web server. In a nextjs application, you can simply place it under the public directory. Then, in your web app, register the Worklet processor as follows:

Register Worklet

B: Reimplementing ScriptProcessor

Currently, the AudioWorkletProcessor emit audio data blocks of 128 frames long - i.e a Float32Array of 128 data sample doc๐Ÿ“„. To emulate the behavior of ScriptProcessor, we will need to buffer the sample data produced by AudioWorkletProcessor.

While researching this setup, I stumbled upon a helpful gist by flpvsk. The buffer implementation below borrowed many ideas from that gist.

Buffering is a technique in data processing where instead of consuming the input data right away, we store the data in a temporary location and wait until a certain threshold is met to start reading the data. The key use case is when the input is a stream of data, and the output is a decoded representation of the streamed data, such as audio and video.

Why do we need buffering? - Letโ€™s take a look at a communication example: Alice is sending Bob a sequence of letters: H, E, L, L, O. If Bobโ€™s short term memory was not functioning, he might have forgotten about the letter H the moment he received the letter E, and so on and so forth. Thus, the lack of buffering prevented Bob from forming a coherent word from Alice, leading to miscommunication. If his memory is normal, he would buffer the letters until his brain recognizes that a group of letters matches a word in his vocabulary.

Alice and bob

The idea is similar when processing microphone input for human speech. The basic groundwork to set up a buffer requires 3 fields:

  • bufferSize: A variable to track the size of the buffer
  • _byteWritten: A pointer to track the location of the last byte written (i.e, an index)
  • _buffer: The buffer itself

Basic buffer

To capture the Float32Array of data samples produced by AudioWorkletProcessor mentioned earlier, we implement its process method and store the data as follows:

Capture buffer

The append method above allows us to fill the data buffer with microphone samples. Next, we will need to check if the buffer is full to emit the data back to our main thread for upstream processing. We will also want to reset the _bytesWritten pointer. Let's call this operation flush.

To make it easier to work with our buffer and implement flush, we can add a couple of helper methods:

  • initBuffer: Reset our _byteWritten pointer to its initial position
  • isBufferEmpty: Check if our pointer is at its initial position
  • isBufferFull: Check if our pointer is of the same size as our pre-allocated buffer

Helpers

With our helpers, we can implement flush method as follows:

Flush method

Finally, we can invoke this flush method at the beginning of our append like so:

invoke flush

With that, our re-implementation of the ScriptProcessor node with a 4096 buffer size and a single input/output channel is now complete. The full code is published here.

Lastly, you can retrieve the data and consume it the same way you would with ScriptProcessor in your web app:

Get F32 data

NOTE: GCP's speech to text service requires an Int16Array buffer. An example utility to convert the Float32Array is shown below:

convert F32 to I16

Have any questions or feedback? Feel free to leave a comment here or on the gist itself!

๐Ÿงช Why did we choose a buffer size of 4096?

Recall our buffering example above: the buffer size in that example is a single alphabet letter. For speech audio communication however, there is no direct equivalent. Thus, we need to experiment with the delay in real-time transcription responses coming from Google Speech to Text service.

A smaller buffer size results in more frequent invocation of the flush method, and vice versa. Flush effectively sends data to the speech service.

For our experimentation, we track 2 timestamps:

  1. Time when flush was first called
  2. Time when our client received the first transcribed word

By subtracting t1 from t2, we can obtain the total delay:

Delay calculation

AudioWorklet emits a minimum of 128 bytes per tick. Following ScriptProcessor's implementation, our buffer size should be a power of 2 between 128 and 16384. For each size, we repeat the experimentation 10 times. The results are in the table below:

buffer size data

To analyze the data, we can visualize it with a box plot:

Buffer size box plot

Based on the plot, 4096 was chosen since it has the lowest delay, while also having a reasonable buffer size to reduce processing frequency.

After switching to our new AudioWorklet implementation, Ponder has stopped seeing audio quality issues such as lagging, stuttering, or missing audio. Furthermore, the transcription speed is noticeably faster.

๐Ÿ™ Thank you to our beta testers!

This post is dedicated to the constructive feedback that motivated our engineering team to keep on improving Ponder's asynchronous coaching experience. The architecture improvement detailed in this post is now on our staging environment and is being tested by our internal team. We will release it into production under Ponder v1.5.0. Stay tuned!

Top comments (1)

Collapse
 
artnerdnet profile image
V

thanks for posting! I was going crazy trying to replace ScriptProcessor until I found this article

๐ŸŒš Friends don't let friends browse without dark mode.

Sorry, it's true.