Ponder is a voice-first application with audio recording and processing as its core features. Unlike most audio Web Apps where the processing is done after the recording is finished, Ponder processes the audio in real-time to provide advanced business logic and features such as transcription. There are two requirements:
- Storing audio with full integrity and cross-device playback
- Processing the audio while recording in real-time
This article is a guide to how Ponder accomplishes these requirements using AudioWorklet.
The WebAudio ScriptProcessor API was deprecated in favor of the AudioWorklet in 2014. However, the AudioWorklet API was unavailable on iOS Safari until the recent 15.0 update.
Over six months of beta testing our product, we received feedback from our customers with concerns about the recorded audio quality when using Safari on iOS.
We investigated two solutions to mitigate the issue:
- Splitting the audio stream using two separate ScriptProcessor nodes:
- Node A with a buffer rate of 8192 for smoother audio playback
- Node B with a buffer rate of 4096 for frequent transcription requests
- Using a WebWorker to process the audio buffer in a background thread, using GoogleChromeLabs/comlink and GoogleChromeLabs/worker-plugin
The audio stream splitting method was a neat technique that helped with optimizing our audio processing pipeline, and the WebWorker processor setup, with full TypeScript interop through comlink, was a breeze to work with.
The resulting code was concise and pretty, but it did not solve the audio problem. If anything, these improvements made our system even more fragile. On Android, the audio worked well. On iOS however, the audio suffered issues from lagging to stuttering, to missing a large chunk of recorded audio. We observed an improvement to the smoothness of our UI, however our priority is the quality and integrity of the audio.
From a first principle perspective, the audio buffer that we were getting data from is still being generated by the ScriptProcessor API's callback. This led us to conclude that it would likely remain until we replaced the ScriptProcessor API.
We initially used the approach described in this google documentation for our old ScriptProcessor based audio recorder.
Below is the most relevant code snippet inspired by that tutorial:
In this section, we shall implement an in-place replacement for the above ScriptProcessor code, using the AudioWorkletProcessor API. The two key code paths were marked
- A: Replace ScriptProcessor with an implementation using AudioWorkletProcessor
- B: Get the data from AudioWorkletProcessor and consume it
First, let's create a
recorder.worklet.js file for our
This file should be served by your static web server. In a nextjs application, you can simply place it under the
public directory. Then, in your web app, register the Worklet processor as follows:
AudioWorkletProcessor emit audio data blocks of 128 frames long - i.e a Float32Array of 128 data sample doc📄. To emulate the behavior of
ScriptProcessor, we will need to buffer the sample data produced by
Buffering is a technique in data processing where instead of consuming the input data right away, we store the data in a temporary location and wait until a certain threshold is met to start reading the data. The key use case is when the input is a stream of data, and the output is a decoded representation of the streamed data, such as audio and video.
Why do we need buffering? - Let’s take a look at a communication example: Alice is sending Bob a sequence of letters: H, E, L, L, O. If Bob’s short term memory was not functioning, he might have forgotten about the letter H the moment he received the letter E, and so on and so forth. Thus, the lack of buffering prevented Bob from forming a coherent word from Alice, leading to miscommunication. If his memory is normal, he would buffer the letters until his brain recognizes that a group of letters matches a word in his vocabulary.
The idea is similar when processing microphone input for human speech. The basic groundwork to set up a buffer requires 3 fields:
- bufferSize: A variable to track the size of the buffer
- _byteWritten: A pointer to track the location of the last byte written (i.e, an index)
- _buffer: The buffer itself
To capture the
Float32Array of data samples produced by
AudioWorkletProcessor mentioned earlier, we implement its
process method and store the data as follows:
append method above allows us to fill the data buffer with microphone samples. Next, we will need to check if the buffer is full to emit the data back to our main thread for upstream processing. We will also want to reset the
_bytesWritten pointer. Let's call this operation
To make it easier to work with our buffer and implement
flush, we can add a couple of helper methods:
- initBuffer: Reset our _byteWritten pointer to its initial position
- isBufferEmpty: Check if our pointer is at its initial position
- isBufferFull: Check if our pointer is of the same size as our pre-allocated buffer
With our helpers, we can implement
flush method as follows:
Finally, we can invoke this
flush method at the beginning of our
append like so:
With that, our re-implementation of the
ScriptProcessor node with a 4096 buffer size and a single input/output channel is now complete. The full code is published here.
Lastly, you can retrieve the data and consume it the same way you would with
ScriptProcessor in your web app:
NOTE: GCP's speech to text service requires an
Int16Arraybuffer. An example utility to convert the
Float32Arrayis shown below:
Have any questions or feedback? Feel free to leave a comment here or on the gist itself!
Recall our buffering example above: the buffer size in that example is a single alphabet letter. For speech audio communication however, there is no direct equivalent. Thus, we need to experiment with the delay in real-time transcription responses coming from Google Speech to Text service.
A smaller buffer size results in more frequent invocation of the
flush method, and vice versa. Flush effectively sends data to the speech service.
For our experimentation, we track 2 timestamps:
- Time when flush was first called
- Time when our client received the first transcribed word
By subtracting t1 from t2, we can obtain the total delay:
AudioWorklet emits a minimum of 128 bytes per tick. Following ScriptProcessor's implementation, our buffer size should be a power of 2 between 128 and 16384. For each size, we repeat the experimentation 10 times. The results are in the table below:
To analyze the data, we can visualize it with a box plot:
Based on the plot, 4096 was chosen since it has the lowest delay, while also having a reasonable buffer size to reduce processing frequency.
After switching to our new
AudioWorklet implementation, Ponder has stopped seeing audio quality issues such as lagging, stuttering, or missing audio. Furthermore, the transcription speed is noticeably faster.
This post is dedicated to the constructive feedback that motivated our engineering team to keep on improving Ponder's asynchronous coaching experience. The architecture improvement detailed in this post is now on our staging environment and is being tested by our internal team. We will release it into production under Ponder v1.5.0. Stay tuned!