Dmitry Tseyler

Posted on May 21 • Originally published at tseiler.tech

How I made a perfect recording button. Simple yet complex thing.

#ux #swift #programming #architecture

Intro

From the start of XSpeak, I wanted it to provide the best possible feel for the user: simple, fast, and responsive. Since it's a recording app, one of its main components is the button the user presses to record a conversation.

Actually, it's a control we're used to in many apps. The simplest example is the standard Voice Memos app on iPhone.

The button looks simple and does two things: starts recording and stops recording. However, behind the scenes, it takes many steps to start the whole pipeline, which is far from simple. In this article, I'm going to cover the technical and usability aspects of a recording button and will try to explain why it's important to make it work perfectly and why it's not as simple as it might look.

XSpeak is a fully private app that takes meeting notes live during conversation and helps you in real-time using on-device AI by providing suggestions, relevant context, facts and more as you speak.

Perfect button

I'd call the recording button perfect if it is:

Functional
Responsive

When I say functional, I mean that it should be able to start and stop recording when it's enabled and I press it. The button should indicate that the state has switched. And for such an operation that usually takes hundreds of milliseconds, that feedback should be immediate.

Let's imagine that feedback is not immediate and there's some progress between states. In this case, the user:

Has to wait to make sure that recording has actually started.
Is out of their natural flow. Instead of focusing on their business, they are monitoring the app's status.
Experiences unnecessary cognitive load as a result: "What just happened? Did it start? How long should I wait?"
Loses control over the interface. They cannot correct the error and press stop.
Feels slight frustration, as the app temporarily prevents them from doing what they want to do.
Misses the instant, tactile feedback expected from a tool.

This discomfort might feel minor for some people. However, when the user does it so many times a day, it might add significant overhead. And that's not what we should expect from a helping tool.

Two qualities define a perfect recording button: it must be functional and responsive. The press should never wait on the pipeline.

Imperfect case:

User presses a button.

Button is disabled, shows a progress state or simply doesn't react.

After some time, the button is functional again, and recording is started.

Behind the scenes

Why is starting recording not that simple? When you press the button, the following happens in XSpeak:

All these operations happen asynchronously. It means that we lose flow during each, and when we resume, the world could have changed: the user might have pressed the button several more times, previously available resources might have become unavailable, and so on.

Besides that, it launches side management threads that restart the mixer to prevent drift between two sources and restart the transcriber to prevent model context overflow.

Quite a start, isn't it? Probably, after that, your perception of this simple button will change, sorry for that :)

Let's see how different apps manage this or similar complexity.

iPhone Voice Memos

iOS 26.5

When I press start, it starts. When I press stop, it stops. Nothing more.

Otter

Otter 1.4.2

As you can see, the button becomes disabled while it starts recording. This makes me feel slightly uncomfortable every time I press it. I feel unresponsiveness and heaviness. And I need to wait before I can stop recording.

Talat

Talat 0.11.5

The button is disabled while recording is started. The good thing is that the recording start is quite fast here. However, it still produces a tiny unresponsiveness feeling.

MacWhisper

MacWhisper 13.21.1

There's a slight delay between the press of the start button and the appearance of the stop button. Also, the button changes its position after I start recording, which requires additional cognitive effort from me to find it.

Fireflies

Fireflies 0.1.30

The button is locked during start.

XSpeak

XSpeak 3.7

As you can see, the button reacts instantly to user action. And if you change your mind, it reacts instantly back.

Implementation

I'll not write a book here about all the approaches I considered and tried. Instead, I'll go from a naive approach to the solution I implemented.

Let's agree that we want instant feedback from the button and will not disable it during our startup chain. Also, let's declare our states:

Each can be started or stopped. Our goal: keep them eventually consistent without ever blocking the user.

The naive approach would be when the user presses the button:

Change S_ui to started.
Launch startup pipeline.

However, the obvious problem would be a race condition. Imagine the following order of operations:

In the end, we have S_ui = stopped and S_real = started.

We have to linearize this pipeline to prevent such races. The first thing that would help is to prevent start and stop operations from running simultaneously. We'll use a queue for that:

Operations run one at a time, in submission order. No two operations overlap.

We also need to introduce one more state:

When we want to start or stop recording, we submit an operation to the queue. This way no two operations overlap and each operation waits for its time. As a result, we always have S_ui equal to the S_op of the last operation.

However, this results in delayed work that doesn't start immediately. We still want to give immediate feedback to the user. To achieve that, we'll work with S_ui from MainActor and with S_real from Queue. This means that when we press the button, S_ui changes immediately, and the work is submitted afterward. The solution gives us the following challenges:

When the actual queue operation starts, the world could have changed, and the operation might not be necessary anymore.
If the queue grows, there might be significant delay. Imagine a situation when the button is pressed 100 times in a row. We'll have 100 operations 0.5s each, resulting in 50 seconds of work.

The world could have changed during the time we waited for the operation to start. It means the user could have stopped the recording, started it again, or even in a corner case, done it several times. To determine if the operation still makes sense, we should compare each operation's S_op with the current S_ui and S_real. If S_op is started and S_ui is stopped, we shouldn't start anymore, so we just exit. The same is true when S_op is stopped, but S_ui is started. Additionally, if S_op already equals S_real, the work is already done, so we exit as well.

This means that the first and the earliest operation whose S_op equals the current S_ui and not equals S_real will perform the work. This change results in a significantly reduced delay between submission and actual work start.

There's one more thing we should do to improve performance further. Imagine the following order of operations:

If the user presses stop when the start operation is already in progress, we have to wait until the start operation finishes. It results in unnecessary delay and extra work.

To resolve this, we'll treat each suspension point where we schedule async work during our operation as a potential interruption point. After every step that awaits, we'll check if the target S_ui is still the same. And if it changes, we'll drop the operation and return.

However, when we change state, like starting physical microphone recording, things become more complex since we should revert that. But that's already what the opposite operation will do. So for consistency, after any step that changes state, we must finish the operation and then the opposite operation will revert everything. In the end, we'll have the desired S_real which is equal to S_ui.

At every suspension point we re-check the target state. If it changed, we drop and return.

In practice, there are more complexities because sometimes we have non-standard user flows. But this architecture, where every audio manipulation goes through the queue, allows us to maintain a consistent and reliable state and gives us a good background to improve the app.

All product names, logos, and brands are property of their respective owners. Use of these names, logos, and brands does not imply endorsement.

DEV Community