DEV Community

Cover image for From Core Audio to LLMs: Native macOS Audio Capture for AI-Powered Tools
Ying
Ying

Posted on

From Core Audio to LLMs: Native macOS Audio Capture for AI-Powered Tools

The complete source code for this guide is available on GitHub.

I. Introduction

1.1 Why Capture Audio?

Ever wished you could have a smart note-taker that captures and transcribes audio in real-time? The ability to capture audio from your macOS system – including both application sounds and microphone input – opens up exciting possibilities for AI-powered tools. From real-time transcription and AI assistants to accessibility features and audio analysis, the applications are vast. The key to unlocking this potential lies in effective system audio capture, which we'll explore in detail throughout this post.

1.2 Core Audio Overview

Core Audio is Apple's low-level audio framework for macOS, providing fine-grained control over audio capture, processing, and output. While higher-level frameworks exist for basic audio operations, system audio capture requires direct interaction with Core Audio's advanced APIs. This framework offers the necessary tools for intercepting system audio streams, managing audio devices, and handling complex audio routing scenarios.

1.3 Scope of this Post

In this blog post, we will focus specifically on how to capture system audio on macOS using Core Audio. We will dive into the essential concepts and code needed to intercept and record audio generated by your system and applications. Although we will touch on microphone access briefly, our main goal is to focus on the complexities of system audio capture. We will guide you through setting up the necessary components, handling permissions, processing the audio data, and managing device changes. This is intended to be a practical guide, showing you how to achieve system audio capture on macOS. We won't delve into every detail of the vast Core Audio API, but rather provide a clear path to accomplish this important functionality.

What you'll learn:

  • How to set up and configure Core Audio for system audio capture
  • Managing permissions for both microphone and system audio access
  • Implementing robust audio device change handling
  • Processing and converting audio data efficiently
  • Building a Node.js native module for easy integration

The sample code is written in Objective-C++, let me know if you are looking for a swift version.

II. Core Audio Basics

2.1 Audio Taps and Aggregate Devices

Alright, let's dive into the weird world of Core Audio. If you want to capture the sound coming from your macOS system, not just your microphone, you'll need to wrap your head around two key concepts: audio taps and aggregate devices. Think of audio taps like little wiretaps on the audio signal. They're points where you can intercept and, well, “tap into” the stream of audio data. This is how we can actually get at the system audio that's usually hidden away. You can’t use audio taps directly in input devices; they are mostly used when working with aggregate devices, where you can route the audio from a selected tap into your device.

AudioTap allows you subscribe audio data from specific processes or the entire system excluding certain processes, with a descriptor called CATapDescription.

Now, audio taps aren't enough on their own. This is where aggregate devices come in. With aggregate devices, you piece together different audio sources (inputs, outputs, even other aggregate devices) into one virtual device. In our case, we’ll be creating an aggregate device that includes both your default input and output devices. We then connect our audio tap to this aggregate device, to allow us to capture the system audio output. It’s kind of like building a little audio lab within your system, where you can route the sound exactly where you need it.

We need to mention a few more important concepts before we move on: the AudioDeviceID is simply an integer used by Core Audio to identify an audio device. Every input and output device has one. Then we have AudioStreamBasicDescription, or ASBD for short; this structure defines the format of our audio stream, including parameters like sample rate, number of channels, and encoding. Finally, we have AudioBufferList, the container for audio data, think of it like an array of audio data chunks. These concepts will be used later in this post.

2.2 IO Procs

So, we've got our tap on the system audio, and an aggregate device routing the audio to us, but how do we actually get the data and do something with it? This is where IO Procs come into play. An IO Proc is simply a callback function that is triggered by the audio hardware every time audio data is available. Think of them like the delivery guys bringing the audio samples to your door. Core Audio will be repeatedly calling this function, every time new data is available. The IO Proc has a pointer to an AudioBufferList, which, as we’ve mentioned previously, is where our audio data samples are stored.

It's important to note that IO Procs run on a dedicated audio thread, which is a special thread that needs to be quick and efficient. You should avoid doing complex processing or UI updates directly inside the IO Proc as it can lead to performance issues such as audio stutter or drop-outs. Instead, do the minimal work needed inside the callback, and then send the data to another thread for further processing.

2.3 How It All Fits Together

Here's a visual representation of how these components interact:

┌─────────────────┐
│  System Audio   │  ← System sound output
└────────┬────────┘
         │
         ▼         ← Audio data flow
┌─────────────────┐
│   Audio Tap     │  ← Intercepts audio stream
└────────┬────────┘
         │
         ▼         ← Routed audio
┌─────────────────┐
│ Aggregate Device│  ← Combines audio sources
└────────┬────────┘
         │
         ▼         ← Processed audio
┌─────────────────┐
│    IO Proc      │  ← Callback processing
└────────┬────────┘
         │
         ▼         ← Final audio data
┌─────────────────┐
│  Your App       │
└─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Think of this system like a professional audio recording setup:

  • The Audio Tap is like placing a microphone in your system
  • The Aggregate Device is like a mixing board combining different audio sources
  • The IO Proc is like the sound engineer monitoring and controlling the audio flow

III. Audio Permissions

macOS takes audio privacy seriously, with separate permission systems for microphone access and system audio capture. Let's break down how to handle both.

3.1 Microphone Permissions

Microphone access uses AVFoundation's permission system. While this is straightforward, there are some important considerations to keep in mind.

Basic Implementation

// Query the current microphone permission status from AVFoundation
AVAuthorizationStatus micStatus = [AVCaptureDevice authorizationStatusForMediaType:AVMediaTypeAudio];

// Request microphone access through AVFoundation's permission system
[AVCaptureDevice requestAccessForMediaType:AVMediaTypeAudio completionHandler:^(BOOL granted) {
    if (granted) {
        // Permission granted, proceed with microphone access
    } else {
        // Permission denied
    }
}];
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Always check permission status before requesting access
  • Handle permission changes during app runtime
  • Provide clear usage description in Info.plist
  • Consider adding UI to guide users to System Preferences if permission is denied
<!-- Required Info.plist entry -->
<key>NSMicrophoneUsageDescription</key>
<string>We need microphone access to capture audio for transcription.</string>
Enter fullscreen mode Exit fullscreen mode

3.2 System Audio Capture Permissions

System Audio Permission Dialog

System audio capture permissions are more complex, requiring interaction with the private TCC (Transparency, Consent, and Control) framework. This is the same framework used for screen recording permissions.

Required Setup

  1. Add the necessary entitlement to your app:
<key>com.apple.security.device.audio-input</key>
<true/>
Enter fullscreen mode Exit fullscreen mode
  1. Include proper usage description:
<key>NSSystemAudioCaptureUsageDescription</key>
<string>We need to capture system audio for transcription.</string>
Enter fullscreen mode Exit fullscreen mode

The real complexity comes in actually requesting and checking these permissions. The approach to get access to system
audio is not officially documented unfortunately, the approach used here is inspired by https://github.com/insidegui/AudioCap/blob/main/AudioCap/ProcessTap/AudioRecordingPermission.swift. Since TCC is a private framework, we need to use
dynamic loading to access it:

void *tccHandle = dlopen("/System/Library/PrivateFrameworks/TCC.framework/Versions/A/TCC", RTLD_NOW);
if (!tccHandle) {
    // Handle error
    return;
}
Enter fullscreen mode Exit fullscreen mode

We then need to get function pointers for the TCC permission check and request functions:

typedef int (*TCCPreflightFuncType)(CFStringRef service, CFDictionaryRef options);
typedef void (*TCCRequestFuncType)(CFStringRef service, CFDictionaryRef options,
                                 void (^completionHandler)(BOOL granted));

TCCPreflightFuncType preflightFunc = (TCCPreflightFuncType)dlsym(tccHandle, "TCCAccessPreflight");
TCCRequestFuncType requestFunc = (TCCRequestFuncType)dlsym(tccHandle, "TCCAccessRequest");
Enter fullscreen mode Exit fullscreen mode

The actual permission check looks like this:

int result = preflightFunc(CFSTR("kTCCServiceAudioCapture"), NULL);
switch (result) {
    case 0: // Authorized
        // Proceed with audio capture
        break;
    case 1: // Denied
        // Handle denied state
        break;
    case 2: // Not determined
        // Need to request permission
        break;
}
Enter fullscreen mode Exit fullscreen mode

When the permission hasn't been determined yet, we need to request it:

requestFunc(CFSTR("kTCCServiceAudioCapture"), NULL, ^(BOOL granted) {
    if (granted) {
        // Permission granted, proceed with setup
    } else {
        // Permission denied, handle accordingly
    }
});
Enter fullscreen mode Exit fullscreen mode

Remember that system audio capture is a privileged operation. Always provide clear feedback to users about what audio is being captured and why. Consider adding UI elements to show when capture is active, similar to how macOS shows the recording indicator in the menu bar.

IV. Capturing System Audio

Now that we've covered the fundamentals, let's roll up our sleeves and explore the core technical aspects of our audio capture implementation. This section will dissect the critical components and algorithms that make our module tick.

4.1 Creating an Audio Tap

The first step in capturing system audio is creating an audio tap. This is our entry point into the system's audio stream. Here's how we implement it:

- (BOOL)setupAudioTapIfNeeded:(NSError **)error {
    if (_tapUID != NULL) {
        return YES;
    }

    CATapDescription *desc = [[CATapDescription alloc]
                           initMonoGlobalTapButExcludeProcesses:@[]];
    _tapUID = [NSUUID UUID];

    desc.name = [NSString stringWithFormat: @"audiorec-tap-%@", _tapUID];
    desc.UUID = _tapUID;
    desc.privateTap = true;
    desc.muteBehavior = CATapUnmuted;
    desc.exclusive = false;
    desc.mixdown = true;

    _tapObjectID = kAudioObjectUnknown;
    OSStatus ret = AudioHardwareCreateProcessTap(desc, &_tapObjectID);

    if (ret != kAudioHardwareNoError) {
        // Handle error
        return NO;
    }

    return YES;
}
Enter fullscreen mode Exit fullscreen mode

The tap configuration is crucial - we're creating a global tap that captures all system audio, mixing it down to a single stream.

4.2 Creating an Aggregate Device

An aggregate device acts as a virtual audio device that combines multiple audio sources. This is essential for properly routing our audio tap and synchronizing multiple audio streams. The setup process involves:

  1. Getting default device references
  2. Retrieving device identifiers
  3. Creating the aggregate device configuration

Here's how we implement this:

- (BOOL)setupAggregateDeviceIfNeeded:(NSError **)error {
    if (_aggregateDeviceID != kAudioDeviceUnknown) {
        return YES;
    }

// Retrieve system's default input and output audio device IDs
AudioDeviceID inputDeviceID, outputDeviceID;
UInt32 propertySize = sizeof(AudioDeviceID);  // Size of an AudioDeviceID
    AudioObjectPropertyAddress propertyAddress = {
        .mSelector = kAudioHardwarePropertyDefaultInputDevice,
        .mScope = kAudioObjectPropertyScopeGlobal,
        .mElement = kAudioObjectPropertyElementMain
    };

    // Get default input device
    OSStatus status = AudioObjectGetPropertyData(kAudioObjectSystemObject,
                                               &propertyAddress,
                                               0,
                                               NULL,
                                               &propertySize,
                                               &inputDeviceID);

    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get default input device"}];
        }
        return NO;
    }

    // Get default output device
    propertyAddress.mSelector = kAudioHardwarePropertyDefaultOutputDevice;
    status = AudioObjectGetPropertyData(kAudioObjectSystemObject,
                                      &propertyAddress,
                                      0,
                                      NULL,
                                      &propertySize,
                                      &outputDeviceID);

    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get default output device"}];
        }
        return NO;
    }

    // Get device UIDs
    CFStringRef inputUID, outputUID;
    AudioObjectPropertyAddress uidPropertyAddress = {
        .mSelector = kAudioDevicePropertyDeviceUID,
        .mScope = kAudioObjectPropertyScopeGlobal,
        .mElement = kAudioObjectPropertyElementMain
    };

    UInt32 dataSize = sizeof(CFStringRef);

    // Get input device UID
    status = AudioObjectGetPropertyData(inputDeviceID,
                                      &uidPropertyAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &inputUID);
    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get input device UID"}];
        }
        return NO;
    }

    // Get output device UID
    status = AudioObjectGetPropertyData(outputDeviceID,
                                      &uidPropertyAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &outputUID);
    if (status != noErr) {
        CFRelease(inputUID);
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get output device UID"}];
        }
        return NO;
    }

    // Get sample rates for both devices
    Float64 inputSampleRate, outputSampleRate;
    AudioObjectPropertyAddress sampleRateAddress = {
        .mSelector = kAudioDevicePropertyNominalSampleRate,
        .mScope = kAudioObjectPropertyScopeGlobal,
        .mElement = kAudioObjectPropertyElementMain
    };

    dataSize = sizeof(Float64);

    // Get input device sample rate
    status = AudioObjectGetPropertyData(inputDeviceID,
                                      &sampleRateAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &inputSampleRate);
    if (status != noErr) {
        CFRelease(inputUID);
        CFRelease(outputUID);
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get input device sample rate"}];
        }
        return NO;
    }

    // Get output device sample rate
    status = AudioObjectGetPropertyData(outputDeviceID,
                                      &sampleRateAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &outputSampleRate);
    if (status != noErr) {
        CFRelease(inputUID);
        CFRelease(outputUID);
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get output device sample rate"}];
        }
        return NO;
    }

    // Choose master device based on lower sample rate
    NSString *masterDeviceUID = inputSampleRate <= outputSampleRate ?
        (__bridge NSString *)inputUID : (__bridge NSString *)outputUID;

    // Generate unique identifier for aggregate device
    NSUUID* aggregateUID = [NSUUID UUID];

    // Create aggregate device configuration
    NSDictionary* description = @{
        @(kAudioAggregateDeviceUIDKey): [aggregateUID UUIDString],
        @(kAudioAggregateDeviceIsPrivateKey): @(1),
        @(kAudioAggregateDeviceIsStackedKey): @(0),
        @(kAudioAggregateDeviceMasterSubDeviceKey): masterDeviceUID,
        @(kAudioAggregateDeviceSubDeviceListKey): @[
            @{
                @(kAudioSubDeviceUIDKey): (__bridge NSString *)inputUID,
                @(kAudioSubDeviceDriftCompensationKey): @(0),
                @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
            },
            @{
                @(kAudioSubDeviceUIDKey): (__bridge NSString *)outputUID,
                @(kAudioSubDeviceDriftCompensationKey): @(1),
                @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
            },
        ],
        @(kAudioAggregateDeviceTapListKey): @[
            @{
                @(kAudioSubTapDriftCompensationKey): @(1),
                @(kAudioSubTapUIDKey): [_tapUID UUIDString],
            },
        ],
    };

    // Create the aggregate device
    AudioDeviceID aggregateDeviceID;
    status = AudioHardwareCreateAggregateDevice((__bridge CFDictionaryRef)description, &aggregateDeviceID);

    CFRelease(inputUID);
    CFRelease(outputUID);

    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to create aggregate device"}];
        }
        return NO;
    }

    _aggregateDeviceID = aggregateDeviceID;
    return YES;
}
Enter fullscreen mode Exit fullscreen mode

4.3 Configuring Aggregate Device

Once we have our aggregate device, we need to configure it properly:

NSString *masterDeviceUID = inputSampleRate <= outputSampleRate ?
    (__bridge NSString *)inputUID : (__bridge NSString *)outputUID;

NSDictionary* description = @{
    @(kAudioAggregateDeviceUIDKey): [aggregateUID UUIDString],
    @(kAudioAggregateDeviceIsPrivateKey): @(1),
    @(kAudioAggregateDeviceIsStackedKey): @(0),
    @(kAudioAggregateDeviceMasterSubDeviceKey): masterDeviceUID,
    @(kAudioAggregateDeviceSubDeviceListKey): @[
        @{
            @(kAudioSubDeviceUIDKey): (__bridge NSString *)inputUID,
            @(kAudioSubDeviceDriftCompensationKey): @(0),
            @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
        },
        @{
            @(kAudioSubDeviceUIDKey): (__bridge NSString *)outputUID,
            @(kAudioSubDeviceDriftCompensationKey): @(1),
            @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
        },
    ],
    @(kAudioAggregateDeviceTapListKey): @[
        @{
            @(kAudioSubTapDriftCompensationKey): @(1),
            @(kAudioSubTapUIDKey): [_tapUID UUIDString],
        },
    ],
};
Enter fullscreen mode Exit fullscreen mode

Key configuration points include:

  • Setting up drift compensation between devices
  • Linking our audio tap to the aggregate device
  • Specifying the master subdevice, this is crucial as the system will use the sample rate from the master device as output. - To pick the master device, I’d suggest to pick the device with lower sample rate as the system will automatically down-sampling the audio for the other device while it doesn’t up-sampling if you choose a master device with higher sample rate.

4.4 Starting and Stopping Capture

With our setup complete, we can start capturing audio:

- (BOOL)startCapture:(NSError **)error {
    if (_isCapturing) {
        return YES;
    }

    // Set up IO proc for the aggregate device
    OSStatus status = AudioDeviceCreateIOProcID(_aggregateDeviceID,
                                             HandleAudioDeviceIOProc,
                                             (__bridge void *)self,
                                             &_deviceProcID);

    if (status != noErr) {
        return NO;
    }

    // Start the IO proc
    status = AudioDeviceStart(_aggregateDeviceID, _deviceProcID);
    if (status != noErr) {
        AudioDeviceDestroyIOProcID(_aggregateDeviceID, _deviceProcID);
        _deviceProcID = NULL;
        return NO;
    }

    _isCapturing = YES;
    return YES;
}
Enter fullscreen mode Exit fullscreen mode

The IO proc is where we receive our audio data:

static OSStatus HandleAudioDeviceIOProc(AudioDeviceID inDevice,
                                      const AudioTimeStamp* inNow,
                                      const AudioBufferList* inInputData,
                                      const AudioTimeStamp* inInputTime,
                                      AudioBufferList* outOutputData,
                                      const AudioTimeStamp* inOutputTime,
                                      void* inClientData) {
    AudioManager *audioManager = (__bridge AudioManager *)inClientData;
    [audioManager handleAudioInput:inInputData];
    return noErr;
}
Enter fullscreen mode Exit fullscreen mode

4.5 Handling Device Changes

One of the trickier aspects of audio capture is handling device changes gracefully. Users can plug in or unplug devices, or switch their default devices at any time:

- (void)startDeviceMonitoring {
    AudioObjectPropertyAddress propertyAddress = {
        .mSelector = kAudioHardwarePropertyDefaultInputDevice,
        .mScope = kDeviceChangeScope,
        .mElement = kDeviceChangeElement
    };

    // Create block for device changes
    AudioManager* blockSelf = self;
    _deviceChangeListener = ^(UInt32 inNumberAddresses,
                            const AudioObjectPropertyAddress* inAddresses) {
        [blockSelf handleDeviceChange];
    };

    // Add listener for input device changes
    OSStatus status = AudioObjectAddPropertyListenerBlock(kAudioObjectSystemObject,
                                                        &propertyAddress,
                                                        self->_audioQueue,
                                                        self->_deviceChangeListener);

    if (status != noErr) {
        Log("Failed to add input device change listener", "error");
        return;
    }

    // Add listener for output device changes
    propertyAddress.mSelector = kAudioHardwarePropertyDefaultOutputDevice;
    status = AudioObjectAddPropertyListenerBlock(kAudioObjectSystemObject,
                                               &propertyAddress,
                                               self->_audioQueue,
                                               self->_deviceChangeListener);

    if (status != noErr) {
        Log("Failed to add output device change listener", "error");
        return;
    }
}

- (void)handleDeviceChange {
    // If we're currently capturing, we need to recreate the audio setup
    BOOL wasCapturing = _isCapturing;
    if (wasCapturing) {
        NSError *error = nil;
        [self stopCapture:&error];
        if (error) {
            Log(std::string("Failed to stop capture after device change: ") +
                     std::string([error.localizedDescription UTF8String]), "error");
            return;
        }
    }

    // Destroy and recreate audio resources
    [self destroyAudioResources];

    NSError *error = nil;
    if (![self setupAudioTapIfNeeded:&error]) {
        Log(std::string("Failed to setup audio tap after device change: ") +
                 std::string([error.localizedDescription UTF8String]), "error");
        return;
    }

    if (![self setupAggregateDeviceIfNeeded:&error]) {
        Log(std::string("Failed to setup aggregate device after device change: ") +
                 std::string([error.localizedDescription UTF8String]), "error");
        return;
    }

    // If we were capturing before, restart capture
    if (wasCapturing) {
        NSError *error = nil;
        [self startCapture:&error];
        if (error) {
            Log(std::string("Failed to start capture after device change: ") +
                     std::string([error.localizedDescription UTF8String]), "error");
            return;
        }
    }
}

- (void)stopDeviceMonitoring {
    if (_deviceChangeListener) {
        // Remove input device listener
        AudioObjectPropertyAddress propertyAddress = {
            .mSelector = kAudioHardwarePropertyDefaultInputDevice,
            .mScope = kDeviceChangeScope,
            .mElement = kDeviceChangeElement
        };

        AudioObjectRemovePropertyListenerBlock(kAudioObjectSystemObject,
                                             &propertyAddress,
                                             _audioQueue,
                                             _deviceChangeListener);

        // Remove output device listener
        propertyAddress.mSelector = kAudioHardwarePropertyDefaultOutputDevice;
        AudioObjectRemovePropertyListenerBlock(kAudioObjectSystemObject,
                                             &propertyAddress,
                                             _audioQueue,
                                             _deviceChangeListener);

        _deviceChangeListener = nil;
    }
}

- (void)destroyAudioResources {
    if (_deviceProcID && _aggregateDeviceID != kAudioDeviceUnknown) {
        AudioDeviceDestroyIOProcID(_aggregateDeviceID, _deviceProcID);
        _deviceProcID = NULL;
    }

    if (_tapObjectID != 0) {
        AudioHardwareDestroyProcessTap(_tapObjectID);
        _tapObjectID = 0;
    }

    if (_tapUID) {
        _tapUID = NULL;
    }

    if (_aggregateDeviceID != kAudioDeviceUnknown) {
        AudioHardwareDestroyAggregateDevice(_aggregateDeviceID);
        _aggregateDeviceID = kAudioDeviceUnknown;
    }
}
Enter fullscreen mode Exit fullscreen mode

The device change handling system consists of several key components:

  1. Device Monitoring Setup
    • Establishes listeners for both input and output device changes
    • Uses a dedicated audio queue for handling changes
    • Implements proper error handling for listener setup
  2. Change Handling Process
    • Preserves the current capture state
    • Safely stops any ongoing capture
    • Destroys existing audio resources
    • Recreates and reconfigures audio components
    • Restores previous capture state if needed
  3. Resource Cleanup
    • Properly removes device change listeners
    • Cleans up all audio resources
    • Handles tap and aggregate device destruction
    • Ensures complete state reset

This implementation ensures smooth audio capture even when users plug in or unplug devices, change their default audio devices, or make other system audio configuration changes. The system maintains stability by properly managing resource lifecycle and handling state transitions.

V. Processing Audio Data

Processing audio data efficiently and correctly is crucial for any audio application. In our system audio capture implementation, we need to handle raw audio buffers, convert between formats, and manage memory carefully to avoid issues like audio glitches or memory leaks.

5.1 Accessing Audio Data

Audio data is accessed through a callback mechanism that provides buffers of raw PCM data. The implementation uses a block-based callback system to ensure thread safety and efficient data handling. When we receive audio data in our IO Proc callback, it comes in the form of an AudioBufferList. Here's how we handle the incoming data:

- (void)setAudioDataCallback:(void (^)(NSData *audioData))callback {
    _audioDataCallback = [callback copy];
}

- (void)handleAudioInput:(const AudioBufferList *)bufferList {
    if (!_isCapturing || !_audioDataCallback) {
        return;
    }

    @autoreleasepool {
        // Validate input data
        if (!bufferList || bufferList->mNumberBuffers == 0) {
            return;
        }

        const AudioBuffer *buffer = &bufferList->mBuffers[0];
        UInt32 numFrames = buffer->mDataByteSize / sizeof(Float32);

        // Process frames based on channel format
        BOOL isInterleaved = !(_sourceFormat.mFormatFlags & kAudioFormatFlagIsNonInterleaved);
        UInt32 numChannels = _sourceFormat.mChannelsPerFrame;
        if (numChannels == 0) {
            numChannels = bufferList->mNumberBuffers;
        }

        if (isInterleaved) {
            numFrames = numFrames / numChannels;
        }

        // Convert and process the audio data
        Float32 *processedData = [self processAudioData:bufferList
                                             numFrames:numFrames
                                          numChannels:numChannels];

        if (processedData) {
            NSData *audioData = [NSData dataWithBytes:processedData
                                             length:numFrames * sizeof(Float32)];
            dispatch_async(dispatch_get_main_queue(), ^{
                if (self->_audioDataCallback) {
                    self->_audioDataCallback(audioData);
                }
            });
            free(processedData);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Key aspects of audio data handling:

  • Data is provided as raw PCM in 32-bit float format
  • Buffer format can be either interleaved or non-interleaved
  • Processing happens on a dedicated audio thread
  • Callbacks are dispatched to the main thread for safety
  • Memory management with autorelease pool for consistent performance

5.2 Sample Rate Conversion

One of the most critical aspects of audio processing is maintaining consistent output regardless of input device changes. We accomplish this through high-quality sample rate conversion:

- (Float32 *)resampleBuffer:(Float32 *)inputBuffer
                inputFrames:(UInt32)inputFrames
               outputFrames:(UInt32 *)outputFrames {
    Float64 sourceRate = _sourceFormat.mSampleRate;
    Float64 ratio = sourceRate / kTargetSampleRate;
    UInt32 newFrameLength = (UInt32)(inputFrames / ratio);

// Allocate output buffer
    Float32 *resampledBuffer = (Float32 *)calloc(newFrameLength, sizeof(Float32));
    if (!resampledBuffer) {
        return NULL;
    }

// Perform sinc resampling with Blackman window
    const UInt32 windowSize = 16;
    const UInt32 halfWindow = windowSize / 2;
    const float M_2PI = 2.0f * M_PI;

    for (UInt32 newIndex = 0; newIndex < newFrameLength; newIndex++) {
        float position = newIndex * ratio;
        int32_t centerIndex = (int32_t)floorf(position);
        float fracOffset = position - centerIndex;
        float sum = 0.0f;
        float weightSum = 0.0f;

// Apply windowed sinc filter
        for (int32_t i = -(int32_t)halfWindow; i <= (int32_t)halfWindow; i++) {
            int32_t sampleIndex = centerIndex + i;

            if (sampleIndex < 0 || (UInt32)sampleIndex >= inputFrames) {
                continue;
            }

            float x = fracOffset - i;
// Normalized sinc function
            float sincValue = (x == 0.0f) ? 1.0f : sinf(M_PI * x) / (M_PI * x);
// Blackman window
            float windowValue = 0.42f - 0.5f * cosf(M_PI * (i + halfWindow) / halfWindow)
                            + 0.08f * cosf(M_2PI * (i + halfWindow) / halfWindow);
            float weight = sincValue * windowValue;

            sum += inputBuffer[sampleIndex] * weight;
            weightSum += weight;
        }

// Normalize by total weight
        resampledBuffer[newIndex] = weightSum > 0.0f ? sum / weightSum : 0.0f;
    }

    *outputFrames = newFrameLength;
    return resampledBuffer;
}

Enter fullscreen mode Exit fullscreen mode

Key features of our sample rate conversion:

  • Uses high-quality sinc interpolation for minimal artifacts
  • Applies Blackman window function to reduce aliasing
  • Maintains phase accuracy across device switches
  • Handles arbitrary input/output sample rate ratios
  • Automatically adapts to device changes

By maintaining a fixed output sample rate (in this case we use 22.05k Hz which is good enough for transcription etc.), we ensure that:

  • Downstream processing remains consistent
  • Memory usage is predictable
  • CPU load stays stable
  • Client applications receive a consistent format

VI. Using it as NodeJs native extension

Now that we have our Core Audio implementation working, let's package it as a Node.js native module for easy integration into JavaScript applications.

6.1 Module Interface

Our module exposes a clean TypeScript interface for JavaScript applications. The main interface is defined in native-modules.d.ts:

declare module "native-modules" {
  // Permission status types
  export type PermissionStatus =
    | "not_determined"
    | "denied"
    | "authorized"
    | "restricted";
  export type DeviceType = "microphone" | "audio";

  // Main interface for audio capture
  export interface AudioWrapperInstance {
    startCapture(callback: (data: ArrayBuffer) => void): void;
    stopCapture(): void;
    getPermissions(): PermissionResult;
    requestPermissions(deviceType: DeviceType): Promise<PermissionResult>;
  }

  // Module exports
  const addon: {
    AudioWrapper: {
      new (): AudioWrapperInstance;
    };
  };
  export default addon;
}
Enter fullscreen mode Exit fullscreen mode

This interface provides a simple yet powerful API for managing audio capture:

  • Permission management through getPermissions and requestPermissions
  • Audio capture control with startCapture and stopCapture
  • Type-safe callback for receiving audio data as ArrayBuffer

6.2 Native Bindings with N-API

The native bindings are implemented using N-API (Node-API) in NativeModule.mm. Here's a simplified look at the key components:

class AudioWrapper : public Napi::ObjectWrap<AudioWrapper> {
public:
  static Napi::Object Init(Napi::Env env, Napi::Object exports) {
    // Define the JavaScript interface
    Napi::Function func = DefineClass(env, "AudioWrapper", {
      InstanceMethod("getPermissions", &AudioWrapper::GetPermissions),
      InstanceMethod("requestPermissions", &AudioWrapper::RequestPermissions),
      InstanceMethod("startCapture", &AudioWrapper::StartCapture),
      InstanceMethod("stopCapture", &AudioWrapper::StopCapture),
    });

    exports.Set("AudioWrapper", func);
    return exports;
  }

  // Constructor wraps our native AudioManager
  AudioWrapper(const Napi::CallbackInfo &info)
      : Napi::ObjectWrap<AudioWrapper>(info) {
    audioManager = [AudioManager sharedInstance];
  }

private:
  AudioManager *audioManager;
  // ... method implementations ...
};
Enter fullscreen mode Exit fullscreen mode

The module is built using node-gyp with the following configuration in binding.gyp:

{
  "targets": [
    {
      "target_name": "nativeAudioManager",
      "conditions": [
        [
          "OS==\"mac\"",
          {
            "sources": ["mac/*.mm"],
            "libraries": [
              "-framework Cocoa",
              "-framework CoreAudio",
              "-framework AudioToolbox"
            ]
          }
        ]
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

6.3 Usage Example

Here's a simple example of using the module in a Electron application:

// usually do this inside the main.ts file
import audioManager from "native-modules";

// Create audio wrapper instance
const audio = new audioManager.AudioWrapper();

async function setupAudioCapture() {
  // Request permissions
  const permissions = await audio.requestPermissions("audio");
  if (permissions.audio !== "authorized") {
    throw new Error("Audio capture permission denied");
  }

  // Start capture with callback
  audio.startCapture((data) => {
    // data is an ArrayBuffer containing raw PCM audio
    console.log(`Received ${data.byteLength} bytes of audio data`);
  });

  // Stop capture after 5 seconds
  setTimeout(() => {
    audio.stopCapture();
  }, 5000);
}

setupAudioCapture().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

This example demonstrates:

  1. Creating an instance of the audio wrapper
  2. Requesting necessary permissions
  3. Starting audio capture with a callback
  4. Receiving audio data as ArrayBuffers
  5. Cleaning up by stopping capture

The native module handles all the complexities of Core Audio while providing a simple, Promise-based API for JavaScript applications.

Note that this is a simplified example, which is not directly suitable for production. In real world applications with Electron, you need to do this in the main process with ipcMain.handle(...) to get triggered by the renderer process for permission requests and capture start/stop separately.

VII. Conclusion

7.1 Summary

In this guide, we've explored how to build a robust system audio capture solution for macOS using Core Audio. We've covered:

  • Core Audio fundamentals including audio taps and aggregate devices
  • Permission handling for both microphone and system audio
  • Implementation of audio capture using native APIs
  • Integration with Node.js through N-API
  • Error handling and device change management

The resulting solution provides a flexible foundation for building audio-based applications, from simple recording tools to complex AI-powered audio analysis systems.

The complete implementation of this guide is available as an open-source project on GitHub. Feel free to use it as a reference or contribute to its development.

7.2 Further Exploration

While our implementation provides a solid foundation, there are several areas worth exploring further:

  • Separating input audio from output audio: Our current implementation combines system audio into a single stream. You could modify the AudioBufferList handling to separate microphone input from system output as it's always given 2 channels, one for input and one for output, enabling more sophisticated audio routing and processing.

  • Advanced audio processing: Consider adding real-time audio processing capabilities like:

    • Volume normalization
    • Noise reduction
    • Audio filtering
    • Format conversion
  • Performance optimization: Areas for potential improvement include:

    • Buffer size tuning
    • Memory allocation strategies
    • Thread pool management for audio processing
  • Extended platform support: While this implementation focuses on macOS, similar functionality could be implemented for:

    • Windows using WASAPI
    • Linux using PulseAudio or JACK
    • iOS using AVFoundation

7.3 Resources

We have referenced the following resources to achieve this implementation:

Remember that working with system audio requires careful attention to:

  • User privacy and consent
  • System resource management
  • Error handling and recovery
  • Platform-specific behaviors and limitations

By building on this foundation and exploring these additional areas, you can create sophisticated audio applications that leverage the power of modern AI and audio processing technologies.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay