Ying

Posted on Jan 14, 2025

From Core Audio to LLMs: Native macOS Audio Capture for AI-Powered Tools

#ai #audio #macos #recording

The complete source code for this guide is available on GitHub.

I. Introduction

1.1 Why Capture Audio?

Ever wished you could have a smart note-taker that captures and transcribes audio in real-time? The ability to capture audio from your macOS system – including both application sounds and microphone input – opens up exciting possibilities for AI-powered tools. From real-time transcription and AI assistants to accessibility features and audio analysis, the applications are vast. The key to unlocking this potential lies in effective system audio capture, which we'll explore in detail throughout this post.

1.2 Core Audio Overview

Core Audio is Apple's low-level audio framework for macOS, providing fine-grained control over audio capture, processing, and output. While higher-level frameworks exist for basic audio operations, system audio capture requires direct interaction with Core Audio's advanced APIs. This framework offers the necessary tools for intercepting system audio streams, managing audio devices, and handling complex audio routing scenarios.

1.3 Scope of this Post

In this blog post, we will focus specifically on how to capture system audio on macOS using Core Audio. We will dive into the essential concepts and code needed to intercept and record audio generated by your system and applications. Although we will touch on microphone access briefly, our main goal is to focus on the complexities of system audio capture. We will guide you through setting up the necessary components, handling permissions, processing the audio data, and managing device changes. This is intended to be a practical guide, showing you how to achieve system audio capture on macOS. We won't delve into every detail of the vast Core Audio API, but rather provide a clear path to accomplish this important functionality.

What you'll learn:

How to set up and configure Core Audio for system audio capture
Managing permissions for both microphone and system audio access
Implementing robust audio device change handling
Processing and converting audio data efficiently
Building a Node.js native module for easy integration

The sample code is written in Objective-C++, let me know if you are looking for a swift version.

II. Core Audio Basics

2.1 Audio Taps and Aggregate Devices

Alright, let's dive into the weird world of Core Audio. If you want to capture the sound coming from your macOS system, not just your microphone, you'll need to wrap your head around two key concepts: audio taps and aggregate devices. Think of audio taps like little wiretaps on the audio signal. They're points where you can intercept and, well, “tap into” the stream of audio data. This is how we can actually get at the system audio that's usually hidden away. You can’t use audio taps directly in input devices; they are mostly used when working with aggregate devices, where you can route the audio from a selected tap into your device.

AudioTap allows you subscribe audio data from specific processes or the entire system excluding certain processes, with a descriptor called CATapDescription.

Now, audio taps aren't enough on their own. This is where aggregate devices come in. With aggregate devices, you piece together different audio sources (inputs, outputs, even other aggregate devices) into one virtual device. In our case, we’ll be creating an aggregate device that includes both your default input and output devices. We then connect our audio tap to this aggregate device, to allow us to capture the system audio output. It’s kind of like building a little audio lab within your system, where you can route the sound exactly where you need it.

We need to mention a few more important concepts before we move on: the AudioDeviceID is simply an integer used by Core Audio to identify an audio device. Every input and output device has one. Then we have AudioStreamBasicDescription, or ASBD for short; this structure defines the format of our audio stream, including parameters like sample rate, number of channels, and encoding. Finally, we have AudioBufferList, the container for audio data, think of it like an array of audio data chunks. These concepts will be used later in this post.

2.2 IO Procs

So, we've got our tap on the system audio, and an aggregate device routing the audio to us, but how do we actually get the data and do something with it? This is where IO Procs come into play. An IO Proc is simply a callback function that is triggered by the audio hardware every time audio data is available. Think of them like the delivery guys bringing the audio samples to your door. Core Audio will be repeatedly calling this function, every time new data is available. The IO Proc has a pointer to an AudioBufferList, which, as we’ve mentioned previously, is where our audio data samples are stored.

It's important to note that IO Procs run on a dedicated audio thread, which is a special thread that needs to be quick and efficient. You should avoid doing complex processing or UI updates directly inside the IO Proc as it can lead to performance issues such as audio stutter or drop-outs. Instead, do the minimal work needed inside the callback, and then send the data to another thread for further processing.

2.3 How It All Fits Together

Here's a visual representation of how these components interact:

┌─────────────────┐
│  System Audio   │  ← System sound output
└────────┬────────┘
         │
         ▼         ← Audio data flow
┌─────────────────┐
│   Audio Tap     │  ← Intercepts audio stream
└────────┬────────┘
         │
         ▼         ← Routed audio
┌─────────────────┐
│ Aggregate Device│  ← Combines audio sources
└────────┬────────┘
         │
         ▼         ← Processed audio
┌─────────────────┐
│    IO Proc      │  ← Callback processing
└────────┬────────┘
         │
         ▼         ← Final audio data
┌─────────────────┐
│  Your App       │
└─────────────────┘

Think of this system like a professional audio recording setup:

The Audio Tap is like placing a microphone in your system
The Aggregate Device is like a mixing board combining different audio sources
The IO Proc is like the sound engineer monitoring and controlling the audio flow

III. Audio Permissions

macOS takes audio privacy seriously, with separate permission systems for microphone access and system audio capture. Let's break down how to handle both.

3.1 Microphone Permissions

Microphone access uses AVFoundation's permission system. While this is straightforward, there are some important considerations to keep in mind.

Basic Implementation

// Query the current microphone permission status from AVFoundation
AVAuthorizationStatus micStatus = [AVCaptureDevice authorizationStatusForMediaType:AVMediaTypeAudio];

// Request microphone access through AVFoundation's permission system
[AVCaptureDevice requestAccessForMediaType:AVMediaTypeAudio completionHandler:^(BOOL granted) {
    if (granted) {
        // Permission granted, proceed with microphone access
    } else {
        // Permission denied
    }
}];

Best Practices

Always check permission status before requesting access
Handle permission changes during app runtime
Provide clear usage description in Info.plist
Consider adding UI to guide users to System Preferences if permission is denied

<!-- Required Info.plist entry -->
<key>NSMicrophoneUsageDescription</key>
<string>We need microphone access to capture audio for transcription.</string>

3.2 System Audio Capture Permissions

System audio capture permissions are more complex, requiring interaction with the private TCC (Transparency, Consent, and Control) framework. This is the same framework used for screen recording permissions.

Required Setup

Add the necessary entitlement to your app:

<key>com.apple.security.device.audio-input</key>
<true/>

Include proper usage description:

<key>NSSystemAudioCaptureUsageDescription</key>
<string>We need to capture system audio for transcription.</string>

The real complexity comes in actually requesting and checking these permissions. The approach to get access to system
audio is not officially documented unfortunately, the approach used here is inspired by https://github.com/insidegui/AudioCap/blob/main/AudioCap/ProcessTap/AudioRecordingPermission.swift. Since TCC is a private framework, we need to use
dynamic loading to access it:

void *tccHandle = dlopen("/System/Library/PrivateFrameworks/TCC.framework/Versions/A/TCC", RTLD_NOW);
if (!tccHandle) {
    // Handle error
    return;
}

We then need to get function pointers for the TCC permission check and request functions:

typedef int (*TCCPreflightFuncType)(CFStringRef service, CFDictionaryRef options);
typedef void (*TCCRequestFuncType)(CFStringRef service, CFDictionaryRef options,
                                 void (^completionHandler)(BOOL granted));

TCCPreflightFuncType preflightFunc = (TCCPreflightFuncType)dlsym(tccHandle, "TCCAccessPreflight");
TCCRequestFuncType requestFunc = (TCCRequestFuncType)dlsym(tccHandle, "TCCAccessRequest");

The actual permission check looks like this:

int result = preflightFunc(CFSTR("kTCCServiceAudioCapture"), NULL);
switch (result) {
    case 0: // Authorized
        // Proceed with audio capture
        break;
    case 1: // Denied
        // Handle denied state
        break;
    case 2: // Not determined
        // Need to request permission
        break;
}

When the permission hasn't been determined yet, we need to request it:

requestFunc(CFSTR("kTCCServiceAudioCapture"), NULL, ^(BOOL granted) {
    if (granted) {
        // Permission granted, proceed with setup
    } else {
        // Permission denied, handle accordingly
    }
});

Remember that system audio capture is a privileged operation. Always provide clear feedback to users about what audio is being captured and why. Consider adding UI elements to show when capture is active, similar to how macOS shows the recording indicator in the menu bar.

IV. Capturing System Audio

Now that we've covered the fundamentals, let's roll up our sleeves and explore the core technical aspects of our audio capture implementation. This section will dissect the critical components and algorithms that make our module tick.

4.1 Creating an Audio Tap

The first step in capturing system audio is creating an audio tap. This is our entry point into the system's audio stream. Here's how we implement it:

- (BOOL)setupAudioTapIfNeeded:(NSError **)error {
    if (_tapUID != NULL) {
        return YES;
    }

    CATapDescription *desc = [[CATapDescription alloc]
                           initMonoGlobalTapButExcludeProcesses:@[]];
    _tapUID = [NSUUID UUID];

    desc.name = [NSString stringWithFormat: @"audiorec-tap-%@", _tapUID];
    desc.UUID = _tapUID;
    desc.privateTap = true;
    desc.muteBehavior = CATapUnmuted;
    desc.exclusive = false;
    desc.mixdown = true;

    _tapObjectID = kAudioObjectUnknown;
    OSStatus ret = AudioHardwareCreateProcessTap(desc, &_tapObjectID);

    if (ret != kAudioHardwareNoError) {
        // Handle error
        return NO;
    }

    return YES;
}

The tap configuration is crucial - we're creating a global tap that captures all system audio, mixing it down to a single stream.

4.2 Creating an Aggregate Device

An aggregate device acts as a virtual audio device that combines multiple audio sources. This is essential for properly routing our audio tap and synchronizing multiple audio streams. The setup process involves:

Getting default device references
Retrieving device identifiers
Creating the aggregate device configuration

Here's how we implement this:

- (BOOL)setupAggregateDeviceIfNeeded:(NSError **)error {
    if (_aggregateDeviceID != kAudioDeviceUnknown) {
        return YES;
    }

// Retrieve system's default input and output audio device IDs
AudioDeviceID inputDeviceID, outputDeviceID;
UInt32 propertySize = sizeof(AudioDeviceID);  // Size of an AudioDeviceID
    AudioObjectPropertyAddress propertyAddress = {
        .mSelector = kAudioHardwarePropertyDefaultInputDevice,
        .mScope = kAudioObjectPropertyScopeGlobal,
        .mElement = kAudioObjectPropertyElementMain
    };

    // Get default input device
    OSStatus status = AudioObjectGetPropertyData(kAudioObjectSystemObject,
                                               &propertyAddress,
                                               0,
                                               NULL,
                                               &propertySize,
                                               &inputDeviceID);

    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get default input device"}];
        }
        return NO;
    }

    // Get default output device
    propertyAddress.mSelector = kAudioHardwarePropertyDefaultOutputDevice;
    status = AudioObjectGetPropertyData(kAudioObjectSystemObject,
                                      &propertyAddress,
                                      0,
                                      NULL,
                                      &propertySize,
                                      &outputDeviceID);

    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get default output device"}];
        }
        return NO;
    }

    // Get device UIDs
    CFStringRef inputUID, outputUID;
    AudioObjectPropertyAddress uidPropertyAddress = {
        .mSelector = kAudioDevicePropertyDeviceUID,
        .mScope = kAudioObjectPropertyScopeGlobal,
        .mElement = kAudioObjectPropertyElementMain
    };

    UInt32 dataSize = sizeof(CFStringRef);

    // Get input device UID
    status = AudioObjectGetPropertyData(inputDeviceID,
                                      &uidPropertyAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &inputUID);
    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get input device UID"}];
        }
        return NO;
    }

    // Get output device UID
    status = AudioObjectGetPropertyData(outputDeviceID,
                                      &uidPropertyAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &outputUID);
    if (status != noErr) {
        CFRelease(inputUID);
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get output device UID"}];
        }
        return NO;
    }

    // Get sample rates for both devices
    Float64 inputSampleRate, outputSampleRate;
    AudioObjectPropertyAddress sampleRateAddress = {
        .mSelector = kAudioDevicePropertyNominalSampleRate,
        .mScope = kAudioObjectPropertyScopeGlobal,
        .mElement = kAudioObjectPropertyElementMain
    };

    dataSize = sizeof(Float64);

    // Get input device sample rate
    status = AudioObjectGetPropertyData(inputDeviceID,
                                      &sampleRateAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &inputSampleRate);
    if (status != noErr) {
        CFRelease(inputUID);
        CFRelease(outputUID);
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get input device sample rate"}];
        }
        return NO;
    }

    // Get output device sample rate
    status = AudioObjectGetPropertyData(outputDeviceID,
                                      &sampleRateAddress,
                                      0,
                                      NULL,
                                      &dataSize,
                                      &outputSampleRate);
    if (status != noErr) {
        CFRelease(inputUID);
        CFRelease(outputUID);
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to get output device sample rate"}];
        }
        return NO;
    }

    // Choose master device based on lower sample rate
    NSString *masterDeviceUID = inputSampleRate <= outputSampleRate ?
        (__bridge NSString *)inputUID : (__bridge NSString *)outputUID;

    // Generate unique identifier for aggregate device
    NSUUID* aggregateUID = [NSUUID UUID];

    // Create aggregate device configuration
    NSDictionary* description = @{
        @(kAudioAggregateDeviceUIDKey): [aggregateUID UUIDString],
        @(kAudioAggregateDeviceIsPrivateKey): @(1),
        @(kAudioAggregateDeviceIsStackedKey): @(0),
        @(kAudioAggregateDeviceMasterSubDeviceKey): masterDeviceUID,
        @(kAudioAggregateDeviceSubDeviceListKey): @[
            @{
                @(kAudioSubDeviceUIDKey): (__bridge NSString *)inputUID,
                @(kAudioSubDeviceDriftCompensationKey): @(0),
                @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
            },
            @{
                @(kAudioSubDeviceUIDKey): (__bridge NSString *)outputUID,
                @(kAudioSubDeviceDriftCompensationKey): @(1),
                @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
            },
        ],
        @(kAudioAggregateDeviceTapListKey): @[
            @{
                @(kAudioSubTapDriftCompensationKey): @(1),
                @(kAudioSubTapUIDKey): [_tapUID UUIDString],
            },
        ],
    };

    // Create the aggregate device
    AudioDeviceID aggregateDeviceID;
    status = AudioHardwareCreateAggregateDevice((__bridge CFDictionaryRef)description, &aggregateDeviceID);

    CFRelease(inputUID);
    CFRelease(outputUID);

    if (status != noErr) {
        if (error) {
            *error = [NSError errorWithDomain:@"audio-manager"
                                       code:status
                                   userInfo:@{NSLocalizedDescriptionKey: @"Failed to create aggregate device"}];
        }
        return NO;
    }

    _aggregateDeviceID = aggregateDeviceID;
    return YES;
}

4.3 Configuring Aggregate Device

Once we have our aggregate device, we need to configure it properly:

NSString *masterDeviceUID = inputSampleRate <= outputSampleRate ?
    (__bridge NSString *)inputUID : (__bridge NSString *)outputUID;

NSDictionary* description = @{
    @(kAudioAggregateDeviceUIDKey): [aggregateUID UUIDString],
    @(kAudioAggregateDeviceIsPrivateKey): @(1),
    @(kAudioAggregateDeviceIsStackedKey): @(0),
    @(kAudioAggregateDeviceMasterSubDeviceKey): masterDeviceUID,
    @(kAudioAggregateDeviceSubDeviceListKey): @[
        @{
            @(kAudioSubDeviceUIDKey): (__bridge NSString *)inputUID,
            @(kAudioSubDeviceDriftCompensationKey): @(0),
            @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
        },
        @{
            @(kAudioSubDeviceUIDKey): (__bridge NSString *)outputUID,
            @(kAudioSubDeviceDriftCompensationKey): @(1),
            @(kAudioSubDeviceDriftCompensationQualityKey): @(kAudioSubDeviceDriftCompensationMaxQuality),
        },
    ],
    @(kAudioAggregateDeviceTapListKey): @[
        @{
            @(kAudioSubTapDriftCompensationKey): @(1),
            @(kAudioSubTapUIDKey): [_tapUID UUIDString],
        },
    ],
};

Key configuration points include:

Setting up drift compensation between devices
Linking our audio tap to the aggregate device
Specifying the master subdevice, this is crucial as the system will use the sample rate from the master device as output. - To pick the master device, I’d suggest to pick the device with lower sample rate as the system will automatically down-sampling the audio for the other device while it doesn’t up-sampling if you choose a master device with higher sample rate.

4.4 Starting and Stopping Capture

With our setup complete, we can start capturing audio:

- (BOOL)startCapture:(NSError **)error {
    if (_isCapturing) {
        return YES;
    }

    // Set up IO proc for the aggregate device
    OSStatus status = AudioDeviceCreateIOProcID(_aggregateDeviceID,
                                             HandleAudioDeviceIOProc,
                                             (__bridge void *)self,
                                             &_deviceProcID);

    if (status != noErr) {
        return NO;
    }

    // Start the IO proc
    status = AudioDeviceStart(_aggregateDeviceID, _deviceProcID);
    if (status != noErr) {
        AudioDeviceDestroyIOProcID(_aggregateDeviceID, _deviceProcID);
        _deviceProcID = NULL;
        return NO;
    }

    _isCapturing = YES;
    return YES;
}

The IO proc is where we receive our audio data:

static OSStatus HandleAudioDeviceIOProc(AudioDeviceID inDevice,
                                      const AudioTimeStamp* inNow,
                                      const AudioBufferList* inInputData,
                                      const AudioTimeStamp* inInputTime,
                                      AudioBufferList* outOutputData,
                                      const AudioTimeStamp* inOutputTime,
                                      void* inClientData) {
    AudioManager *audioManager = (__bridge AudioManager *)inClientData;
    [audioManager handleAudioInput:inInputData];
    return noErr;
}

4.5 Handling Device Changes

One of the trickier aspects of audio capture is handling device changes gracefully. Users can plug in or unplug devices, or switch their default devices at any time:

- (void)startDeviceMonitoring {
    AudioObjectPropertyAddress propertyAddress = {
        .mSelector = kAudioHardwarePropertyDefaultInputDevice,
        .mScope = kDeviceChangeScope,
        .mElement = kDeviceChangeElement
    };

    // Create block for device changes
    AudioManager* blockSelf = self;
    _deviceChangeListener = ^(UInt32 inNumberAddresses,
                            const AudioObjectPropertyAddress* inAddresses) {
        [blockSelf handleDeviceChange];
    };

    // Add listener for input device changes
    OSStatus status = AudioObjectAddPropertyListenerBlock(kAudioObjectSystemObject,
                                                        &propertyAddress,
                                                        self->_audioQueue,
                                                        self->_deviceChangeListener);

    if (status != noErr) {
        Log("Failed to add input device change listener", "error");
        return;
    }

    // Add listener for output device changes
    propertyAddress.mSelector = kAudioHardwarePropertyDefaultOutputDevice;
    status = AudioObjectAddPropertyListenerBlock(kAudioObjectSystemObject,
                                               &propertyAddress,
                                               self->_audioQueue,
                                               self->_deviceChangeListener);

    if (status != noErr) {
        Log("Failed to add output device change listener", "error");
        return;
    }
}

- (void)handleDeviceChange {
    // If we're currently capturing, we need to recreate the audio setup
    BOOL wasCapturing = _isCapturing;
    if (wasCapturing) {
        NSError *error = nil;
        [self stopCapture:&error];
        if (error) {
            Log(std::string("Failed to stop capture after device change: ") +
                     std::string([error.localizedDescription UTF8String]), "error");
            return;
        }
    }

    // Destroy and recreate audio resources
    [self destroyAudioResources];

    NSError *error = nil;
    if (![self setupAudioTapIfNeeded:&error]) {
        Log(std::string("Failed to setup audio tap after device change: ") +
                 std::string([error.localizedDescription UTF8String]), "error");
        return;
    }

    if (![self setupAggregateDeviceIfNeeded:&error]) {
        Log(std::string("Failed to setup aggregate device after device change: ") +
                 std::string([error.localizedDescription UTF8String]), "error");
        return;
    }

    // If we were capturing before, restart capture
    if (wasCapturing) {
        NSError *error = nil;
        [self startCapture:&error];
        if (error) {
            Log(std::string("Failed to start capture after device change: ") +
                     std::string([error.localizedDescription UTF8String]), "error");
            return;
        }
    }
}

- (void)stopDeviceMonitoring {
    if (_deviceChangeListener) {
        // Remove input device listener
        AudioObjectPropertyAddress propertyAddress = {
            .mSelector = kAudioHardwarePropertyDefaultInputDevice,
            .mScope = kDeviceChangeScope,
            .mElement = kDeviceChangeElement
        };

        AudioObjectRemovePropertyListenerBlock(kAudioObjectSystemObject,
                                             &propertyAddress,
                                             _audioQueue,
                                             _deviceChangeListener);

        // Remove output device listener
        propertyAddress.mSelector = kAudioHardwarePropertyDefaultOutputDevice;
        AudioObjectRemovePropertyListenerBlock(kAudioObjectSystemObject,
                                             &propertyAddress,
                                             _audioQueue,
                                             _deviceChangeListener);

        _deviceChangeListener = nil;
    }
}

- (void)destroyAudioResources {
    if (_deviceProcID && _aggregateDeviceID != kAudioDeviceUnknown) {
        AudioDeviceDestroyIOProcID(_aggregateDeviceID, _deviceProcID);
        _deviceProcID = NULL;
    }

    if (_tapObjectID != 0) {
        AudioHardwareDestroyProcessTap(_tapObjectID);
        _tapObjectID = 0;
    }

    if (_tapUID) {
        _tapUID = NULL;
    }

    if (_aggregateDeviceID != kAudioDeviceUnknown) {
        AudioHardwareDestroyAggregateDevice(_aggregateDeviceID);
        _aggregateDeviceID = kAudioDeviceUnknown;
    }
}

The device change handling system consists of several key components:

Device Monitoring Setup
- Establishes listeners for both input and output device changes
- Uses a dedicated audio queue for handling changes
- Implements proper error handling for listener setup
Change Handling Process
- Preserves the current capture state
- Safely stops any ongoing capture
- Destroys existing audio resources
- Recreates and reconfigures audio components
- Restores previous capture state if needed
Resource Cleanup
- Properly removes device change listeners
- Cleans up all audio resources
- Handles tap and aggregate device destruction
- Ensures complete state reset

This implementation ensures smooth audio capture even when users plug in or unplug devices, change their default audio devices, or make other system audio configuration changes. The system maintains stability by properly managing resource lifecycle and handling state transitions.

V. Processing Audio Data

Processing audio data efficiently and correctly is crucial for any audio application. In our system audio capture implementation, we need to handle raw audio buffers, convert between formats, and manage memory carefully to avoid issues like audio glitches or memory leaks.

5.1 Accessing Audio Data

Audio data is accessed through a callback mechanism that provides buffers of raw PCM data. The implementation uses a block-based callback system to ensure thread safety and efficient data handling. When we receive audio data in our IO Proc callback, it comes in the form of an AudioBufferList. Here's how we handle the incoming data:

- (void)setAudioDataCallback:(void (^)(NSData *audioData))callback {
    _audioDataCallback = [callback copy];
}

- (void)handleAudioInput:(const AudioBufferList *)bufferList {
    if (!_isCapturing || !_audioDataCallback) {
        return;
    }

    @autoreleasepool {
        // Validate input data
        if (!bufferList || bufferList->mNumberBuffers == 0) {
            return;
        }

        const AudioBuffer *buffer = &bufferList->mBuffers[0];
        UInt32 numFrames = buffer->mDataByteSize / sizeof(Float32);

        // Process frames based on channel format
        BOOL isInterleaved = !(_sourceFormat.mFormatFlags & kAudioFormatFlagIsNonInterleaved);
        UInt32 numChannels = _sourceFormat.mChannelsPerFrame;
        if (numChannels == 0) {
            numChannels = bufferList->mNumberBuffers;
        }

        if (isInterleaved) {
            numFrames = numFrames / numChannels;
        }

        // Convert and process the audio data
        Float32 *processedData = [self processAudioData:bufferList
                                             numFrames:numFrames
                                          numChannels:numChannels];

        if (processedData) {
            NSData *audioData = [NSData dataWithBytes:processedData
                                             length:numFrames * sizeof(Float32)];
            dispatch_async(dispatch_get_main_queue(), ^{
                if (self->_audioDataCallback) {
                    self->_audioDataCallback(audioData);
                }
            });
            free(processedData);
        }
    }
}

Key aspects of audio data handling:

Data is provided as raw PCM in 32-bit float format
Buffer format can be either interleaved or non-interleaved
Processing happens on a dedicated audio thread
Callbacks are dispatched to the main thread for safety
Memory management with autorelease pool for consistent performance

5.2 Sample Rate Conversion

One of the most critical aspects of audio processing is maintaining consistent output regardless of input device changes. We accomplish this through high-quality sample rate conversion:

- (Float32 *)resampleBuffer:(Float32 *)inputBuffer
                inputFrames:(UInt32)inputFrames
               outputFrames:(UInt32 *)outputFrames {
    Float64 sourceRate = _sourceFormat.mSampleRate;
    Float64 ratio = sourceRate / kTargetSampleRate;
    UInt32 newFrameLength = (UInt32)(inputFrames / ratio);

// Allocate output buffer
    Float32 *resampledBuffer = (Float32 *)calloc(newFrameLength, sizeof(Float32));
    if (!resampledBuffer) {
        return NULL;
    }

// Perform sinc resampling with Blackman window
    const UInt32 windowSize = 16;
    const UInt32 halfWindow = windowSize / 2;
    const float M_2PI = 2.0f * M_PI;

    for (UInt32 newIndex = 0; newIndex < newFrameLength; newIndex++) {
        float position = newIndex * ratio;
        int32_t centerIndex = (int32_t)floorf(position);
        float fracOffset = position - centerIndex;
        float sum = 0.0f;
        float weightSum = 0.0f;

// Apply windowed sinc filter
        for (int32_t i = -(int32_t)halfWindow; i <= (int32_t)halfWindow; i++) {
            int32_t sampleIndex = centerIndex + i;

            if (sampleIndex < 0 || (UInt32)sampleIndex >= inputFrames) {
                continue;
            }

            float x = fracOffset - i;
// Normalized sinc function
            float sincValue = (x == 0.0f) ? 1.0f : sinf(M_PI * x) / (M_PI * x);
// Blackman window
            float windowValue = 0.42f - 0.5f * cosf(M_PI * (i + halfWindow) / halfWindow)
                            + 0.08f * cosf(M_2PI * (i + halfWindow) / halfWindow);
            float weight = sincValue * windowValue;

            sum += inputBuffer[sampleIndex] * weight;
            weightSum += weight;
        }

// Normalize by total weight
        resampledBuffer[newIndex] = weightSum > 0.0f ? sum / weightSum : 0.0f;
    }

    *outputFrames = newFrameLength;
    return resampledBuffer;
}

Key features of our sample rate conversion:

Uses high-quality sinc interpolation for minimal artifacts
Applies Blackman window function to reduce aliasing
Maintains phase accuracy across device switches
Handles arbitrary input/output sample rate ratios
Automatically adapts to device changes

By maintaining a fixed output sample rate (in this case we use 22.05k Hz which is good enough for transcription etc.), we ensure that:

Downstream processing remains consistent
Memory usage is predictable
CPU load stays stable
Client applications receive a consistent format

VI. Using it as NodeJs native extension

Now that we have our Core Audio implementation working, let's package it as a Node.js native module for easy integration into JavaScript applications.

6.1 Module Interface

Our module exposes a clean TypeScript interface for JavaScript applications. The main interface is defined in native-modules.d.ts:

declare module "native-modules" {
  // Permission status types
  export type PermissionStatus =
    | "not_determined"
    | "denied"
    | "authorized"
    | "restricted";
  export type DeviceType = "microphone" | "audio";

  // Main interface for audio capture
  export interface AudioWrapperInstance {
    startCapture(callback: (data: ArrayBuffer) => void): void;
    stopCapture(): void;
    getPermissions(): PermissionResult;
    requestPermissions(deviceType: DeviceType): Promise<PermissionResult>;
  }

  // Module exports
  const addon: {
    AudioWrapper: {
      new (): AudioWrapperInstance;
    };
  };
  export default addon;
}

This interface provides a simple yet powerful API for managing audio capture:

Permission management through getPermissions and requestPermissions
Audio capture control with startCapture and stopCapture
Type-safe callback for receiving audio data as ArrayBuffer

6.2 Native Bindings with N-API

The native bindings are implemented using N-API (Node-API) in NativeModule.mm. Here's a simplified look at the key components:

class AudioWrapper : public Napi::ObjectWrap<AudioWrapper> {
public:
  static Napi::Object Init(Napi::Env env, Napi::Object exports) {
    // Define the JavaScript interface
    Napi::Function func = DefineClass(env, "AudioWrapper", {
      InstanceMethod("getPermissions", &AudioWrapper::GetPermissions),
      InstanceMethod("requestPermissions", &AudioWrapper::RequestPermissions),
      InstanceMethod("startCapture", &AudioWrapper::StartCapture),
      InstanceMethod("stopCapture", &AudioWrapper::StopCapture),
    });

    exports.Set("AudioWrapper", func);
    return exports;
  }

  // Constructor wraps our native AudioManager
  AudioWrapper(const Napi::CallbackInfo &info)
      : Napi::ObjectWrap<AudioWrapper>(info) {
    audioManager = [AudioManager sharedInstance];
  }

private:
  AudioManager *audioManager;
  // ... method implementations ...
};

The module is built using node-gyp with the following configuration in binding.gyp:

{
  "targets": [
    {
      "target_name": "nativeAudioManager",
      "conditions": [
        [
          "OS==\"mac\"",
          {
            "sources": ["mac/*.mm"],
            "libraries": [
              "-framework Cocoa",
              "-framework CoreAudio",
              "-framework AudioToolbox"
            ]
          }
        ]
      ]
    }
  ]
}

6.3 Usage Example

Here's a simple example of using the module in a Electron application:

// usually do this inside the main.ts file
import audioManager from "native-modules";

// Create audio wrapper instance
const audio = new audioManager.AudioWrapper();

async function setupAudioCapture() {
  // Request permissions
  const permissions = await audio.requestPermissions("audio");
  if (permissions.audio !== "authorized") {
    throw new Error("Audio capture permission denied");
  }

  // Start capture with callback
  audio.startCapture((data) => {
    // data is an ArrayBuffer containing raw PCM audio
    console.log(`Received ${data.byteLength} bytes of audio data`);
  });

  // Stop capture after 5 seconds
  setTimeout(() => {
    audio.stopCapture();
  }, 5000);
}

setupAudioCapture().catch(console.error);

This example demonstrates:

Creating an instance of the audio wrapper
Requesting necessary permissions
Starting audio capture with a callback
Receiving audio data as ArrayBuffers
Cleaning up by stopping capture

The native module handles all the complexities of Core Audio while providing a simple, Promise-based API for JavaScript applications.

Note that this is a simplified example, which is not directly suitable for production. In real world applications with Electron, you need to do this in the main process with ipcMain.handle(...) to get triggered by the renderer process for permission requests and capture start/stop separately.

VII. Conclusion

7.1 Summary

In this guide, we've explored how to build a robust system audio capture solution for macOS using Core Audio. We've covered:

Core Audio fundamentals including audio taps and aggregate devices
Permission handling for both microphone and system audio
Implementation of audio capture using native APIs
Integration with Node.js through N-API
Error handling and device change management

The resulting solution provides a flexible foundation for building audio-based applications, from simple recording tools to complex AI-powered audio analysis systems.

The complete implementation of this guide is available as an open-source project on GitHub. Feel free to use it as a reference or contribute to its development.

7.2 Further Exploration

While our implementation provides a solid foundation, there are several areas worth exploring further:

Separating input audio from output audio: Our current implementation combines system audio into a single stream. You could modify the AudioBufferList handling to separate microphone input from system output as it's always given 2 channels, one for input and one for output, enabling more sophisticated audio routing and processing.
Advanced audio processing: Consider adding real-time audio processing capabilities like:
- Volume normalization
- Noise reduction
- Audio filtering
- Format conversion
Performance optimization: Areas for potential improvement include:
- Buffer size tuning
- Memory allocation strategies
- Thread pool management for audio processing
Extended platform support: While this implementation focuses on macOS, similar functionality could be implemented for:
- Windows using WASAPI
- Linux using PulseAudio or JACK
- iOS using AVFoundation

7.3 Resources

We have referenced the following resources to achieve this implementation:

AudioCap - An excellent example of system audio capture on macOS
Capturing system audio with Core Audio taps - Apple's official documentation
CAAudioHardware - A useful swift library extension for audio capture

Remember that working with system audio requires careful attention to:

User privacy and consent
System resource management
Error handling and recovery
Platform-specific behaviors and limitations

By building on this foundation and exploring these additional areas, you can create sophisticated audio applications that leverage the power of modern AI and audio processing technologies.

Top comments (2)

Sepehr • Apr 28 '25

absolute life saver, been looking for this for the past few days, this is gold

Wayne J. • Jun 26 '25 • Edited

Hi Ying, this post is really god post , and truly helpful. I still have a qustion about this , when using the TCC private framework that wont be accept for the app store is that right ?