By Eber Cruz | March 2026
The audio engine runs two completely independent TTS backends, both executing inference on the Metal GPU but with fundamentally different architectural paths.
If you've ever tried to build a truly conversational AI, you know that latency is the enemy of presence. It's not just about how fast the model generates tokens; it's about how fast the system can "yield the floor" when a human starts to speak.
Standard Java audio stacks and JNI bridges often introduce non-deterministic delays that make real-time, full-duplex interaction feel robotic. To solve this for the C-Fararoni ecosystem, I decided to bypass the legacy abstractions and talk directly to the silicon.
In this deep dive, I share the architecture and real-world benchmarks of a system built on Java 25, Panama FFM, and Apple Metal GPU. We aren't talking about millisecond improvements here—we've measured a playback interrupt cycle that completes in just 833 nanoseconds.
What's inside:
- Zero-JNI Architecture: How we achieved a 42ns overhead using the Foreign Function & Memory API.
- Metal GPU Orchestration: Running 0.6B and 1.7B neural models locally on 32 GPU cores via PyTorch MPS and ggml-metal.
- The "Abort" Benchmark: Why a 6,000x improvement over our initial latency target was necessary for Sovereign AI.
Bye Bye JNI: Metal GPU, CoreAudio and Panama FFM on Apple Silicon
When we set out to build a voice-first AI assistant that could hold a real conversation — not the kind where you wait three seconds for a response, but the kind where the system knows when to stop talking the instant you speak — we realized the entire Java audio stack had to go. No JNI. No abstraction layers. Just Java talking directly to the hardware through Panama FFM, CoreAudio rendering audio at 24kHz mono float32 through a native callback, and Metal GPU running neural inference on all 32 cores of an M1 Max.
This is the architecture behind Fararoni's audio engine. Every number in this document was measured on real hardware, on real code running in a high-fidelity development environment. These are not theoretical projections — they are measurements taken directly from Fararoni's core as we build the foundation for a sovereign, low-latency AI.
What follows is the story of three bridges: Java to native code in 42 nanoseconds, a playback interrupt in 833 nanoseconds, and two neural models — 0.6B and 1.7B parameters — running Metal compute kernels to synthesize human voice.
Direct Metal: How We Talk to the Hardware
The foundation of the audio engine is a C++ library (fararoni_audio.cpp) that programs CoreAudio's AudioUnit directly. No wrappers, no middleware. The output unit is configured at 24kHz mono float32 with a render callback that copies PCM samples via zero-copy memcpy — the audio thread owns the buffer, and we never fight it for a lock.
The critical path, however, is not playback — it's interruption. In a conversational AI, the system must stop speaking the instant the user starts talking. That means the abort command has to be fast. Not "fast for software" — fast enough that the hardware is the bottleneck.
Here is the entire abort implementation:
extern "C" void fararoni_abort_playback(void) {
AudioUnitReset(outputUnit, kAudioUnitScope_Global, 0); // <1us measured
AudioOutputUnitStop(outputUnit);
playbackCtx.position = playbackCtx.size;
playbackCtx.finished = true;
g_playing.store(false);
}
AudioUnitReset flushes the AudioUnit's internal buffers. On Apple Silicon this is sub-microsecond because there's no DMA transfer to wait for — the buffer lives in unified memory — no lock contention (AudioUnit runs on its own thread), and the buffer is small (24kHz × buffer_size_frames × 4 bytes).
We measured it. On an M1 Max running Java 25.0.1:
Zero-Overhead Bridge: The jump from Java to C++ via Panama FFM adds just 42ns (P50).
Hardware Reset: The AudioUnitReset command halts the CoreAudio engine in 459ns (P50).
End-to-End: The complete abort cycle consolidates at 833ns (P50) — 0.0008ms — beating the original 5ms target by 6,000x. Even at P99 (8.4µs), it's 600x under target.
An important distinction: this measures the interrupt command — the ability to stop audio that is already playing. It is not the latency of generating audio, which takes seconds. The only physical bottleneck that remains is the microphone buffer itself: the AudioUnit HAL input needs ~5-10ms to fill a capture buffer, and no software can change that. Once the buffer arrives, our software reacts in under 2 microseconds.
Frameworks linked directly (Makefile):
CoreFoundation, CoreAudio, AudioToolbox, AudioUnit, IOKit,
Metal, MetalKit, Accelerate, Foundation
Dual-Engine TTS: Direct vs. Indirect Metal GPU Execution
The audio engine runs two completely independent TTS backends, both executing inference on the Metal GPU but with fundamentally different architectural paths.
The 0.6B Engine: ggml-metal Compute
The fast path uses qwen3-tts-cli, a C++ binary that loads a GGUF-quantized 0.6B model and runs inference through ggml-metal — a complete Metal compute implementation that compiles .metal shaders at runtime, creates MTLComputePipelineState for each tensor operation, and dispatches MTLComputeCommandEncoder with optimized thread groups. The buffers live in unified memory: zero-copy between CPU and GPU.
To achieve this efficiency, the engine relies on highly optimized Metal implementations of core neural network operations. These "compute shaders" include:
Attention: The mechanism that allows the model to dynamically weight the importance of different parts of the input text sequence when generating each audio frame.
Softmax: The activation function that normalizes raw model scores into a probability distribution, crucial for accurate audio token selection.
RoPE (Rotary Positional Embeddings): An advanced method for encoding token positions in the sequence, improving the model's understanding of context and order compared to traditional absolute embeddings.
Java launches it as a subprocess:
List<String> cmd = List.of(
binaryPath.toString(), // qwen3-tts-cli
"-m", modelDir.toString(), // GGUF model dir
"-t", styledText, // text to synthesize
"-o", outputWav.toString() // WAV output
);
Process proc = new ProcessBuilder(cmd).start();
For a single word ("Hi."), the 0.6B engine produces 0.3 seconds of audio in 3.7s total (including model load). For longer text, autoregressive generation scales linearly with audio duration — 10 words producing 2.8s of audio take ~32s.
The 1.7B Engine: PyTorch MPS on Metal
The high-fidelity path runs a persistent Python sidecar (tts_server.py) with three variants of Qwen3-TTS-12Hz-1.7B loaded into GPU memory via PyTorch's MPS backend. MPS (Metal Performance Shaders) translates PyTorch tensor operations into Metal compute commands — the same MTLComputeCommandEncoder, the same unified memory buffers, the same GPU cores.
def detect_device():
if torch.backends.mps.is_available():
return "mps" # Apple Silicon -> Metal GPU
elif torch.cuda.is_available():
return "cuda"
else:
return "cpu"
m = Qwen3TTSModel.from_pretrained(model_id, device_map="cpu")
m = m.to("mps") # Tensors move to Metal GPU
The sidecar stays resident — the model loads once and serves requests via a JSON-line protocol over stdin/stdout. Java orchestrates:
Java (PythonSidecarBackend)
-> stdin: {"command":"synthesize", "speaker":"Aiden", "text":"..."}
-> Python: PyTorch MPS -> Metal GPU (1.7B inference)
-> stdout: {"status":"ok", "wav":"/tmp/fararoni_tts_xxx.wav"}
Python is only the invocation wrapper. The heavy lifting — neural inference — runs 100% on the Metal GPU. We use Python because HuggingFace Transformers publishes Qwen3-TTS models with a Python API, and PyTorch MPS is the bridge to Metal. The data never leaves the machine.
Why two engines? The 0.6B model is instant-quality: fast, stateless, no persistent process. The 1.7B model is studio-quality: speaker embeddings, richer prosody, but requires a warm sidecar. The routing engine (FararoniAudioEngine.synthesizeToFile()) selects the backend based on speaker availability and quality preference — builtin speakers (Aiden, Dylan, Vivian, Eric) route to the 1.7B sidecar, while unknown speakers fall back to the 0.6B CLI.
| Aspect | 0.6B (ggml-metal) | 1.7B (PyTorch MPS) |
|---|---|---|
| Metal path | .metal shaders compiled at runtime | MPS precompiled kernels |
| Model format | GGUF quantized | HuggingFace float16/32 |
| Java interface | ProcessBuilder (per-invocation) | stdin/stdout JSON-line (persistent) |
| Speaker selection | No (CLI limitation) | Yes (speaker embedding) |
| Quality | Instant | Studio |
| Both execute on | Metal GPU compute | Metal GPU compute |
| Hardware Access | Direct Metal (zero framework overhead) | Indirect Metal (PyTorch/Python tax) |
High-Fidelity Synthesis: Scaling to 1.7B with PyTorch MPS
The 1.7B sidecar is where data sovereignty meets quality. The model runs locally on 32 GPU cores — no cloud API, no network hop, no third-party data processing. For an AI assistant that handles private conversations, this is not a feature; it's a requirement.
Measured synthesis times on M1 Max (all routed to 1.7B Python sidecar via Metal GPU):
Studio quality — speaker-embedded, full prosody:
| Speaker | Text | Audio Duration | Total Time |
|---|---|---|---|
| Aiden | "Hello, I am Aiden..." (10 words) | 3.9s | 47.5s |
| Dylan | "Hey, I am Dylan..." (10 words) | 3.4s | 40.1s |
| Vivian | "Hola, soy Vivian..." (8 words) | 3.3s | 38.9s |
| Eric/Marcus | "Buenos dias, soy Marcus..." (12 words) | 4.4s | 52.0s |
Instant quality — still 1.7B for builtin speakers:
| Speaker | Text | Audio Duration | Total Time |
|---|---|---|---|
| Aiden | "Hello, I am Aiden..." | 2.6s | 30.3s |
| Vivian | "Hola, soy Vivian..." | 2.6s | 30.9s |
0.6B Metal — unknown speakers, activeBackend fallback:
| Speaker | Text | Audio Duration | Total Time |
|---|---|---|---|
| (unknown, 10 words) | "Hello world..." | 2.8s | 32.3s |
| (unknown, 9 words) | "Buenos dias..." | 2.6s | 31.7s |
| (unknown, 1 word) | "Hi." | 0.3s | 3.7s |
These are real numbers from real synthesis runs, not estimates. The routing was verified by tracing the six conditions in FararoniAudioEngine.synthesizeToFile() (line 755-802).
Zero JNI: Panama FFM as the Universal Bridge
Every call from Java to native code in this engine goes through Panama FFM (JEP 454). Zero JNI imports. Zero generated headers. Zero boilerplate.
The pattern is the same across all three native-bridging classes:
Linker linker = Linker.nativeLinker();
SymbolLookup nativeLib = SymbolLookup.libraryLookup(path, Arena.global());
MethodHandle fn = linker.downcallHandle(
nativeLib.find("fararoni_xxx").get(),
FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS)
);
int result = (int) fn.invokeExact(memorySegment);
Three classes, three domains, one pattern:
NativeAudioPlayer — Playback control. Four downcall handles: initEngine, playBuffer, stopEngine, isInitialized. The buffer transfer is zero-copy via confined arenas:
try (Arena arena = Arena.ofConfined()) {
MemorySegment nativeBuffer = arena.allocate(ValueLayout.JAVA_FLOAT, samples.length);
MemorySegment.copy(samples, 0, nativeBuffer, ValueLayout.JAVA_FLOAT, 0, samples.length);
int result = (int) playBuffer.invokeExact(nativeBuffer, samples.length);
}
Note on Manual Memory (Arenas): In high-performance Java, an Arena is a bounded memory region that allows for deterministic, off-heap allocation. Unlike standard Java objects managed by the Garbage Collector, memory within an Arena is orchestrated manually. This ensures that our 833ns critical path remains GC-free, providing the microsecond-level determinism required for real-time conversational AI.
Arena.ofConfined() gives us deterministic memory: allocated before the call, freed when the try-with-resources block ends. No GC pressure, no finalizers, no surprises. Measured allocation+copy for 1 second of audio (24,000 float32 samples): 5.3µs (P50). For 100 seconds of audio: 434µs.
WhisperEngine — Engine control and STT. Eight downcall handles spanning both the TTS abort path (abortPlayback, setVolume, isPlaying, initEngine, getTelemetry) and Whisper STT for voice commands (whisperInit, startTranscription, stopTranscription). This class also demonstrates Panama upcalls — C-to-Java callbacks:
MemorySegment callbackStub = linker.upcallStub(
MethodHandles.lookup().bind(handler, "onTranscript", ...),
FunctionDescriptor.ofVoid(ValueLayout.ADDRESS, ValueLayout.ADDRESS),
callbackArena
);
VadDetector — Voice Activity Detection. Four handles: vadIsSpeech, rmsEnergy, startVadCapture, stopEngine. The VAD runs inline on the audio capture thread — no thread hop, no queue.
The measured FFM overhead: 42ns per downcall (P50), based on 10,000 iterations of a noop function. The JEP 454 spec claims ~10ns; the 4x difference is explained by nanoTime granularity on M1 (42ns resolution), branch prediction variability, and cache state. Still sub-microsecond, still negligible for audio work.
The Anatomy of a Sub-Microsecond Interrupt
Full-duplex means the system listens while it speaks. When the user starts talking mid-sentence, the interrupt chain fires:
User speaks while TTS plays audio
+-- HAL AudioUnit captures mic -> callback
+-- RMS energy > threshold -> speech detected
+-- Panama upcall: C -> Java (~42ns)
+-- WhisperEngine.abortPlayback()
+-- Panama downcall: Java -> C (~42ns)
+-- fararoni_abort_playback()
+-- AudioUnitReset: audio stops (459ns)
+-- AudioOutputUnitStop
Three measured segments tell the whole story:
The Bridge (Panama FFM round-trip): upcall + downcall = ~84ns. Java is not a bottleneck. The foreign function boundary is invisible at audio timescales.
The Command (AudioUnitReset + stop on C side): 459ns (P50). The AudioUnit flushes its buffers in unified memory — no DMA, no contention.
The Full Cycle (Java → Panama → C → AudioUnitReset → C → Panama → Java): 833ns (P50). Under one microsecond. The original design target was 5ms; we beat it by 6,000x.
The one thing software cannot accelerate is physics. The microphone's AudioUnit HAL input buffer takes ~5-10ms to fill — a hardware constraint determined by buffer_size_frames, not by code. Once that buffer delivers the speech event, our stack reacts in under 2 microseconds. The total real-world interrupt latency is dominated entirely by the microphone hardware, not by the software chain.
Real-World Synthesis: Measured, Not Estimated
Every claim in this document traces back to NativeAudioBenchmark.java, a standalone benchmark class that exercises the native bridge through Panama FFM on a live libfararoni_audio.dylib.
The benchmark measures six distinct operations across thousands of iterations on an M1 Max with Java 25.0.1:
- FFM downcall overhead: 10,000 calls to a noop — 42ns (P50), 292ns (P99)
- Abort C-side (AudioUnitReset + Stop): 100 calls — 459ns (P50), 2,750ns (P99)
- Abort end-to-end (Java→C→Java): 100 calls — 833ns (P50), 8,375ns (P99)
- Arena alloc+copy (1s audio, 24K samples): 5.3µs (P50)
- Arena alloc+copy (100s audio, 2.4M samples): 434µs (P50)
- Native buffer read (1s audio): 22.4µs (P50)
The synthesis times are equally real — every speaker, every quality level, every backend was tested through the REST endpoint (POST /v1/audio/synthesize) with the routing verified by tracing the condition branches in FararoniAudioEngine.
What we do not claim: we do not claim "5ms audio generation latency." Generation is neural inference and takes seconds. What is sub-microsecond is the command to stop — and that distinction matters, because it's the difference between an assistant that talks over you and one that yields the floor instantly.
Why This Matters
An AI assistant that can generate speech is not the same as one that can hold a conversation. Conversation requires knowing when to stop. Not "stop after a timeout" — stop now, mid-phoneme, because the human on the other end just opened their mouth.
That's what 833 nanoseconds buys us. Not speed for speed's sake, but the foundation for an AI that doesn't just respond — it knows when to be silent and listen. Full-duplex interrupt is the mechanical prerequisite for conversational presence: the system must be able to yield the floor faster than a human can perceive the delay.
The architecture we've built — Panama FFM as the zero-overhead bridge, CoreAudio's AudioUnit as the render engine, Metal GPU driving two neural models, and a sub-microsecond abort chain — is not about showing off low-level programming. It's about removing every artificial barrier between the AI and natural conversation, so the only latency that remains is the physics of a microphone filling its buffer.
Everything runs on-device. The voice models, the inference, the audio rendering, the interrupt — all local, all on Metal, all without a single byte leaving the machine. For an assistant that handles private conversations, sovereignty over the audio pipeline is not optional.
**The Zero-GC Determinism Factor
While industry-standard frameworks like PyTorch provide incredible flexibility, they often carry a 'latency tax' due to their Python-heavy orchestration and complex abstraction layers. By leveraging Java 25’s Scoped Arenas, we’ve moved the critical path Off-Heap. This means the Garbage Collector never touches our 833ns interrupt logic. We aren't just calling a model; we are orchestrating silicon without the overhead of the giants.
Appendix: Raw Benchmark Data
Environment: Java 25.0.1 | aarch64 | Mac OS X | Apple M1 Max
========================================================================
FARARONI AUDIO BENCHMARK — Panama FFM + CoreAudio + Metal
========================================================================
[System.nanoTime() overhead]
Iterations: 10,000
Mean: 9 ns | P50: 0 ns | P99: 42 ns | Max: 167 ns
[FFM Downcall Overhead (noop)]
Iterations: 10,000
Mean: 88 ns | P50: 42 ns | P99: 292 ns | Max: 22,916 ns
[Arena alloc+copy (1024 samples = 0.04s audio)]
Mean: 4,916 ns | P50: 3,917 ns | P99: 28,125 ns
[Arena alloc+copy (24000 samples = 1.00s audio)]
Mean: 5,515 ns | P50: 5,334 ns | P99: 9,583 ns
[Arena alloc+copy (240000 samples = 10.00s audio)]
Mean: 35,841 ns | P50: 34,791 ns | P99: 54,584 ns
[Arena alloc+copy (2400000 samples = 100.00s audio)]
Mean: 434,339 ns | P50: 432,083 ns | P99: 471,959 ns
[Native buffer read (24000 floats = 1s audio)]
Mean: 22,970 ns | P50: 22,375 ns | P99: 30,875 ns
[Abort Playback (C-side only: AudioUnitReset+Stop)]
Iterations: 100
Mean: 514 ns | P50: 459 ns | P99: 2,750 ns | Max: 2,750 ns
[Abort Playback (Java->C->AudioUnitReset->Java)]
Iterations: 100
Mean: 1,016 ns | P50: 833 ns | P99: 8,375 ns | Max: 8,375 ns
========================================================================
Benchmark executed with NativeAudioBenchmark.java on libfararoni_audio.dylib
compiled with fararoni_benchmark.cpp (Makefile updated, make && make install).
M1 Max, 32-core GPU, Java 25.0.1, 2026-03-21.
Top comments (0)