DEV Community: Nariaki Wada

Estimating the Side of Sudden Sounds in Unreal Engine 5.8: 35/35 Detected, 97.14% Side Accuracy

Nariaki Wada — Sat, 25 Jul 2026 11:28:30 +0000

Hello, everyone.

Even in a noisy place, people can quickly direct their attention toward a sudden sound. Giving a game agent a similar input becomes more interesting when it cannot inspect each sound source directly and must decide “something happened” and “it came from the left” using only the audio that reached its ears.

Today, I mix 40 ms transients into continuous ambient sound in Unreal Engine 5.8, then detect them, estimate their side, and extract candidate clips using only post-HRTF stereo PCM. In a short fixed-listener demo, the system detected all 35 events and classified 34 of 35 sides correctly, for 97.14% accuracy. Detection latency ranged from 20 to 60 ms, with a median of 20 ms.

This result needs an important qualification: it predates the movable listener and comes from a short demo. The five-minute negative test and SI-SDR measurement of extraction quality have not been completed.

Binaural Audio in Unreal Engine

Unreal Engine 5 became generally available in April 2022. Its Audio Mixer goes beyond source playback and provides procedural synthesis, submix routing, DSP, and C++ APIs. This experiment uses Unreal Engine 5.8 from 2026.

Resonance Audio is a spatial-audio SDK released by Google in November 2017. It can apply HRTF, or Head-Related Transfer Function, processing that represents source direction through differences in arrival time, level, and frequency response at the two ears.

I configured BINAURAL_HIGH. Unreal Engine 5.8's Resonance Audio settings describe it as the high-quality mode using third-order Ambisonics. Its output here is two-channel binaural audio intended for headphones.

Licenses

Component	License
Unreal Engine 5.8	Unreal Engine EULA
Public Resonance Audio SDK	Apache License 2.0

Unreal Engine is not open-source software, and its EULA governs use and distribution. The public Resonance Audio source uses Apache License 2.0. When working with the plugin bundled into Unreal Engine or with Engine code, Epic's distribution terms must also be checked.

What I Tested

I mix the ambient and target sources, then capture the Main Output Submix as 48 kHz stereo after it reaches the listener. This PCM is the analyzer's only input. The detector does not receive source-actor positions or playback state directly.

Whether per-band energy rise and full-band RMS rise can detect a 40 ms burst in ambient sound with low latency
Whether masked GCC-PHAT on changed time-frequency bins plus ILD voting can determine left versus right
Whether a soft mask derived from detection time and estimated direction can resynthesize a candidate clip

The complete C++ source, configuration, Automation Tests, and aggregate JSON are available in the unreal-mixed-audio-attention lab in kiarina/labs.

Reproducing the Lab

Install Unreal Engine 5.8, a compatible Xcode version, and mise on macOS. Set UE_ROOT to the Engine root. Because this is a C++ project, close the project in the Editor before building it.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/25/unreal-mixed-audio-attention

export UE_ROOT=/path/to/UE_5.8
mise -C 2026/07/25/unreal-mixed-audio-attention run
mise -C 2026/07/25/unreal-mixed-audio-attention run test
mise -C 2026/07/25/unreal-mixed-audio-attention run editor

The first mise ... run builds the Editor target and verifies the project structure. test runs Automation Tests for signal generation, side estimation, and the SI-SDR function. editor opens the ordinary demo with a five-second calibration phase.

After starting PIE, use W/S to move forward and backward, A/D to move sideways, and the mouse to rotate the camera and head listener. Touching a cyan ENV sphere or orange BURST sphere toggles that source. Press Shift+F1 to return control to the Editor.

From Mixed Audio to a Decision

The processing path is:

two continuous ambient sources + 40 ms targets from four directions
  -> Resonance Audio / HRTF
  -> 48 kHz stereo Main Output Submix
  -> five-second ring buffer
  -> 1024-point STFT, 480-sample hop
       ├─ onset: eight-band energy rise + full-band RMS rise
       ├─ side: 200–1500 Hz masked GCC-PHAT + 1.5–8 kHz ILD
       └─ extraction: build a soft mask from time and direction
  -> raw/extracted stereo WAV + run JSON + HUD

An STFT divides a short segment of audio into frequency components. At 48 kHz, 1,024 samples span about 21.3 ms, and the 480-sample hop advances by 10 ms.

1. Detecting a Transient

The analyzer divides the combined left/right spectrogram into eight bands and measures how much energy rose from the preceding context. A separate path measures the full-band RMS rise, and the larger value becomes the onset score.

An early version averaged the rise across three bands, which diluted a narrow burst with unchanged bands. Switching to the maximum band rise made the fast path more responsive to short sounds. The threshold is derived from the median and median absolute deviation of the scores collected during calibration.

2. Estimating Left or Right

GCC-PHAT finds the delay between the left and right waveforms through cross-correlation. Here, it uses only onset-changed bins between 200 and 1,500 Hz.

At higher frequencies, head shadow tends to produce a stronger level difference, so ILD, or Interaural Level Difference, from 1.5 to 8 kHz also votes on the result. If GCC-PHAT and ILD do not provide enough agreement, the system returns Unknown instead of forcing a side.

3. Extracting a Candidate Clip

The analyzer takes a one-second interval from 250 ms before the event to 750 ms after it. It uses the pre-event interval as a baseline, then gives more mask weight to bins whose energy rises above that baseline and whose ILD agrees with the estimated side. The same mask is applied to both channels before reconstruction with the inverse STFT.

Applying one mask to both channels avoids inventing a new interaural difference during extraction. The raw and extracted signals are saved as PCM16 stereo WAV files under Saved/MixedAudioAttention/.

Scene and Test Conditions

Two deterministic continuous-noise sources sit on an inner three-meter ring, while four target sources sit on an outer five-meter ring. Distance attenuation is disabled. Moving the listener therefore changes the HRTF direction relative to the head without using distance-dependent loudness as a cue.

Item	Condition
Engine	Unreal Engine 5.8
OS / machine	macOS 26.5.2 / Apple M4 Max
Audio	48 kHz stereo, Resonance Audio `BINAURAL_HIGH`
Ambient sound	Two deterministic continuous-noise sources
Targets	40 ms chirp / deterministic noise burst
Azimuths	`-120 / -60 / +60 / +120°`
Requested SNR	`+12 / +6 / 0 / -6 dB`
Analyzer	1024-point Hann STFT, 480-sample hop

The ordinary demo calibrates for five seconds and then cycles through enabled burst sources every two seconds. The implemented full mode uses a 60-second calibration, a 300-second negative period, and 160 events: two signal types, four azimuths, four SNR levels, and five repetitions.

The block pawn at the center of the screenshot carries the listener at head height. Listener yaw follows the camera. The top HUD shows position, orientation, score, threshold, side prediction, confidence, lag, ILD, and latency. The bottom dock displays two seconds of waveform, four seconds of spectrogram, and the latest event for both ears.

Observed Results

On July 25, 2026, I launched PIE through Unreal MCP and observed the fixed-listener demo from before the block pawn was added.

Metric	Observed value
Matched detections	35 / 35
Correct side	34 / 35 (97.14%)
Latency	20–60 ms, 20 ms median
Unmatched detections	0
Analyzer queue overruns	0

Every target had a corresponding detection, and there were no extra detections during this short run. The side was correct on 34 events. With processing advancing in 10 ms hops, responses arrived at a median of 20 ms and a maximum of 60 ms.

The HUD simultaneously updated the left/right stream waveforms, scrolling spectrograms, selected bins, detection markers, and raw/extracted waveforms for the latest event. The run also produced the expected WAV files and JSON record.

A Plain-Language Reading of the Results

The system found short changes and their side quickly from the mixed audio alone.

It did not use the Engine's knowledge of which actor emitted the sound as the detector input. Operating on post-HRTF stereo makes the design easier to adapt to inputs such as recordings or voice chat, where source metadata may not exist.

35/35 is promising, but it is not a false-alarm evaluation.

The zero unmatched detections apply only to a short demo. The planned five-minute negative test has not run, so the provisional requirement of no more than one false alarm in five minutes has not been demonstrated.

Producing an extracted WAV is not the same as separating the sound well.

The soft mask generated output files, but there is no post-HRTF target-only reference yet. Without SI-SDR, improvement in extraction quality remains unmeasured.

Reproducibility Findings

Signal processing was only part of the work. Several Unreal-specific issues mattered:

Adjusting volume alone did not prevent Resonance Audio's external send from becoming silent when PIE lacked focus. The experiment temporarily corrects VR focus and pause-on-focus-loss as well, then restores the previous values in EndPlay.
The first implementation refilled procedural ambient audio on a timer and became silent after about two seconds. Queueing enough audio for the known experiment duration at startup fixed it.
With an ordinary TArray<float> and non-aligned settings, Unreal Engine 5.8's FFT factory returned null. The analyzer now uses Audio::FAlignedFloatBuffer and 128-bit alignment.
The audio callback only copies into a preallocated five-second ring buffer. A dedicated worker performs STFT processing so that analysis does not block the callback.

Limitations

The aggregate comes from a short fixed-listener demo, not the current movable block pawn.
The full 60-second calibration and 300-second negative run has not been performed.
Results have not yet been broken down by SNR, signal type, or azimuth.
No post-HRTF target-only reference exists, so raw and extracted SI-SDR are unmeasured.
Generating a soft-masked WAV alone does not establish extraction quality.
The experiment does not classify sound type or estimate front/back or elevation, occlusion, or reverberation.
It uses a generalized HRTF and has not been tested with real-ear recordings or individualized HRTFs.
It was observed on one setup: macOS 26.5.2, Apple M4 Max, and Unreal Engine 5.8.
It assumes that the project is the only audio feeding the Main Output.

My Takeaway

If I let the code inspect source actors, determining left and right would be trivial. Recovering the signal from the actual mixed audio at the listener, detecting all 35 events, and getting 34 sides right felt much more useful. Walking through the scene and turning the listener also makes the HUD's numbers tangible: the same source changes in the binaural stream and in the analyzer's decision as the head rotates.

The extracted waveform looks convincing on screen, which makes it especially important not to call extraction successful while its quality metric is still blank. As a low-latency “auditory attention” sensor for an NPC or interactive agent, this is already a fun system to build with.

Running Qwen3 Through the ExecuTorch MLX Delegate: Up to 4.52x Faster on M1 Max

Nariaki Wada — Wed, 22 Jul 2026 03:57:08 +0000

Hello, everyone.

There are now many ways to run an LLM on a Mac, but exporting a PyTorch model for Apple Silicon and executing it in a lightweight runtime is still an evolving path. How much faster is it, and does 4-bit quantization change the output?

Today, I am looking at ExecuTorch's experimental MLX delegate, released in May 2026. It enables PyTorch models to run on Apple Silicon GPUs. I use ExecuTorch 1.3.1 to run Qwen3-0.6B and compare it with PyTorch MPS.

The short result is that decode throughput was 41.8 tokens/s with PyTorch MPS BF16, 134.8 tokens/s with MLX BF16, and 188.9 tokens/s with MLX INT4. MLX INT4 was 4.52x faster, and its file was 71.8% smaller than BF16. However, INT4 changed the generated output in two of three simple prompts.

What Is the ExecuTorch MLX Delegate?

ExecuTorch is a runtime for running trained PyTorch models on desktops, phones, and embedded devices. Inference means feeding input into a trained model to obtain an output.

The MLX delegate was released on May 18, 2026. It sends a PyTorch computation graph to Apple's MLX framework for execution on an Apple Silicon GPU. It supports BF16, FP16, FP32, and 2/4/8-bit quantization, among other formats. It is currently experimental, so its APIs and supported scope may change.

Quantization stores weights with fewer bits to reduce size and computation. I compared ordinary BF16 with INT4, which quantizes the linear layers and embeddings to four bits.

The Qwen3-0.6B Model Used Here

Qwen3 is an LLM family announced by the Qwen Team on April 29, 2025. A single model can switch between a thinking mode for step-by-step reasoning and a non-thinking mode for shorter answers. The family supports more than 100 languages and dialects.

The Qwen3-0.6B used here is one of the smaller dense models in the family. “Dense” means that it generally uses the full model for each input, unlike a mixture-of-experts model that activates only selected parts. It has a nominal 0.6 billion parameters, 28 layers, and a context length of 32,768 tokens. A token is a small unit of text processed by the model.

I disabled thinking mode for this test and pinned the model revision to c1899de289a04d12100db370d81485cdf75e47ca.

Licenses

Component	License
Qwen3-0.6B weights	Apache License 2.0
ExecuTorch	BSD 3-Clause License
MLX	MIT License
PyTorch	BSD 3-Clause License

These permissive open-source licenses allow broad use, including commercial use, but their conditions—such as retaining copyright notices when redistributing—still apply. Check the linked license text for your use case.

The Three Paths and Their Data Flow

Only one trained model, Qwen3-0.6B, was used. The same weights were tested through three execution paths.

same prompt
  -> tokenizer: convert text to token IDs
  -> Qwen3-0.6B
       ├─ ExecuTorch MLX BF16 -> .pte -> MLX / Metal GPU
       ├─ ExecuTorch MLX INT4 -> 4-bit .pte -> MLX / Metal GPU
       └─ PyTorch MPS BF16 ----------------> MPS / Metal GPU
  -> greedy generation: select the most likely next token each time
  -> tokenizer: convert token IDs back to text

A .pte file is a model program exported for ExecuTorch. The complete computation graph could be lowered into a single MLX subgraph. PyTorch MPS BF16 ran the same BF16 weights through the ordinary Transformers/PyTorch path and served as the reference.

What I Tested

Whether the complete Qwen3-0.6B graph could be lowered to MLX and executed
Whether MLX BF16 and PyTorch MPS BF16 generated the same tokens
How INT4 changed model size, speed, and process memory
Whether the experiment could be reproduced using published wheels only

The complete code and JSON report are available in the executorch-mlx-qwen3 lab in kiarina/labs.

Reproducing the Lab

You need an Apple Silicon Mac, Xcode Command Line Tools, mise, uv, and an internet connection. The first run downloads the Qwen3 weights and creates about 1.53 GB of PTE files in total.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/22/executorch-mlx-qwen3
mise -C 2026/07/22/executorch-mlx-qwen3 run

The export and benchmark steps can also be run separately.

mise -C 2026/07/22/executorch-mlx-qwen3 run export
mise -C 2026/07/22/executorch-mlx-qwen3 run benchmark

Test Conditions

machine: MacBook Pro (Apple M1 Max, 32 GPU cores, 64 GB)
OS: macOS 26.5.2
Python: 3.13.7
ExecuTorch: 1.3.1
PyTorch: 2.12.1
Transformers: 4.56.1
model: Qwen/Qwen3-0.6B
generation: greedy, batch 1, up to 16 tokens
PTE: custom MLX SDPA / KV cache, requested maximum sequence 128

For performance, each backend was given the same Japanese prompt and forced to generate 16 tokens. The reported values are medians from five trials after warm-up. Prefill reads the input and produces the first token; decode generates the remaining tokens one at a time.

Each backend ran in a separate process. The MLX path loaded a fresh forward method and initialized its KV cache for each trial, while the PyTorch path created a fresh cache for every trial.

Results

PTE Size

PTE	Size	Relative to BF16	Export time	SHA-256
MLX BF16	1,192,264,196 bytes	100.0%	41.61 s	`83da47c2…bfb8c0`
MLX INT4	335,662,976 bytes	28.2%	54.55 s	`0e30a054…71267`

INT4 was 856,601,220 bytes, or 71.8% smaller, than BF16. Quantization itself takes work, so exporting INT4 took about 13 seconds longer.

Generation Speed

The input was the Japanese prompt “Briefly explain local inference on Apple Silicon.”

Backend	Load	Median prefill	Median decode	16-token total	Peak RSS increase
ExecuTorch MLX BF16	0.002 s	0.020 s	134.8 tokens/s	0.131 s	1.27 GiB
ExecuTorch MLX INT4	0.003 s	0.028 s	188.9 tokens/s	0.108 s	0.47 GiB
PyTorch MPS BF16	0.726 s	0.038 s	41.8 tokens/s	0.396 s	0.14 GiB

MLX BF16 decoded 3.22x as fast as PyTorch MPS BF16, while MLX INT4 was 4.52x as fast. INT4 was also 1.40x faster than MLX BF16 and reduced the RSS increase by about 63%.

There are important qualifications. The MLX load figure measures only opening the PTE program, not all work needed to materialize weights for GPU use. RSS measures the process's main memory, not GPU memory itself. PyTorch reported 1.20 GiB of MPS driver-allocated memory at the end of the run. Because GPU memory was not measured on the same basis, the table does not prove that PyTorch used the least memory.

On the first invocation only, BF16 prefill took 0.434 seconds and INT4 took 1.014 seconds. Cold starts that include Metal setup and initial compilation were much slower than the warmed-up figures.

Generated Output

I compared the generated token sequences on three short prompts.

Prompt	PyTorch MPS BF16	MLX BF16	MLX INT4
Answer Japan's capital in one word	`日本の首都は、大阪です。` (Japan's capital is Osaka.)	Exact token match	`日本の首都は、东京です。` (Japan's capital is Tokyo.)
Answer 1+1 with one digit	`1+1=2`	Exact token match	`1`
Copy `MLX` unchanged	`MLX`	Exact token match	Exact token match

MLX BF16 matched PyTorch MPS BF16 token for token in all three cases. Within this narrow test, changing the execution path to MLX did not introduce a difference.

INT4 matched in only one case. Quantization represents numbers more coarsely, so when candidate next tokens have similar scores, their ranking can change.

The capital answer was wrong in both BF16 and INT4. The small 0.6B model, prompt, and greedy decoding could all contribute, but three questions are not enough to identify the cause. This probe checks differences between backends; it does not certify the model's knowledge or answer quality.

A Plain-Language Reading of the Results

The same small LLM ran substantially faster through MLX.

Even BF16 decoded 3.22x as fast as PyTorch MPS. Exporting a PyTorch model to a lightweight Mac runtime looks promising.

Four-bit weights are smaller and faster, but answers can change.

INT4 reduced the file from about 1.19 GB to 336 MB and produced the highest decode rate. Yet it changed two of only three token sequences. It should be evaluated on a task-specific quality set before adoption.

The gap cannot be attributed to MLX kernels alone.

This comparison covers an ExecuTorch MLX pipeline versus a Transformers/PyTorch MPS pipeline. Their runtimes, cache handling, and execution paths differ, so the numbers describe the complete pipelines.

Reproducibility Findings

The dependency metadata for executorch==1.3.1 allowed PyTorch 2.13.0, but importing the published ExecuTorch extension failed because the materialize_cow_storage symbol was missing. Pinning PyTorch to 2.12.1 made the same wheel work. The failure can be reproduced with this optional task:

mise -C 2026/07/22/executorch-mlx-qwen3 run probe-torch-2-13

The bundled PTE inspector also failed because its included flatc did not recognize the --json option. I verified full-graph delegation from the partitioner log emitted during export instead.

Limitations

Only one M1 Max, Qwen3-0.6B, batch 1, and short Japanese prompts were tested.
Five short trials do not control thermal state, power consumption, or other GPU workloads.
Only BF16 and INT4 were compared; FP16, 2/8-bit, Core ML, and other paths were not tested.
The quality probe had only three questions; no standard benchmark or perplexity was measured.
The requested maximum sequence was 128, and long text was not tested.
GPU memory could not be measured on the same basis for MLX and MPS.
Cold-start latency was observed only once.
The MLX delegate is experimental.

My Takeaway

Lowering the entire Qwen3-0.6B graph to MLX and increasing decode throughput by more than 3x without changing the BF16 tokens was a good result. INT4 reduced the file to less than one-third of its BF16 size and reached about 189 tokens/s, which is attractive when embedding a small model on a Mac.

The output changes from 4-bit quantization appeared immediately, even in this tiny probe. Quantization is not a free speedup. Paired with a quality suite for the real task, this path could work well for local helper features or small, responsive agents.

Testing PyTorch 2.13 MPS FlexAttention on M1 Max: Up to 7.83x Faster for Sparse Attention

Nariaki Wada — Tue, 21 Jul 2026 04:03:12 +0000

Hello, everyone.

Attention becomes expensive very quickly as more text is given to an AI model. Can a Mac GPU make it faster when every token is restricted to looking only at nearby tokens?

Today, I am comparing FlexAttention, newly available on Apple Silicon in PyTorch 2.13, with standard SDPA on an M1 Max.

The short answer is that, with 32,768 tokens and a 256-token local window, FlexAttention took 75.27 ms while SDPA took 589.05 ms: a 7.83x speedup. For ordinary causal attention, however, SDPA was about 19x faster. FlexAttention was not universally faster; it helped with long, extremely sparse attention.

FlexAttention and Its MPS Support

FlexAttention was announced as a prototype with PyTorch 2.5 in October 2024. It lets developers express an attention rule as a short Python function, which torch.compile turns into a specialized fused kernel.

Attention determines which tokens in the input should influence one another. The number of possible comparisons grows rapidly with sequence length. FlexAttention can describe rules such as “look only into the past” or “look only at the previous 256 tokens,” allowing unnecessary comparisons to be skipped. A pattern with many skipped comparisons is called sparse.

PyTorch 2.13, announced on July 8, 2026, added Metal/MPS kernels for FlexAttention on Apple Silicon. PyTorch reports up to roughly 12x higher performance than SDPA for sparse patterns. The API and kernel options remain unstable in 2.13.

This experiment does not use a pretrained AI model. It is a kernel benchmark that sends the same randomly generated attention inputs through two implementations.

Roles of the Technologies

Technology	Role
PyTorch 2.13.0 / FlexAttention	Run custom attention rules as compiled kernels
SDPA	Standard PyTorch attention implementation used as the baseline
MPS / Metal	Execute calculations on the Apple Silicon GPU

Roles and Data Flow

Both paths receive the same query, key, and value tensors. Loosely speaking, these represent what to search for, what to match against, and what content to retrieve.

same random inputs (query / key / value)
  ├─ FlexAttention
  │    Python mask rule -> BlockMask -> torch.compile -> Metal kernel
  │
  └─ SDPA
       dense mask --------------------> MPS backend

              -> compare output error and forward time

A BlockMask groups the regions to compute into 128×128-token blocks. FlexAttention can skip unnecessary blocks. The SDPA path received an ordinary boolean mask describing the same rule.

What I Tested

Whether sparse attention is also faster on an M1 Max using the representative official shapes
How wide an 8,192-token window can become before SDPA takes the lead
Whether the benefit survives initial compilation and BlockMask construction
Whether FlexAttention and SDPA produce matching outputs
Whether backward gradient computation works on MPS

The complete code and JSON report are available in the pytorch-2-13-flexattention-mps lab in kiarina/labs.

Reproducing the Lab

You will need an Apple Silicon Mac, mise, and uv. The default task also creates a 32,768×32,768 boolean mask, which alone occupies 1 GiB. Stop other GPU workloads and run it on a machine with sufficient memory.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/21/pytorch-2-13-flexattention-mps
mise -C 2026/07/21/pytorch-2-13-flexattention-mps run

To skip the long case, run this inside the lab:

uv run python benchmark.py --quick

Test Conditions

Query, key, and value were generated independently at random. Both implementations received the same tensors. I timed only the forward pass, excluding mask creation and compilation. Because MPS runs asynchronously, each measurement explicitly waited for the GPU before and after the call. The reported values are medians from ten trials after three warm-ups.

machine: MacBook Pro (Apple M1 Max, 32 GPU cores, 64 GB)
OS: macOS 26.5.2
Python: 3.13.7
PyTorch: 2.13.0
shape: batch 1, 8 heads, head dimension 64
dtype: bfloat16
FlexAttention: torch.compile(..., dynamic=False)
BlockMask: 128×128 blocks
CPU fallback: disabled

A sliding window lets each position see the previous W tokens. With a window of 256, for example, each token is limited to the most recent 256 tokens even in a very long input.

Results

An SDPA / Flex value above one means that FlexAttention was faster. Density is the percentage of all token pairs that the pattern actually allows.

Pattern	Sequence	Window	Token / block density	Flex median	SDPA median	SDPA / Flex
causal	8,192	—	50.01% / 50.78%	231.22 ms	12.33 ms	0.05x
local	8,192	64	0.78% / 3.10%	11.75 ms	25.23 ms	2.15x
local	8,192	256	3.08% / 4.61%	19.14 ms	25.25 ms	1.32x
local	8,192	1,024	11.72% / 13.18%	56.24 ms	25.30 ms	0.45x
local	8,192	4,096	37.50% / 38.67%	169.24 ms	25.30 ms	0.15x
local	32,768	256	0.78% / 1.17%	75.27 ms	589.05 ms	7.83x

At 8,192 tokens, FlexAttention won through a window of 256. At a window of 1,024, SDPA became 2.22x faster. The crossover on this shape lies between 3.08% and 11.72% token density.

Causal attention allows each token to see the entire past, giving it a density of about 50%. SDPA has a specialized path for this common pattern and was 18.75x faster than FlexAttention. Simply replacing ordinary causal attention is counterproductive.

For the 32,768 / window 256 case, one of ten FlexAttention trials rose to 147.89 ms. Its full range of 74.96–147.89 ms still did not overlap with SDPA's 586.08–590.03 ms range, so the performance ordering was unambiguous.

Difference from the Official Benchmark

PyTorch reports 4.15x for 8,192 / window 64 and approximately 12.3x for 32,768 / window 256. The M1 Max produced 2.15x and 7.83x, respectively.

The expected direction was reproduced: sparser patterns were faster, and the gap grew at longer sequence lengths. The speedups did not reach the official figures. Because the release blog does not identify the Apple Silicon model used for those figures, the difference cannot be attributed to hardware alone.

Initial Setup Cost

In addition to steady-state forward time, FlexAttention needs BlockMask construction and a first compiled call.

Pattern	Sequence / window	BlockMask build	First compiled call
causal	8,192 / —	564.54 ms	546.57 ms
local	8,192 / 64	95.90 ms	94.31 ms
local	8,192 / 256	100.54 ms	102.34 ms
local	8,192 / 1,024	95.02 ms	130.75 ms
local	8,192 / 4,096	96.65 ms	244.62 ms
local	32,768 / 256	1,120.85 ms	162.57 ms

The 32,768 / window 256 case saves about 514 ms per call, so reusing the same mask recovers the setup cost in roughly three forwards. The 8,192 / window 64 case needs about 14. A mask used only once should not be selected based on the steady-state 7.83x or 2.15x figure alone.

The first-call figures include the state of the host's compile cache. BlockMask construction was eager; I did not benchmark compiling mask construction itself.

Output Agreement and Supported Scope

On identical MPS bfloat16 inputs, maximum absolute error between FlexAttention and SDPA ranged from 0.0078125 to 0.015625, while mean absolute error ranged from 0.000079 to 0.000363.

I also compared a smaller input against float32 SDPA on the CPU.

MPS implementation	Maximum absolute error	Mean absolute error
FlexAttention bfloat16	0.012440	0.000553
SDPA bfloat16	0.012440	0.000677

Both MPS paths had the same maximum error, and I did not observe a large discrepancy specific to FlexAttention. This is a numerical comparison on synthetic inputs, not a quality evaluation of a complete model.

A probe with requires_grad=True failed with FlexAttention does not support backward on MPS. The PyTorch 2.13 MPS path is forward-inference only. The deterministic backward feature added in 2.13 applies to CUDA and does not add MPS training support.

A Plain-Language Reading of the Results

It was faster when a long input looked at only a tiny nearby region.

FlexAttention was 7.83x faster when 32,768 tokens were limited to the previous 256. It benefits when a large share of the work can be skipped.

It is not a drop-in speedup for ordinary attention.

Standard SDPA was about 19x faster for causal attention over the full past. The pattern and density must be measured.

Reusing the same rule matters.

Mask construction and compilation add an initial wait. FlexAttention is better suited to work that reuses them across layers or inference calls.

Limitations

Only one M1 Max, one process, and synthetic random inputs were tested.
Measurements covered only bfloat16, batch 1, 8 heads, and head dimension 64.
Ten short trials after three warm-ups do not control long-term thermal throttling or other GPU workloads.
Only forward prefill was tested; decode, GQA, captured buffers, and score modification were not.
Peak memory, energy use, and per-kernel Metal profiles were not measured.
The hardware, OS, and measurement process were not fully identical to the official benchmark.
The FlexAttention API and kernel options remain unstable in PyTorch 2.13.

My Takeaway

Seeing the gap reach 7.83x for extremely sparse, long attention on an M1 Max was a good result. Writing the mask rule in Python and letting PyTorch produce a specialized Metal kernel substantially lowers the barrier to experimenting with custom attention on a Mac.

The advantage disappeared quickly as the window widened, and the initial setup time was not trivial. Despite the “Flex” name, it should not be treated as a universal optimization; density and reuse count need to be measured. It looks useful for local LLM inference over long documents with nearby context or experiments that operate on sparse relationships.

Testing Japanese Streaming ASR with Apple SpeechAnalyzer: 0% CER and About One Second to Display

Nariaki Wada — Tue, 21 Jul 2026 02:56:57 +0000

Hello, everyone.

If meeting and note transcription can run entirely on a Mac, the audio never has to be sent to an external service. But how accurate and responsive is that experience while someone is speaking Japanese?

Today, I am feeding Japanese audio incrementally into Apple's SpeechAnalyzer and measuring recognition accuracy and partial-result latency. I also built a local browser tool for trying it with a microphone.

The short answer is that both configurations achieved a 0% character error rate (CER) in all three trials on a 14.171-second synthetic conversation. In progressive mode, however, the first text arrived after about 1.09 seconds, and partial-result delivery latency was about 1.01 seconds at p95. The content was accurate, but the text did not appear immediately after each word was spoken.

What Is SpeechAnalyzer?

SpeechAnalyzer is Apple's new speech-to-text API, announced at WWDC25 in June 2025 and introduced with the iOS 26 and macOS 26 generation.

Its SpeechTranscriber module uses a new general-purpose conversational speech-recognition model. Apple describes it as faster and more flexible than the previous model and suitable for long-form or distant audio such as meetings and lectures. Processing runs on the device.

This is not an open-source model whose weights can be downloaded and redistributed independently. AssetInventory downloads the required model from Apple's servers, while the operating system stores and updates it. Apple does not present a standalone open-source license for the model; use of the Speech framework and its model is subject to the applicable Apple OS, Xcode, SDK, and developer terms. The lab code created for this experiment uses the MIT License.

Roles and Data Flow

This experiment does not combine multiple recognition models. It uses one ja_JP SpeechTranscriber model. The surrounding path is:

browser microphone (normally 48 kHz Float32)
  -> AudioWorklet
  -> WebSocket (localhost)
  -> Python bridge (resample to 16 kHz)
  -> Swift CLI (convert to Int16 PCM)
  -> SpeechAnalyzer + SpeechTranscriber
  -> partial / final results
  -> browser display

PCM is a basic audio representation that stores the sound wave as a sequence of numbers. The Python bridge and Swift CLI only prepare the format; they do not recognize speech. Audio remains within localhost and is neither uploaded nor saved.

What I Tested

The experiment asks three questions:

Can it transcribe a known Japanese conversation correctly?
How long do the first and subsequent partial results take?
Can a browser microphone be connected to the on-device SpeechAnalyzer path?

The complete code and JSON measurement report are available in the apple-speech-analyzer-streaming-asr lab in kiarina/labs.

Reproducing the Lab

You will need an Apple Silicon Mac, macOS 26 or later, Xcode Command Line Tools, mise, uv, and FFmpeg. These commands download the shared audio asset, run the tests, build a release binary, and execute three trials per configuration.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/20/apple-speech-analyzer-streaming-asr
mise -C 2026/07/20/apple-speech-analyzer-streaming-asr run

To try a live microphone in the browser, run:

mise -C 2026/07/20/apple-speech-analyzer-streaming-asr run demo

Partial text appears in lime and finalized text in white. The local server has no authentication, so keep it on 127.0.0.1 and do not expose it publicly.

Test Conditions

I fed a 14.171-second synthetic two-speaker conversation in 100 ms chunks, paced to match real time. I compared two presets:

transcription: prioritizes accuracy and returns finalized results
progressiveTranscription: also returns the accumulated in-progress transcript

The main environment was a MacBook Pro with an Apple M1 Max and 64 GB of memory, macOS 26.5.2, and Swift 6.3.3. The locale was ja_JP, and input was 16 kHz mono Int16 PCM.

I measured differences from the reference with CER. It is the proportion of character substitutions, deletions, and insertions needed to match the correct text. This experiment ignored whitespace and punctuation during that comparison.

Results

Preset	CER	Partial results	First partial	Partial delivery p50 / p95	Final delivery
transcription	0.0% × 3	0	—	—	1.160–1.215 s
progressive	0.0% × 3	93, 93, 94	1.082–1.101 s	0.526 / 1.014 s	1.033–1.223 s

End-to-end real-time factor (RTF), from input start to finalization, was 1.072–1.086. An RTF of 1.0 means the full processing time equals the audio duration. Because this test intentionally waits while feeding audio at real-time speed, that figure is not the model's isolated compute speed.

All Six Utterances Retained Their Content

After ignoring punctuation and whitespace, all six utterances matched the reference in every trial. The phrase containing the number, あと5分くらいかな (“about five more minutes”), was also preserved.

Punctuation varied between trials and presets. For example, one result joined そっちはこっちは, while another inserted a full stop between the two phrases. A 0% CER here does not mean that every output was equally readable.

Progressive Display Lagged by About One Second

progressiveTranscription returned its first partial after about 1.09 seconds and then kept updating the accumulated text. Across all 280 partial results, delivery lag relative to the corresponding audio time was 0.526 seconds at p50 and 1.014 seconds at p95.

p95 means that 95% of the results arrived within that amount of time. The preregistered targets—under one second for both the first partial and p95 delivery—were narrowly missed.

The Live-Microphone Path Also Worked

The automated end-to-end test sent 48 kHz audio in browser-sized 128-frame chunks, resampled it to 16 kHz, and passed it to SpeechAnalyzer. It received 93 partial results and one final result, with the same final content as the benchmark.

In an informal Chrome trial with a MacBook Pro's built-in microphone, one user also perceived about one second of display delay. Recognition errors seemed infrequent, but 十分 (“enough”) was rendered as 10分 (“ten minutes”), and some punctuation was missing. This was a subjective single-person check, not part of the measured CER evaluation.

A Plain-Language Reading of the Results

It captured the spoken content accurately in one short, clean conversation.

This was only one synthetic recording. A noisy room or unfamiliar proper nouns may produce very different results.

It can update text while someone speaks, but the display is roughly one second behind.

That may be acceptable for meeting notes or draft captions, but it may feel slow for a voice command that should react immediately after one word.

Speech can be processed without sending it away from the device.

The model runs on the Mac. Its initial asset download and later updates still come from Apple's servers.

Limitations and Implementation Notes

The evaluation covered one clean, 14.171-second synthetic conversation and three trials per preset.
It did not score punctuation quality.
It did not test noise, reverberation, dialects, proper nouns, or long streams.
The exact version and hash of Apple's system-managed model cannot be pinned.
The benchmark excludes microphone buffering and browser permission time.

The implementation had to match the model's PCM type as well as its sample rate. Passing Float32 directly failed, while conversion to the 16 kHz Int16 format requested by SpeechAnalyzer worked. Reproductions should record the OS build, Speech framework, and locale together.

My Takeaway

Even though this was only one short conversation, a 0% CER across every on-device trial was better than I expected. The roughly one-second display delay was visible, but it did not feel extremely slow during informal free speech.

Punctuation was less stable than content recognition, so a production tool would benefit from a readability cleanup stage. The current path looks useful for local meeting notes or draft captions when audio must not leave the device.

Can MeanVC Stream 200 ms Voice Conversion in Real Time on an M1 Max? 37.7 ms p95

Nariaki Wada — Sun, 19 Jul 2026 12:52:02 +0000

Hello, everyone.

It would be interesting if a Mac could keep what I say while changing only the characteristics of my voice, all locally and in real time.

Today, I am testing MeanVC's 200 ms streaming model on an Apple M1 Max. I will measure processing speed, the shift in speaker characteristics, and discontinuities between audio chunks.

The short answer is that one CPU thread converted a 200 ms audio chunk in 37.7 ms at p95. All 99 measured chunks finished before the next chunk arrived, and the converted voice became more similar to the target reference than to the source speaker. However, estimated time to the first output was about 0.28–0.37 seconds.

What Is MeanVC?

MeanVC is a lightweight streaming zero-shot voice conversion model whose paper was published on October 9, 2025. Voice conversion changes vocal characteristics while preserving the spoken content. Zero-shot means that a target speaker can be specified from reference audio without training a separate model for that speaker.

MeanVC divides audio into short chunks and carries context forward while converting them. It also uses MeanFlow to move from noise to generated audio features in a small number of steps. I used the officially released 200 ms model with two inference steps.

The published MeanVC code and pretrained model use the Apache License 2.0. It permits use, modification, and redistribution subject to conditions such as preserving copyright and license notices. The original WavLM implementation uses the MIT License, while S3PRL, which loads WavLM here, uses Apache License 2.0. The evaluation-only ECAPA-TDNN ONNX model uses the MIT License.

Model Roles and Data Flow

The conversion path uses four pretrained models.

Model	Role
Fast-U2++	Extracts linguistic content features from the source audio
WavLM speaker encoder	Extracts target-speaker characteristics from the reference
MeanFlow DiT	Generates a converted audio spectrogram from the content and speaker features
Vocos	Decodes the generated spectrogram into a playable waveform

The data flows as follows.

target reference -> WavLM + mel features ----┐
                                             v
source -> 200 ms chunks -> Fast-U2++ -> MeanFlow DiT -> Vocos -> converted audio

evaluation only:
source / target / converted -> ECAPA-TDNN -> compare speaker-feature distances

Mel features describe energy at different frequencies on a scale closer to human hearing. ECAPA-TDNN is not involved in conversion. I used it only to check whether an independent model also saw a shift toward the target speaker.

What I Tested

The experiment asks three questions.

Can each 200 ms chunk finish before the next one arrives?
Do the converted speaker features move closer to the target than the source?
Do the 200 ms boundaries create large waveform discontinuities?

The complete code, pinned model revisions, SHA-256 hashes, and JSON report are available in the meanvc-streaming-apple-silicon lab in kiarina/labs.

Reproducing the Lab

You will need mise, uv, FFmpeg, and an internet connection for the initial download. The first run downloads about 2.7 GiB of models and related files.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/19/meanvc-streaming-apple-silicon
mise -C 2026/07/19/meanvc-streaming-apple-silicon run

The converted WAV files and measurement report are written to output/. To try audio files or a live microphone from a browser, run:

mise -C 2026/07/19/meanvc-streaming-apple-silicon run demo

The browser tool accepts source audio to convert and target audio as the voice reference. It also includes a button for the same synthetic sample used in this experiment and a live mode that converts microphone input in 200 ms chunks. Audio is processed only on localhost and is not uploaded to an external service.

Test Conditions

I used a 14.171-second synthetic conversation between two speakers. Three turns from Speaker 1, totaling 6.588 seconds, formed the source. Three different turns from Speaker 2, totaling 7.583 seconds, formed the target reference. No utterance appeared in both sets.

The main conditions were:

Item	Condition
Mac	MacBook Pro, Apple M1 Max, 64 GB
Execution	PyTorch CPU, one thread
Audio	16 kHz, mono
Chunk	245 ms first, then 200 ms
MeanFlow	Two steps
Measurement	33 chunks × 3 trials, 99 chunks total

Instead of a physical microphone, the lab fed an audio file incrementally in microphone order. This measures model compute time but excludes buffering in the microphone, virtual audio device, and speakers.

Results

200 ms Converted in 37.7 ms at p95

Metric	Observed
200 ms chunk inference p50	35.1 ms
200 ms chunk inference p95	37.7 ms
200 ms chunk inference p99	38.0 ms
Maximum	38.4 ms
Deadline misses	0 / 99 chunks
Peak resident memory	2,849 MB

p95 means that 95% of measurements were at or below that value. Here, most 200 ms chunks finished within 37.7 ms. That left about 162 ms before the next chunk arrived, and the model never fell behind during the measurement.

The first sound still has to wait for the initial 245 ms of input and its inference. Estimated first-output latency was 277 ms at p50, 359 ms at p95, and 368 ms at maximum. Continuous processing has ample compute headroom, but this is not a 20–100 ms low-latency voice changer.

Speaker Features Shifted Toward the Target

ECAPA-TDNN represented each audio sample as 192 numbers, and I compared them with cosine similarity. A value closer to one means that the model considers the speaker characteristics more similar.

Pair	Similarity
Source − target reference	0.210
Converted − source	0.320
Converted − target reference	0.725

The converted voice was clearly closer to the target reference than to the source. Individual cross-speaker turns averaged 0.175, so 0.725 was a substantial shift for this sample.

This score alone does not mean “the same person” or “natural speech.” Naturalness, pronunciation, and listener judgments of identity require separate evaluation.

No Large Discontinuity Found at 200 ms Boundaries

I measured the sample-value difference across 32 chunk boundaries. Boundary differences had a p95 of 0.0462, while ordinary adjacent differences across the full audio had a p95 of 0.0567. The boundaries were not systematically larger. The converted waveform also contained 0% clipped samples.

This is a waveform check, not a human listening test for click noise.

A Plain-Language Reading of the Results

The M1 Max CPU converted audio faster than it arrived.

Each 200 ms of speech took about 35–38 ms to process, and all 99 chunks met their deadline. The compute speed is suitable for applications such as streaming or online meetings that can tolerate some delayed output.

The voice moved in the target speaker's direction.

An independent speaker model also found the converted audio more similar to the target than the source. This does not establish naturalness or pronunciation quality.

“Real-time processing” is not the same as “imperceptible latency.”

Sustained conversion was fast, but the first output took about 0.28–0.37 seconds. That delay may be very noticeable when monitoring your own voice through headphones.

A Note on Japanese

The MeanVC paper trained the conversion model primarily on Mandarin speech. Fast-U2++, which extracts content, was also trained on the Mandarin-centered WenetSpeech dataset. The Japanese input used here is outside the main language conditions evaluated in the paper.

Japanese features such as geminate consonants, long vowels, and pitch accent may not be fully preserved. With one Japanese input and one speaker pair, this experiment cannot isolate language-related degradation. For the reference, clean single-speaker audio with little noise or reverberation is the safest starting point.

Limitations and Responsible Use

The evaluation used one short Japanese conversation and one speaker pair.
It did not measure full microphone-to-speaker latency.
It did not run listening tests for naturalness, intelligibility, or pronunciation preservation.
It did not test noise, singing, laughter, whispering, or long sessions.
It used PyTorch CPU, not MPS, ONNX, or Core ML.

Voice conversion should only use audio whose speaker has consented to both its processing and intended use. This experiment used a synthetic conversation from shared test assets and did not create a model that imitates a real person.

My Takeaway

Because the model waits for 200 ms of input, I expected the conversion itself to be fairly heavy. In practice, one CPU thread reached 37.7 ms at p95, leaving substantial headroom for continuous processing. The independent speaker model also confirmed a shift toward the target, making this more than a speed-only demonstration.

The full zero-shot path is less lightweight than the conversion core alone suggests: WavLM brings the initial download to about 2.7 GiB and peak memory to about 2.8 GB. Even so, MeanVC could work well for local streaming effects or prototyping character voices with consent when a modest delay is acceptable.

Real-Time Webcam-to-VRM Retargeting with MediaPipe Holistic: 17.3 FPS on M1 Max

Nariaki Wada — Sat, 18 Jul 2026 11:37:06 +0000

Hello, everyone.

If I raise my hand in front of a webcam, can a 3D avatar raise its hand too? More specifically, can a browser do this without dedicated motion-capture hardware?

Today, I am testing MediaPipe Holistic as a way to estimate the body, face, and both hands, then retarget the result to a VRM 1.0 avatar in real time.

The short answer is that an Apple M1 Max processed a 1280×720 webcam stream at an effective 17.3 FPS using CPU/WASM. Every one of the 141 frames with a detected pose updated the VRM skeleton. Body tracking worked, but this short measurement did not establish stable simultaneous capture of the face and both hands.

What Is MediaPipe Holistic?

MediaPipe Holistic is a pipeline that Google announced on December 10, 2020 for estimating the body, face, and both hands from one camera. A landmark is a point representing a location such as a shoulder, elbow, or fingertip. The announced version returned 543 points: 33 for the pose, 468 for the face, and 21 for each hand. The bundle used here includes Face Mesh V2 with ten additional iris points, bringing the face output to 478 and the total to 553.

The 13.7 MB holistic_landmarker.task used here contains seven TensorFlow Lite model files. Their roles can be summarized as follows.

Model	Role
pose detector	Finds the person in the image
pose landmarks detector	Estimates 33 body points and 3D coordinates
face detector	Finds the face region
face landmarks detector	Estimates 478 facial points, including the irises
face blendshapes	Produces 52 coefficients such as eye blink and jaw opening
hand ROI refinement	Corrects the hand crop for a closer look
hand landmarks detector	Estimates 21 points and is shared by the left and right hands

Holistic first uses the body position to propose approximate face and hand regions. It then crops those areas from the full-resolution input. This multi-stage design avoids trying to read small fingertips from the already-downscaled whole-body image.

I pinned the float16 task bundle updated on December 21, 2023. Its component documentation dates BlazePose to April 2021, Hand Tracking to October 2021, Face Mesh V2 to September 2022, and Blendshape V2 to November 2022. The model cards for BlazePose GHUM 3D, Hand Tracking, Face Mesh V2, and Blendshape V2 all specify the Apache License 2.0. MediaPipe itself and @mediapipe/tasks-vision also use Apache License 2.0.

For the lab's bundled avatar and the measurement, I used Seed-san, an official sample model by VirtualCast, Inc. Its embedded settings point to the VRM Public License 1.0 and require credit. Redistribution and redistribution of modified data are allowed; its avatar permission is everyone, and its commercial-use setting is corporation.

The demo video uses the tool's custom-VRM loading feature to display AvatarSample_A, a VRoid Studio sample model. Its official conditions of use allow free commercial and noncommercial use without attribution. The model is not CC0, and its copyright has not been waived. The @pixiv/three-vrm and Three.js libraries use the MIT License.

What I Tested

The experiment asks five questions.

Can the browser version of Holistic Landmarker process a webcam stream in real time?
Can 33 body landmarks be converted into VRM 1.0 bone rotations?
Can 21 landmarks per hand drive the wrists and fingers?
Can facial coefficients be converted into VRM expressions?
Does smoothing keep the response interactive while reducing jitter?

The complete code and measurement notes are available in the mediapipe-holistic-vrm lab in kiarina/labs.

Reproducing the Lab

You will need mise, a browser with WebGL and camera-input support, and an internet connection for the first download. mise installs Node.js 22.22.0.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/18/mediapipe-holistic-vrm
mise -C 2026/07/18/mediapipe-holistic-vrm run
mise -C 2026/07/18/mediapipe-holistic-vrm run preview

Open the displayed localhost URL and select カメラを開始 (Start Camera). The initial setup downloads the MediaPipe bundle and Seed-san, then verifies their SHA-256 hashes. Camera input and inference remain inside the browser and are not sent to an external server. You can also load another VRM file with the file picker or by dropping it onto the viewer.

Data Flow

webcam (1280×720)
  -> Holistic Landmarker (body, face, and both hands)
  -> landmarks and 52 face blendshapes
  -> MediaPipe-to-Three.js coordinate conversion
  -> directions and body, face, and palm orientation
  -> 34 VRM 1.0 bones and expressions
  -> interpolation against the previous frame to reduce jitter
  -> Three.js rendering

Retargeting means transferring an estimated human pose to another skeleton. For an arm or leg, I align the direction between two landmarks, such as shoulder to elbow, with the corresponding bone's rest direction in the VRM. The hips, shoulders, face, and palms use multiple points to build three axes, providing an approximation of torso rotation and palm orientation as well.

The hand chains map MediaPipe joints to VRM finger bones. For the face, mappings include eyeBlinkLeft to blinkLeft and jawOpen to aa. This is not speech recognition; it is a simple conversion from facial-shape coefficients to avatar expressions.

Small changes in estimated points look like shaking when applied directly to an avatar. The implementation therefore holds changes below two degrees for normal bones and below three degrees for wrists and fingers, then interpolates larger movements over time. Leg landmarks below 0.65 visibility are rejected. If the whole pose is lost for more than 0.5 seconds, the avatar returns toward its neutral pose.

Results

I connected to the local server from the Codex in-app browser on a MacBook Pro with an Apple M1 Max and 64 GB of memory, running macOS 26.5.2. Inference ran through browser WASM using the CPU XNNPACK delegate.

The following values came from a 12-second measurement. I did not control the subject to remain fully visible throughout the run.

Metric	Observed
Input	1280×720
Processed frames	207
Effective throughput	17.3 FPS
Mean inference time	49.17 ms
Median inference time	59.60 ms
Inference time p95	70.60 ms
Final inference rate	23 FPS
Final render rate	43 FPS
Pose detected	141 / 207 frames
Pose applied to VRM	141 / 141 detected frames
Right hand detected	1 / 207 frames
Left hand detected	0 / 207 frames
Face detected	0 / 207 frames

The end-to-end path from webcam input through Holistic inference and coordinate conversion to VRM 1.0 loading and bone updates worked. The 49.17 ms mean measures the model call's wall time. It is not glass-to-glass latency including camera exposure and display scanout.

The demo video shows the tool in operation with VRoid Studio's AvatarSample_A, rather than the Seed-san model bundled with the lab. The raw camera feed can be hidden without stopping inference, which makes the landmarks and avatar response easier to see.

How to Read the Zero Hand and Face Counts

These zeros are not an accuracy result saying that the model cannot detect hands or faces. During the short whole-body measurement, the face and hands were small and were not kept in a controlled, continuously detectable framing. A separate short run did produce a frame where the body and face were active together.

All finger-bone and expression conversions are implemented. Synthetic-input tests confirmed finger rotation and the jawOpen -> aa mapping. However, this experiment does not tell us how stable simultaneous face, hand, and full-body tracking is with a real camera.

What the Throughput Means

At 17.3 FPS, the pose can be updated about 17 times per second. That is usable for checking movement and interactive demos, but it is not smooth 60 FPS motion capture.

In @mediapipe/tasks-vision 0.10.35, detectForVideo is synchronous. Rendering on the same main thread therefore waits during inference. A frame that takes the measured p95 of 70.60 ms also blocks rendering for that time; inference and rendering are not fully isolated.

A Plain-Language Reading of the Results

The findings come down to three points.

A normal webcam drove a 3D avatar.

All 141 frames with a detected pose reached the VRM update. The basic path from body movement to a browser avatar worked without a dedicated sensor.

The speed was suitable for a demo, not smooth motion capture.

Effective throughput was 17.3 FPS and mean inference time was 49.17 ms. The response is visible, but fast movement or live-production use may show stutter.

This was not yet a complete face-and-finger evaluation.

The 12-second run was not a controlled framing test. Successful body retargeting and stable simultaneous tracking of the body, face, and both hands are separate claims.

Limitations

This retargeter is a lightweight directional approximation. Aligning two points leaves rotation around the bone axis unresolved, so forearm, wrist, and ankle twist is not exact. It also lacks hip translation, foot locking, and floor-contact IK. IK, or inverse kinematics, calculates joint angles backward from a target hand or foot position. Without it, the feet can slide while the root remains fixed.

The evaluation is also limited to one M1 Max, 12 seconds, and 207 frames. I did not measure model accuracy, different lighting and backgrounds, multiple people, long-run stability, or full glass-to-glass latency.

My Takeaway

MediaPipe Holistic removed much of the plumbing required to connect separate body, face, and hand estimators. It provided a practical foundation for reaching a VRM avatar entirely inside the browser. Seeing visible body response from CPU/WASM on an M1 Max was more usable than I expected.

At the same time, the ability to output 553 points does not guarantee stable simultaneous full-body, facial, and finger capture from one monocular camera. A design centered on the body tracking demonstrated here, with face and hand tracking used according to framing and purpose, could work well for lightweight avatar demos or in-browser gesture interfaces.

Can YAMNet Detect Unseen Sudden Sounds in Real Time? A 48-Stream Evaluation

Nariaki Wada — Fri, 17 Jul 2026 01:53:43 +0000

Hello, everyone.

There are many situations where we may want to monitor sounds that happen without warning, such as breaking glass or a car horn. Registering every possible sound in advance, however, is not realistic.

Today, I am testing whether YAMNet can detect that a sound stream has changed, without first specifying which sound it should recognize.

The short answer is that the best configuration detected 22 of 48 events, for 45.8% recall. Processing was easily fast enough, but the results did not support the hypothesis that a simple distance over YAMNet embeddings can reliably detect unseen sudden sounds.

What Is YAMNet?

YAMNet is an audio classification model that Google added to TensorFlow Models on November 21, 2019. It was trained on AudioSet and predicts 521 acoustic event classes, including speech, rain, vehicles, and animals.

The network uses MobileNet V1, a convolutional architecture designed to reduce computation on mobile hardware. It reads roughly 0.96 seconds of 16 kHz mono audio at a time and returns:

scores (521 dimensions): confidence for each class
embedding (1,024 dimensions): a compact numerical representation of the sound
spectrogram (64 bands): frequency content over time

This experiment needs the embedding as well as the class scores, so I used the SavedModel version on TensorFlow Hub.

The relevant licenses are:

TensorFlow Models, including YAMNet: Apache License 2.0
TensorFlow: Apache License 2.0
ESC-50: Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)

ESC-50 includes a noncommercial restriction. The model and dataset have separate licenses, so both need to be considered for redistribution or product use.

What I Tested

The goal was to determine whether the detectors could notice a new sound mixed into a normal background, without being told the event label.

Whether a simple frequency-change detector can catch sudden events
Whether YAMNet scores or embeddings represent the change more clearly
Whether distance from the previous sound works better than distance from a memory of normal sounds
Whether the pipeline can process audio faster than real time

The complete code and results are available in the yamnet-streaming-novelty lab in kiarina/labs.

Reproducing the Lab

You will need mise, uv, FFmpeg, and an internet connection for the first download. The model and data use about 120 MB, while the Python environment uses about 1.3 GB.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml \
  2026/07/17/yamnet-streaming-novelty
mise -C 2026/07/17/yamnet-streaming-novelty run

The first run downloads the pinned YAMNet model and 208 required WAV files from a fixed ESC-50 revision. The task verifies the SHA-256 of the YAMNet archive and writes the results to output/report.json.

Roles and Data Flow

YAMNet is the only AI model in this pipeline. Small detectors for frequency change and vector distance operate around it.

16 kHz mono audio stream
  ├─ frequency change every 32 ms ─────────> spectral flux
  └─ 0.975 s window sent to YAMNet every 0.48 s
       ├─ 521 class scores
       │    ├─ distance from previous frame -> score delta
       │    └─ distance from normal memory -> score kNN
       └─ 1,024-dimensional embedding
            ├─ distance from previous frame -> embedding delta
            └─ distance from normal memory -> embedding kNN

The delta detectors measure change from the immediately previous sound. The kNN detectors compare each frame with the five nearest examples in a memory of normal sounds. They use cosine distance, which measures the difference in direction between numerical vectors.

I also evaluated temporal fusion, which produces an alert when either score delta or embedding delta fires.

Evaluation Setup

The normal sounds were rain, sea_waves, wind, and clock_tick. I inserted one of crying_baby, door_wood_knock, glass_breaking, siren, car_horn, and fireworks from 2.0 to 3.0 seconds in the same five-second background clip. There were eight examples of each event, for 48 positive streams in total.

Event-to-background level was tested at -10, -5, 0, and +5 dB. At -10 dB, the event is substantially quieter than the background; at +5 dB, it is louder. The mixed streams existed only in memory during evaluation and were not saved as files.

Recordings were separated by purpose:

Purpose	ESC-50 fold	Number of clips
Normal-sound memory	1-3	96
Threshold calibration	4	32
False-alert evaluation	5	32
Sudden-event evaluation	5	48 streams

Each threshold was set high enough to produce no alerts on the normal fold-4 calibration streams. I did not tune it against the final fold-5 evaluation set.

Results

Which Score Ranked Anomalies Best?

I first compared whether positive streams received higher anomaly scores than negative streams without fixing a threshold. AUROC summarizes this ranking: 1.0 is ideal, while a value around 0.5 is close to chance.

Detector	AUROC	-10 dB	-5 dB	0 dB	+5 dB
spectral flux	0.449	0.393	0.315	0.555	0.534
score delta	0.717	0.622	0.672	0.776	0.797
embedding delta	0.734	0.638	0.625	0.815	0.859
score kNN	0.661	0.484	0.698	0.789	0.672
embedding kNN	0.632	0.479	0.581	0.721	0.745

Embedding delta ranked first at 0.734, but it was only 0.018 ahead of score delta. These 80 evaluation streams are not enough to conclude that embeddings are generally better.

The broader pattern is clearer: distance from the previous frame worked better than distance from the normal-sound memory. For short events like these, asking "did the sound suddenly change?" was more useful than asking "is this sound globally unlike the normal set?"

Alerts with Strict Thresholds

Detector	Precision	Recall	F1	False alerts/hour	Median latency
spectral flux	0.000	0.0%	0.000	0.0	—
score delta	0.810	35.4%	0.493	22.5	0.415 s
embedding delta	0.765	27.1%	0.400	22.5	0.415 s
score kNN	1.000	6.2%	0.118	0.0	0.895 s
embedding kNN	0.800	16.7%	0.276	0.0	0.895 s
score delta + embedding delta	0.786	45.8%	0.579	22.5	0.415 s

Precision is the proportion of emitted alerts that were correct. Recall is the proportion of the 48 events that were found. The best temporal-fusion detector found 22 and missed 26.

There was only one false alert in 160 seconds of normal audio. The reported 22.5 false alerts/hour extrapolates that single event to one hour, so it is a highly uncertain estimate rather than a production false-alert rate. Several hours of continuous audio would be needed for a useful measurement.

Recall by event level was:

Event level	Detected	Recall	Median latency
-10 dB	4/12	33.3%	1.855 s
-5 dB	3/12	25.0%	0.415 s
0 dB	7/12	58.3%	0.415 s
+5 dB	8/12	66.7%	0.415 s

By class, the detector found 6/8 glass-breaking events, 5/8 fireworks, and 4/8 car horns. It found only 2/8 crying-baby and 2/8 siren events. Short, sharp changes were relatively easy, while sounds that blended into the background or were weak in the selected one-second segment were harder.

What Did Not Work

With a looser threshold, simple spectral flux reached 29.2% recall but produced 967.5 false alerts/hour. Rain and waves naturally contain many frequency changes, so this score could not isolate unusual events.

Combining spectral flux with embedding kNN also failed to help. Under the strict threshold it matched embedding kNN alone at 16.7% recall. Under the looser threshold it reached 45.8% recall but produced 607.5 false alerts/hour. Adding detectors was not automatically an improvement.

A high novelty score also does not guarantee a correct YAMNet label. Glass was the top label for only three of the eight glass-breaking events. At low event levels, background labels such as Water, Rain, and Vehicle often remained on top. Detecting that something changed and explaining what changed are separate problems.

Processing Speed

I processed 1,040 seconds of audio as sequential windows on a Mac Studio with an Apple M4 Max.

feature extraction elapsed: 6.262 s
real-time factor:           0.0060x

A repeat run while writing this article took 6.921 seconds, for a real-time factor of 0.0067x. The detection counts, AUROC values, thresholds, and latencies matched the original run. Including YAMNet inference and spectral flux, both runs processed audio more than 150 times faster than its duration. Compute throughput was not a problem.

However, YAMNet reads about 0.96 seconds at a time, so the shortest observed detection latency was still 0.415 seconds. Fast inference does not remove the wait introduced by the input window.

The verification environment was:

machine: Mac Studio (Mac16,9)
chip: Apple M4 Max
OS: macOS 26.5.2, arm64
Python: 3.12.10
TensorFlow: 2.21.0
NumPy: 2.3.5
FFmpeg: 8.1.2
random seed: 20260717

A Plain-Language Reading of the Results

The results come down to three points.

The pipeline was fast enough.

It processed 1,040 seconds of audio in 6.262-6.921 seconds across two runs. The compute throughput needed for real-time monitoring was available.

Detection accuracy was not good enough.

The best method found only 22 of 48 events. Quiet events were especially likely to disappear into the background. This is not ready for a safety-monitoring application.

Recent change mattered more than distance from normal.

Comparing the embedding with a memory of normal sounds was weaker than comparing each frame with the previous one. Local change was the more useful signal for these short events.

This experiment is limited to 48 synthetic five-second streams, 160 seconds of negative audio, ten selected ESC-50 categories, and one Mac. Continuous microphone input, several-hour false-alert measurements, different recording devices, and environments with multiple simultaneous events remain outside its scope.

My Takeaway

YAMNet was lightweight and easy to use as a foundation because it exposes both class scores and embeddings. Still, an embedding may contain useful acoustic information without its raw distance being a reliable anomaly score.

In this experiment, recent change worked more directly than storing a large normal-sound memory. A lightweight first stage that proposes sharp events such as glass breaks or fireworks, followed by YAMNet labels or another decision step, looks like a more promising use of this approach.

Estimating Surface Orientation and 3D from One Image with MoGe-2 on Apple Silicon

Nariaki Wada — Thu, 16 Jul 2026 00:47:39 +0000

Estimating Surface Orientation and 3D from One Image with MoGe-2 on Apple Silicon

Hello, everyone.

Estimating distance from one image is useful, but knowing surface orientation as well—whether a road faces upward or a wall faces sideways—provides richer information for 3D conversion and relighting.

Today, I ran MoGe-2 ViT-S Normal on Apple Silicon and tested its surface-normal output, CPU and MPS speed, and consistency with the 3D information produced by the same inference.

To give the result first, median inference time on an Apple M1 Max was 212.13 ms with MPS and 1,259.34 ms with the CPU, making MPS 5.94 times faster. Surface orientation and object boundaries were visible across a street, tabletop objects, and a crowd. I also turned single images of a street, a cat, and a girl into 2.5D animations with sideways camera movement.

What is MoGe-2?

MoGe-2 is a monocular geometry estimation model released by researchers at Microsoft Research on June 10, 2025. Monocular means that it uses a single RGB image rather than a stereo camera pair. Its paper was submitted to arXiv on July 3 of the same year.

While the original MoGe focused on relative 3D shape within an image, MoGe-2 adds metric scale, sharper details, and normal estimation in a unified model.

A surface normal is a three-dimensional vector that tells us which way the surface at each pixel is facing. In the colorful normal maps in this article, the x, y, and z directions are mapped to RGB. The colors represent orientation, not object categories.

What we are testing

I used Ruicheng/moge-2-vits-normal, the smallest official model with normal output, to check:

Whether it runs on an Apple M1 Max CPU and MPS without patches
How much faster MPS is than the CPU with the same FP32 model
Whether normals are visually readable for a street, tabletop objects, and a crowd
How closely the directly predicted normals agree with normals calculated from the 3D point map
Whether the point map can be converted into a textured 3D mesh

MPS is the PyTorch backend that uses the GPU in a Mac. The target lab is available here:

kiarina/labs/2026/07/16/moge2-surface-normal-apple-silicon

Reproducing the environment

You need an Apple Silicon Mac, mise, uv, and an internet connection for the initial model and shared-image downloads.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml 2026/07/16/moge2-surface-normal-apple-silicon
mise -C 2026/07/16/moge2-surface-normal-apple-silicon run

The first run downloads a checkpoint at a fixed revision and verifies its SHA-256 hash. To also create a GLB and a four-second Blender video from the street image, install Blender 5.1.2 and FFmpeg, then run:

mise -C 2026/07/16/moge2-surface-normal-apple-silicon run render-video

Model, mechanism, and licenses

This test uses one pretrained model. It is not a pipeline that calls multiple AI models in sequence. Inside one checkpoint, an encoder reads image features and several heads produce task-specific outputs.

RGB image
  -> DINOv2 ViT-S encoder: extract global and local image features
  -> shared neck: expand them into multi-resolution features
  -> points head: 3D coordinate at each pixel
     normal head: surface orientation at each pixel
     mask head: reliable pixels
     scale head: metric scale in meters
  -> metric point map / depth / normal / mask / camera intrinsics
  -> optional: point map + source-image texture -> GLB mesh

A head is a small output component that turns shared features into values for one task. A point map assigns a 3D camera-space point to every pixel. Camera intrinsics describe internal camera properties related to focal length and field of view.

Item	Pinned value
checkpoint	`Ruicheng/moge-2-vits-normal/model.pt`
parameters	35,103,656
file size	140,550,416 bytes
SHA-256	`79a16621928c2bf0ed04659218c55c01075e950507f40bb3332fb4c873d3e1dc`
Hugging Face revision	`679230677b4d282c6f304189a93e98e14f085902`
MoGe repository commit	`07444410f1e33f402353b99d6ccd26bd31e469e8`

The licenses are:

MoGe code: MIT License
DINOv2 code under moge/model/dinov2: Apache License 2.0
The Hugging Face checkpoint used here: labeled MIT

The lab does not commit the checkpoint. It downloads the pinned revision at runtime. Check the current official terms again before redistributing the model or embedding it in a product.

Method

I resized each input to a short side of 384 pixels and fixed the settings to resolution_level=5, FP32, and batch size 1. For the street-image benchmark, I measured 10 runs after three warm-up runs. Model loading, image loading, preprocessing, and saving were excluded. Warm-up reduces the effect of one-time setup work on the timing.

The comparison target for the predicted normals is not ground truth. I calculated a second normal from neighboring 3D points in the point map produced by the same inference, then measured the per-pixel angular difference. This tests internal consistency between two outputs.

Before and after

The input is on the left, and the directly predicted MoGe-2 normal is on the right.

Street

The road, buildings on both sides, cars, and trees have distinct orientations. The sky is black because it was marked invalid.

Tabletop objects

Large surfaces on the desk and books have consistent colors, while the colors vary smoothly around the curved cup and bottle.

Crowd

Rounded faces and shoulders remain visible, as do boundaries between overlapping people. However, the distant background and small people also show unstable bands.

Results

CPU and MPS speed

backend	mean	median	min	max	std dev
PyTorch CPU	1,272.74 ms	1,259.34 ms	1,244.37 ms	1,340.16 ms	28.24 ms
PyTorch MPS	211.58 ms	212.13 ms	208.75 ms	214.68 ms	1.88 ms

By median time, MPS was 5.94 times faster than the CPU. About 212 ms corresponds to roughly 4.71 FPS for model inference alone. The normal difference between CPU and MPS on the same input had a mean of 0.0063 degrees, a median of 0 degrees, and a maximum of 0.0396 degrees.

I ran the same command again while writing this article. Median time was 1,299.96 ms on the CPU and 175.88 ms on MPS, making MPS 7.39 times faster in that run. The model hash, normal differences, and three-image consistency data matched the original measurement. Inference time varies with runtime conditions, so these two runs should be read as a measured range rather than one fixed speed.

Consistency between two normal representations

image	valid pixels	mean	median	p90
street	90.80%	30.75°	18.69°	78.55°
objects	100.00%	14.41°	7.05°	35.86°
crowd	98.63%	37.94°	33.04°	73.47°

The two representations agreed most closely on the broad surfaces in the tabletop image. Differences increased at object boundaries, thin structures, trees, crowds, and distant details. Normals derived from a point map use differences between neighboring pixels, so they become unstable when those neighbors cross a sudden depth discontinuity. These angular differences should not be interpreted as the accuracy of the directly predicted normals.

Converting three images into 3D animations

I used the official CLI to turn each point map into triangles and create GLBs textured with the source images. I then moved a Blender camera sideways from the original viewpoint and rendered four-second animations that show the resulting parallax.

The APNGs below are reduced to 512x288 at 10 FPS for article display. The lab produces the source MP4 files at 1280x720 and 30 FPS.

Street with cars

Parallax appears between the foreground cars, road, buildings on both sides, and people. Even a small sideways camera movement makes it clear that the single image has been converted into surfaces with depth.

vertices: 233,120
triangles: 449,292
dimensions: approximately 30.75 x 114.21 x 21.15 m
estimated horizontal FoV: 60.52°
GLB size: 13,342,320 bytes

Cat character

The pink cat is placed in front of the trees, bench, and flower bed, producing clear parallax. Its head has gentle curvature, but the ears, hands, bow tie, and body are thin shapes centered on surfaces visible from the source camera.

Girl character

The girl, bench, streetlight, sign, and background trees are separated into different depths. However, the hair, arms, and legs are not complete independent volumes; they appear as thin surfaces following the visible outlines.

The measured mesh data for all three scenes is below.

image	vertices	triangles	FoV x / y	GLB size
street	233,120	449,292	60.52° / 32.53°	13,342,320 bytes
cat	312,105	599,514	54.27° / 42.05°	17,893,416 bytes
girl	325,649	621,216	60.30° / 47.08°	18,731,304 bytes

I moved the street camera by ±0.8 m. The cat and girl were estimated closer to the camera, so I reduced their movement to ±0.5 m. These results are 2.5D: they look three-dimensional near the original viewpoint but do not reconstruct the hidden backs of objects.

All three examples support small viewpoint changes, but black holes appear where triangles were removed around the sky, leaves, hair, and object boundaries. Stylized images can therefore be animated as well as realistic ones, but thin outlines and stylized shapes remain difficult to interpret as 3D geometry.

What I learned during the test

My first attempt calculated normals only from horizontal and vertical differences in the depth image. The median angular difference was an invalid 113–143 degrees because that method ignored perspective projection and camera intrinsics. Switching to neighboring 3D points in the metric point map produced comparable orientations.

The official implementation also emitted a warning because FP32 autocast is unsupported on both CPU and MPS. Autocast was disabled automatically, and inference completed without a patch.

The verification environment was:

machine: MacBook Pro (Apple M1 Max, 64 GB, arm64)
OS: macOS 26.5.2
Python: 3.12.10
PyTorch: 2.13.0
OpenCV: 5.0.0
NumPy: 2.5.1
Blender: 5.1.2
FFmpeg: 8.1.2

Interpretation

The detailed data is above, but the simpler reading has three main points.

The Mac GPU reduced the wait to a practical range

MPS took about 176–212 ms per image. That is still too slow for real-time video, but practical for image-editing or 3D preprocessing tasks.

One inference produced several kinds of 3D information

Metric point maps, depth, normals, masks, and camera intrinsics were produced together. No separate models were needed, making the outputs easy to combine.

Boundaries and hidden surfaces remain difficult

Normals represented broad planes and curved objects, but thin structures and distant details were unstable. A mesh made from one image is also not a complete 3D space with reconstructed backsides.

This test was limited to three generated images without ground-truth normals or depth. Timing used only one street image, a 384-pixel short side, FP32, and one M1 Max. I did not test official benchmarks, real photos, ViT-B/L, FP16, Core ML, the Apple Neural Engine, temporal video consistency, power consumption, or accuracy against ground truth.

Thoughts after verification

Getting road orientation, object curvature, and a metric point map from one image in one pass was more convenient than I expected. In particular, the dedicated normal head produced smoother boundaries than the normals calculated naively from the point map, which made the value of direct normal prediction easy to see.

The 3D mesh is interesting for small viewpoint changes, but it is not a space that can be explored freely.

This looks like a fun way to add a little camera movement to a single image.

Running the Lightweight ZipDepth Model on Apple Silicon

Nariaki Wada — Wed, 15 Jul 2026 00:44:59 +0000

Running the Lightweight ZipDepth Model on Apple Silicon

Hello, everyone.

Depth information is useful when we want to distinguish the foreground from the background in a single image. It can support background blur, 3D effects, and perception for robots.

Today, I test the lightweight ZipDepth model on Apple Silicon and compare PyTorch CPU, PyTorch MPS, and ONNX Runtime CPU.

The short result is that MPS was the fastest option on an Apple M1 Max, averaging 15.34 ms. It was about five times faster than PyTorch CPU with the same standard model. The model also produced clear near-to-far relationships for three different images.

What I tested

Depth estimation predicts how near or far each pixel is from the camera. This test uses monocular depth estimation, which needs only one RGB image.

Whether ZipDepth runs on the Apple Silicon CPU and GPU
Whether the NPU-compatible model can be converted to ONNX and executed
Inference time and output differences between backends
Qualitative results for a street, tabletop objects, and a crowd

This is the street image used for the benchmark.

Lab: kiarina/labs/2026/07/15/zipdepth-apple-silicon

Reproducing the test

You need mise, uv, an Apple Silicon Mac, and an internet connection for the first model and image download.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml 2026/07/15/zipdepth-apple-silicon
mise -C 2026/07/15/zipdepth-apple-silicon run

On the first run, the task downloads the checkpoints, verifies their SHA-256 hashes, and exports the NPU-compatible model to ONNX opset 18. Benchmarking and result image generation are included.

What is ZipDepth?

ZipDepth is a monocular depth estimation model whose paper was released in July 2026 and presented at ECCV 2026. It has 6.1 million parameters and requires 3.0 GMACs for a 384x384 input. It targets devices ranging from server GPUs to mobile hardware.

Its output is relative inverse depth. In the images in this article, brighter pixels are nearer and darker pixels are farther away. It does not return metric distances such as three meters.

The model was distilled on about 14.07 million images across 17 domains, using pseudo-depth produced by Depth Anything V2 Large. Knowledge distillation trains a small model using predictions from a larger teacher model.

The data flow is short:

Training performed by the ZipDepth authors
RGB image -> Depth Anything V2 Large -> pseudo-depth -> train ZipDepth

Inference in this lab
RGB image -> resize short side to 384 -> ZipDepth -> inverse depth map
                                            -> color visualization

Depth Anything V2 Large was the training-time teacher. It is not executed by this lab.

Models and licenses

ZipDepth provides two checkpoints that use different methods to restore the output resolution.

Checkpoint	Role	Runtime
`zipdepth_base.pth`	Standard model with `Unfold`-based upsampling	PyTorch CPU / MPS
`zipdepth_base_npu.pth`	Conversion-friendly, unfold-free model	PyTorch CPU / ONNX Runtime CPU

MPS is the PyTorch backend for running on the Apple Silicon GPU. Despite the NPU-compatible name, this test runs that model on the CPU, not the Apple Neural Engine.

ZipDepth code and checkpoints: MIT License
Commit: a302e5437bc58f15c4efd41d3e8222bf24f7d470
Standard SHA-256: a55910bb0b99c8c5e641cb9206e810b269690ad94e8a2ef08c827c4679391a65
NPU-compatible SHA-256: 627c04fda584133ead4310074884a4a037061b4c01ba86e73e492ea30fab570d

The Depth Anything V2 Large teacher is licensed under CC BY-NC 4.0. The official project lists Small under Apache-2.0 and Base, Large, and Giant under CC BY-NC 4.0. This lab downloads and runs the MIT-licensed ZipDepth checkpoints; it does not download Depth Anything V2 Large.

Method

Each image keeps its aspect ratio, with its shorter side resized to 384 pixels. I benchmarked the 768x384 street image for ten runs after three warm-up runs. A warm-up avoids including one-time initialization work in the measurement.

Model loading, image loading, preprocessing, visualization, and file saving are excluded from inference time. I visually inspected the output for a street, tabletop objects, and a crowd.

machine: MacBook Pro (Apple M1 Max, 64 GB, arm64)
OS: macOS 26.5.2
Python: 3.12.10
PyTorch: 2.13.0
ONNX Runtime: 1.27.0
input: FP32, batch 1, 768x384 for the benchmark
warm-up: 3 runs
measurement: 10 runs

Results

The input is on the left and the ZipDepth result is on the right. The color range is normalized independently for each image, so colors cannot be compared directly between rows.

For the street, the nearby road and cars are bright, while distant buildings and the vanishing point are dark. On the tabletop, the desk, books, laptop, and background form separate depth levels. In the crowd, foreground people are brighter. Some small distant people merge into smooth regions instead of remaining individually separated.

These are visual observations, not an accuracy test against ground-truth depth.

Inference speed

Backend / model	Mean	Median	Min	Max	Std. dev.
PyTorch CPU / standard	77.78 ms	77.49 ms	75.75 ms	80.89 ms	1.65 ms
PyTorch MPS / standard	15.34 ms	15.56 ms	14.36 ms	15.89 ms	0.55 ms
PyTorch CPU / NPU-compatible	101.49 ms	100.47 ms	97.44 ms	109.55 ms	3.26 ms
ONNX Runtime CPU / NPU-compatible	47.08 ms	47.17 ms	46.54 ms	47.46 ms	0.33 ms

By median time, MPS was 4.98 times faster than PyTorch CPU with the same standard model. ONNX Runtime CPU was 2.13 times faster than PyTorch CPU with the same NPU-compatible model.

The MPS and ONNX Runtime outputs were also nearly identical by eye.

Numerical output differences

Relative depth can describe the same geometry with a different scale and offset. I therefore aligned scale and shift with least squares before measuring error.

Comparison	Aligned MAE	Aligned RMSE
Standard: PyTorch CPU vs MPS	0.00000002	0.00000002
NPU-compatible: PyTorch vs ONNX Runtime CPU	0.00000001	0.00000002
PyTorch CPU: standard vs NPU-compatible	0.00044436	0.00086735

Differences between backends using the same checkpoint were at the level of FP32 rounding. The standard and NPU-compatible models were not exactly identical, but the difference was not visible in the street result.

ONNX export issues

The first ONNX export failed because onnxscript was missing. PyTorch 2.13.0's exporter requires it, so adding the dependency fixed the error.

Requesting opset 17 also failed during conversion from the internally generated opset 18 model. The generated opset 18 model passed the ONNX checker and ran with ONNX Runtime, so the lab uses the actual opset 18 output.

Interpretation

The detailed data is above, but there are three main points.

MPS reached roughly 65 FPS of model inference

The model averaged 15.34 ms without any compatibility patch. That is promising for local image-processing features on Apple Silicon.

ONNX Runtime roughly halved CPU inference time

With the same NPU-compatible model, ONNX Runtime took 47.08 ms versus 101.49 ms for PyTorch CPU. The numerical output difference was minimal.

The direct MPS-to-ONNX comparison needs care

They use checkpoints with different upsampling methods. The roughly three-times speed difference cannot be attributed to the backend alone.

This test is limited to three generated images, one benchmark image, and one M1 Max. It does not test metric distance accuracy, official benchmarks, temporal stability in video, Core ML, the Apple Neural Engine, quantization, power use, or peak memory.

Final thoughts

For a model with only 6.1 million parameters, ZipDepth preserved useful object boundaries while producing a clear near-to-far structure. The roughly 15 ms MPS result looks fast enough for experiments such as local background blur or simple 3D effects.

The output is relative depth, so it cannot directly support applications that require real-world distance. Small distant people also tend to merge, making the model unsuitable as the only perception component in safety-critical systems. Even so, combined with object detection and tracking, it could provide useful context for an LLM-controlled agent to understand the relative position of obstacles or follow a person in real time.

Comparing YOLO26 Semantic Segmentation with PyTorch and ONNX

Nariaki Wada — Tue, 14 Jul 2026 05:43:39 +0000

Comparing YOLO26 Semantic Segmentation with PyTorch and ONNX

Hello, everyone.

Sometimes it is not enough to know that an image contains a car. We also need the pixel-level boundaries of the road and sidewalk.

Today, I ran a YOLO26n semantic segmentation model on an Apple Silicon CPU and compared its PyTorch and ONNX Runtime outputs and speed.

To give the result first, the two class maps agreed on 99.3129% of all pixels. Mean end-to-end processing time was 27.55 ms with ONNX Runtime and 266.00 ms with PyTorch. Most of the difference came from post-processing rather than inference itself.

What we are verifying

Semantic segmentation assigns every pixel in an image to a class. Unlike object detection, which draws boxes, it produces regions that follow the shapes of roads, sky, cars, and other classes.

I used the official Cityscapes-trained yolo26n-sem.pt model to check:

Whether PyTorch inference and ONNX export work on an Apple M1 Max CPU
How closely the two class maps agree under the same input conditions
The time spent on preprocessing, inference, post-processing, and the full call
How the model divides a generated urban street image into 19 classes

This is the original test image.

Target lab: kiarina/labs/2026/07/14/yolo26-semantic-segmentation

Reproducing the environment

You need mise, uv, and an internet connection for the initial model and shared-image downloads.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml 2026/07/14/yolo26-semantic-segmentation
mise -C 2026/07/14/yolo26-semantic-segmentation run

On the first run, the task downloads the checkpoint, verifies its SHA-256 hash, and exports it to ONNX opset 18. It creates four images: the PyTorch output, ONNX output, disagreement view, and an annotated comparison.

Model and license

This test uses one pretrained model. PyTorch and ONNX are not two different models here; they are two execution paths for the same checkpoint.

Model	Role	Input and output
`yolo26n-sem.pt`	Assign each pixel to one of 19 Cityscapes classes	1x3x640x640 image → class logits
Exported `yolo26n-sem.onnx`	Run the same weights with ONNX Runtime	1x3x640x640 image → 1x640x640 class IDs

Class logits are raw scores indicating how likely each class is at each pixel. A class ID is the number of the selected class.

The data flow is:

1774x887 JPEG
  -> Fit into 640x640 while preserving aspect ratio (letterbox)
  -> Run the same checkpoint through two paths
       PyTorch: class logits -> resize -> class IDs
       ONNX:    640x640 class IDs -> resize
  -> 1774x887 class maps
  -> Compare pixel agreement, per-class IoU, and timing

Letterboxing adds padding to fit an image into a target size without stretching it. IoU measures the overlap between two regions; values closer to 100% mean better agreement.

Checkpoint: Ultralytics assets v8.4.0
Model size: 3,487,283 bytes
SHA-256: f3f293cca764de1f93044030d8d5612de9c5ffbf37c9c8ea1b69418b73038999
Ultralytics code and trained models: AGPL-3.0 or Ultralytics Enterprise License

The Ultralytics licensing page states that trained models are covered by AGPL-3.0 by default. Proprietary or commercial integration may require an Enterprise License. Check the official terms for your use case at the time of use. The lab does not commit the checkpoint or exported ONNX file; it downloads and generates them at runtime.

Method

The input is one fixed image containing a road, sidewalks, buildings, traffic signals, cars, people, and bicycles.

file: tests/assets/jpg/street_scene_1774x887_287kb.jpg
resolution: 1774x887
SHA-256: d5c865f452599311fbbfd0c132bb4f8b7ade4dd88f0c8ac14ce136490ea53a2e

Both paths use imgsz=640, rect=False, and the CPU. I measured 10 runs after three warm-up runs. Warm-up reduces the effect of setup work that happens only during the first inference. Model initialization, image loading, and image saving are excluded from the benchmark.

The model can output these 19 classes:

road, sidewalk, building, wall, fence, pole, traffic light, traffic sign,
vegetation, terrain, sky, person, rider, car, truck, bus, train,
motorcycle, bicycle

Results

The left side is the original image. The right side overlays the ONNX class map at 50% opacity. The classes that appeared and their pixel shares are shown below the image.

Both PyTorch and ONNX Runtime ran successfully on the Apple M1 Max CPU.

backend / stage	mean	min	max	std dev
PyTorch preprocessing	0.93 ms	0.89 ms	1.04 ms	0.04 ms
PyTorch inference	28.83 ms	26.39 ms	30.31 ms	1.02 ms
PyTorch post-processing	236.07 ms	233.19 ms	239.70 ms	1.67 ms
PyTorch wall time	266.00 ms	264.44 ms	270.02 ms	1.69 ms
ONNX preprocessing	1.01 ms	0.91 ms	1.24 ms	0.10 ms
ONNX inference	25.67 ms	24.32 ms	26.91 ms	0.88 ms
ONNX post-processing	0.72 ms	0.60 ms	1.11 ms	0.15 ms
ONNX wall time	27.55 ms	26.06 ms	28.82 ms	0.95 ms

ONNX inference itself took about 0.89 times the PyTorch time, while its full call took about 0.10 times as long. The main difference was post-processing. PyTorch resizes class logits to the original resolution before choosing class IDs. This exported ONNX model directly returns class IDs, making its post-processing much shorter.

I ran the lab again while writing this article. Mean wall time was 290.79 ms for PyTorch and 29.76 ms for ONNX. Timings vary between runs, but the roughly 10x gap and 99.3129% pixel agreement were reproduced.

Pixel and per-class agreement

The two paths produced the same class ID for 99.3129% of all pixels and different IDs for 0.6871%. The detailed data for the 16 predicted classes is below.

class	PyTorch area	ONNX area	IoU
road	21.27%	21.22%	99.52%
sidewalk	10.38%	10.39%	98.48%
building	16.28%	16.30%	98.99%
wall	3.05%	3.05%	97.16%
fence	4.18%	4.17%	98.32%
pole	1.62%	1.61%	88.15%
traffic light	0.03%	0.04%	88.03%
traffic sign	0.29%	0.29%	93.30%
vegetation	26.52%	26.54%	98.85%
terrain	0.02%	0.02%	82.92%
sky	8.94%	8.97%	99.15%
person	0.90%	0.90%	96.20%
rider	0.21%	0.21%	94.61%
car	5.11%	5.12%	98.67%
motorcycle	0.74%	0.74%	95.68%
bicycle	0.44%	0.44%	94.24%

Red pixels show where the two paths produced different class IDs. They are concentrated around object boundaries rather than inside large regions.

The first failed comparison

In the first attempt, I did not fix the rect option, and pixel agreement was 98.9648%. PyTorch used minimal padding while the fixed-shape ONNX model used 640x640 padding. The actual inputs therefore differed even with the same imgsz=640 setting.

Setting rect=False for both paths raised agreement to 99.3129%. A backend comparison must align preprocessing as well as the model.

The verification environment was:

machine: MacBook Pro (Apple M1 Max, arm64)
OS: macOS 26.5.2
Python: 3.12.10
Ultralytics: 8.4.95
PyTorch: 2.13.0
ONNX: 1.22.0
ONNX Runtime: 1.27.0
OpenCV: 5.0.0
NumPy: 2.5.1
provider: CPUExecutionProvider

Interpretation

The detailed data is above, but the simpler reading has three main points.

Large regions agreed closely

Road, building, vegetation, and sky all had IoU above 98%. The overall visual results were also nearly identical. Relative differences were larger for thin poles, traffic lights, terrain, and region boundaries.

The output format was the main reason ONNX was faster

Inference differed by only about 3 ms, while post-processing differed by about 235 ms. That is because this ONNX export returns a class ID map directly. The result does not mean that ONNX is always 10 times faster; it applies to this export and the full processing paths tested here.

The street was segmented, but a distant bus was missed

Road, sidewalk, buildings, sky, people, cars, and bicycles were assigned to visually reasonable locations. However, a small bus in the distance was not labeled as bus. Small and distant objects remain difficult.

This test used one generated image without a ground-truth mask. Therefore, 99.3129% is agreement between PyTorch and ONNX, not prediction accuracy. I did not test Cityscapes validation data, real photos, other images, different resolutions, MPS, CoreML, quantization, or memory usage.

Thoughts after verification

Exporting to ONNX preserved almost the same visual result while reducing CPU processing to about 30 ms per image. That looks useful for locally processing road and sidewalk regions in sequence.

The important caveat is that most of the speedup came from different post-processing. Performance should not be judged from the backend name alone; preprocessing and output format matter too. Next, I would like to test temporal stability on real road video and compare CoreML performance.

Removing a Portrait Background with BiRefNet ONNX on CPU

Nariaki Wada — Mon, 13 Jul 2026 05:11:02 +0000

Removing a Portrait Background with BiRefNet ONNX on CPU

Hello, everyone.

Background removal is common, but I wanted to see whether a local CPU could preserve thin details such as hair.

Today, I ran an official BiRefNet ONNX model with ONNX Runtime and removed the background from a portrait.

To give the result first, it produced a transparent PNG while preserving the main subject and most of the hair. Inference with a 1024x1024 input took about 4.32 seconds on average on an Apple M1 Max CPU.

What we are verifying

BiRefNet separates a high-resolution image into foreground and background. This is called dichotomous image segmentation (DIS), meaning segmentation into two kinds of regions.

I used the general-purpose Swin-Tiny ONNX model from the official GitHub Release and checked:

Whether it runs from Python with only the ONNX Runtime CPU backend
Whether it separates a person, white clothing, and thin windblown hair from the background
How long preprocessing, inference, post-processing, and PNG saving take
How alpha values are distributed in the output

This is the original test image.

Target lab: kiarina/labs/2026/07/13/birefnet-onnx

Reproducing the environment

You need mise, uv, and an internet connection for the initial model and shared-image downloads.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml 2026/07/13/birefnet-onnx
mise -C 2026/07/13/birefnet-onnx run

On the first run, the task downloads the model after verifying its SHA-256 hash. It then creates output_removed_bg.png, with the background represented by the alpha channel.

Model and license

This test uses one model.

Model	Role	Input and output
`BiRefNet-general-bb_swin_v1_tiny-epoch_232.onnx`	Estimate whether each pixel belongs to the foreground or background	1x3x1024x1024 image → 1x1x1024x1024 logits

Swin-Tiny is the backbone, or the base component that extracts image features. The official model list presents it as a smaller and faster alternative to the larger Swin-Large backbone. The ONNX file used here is 224,005,088 bytes.

As described in the paper, BiRefNet has a localization module that uses the whole image to locate the subject and a reconstruction module that refers to local image details and edges to recover fine structure. Because this test uses the exported ONNX model, these modules are not invoked separately in the code.

The data flow is:

1536x1024 JPEG
  -> Resize to 1024x1024, convert to RGB, and normalize
  -> Estimate foreground logits with BiRefNet ONNX
  -> Convert logits to a 0-1 mask with sigmoid
  -> Resize the mask back to 1536x1024
  -> Add it to the original image as an alpha channel
  -> Transparent PNG

Logits are the model's raw output values. Sigmoid converts them to a range from 0 to 1. The final image treats 0 as transparent and 1 as opaque.

BiRefNet code: MIT License
Official Hugging Face model: labeled MIT
ONNX weight used in this test: official GitHub Release v1

Before redistributing the model or integrating it into a product, check the current license text, attribution requirements, and dependency terms.

Method

The input is one fixed portrait of a person by the sea. It includes white clothing and thin strands of hair extending to the right in the wind.

file: tests/assets/jpg/removebg_1536x1024_141kb.jpg
resolution: 1536x1024
SHA-256: d3d362b876936c57cfaf61eedd0ada05fd4950483ab79502aa5a67ded4a6b910

OpenCV resizes the image to 1024x1024 and normalizes it with the ImageNet mean and standard deviation. After ONNX Runtime inference, the mask is resized to the original resolution and saved as an 8-bit alpha channel.

The timing excludes model download, SHA-256 verification, InferenceSession initialization, and image loading. The inference benchmark runs 10 times after three warm-up runs. Warm-up reduces the effect of setup work that occurs only during the first inference.

Results

This is the generated transparent PNG. Depending on the viewer, transparent areas may appear white or as a checkerboard.

These results were recorded on the CPU of a MacBook Pro with an Apple M1 Max.

--- One-shot processing time ---
Preprocessing:   23.86 ms
Inference:     4699.83 ms
Postprocessing:  6.80 ms
PNG save:        41.07 ms
Total:         4771.56 ms

--- Inference benchmark (Warmup: 3, Iterations: 10) ---
Average time: 4319.62 ms
Min time:     4206.41 ms
Max time:     4480.35 ms
Std dev:        88.62 ms

I ran the same lab again while writing this article. Average inference time was 4302.47 ms, with a minimum of 4179.94 ms and a maximum of 4477.45 ms. Both runs were close to 4.3 seconds, but these timings are reference values that vary between runs.

The output is a 1536x1024, 8-bit RGBA PNG. The alpha distribution was identical in both runs.

Transparent (alpha=0): 65.92%
Transition (1-254):     9.37%
Opaque (alpha=255):     24.71%

Transition means pixels that are neither fully transparent nor fully opaque. These values help produce smoother hair and clothing boundaries, but their percentage alone does not measure segmentation accuracy.

The verification environment was:

machine: MacBook Pro (Apple M1 Max, arm64)
OS: macOS 26.5.1
Python: 3.12.10
ONNX Runtime: 1.27.0
OpenCV: 5.0.0
NumPy: 2.5.1
provider: CPUExecutionProvider

Interpretation

The detailed numbers are above, but the simpler reading has three main points.

It preserved the person and most of the hair

The model separated the face, white clothing, and most of the hair from the background. It retained many thin strands extending to the right, and partially transparent pixels kept the boundary from ending abruptly.

Some flyaway hair disappeared, and color spill remained

Some especially thin strands above the head and on the right disappeared. A small amount of blue from the original background also remained around the hair and clothing edges. Replacing the background with a different color may require an additional edge-color correction step.

It runs on a CPU, but it is not real-time

Inference took about 4.3 seconds per image. That is usable for processing a small number of local images one by one, but video or large batches would need further tests with a GPU, a smaller model, or quantization. Quantization reduces the numerical precision used for computation to make a model lighter.

This test used only one well-lit outdoor portrait. There was no ground-truth mask, so the quality assessment is visual only. I did not test other subjects, complex backgrounds, low light, low resolution, multiple people, or a quality and speed comparison with the Swin-Large model. The current implementation also stretches the input into a square before inference, so alternative resizing methods remain untested.

Thoughts after verification

I was impressed that a simple ONNX Runtime implementation running only on the CPU could create a transparent image while preserving much of the hair. It seems useful when background removal should stay local instead of sending an image to an external service. This quality should be sufficient for an LLM agent that automatically converts collected portraits into transparent assets for display or image compositing.

Testing Japanese and English OCR with PP-OCRv6-small and RapidOCR

Nariaki Wada — Sun, 12 Jul 2026 07:39:21 +0000

Testing Japanese and English OCR with PP-OCRv6-small and RapidOCR

Hello, everyone.

OCR is useful for reading text in images, but it is worth asking how much we can trust a result that merely looks correct.

Today, I ran PP-OCRv6-small through RapidOCR and tested how well it could read an image containing both Japanese and English on a CPU.

To give the result first, 12 of 14 representative strings matched exactly, and 13 of 14 were recognizable when a character-form difference was accepted. It also read vertical text, a slanted address, and small email text.

What we are verifying

PP-OCRv6 is a family of OCR models that finds text regions in an image and converts them into strings. It has tiny, small, and medium tiers. I selected small, with about 7.7 million parameters, as the balanced option between accuracy and size. Parameters are the values a model acquires through training.

Using RapidOCR 3.9.1 and the ONNX Runtime CPU backend, I checked:

Whether the PP-OCRv6-small detection and recognition models can be explicitly selected
Whether they can read Japanese, English, numbers, symbols, vertical text, and slanted text
End-to-end OCR pipeline time, excluding model initialization
Whether confidence scores agree with the actual correctness of recognized text

I used the fixed image below for the test. In addition to horizontal text, it contains vertical, small, and slanted text.

Target lab: kiarina/labs/2026/07/12/pp-ocrv6-small-rapidocr

Reproducing the environment

You need mise, uv, and an internet connection for the initial download of the shared image.

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/kiarina/labs.git
cd labs
git sparse-checkout set .gitignore .mise/tasks Makefile mise.toml 2026/07/12/pp-ocrv6-small-rapidocr
mise -C 2026/07/12/pp-ocrv6-small-rapidocr run

The command prints all recognized strings, confidence scores, coordinates, and timing data. It also generates output_ocr.jpg with the detected text regions drawn on the image.

Models and licenses

I used the following three models included in the RapidOCR wheel.

Stage	Model	Role
1. Detection	`PP-OCRv6_det_small.onnx`	Find regions that contain text
2. Orientation classification	`ch_ppocr_mobile_v2.0_cls_mobile.onnx`	Check and correct the orientation of each cropped region
3. Recognition	`PP-OCRv6_rec_small.onnx`	Convert each corrected region into Japanese or other text

The data flow is:

Input image
  -> Detect text regions
  -> Crop each region
  -> Correct text orientation
  -> Output text and confidence scores

PP-OCRv6-small handles detection and recognition, while RapidOCR's default model handles orientation classification. All inference ran with ONNX Runtime's CPUExecutionProvider.

Before redistributing models or integrating them into a product, check the current license text, attribution requirements, and dependency terms.

Method

The input was one fixed synthetic image of an indoor scene. It contains Japanese and English signs, a whiteboard, vertically printed book spines, a PC screen, small contact details, and a slanted envelope.

file: tests/assets/jpg/ocr_1448x1086_242kb.jpg
resolution: 1448x1086
SHA-256: 42d9024588f112ab9fbaf69c0e32a95462613c35b9cdbbb1a9c4bc1ff93ab96e

I selected 14 representative strings from the image and checked for matches after removing spaces and symbols. When a sentence was split across multiple detected regions, it passed if all parts were present.

For timing, I excluded model initialization and the first inference. After three warm-up runs, I passed the already-loaded image through the full pipeline 10 times. Warm-up runs reduce the effect of setup work that only occurs during the first inference.

Results

The left side of the following image shows the detected regions overlaid on the input, while the right side places the recognized strings at their corresponding positions.

These results were measured on the CPU of a Mac Studio with an Apple M4 Max.

detected lines:       40
representative match: 12/14
mean confidence:      0.984
min confidence:       0.883
max confidence:       1.000

average time:         839.01 ms
min time:             758.89 ms
max time:             890.43 ms
standard deviation:    37.32 ms

representative match is the strict automated check, which treats different character forms as a mismatch. If the hyphen recognized in place of the Japanese prolonged sound mark in ノート is accepted because the word remains understandable, the practical result is 13/14.

I ran the same lab again while writing this article. The strict 12/14 match result and confidence scores were unchanged, while timing was 920.47 ms on average, with a minimum of 835.24 ms and a maximum of 1090.61 ms. The 839.01 ms result is a reference measurement that varies between runs, not a fixed performance guarantee.

A confidence score is the model's own estimate of how reliable a recognition result is. A value near 1.000 is high, but it does not guarantee correctness.

The representative string checks were:

Type	Expected text	Result
Japanese	`OCR テストルーム`	PASS
English	`Please knock before entering`	PASS
Numbers	`12345`	PASS
Japanese and numbers	`在庫確認：ノート12冊／ペン24本`	ACCEPT (character-form difference)
English and punctuation	`Next review: Friday, 3:45 PM`	PASS
Small Japanese	`忘れずに水やり`	PASS
Small English	`Call Ken at 18:00`	PASS
Vertical Japanese	`日本語の練習`	PASS
Vertical English	`Deep Learning Basics`	PASS
Japanese	`取扱注意`	MISS
English	`FRAGILE`	PASS
Slanted Japanese	`東京都千代田区1-2-3`	PASS
Small email	`test@example.com`	PASS
Small phone number	`03-1234-5678`	PASS

The two strings that did not pass the strict automated check were:

expected: 在庫確認：ノート12冊／ペン24本
actual:   在庫確認：ノ-ト12冊／ペン24本
score:    0.947

expected: 取扱注意
actual:   取极注意
score:    0.883

The ノ-ト result uses a different character for the prolonged sound mark, but the word and the full sentence remain understandable, so I count it as recognized in this article. The clear recognition error is 取扱注意.

Outside the representative set, Project Alpha was recognized as Project Alpi. However, the pencil in the source image covers the final a area. Because the complete string is not visible, I do not count this as a clear model failure.

The verification environment was:

machine: Mac Studio (Apple M4 Max, arm64)
OS: macOS 26.5.1
Python: 3.12.10
RapidOCR: 3.9.1
ONNX Runtime: 1.27.0
OpenCV: 5.0.0

Interpretation

The detailed numbers are above, but the simpler reading has three main points.

It read many texts despite differences in direction and size

The matches included vertical Japanese and English, a slanted address, and small email and phone text. This is only one fixed image, but it shows potential beyond plain document scans.

The evaluation rule changes how the result looks

The strict exact-match result is 12/14, while accepting ノ-ト as an understandable character-form difference raises it to 13/14. In contrast, recognizing 取扱注意 as 取极注意 changes the text itself, so I count it as an error. OCR should be evaluated according to whether the application needs exact transcription or only understandable content.

The CPU processed one image in about 0.84 seconds

For the 1448x1086 image, the two recorded averages for detection, orientation classification, recognition, and pre/post-processing were 839.01 ms and 920.47 ms. Bulk processing and real-time use would need additional measurement and optimization, but this felt manageable for processing local images one by one.

This test used only one readable synthetic image, so it is not a general OCR accuracy benchmark. I did not test handwriting, low light, blur, strong distortion, smaller text, other fonts, the tiny or medium tiers, or GPU backends. Timing also varies with hardware and ONNX Runtime optimization.

Thoughts after verification

RapidOCR felt like an accessible way to try ONNX-based OCR because it wraps detection, orientation correction, and recognition into one pipeline. Reading vertical and slanted text was better than I expected.

A character-form difference such as ノ-ト should still be useful for search or understanding the content. On the other hand, a character can change entirely, as in 取扱注意. If I integrate this into an LLM agent, I would keep OCR output as one observation rather than confirmed information and add a mechanism to recheck important values.