Running ASR for smart homes in the NPU of Intel processors

Miguel Camba — Mon, 25 May 2026 19:50:26 +0000

I run my own smart home — Home Assistant, voice assistant pipeline, the whole self-hosted thing. The speech-to-text step (Parakeet TDT 0.6B v3 over the Wyoming protocol) had been running on my i3 1220P intel NUC with an 12gb RTX 3060 eGPU for months. I recently upgraded my home server to a full desktop with an AMD 7900XTX, and since I want to save as much of the VRAM as I can for LLMs, I've been running nvidia parakeet on CPU since then.
It works fine, but it always nagged me: my new home server has an Intel Core Ultra 7 265K (Arrow Lake) with the built-in "AI Boost" NPU, and that silicon was sitting completely idle.

With the hype of AI, chip manufacturers have started to slap NPUs on their chips mostly so they can put AI on their names, but little to no software actually makes use of them, although some projects are starting to pop here and there.

So I decided to actually try one if I could put that stupidly underused chunk of silicon to work on a workload that should, on paper, be ideal for it.

And it worked remarkably well, but the road was bumpy.

TL;DR — the result, you came here for this table.

Same Spanish audios, similar wyoming-onnx-asr stack, but I swapped the inference backend from plain ORT-CPU to OpenVINO targeting the NPU, and I went from using the INT8 quantized model on the CPU to using the full precision FP32 model on the NPU.

Results averaged from 10 runs after 1 warmup round.

Audio	Backend	Avg latency	Energy / inference	Power above idle
10 s	CPU INT8	978 ms	44.6 J	45.6 W
10 s	NPU FP32	204 ms ⚡	4.2 J	20.5 W
20 s	CPU INT8	1 708 ms	79.8 J	46.7 W
20 s	NPU FP32	615 ms ⚡	7.8 J	12.7 W
60 s	CPU INT8	5 011 ms	237.7 J	47.4 W
60 s	NPU FP32	818 ms ⚡	11.0 J	13.4 W

3-6× faster wall time. 10-22× less energy per transcription. For a workload that runs quite often in my home (I have 5 satellites and I don't reach for switches often), this is the kind of result that makes me wonder why nobody seems to be doing it.
For a nice voice assistant, response speed is a critical part of the experience. It's not like 500ms extra makes for a terrible experience, but very little you save does improve the experience.

I've packaged the whole thing into a Docker image: 👉 ghcr.io/cibernox/wyoming-parakeet-on-intel-npu. If you have a Core Ultra chip and are Home Assistant, you can docker run it and skip everything below.

But if you want the story…

Why bother

Quick context. The home server is a Proxmox 9.x box, Intel Core Ultra 7 265K, 64 GB DDR5, an AMD 7900XTX dedicated GPU, and various LXC + Docker workloads (Home Assistant, llama.cpp on GPU, paperless-ngx, the usual). I'd been running Parakeet TDT on CPU at ~0.5-0.8 s per utterance. Acceptable but not "instant", but it was a downgrade from where I was running it in my RTX 3060 that I could live with but it could feel it too.

The CPU baseline is genuinely strong on this chip — Parakeet's INT8 ONNX through ORT-CPU benefits from AVX-VNNI INT8 matmuls and the 265K is beefier than most home servers. So when I say the NPU is 3-6× faster, I'm not comparing it to a low power N150 mini-cp. This is a 20-core desktop-class CPU at 125 W TDP.

The Intel NPU on Arrow Lake is rated at 13 TOPS. By LLM-accelerator standards that's tiny, and AMD boosts NPUs with 40TOPS already. But Parakeet's encoder is exactly the kind of work an NPU is designed for: matrix multiplications with predictable shapes and modest activation memory. Worth trying.

Trap #1: Ubuntu's Level Zero loader is too old

First time you'd think "yeah I just install OpenVINO and the NPU driver, right?" And it almost works. The container detected the NPU device node but reported available_devices: ['CPU']. No NPU.

The reason, after some ZE_ENABLE_LOADER_DEBUG_TRACE=1 archeology:

ZE_LOADER_DEBUG_TRACE: Load Library of libze_intel_vpu.so.1 failed

Ubuntu 24.04's bundled Level Zero loader (libze1 v1.16) is looking for the legacy library name libze_intel_vpu.so.1. I should have figured this faster than I did because this chip was released in 2025, so it's totally to be expected that Ubuntu needed some help getting it to work. Recent Intel NPU driver builds install libze_intel_npu.so.1 — different name, same library. The loader needs to be v1.17 or newer to know about the new name.

Fix is straightforward once you know:

RUN curl -fL -O \
    "https://github.com/oneapi-src/level-zero/releases/download/v1.28.6/libze1_1.28.6+u24.04_amd64.deb" \
    && apt-get install -y --no-install-recommends ./libze1*.deb

Now ov.Core().available_devices returns ['CPU', 'NPU'] and the full device name comes back as Intel(R) AI Boost. 🎉

Trap #2: don't try to run the INT8 model on NPU

The model I was already using is INT8 quantized. Natural first move: feed the same ONNX to OpenVINO targeting NPU. It blows up:

[OpenVINO-EP] Output names mismatch between OpenVINO and ONNX

What's happening: the INT8 Parakeet ONNX uses DynamicQuantizeLinear/MatMulInteger/DequantizeLinear chains, and OpenVINO's graph optimizer aggressively folds those into native INT8 matmuls. The folding renames or drops intermediate tensors that the runtime is trying to read back. Hard fail at first inference.

Worse: even if you find a way to coax it through (I tried onnxruntime-openvino, raw OpenVINO with enable_qdq_optimizer, even NNCF post-training quantization), INT8 runs slower than FP32 on this NPU. The Intel NPU is BF16-native — it converts everything to BF16 internally. Feeding it INT8 just means extra dequant/requant on every operator boundary.

The right move is the opposite of what I expected: use the FP32 model. It's 4× bigger on disk (2.5 GB vs 650 MB) but the compiler converts it cleanly to BF16 for the NPU and runs full speed.

NOTE: After all theses tests I found that someone has created an FP16 version of Parakeet that is ~1.5 GB. I tried it briefly and if performed much better than INT8 but still 15% slower than fp32. I am not sure why, but if you are ram constrained you might prefer that one.

Trap #3: NPUs hate dynamic shapes

The Parakeet encoder accepts dynamic input shape (batch, 128, T) where T is the number of mel-feature frames — proportional to audio length. A 1.5 s "lights off" command is 150 frames; a 60 s dictation is 6 000 frames. ONNX Runtime on CPU handles that natively — every call allocates whatever shape comes in.

Quick aside: what's a "mel-feature frame" you may ask? (It's OK, I didn't know until yesterday) Speech models don't ingest raw audio. The audio is sliced into overlapping ~25 ms windows, each window converted into a 128-element vector of mel-frequency magnitudes (energies at different frequency bands, weighted to match human hearing). Parakeet does this conversion at 100 frames per second. T = audio_seconds × 100. That's the dimension that varies with utterance length.

The Intel NPU absolutely does not do dynamic shapes. At least I couldn't find a way. The compiler bakes the tile sizes and memory layout into the compiled blob based on the static input dimensions. Hand it an unbounded dynamic shape and OpenVINO refuses to compile:

[ERROR] Upper bounds are not specified for node '/pre_encode/Cast' (type 'Convert'):
        input '0' bounds are '[9223372036854775807]'

I tried bounded dynamic shapes too (ov.PartialShape([1, 128, ov.Dimension(1, 2000)])) — the bounds don't propagate through every internal op of the Conformer, so the compiler still hits unbounded operands and bails out.

Three options:

One static shape (e.g., 20 s) — pad every utterance up to 20 s of silence and run the full encoder no matter what. Simple but very wasteful — a 2 s command pays the encoder cost of 20 s.
Recompile per request — NPU compile takes ~12 s. Hard no.
Multi-bucket dispatch — compile a handful of static shapes ahead of time, cache them, and route each request to the smallest bucket that fits.

Option 3 is the only sane answer unless someone can prove me wrong on allowing dynamic shapes. Since smart home commands are usually rather quick, here are the bucket sizes I settled on for my Spanish smart-home traffic and the NPU encoder time for each:

Bucket	Typical traffic	Encoder time on NPU
5 s	"Apaga la luz de la cocina y la del comedor"	~55 ms
20 s	Voice notes, reminders	~150 ms

Without buckets, every single utterance would pay the 20 s bucket's ~150 ms encoder cost. With the 5 s bucket added, the most common commands now spend only 55 ms on the encoder phase. We could have smaller buckets, and I did try that, but each bucket requires a new compilation step, and takes space and memory, so I though that 2 tiers was granular enough.

Trap #4: the false start that wasted a whole afternoon

For most of this investigation I was getting "NPU" and "CPU" timings within noise of each other and was about to declare the NPU not worth it.

Turned out my integration shim was being attached to the wrong attribute on the loaded onnx-asr model.

onnx_asr.load_model() returns a TextResultsAsrAdapter that wraps the actual ASR object on .asr. The wrapper does NOT proxy attribute writes. So this:

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", ...)
model._encoder = OpenVINOEncoderShim(...)  # ← attribute added to wrapper, ignored

…just adds an attribute to the wrapper that nothing reads. model.recognize() still routes through model.asr._encoder, which is the original ORT-CPU session. Every "NPU" benchmark I had been running was secretly plain ORT-CPU with an extra unused NPU encoder warming up uselessly in memory.

One-line fix:

model.asr._encoder = OpenVINOEncoderShim(...)  # ← actually used
model.asr._decoder_joint = OpenVINODecoderShim(...)

Once corrected, the real numbers landed where the silicon could deliver them. Lesson: when integrating with someone else's pipeline, add a tracer that confirms your code is actually being called before you trust any benchmark. This was on me.

The code that works

Download the FP32 encoder (encoder-model.onnx + 2.4 GB external data) and FP32 decoder from the istupakov HF repo.
For each bucket size, reshape the encoder to a static T and compile for NPU:

import openvino as ov
core = ov.Core()
model = core.read_model("encoder-model.onnx")
model.reshape({"audio_signal": [1, 128, T_fixed], "length": [1]})
compiled = core.compile_model(model, "NPU", config={
    "CACHE_DIR": "/data/ov_cache",
    "PERFORMANCE_HINT": "LATENCY",
    "NPU_TURBO": "YES",
})

Same idea for the decoder/joint, but only ONE bucket — it's called per-token with fixed shapes regardless of audio length.
At inference time, pick the smallest bucket whose T_fixed ≥ the actual mel-frame count. Zero-pad to that bucket's length, pass length=actual so the encoder knows where real audio ends.
Plug the NPU-compiled encoder + decoder into onnx_asr by assigning to model.asr._encoder and model.asr._decoder_joint (NOT model._encoder — see Trap #4).

NPU compile time is ~12 s per bucket cold, ~1 s when the CACHE_DIR blob hits. First container start is ~80 s with all buckets; subsequent restarts are fast because everything is cached.

Things I tried that didn't help

So you don't have to:

onnxruntime-openvino with the INT8 model → output-name mismatch bug
NNCF post-training INT8 quantization → still 17% slower than FP32 on this NPU, but not bad at all considering saves 75% of the RAM. If was is tight, this approach is for you, and quality degradations is very low.
FP16 model (from the grikdotnet repo) → marginally slower than FP32 because of FP32-I/O Cast ops the converter inserts. Saves ~50% RAM though, which is nice. I have plenty, so I didn't bother, but might be an easy save and fp32 -> fp16 should be negligible.
Async ping-pong with two InferRequests on the decoder → TDT decoder is auto-regressive, nothing to overlap
INFERENCE_PRECISION_HINT=f16 → no measurable effect; compiler was already running BF16 internally
MODEL_PRIORITY=HIGH → compile-time only, no runtime effect
Bounded dynamic shapes → bounds don't propagate through the Conformer ops, compiler still bails

Benchmarking for smart home commands

Voice commands arrive sporadically — a few seconds of speech after several minutes of silence. The relevant metric isn't steady-state throughput transcribing a 90min podcast, it's single-shot cold-after-idle latency, because the CPU's caches/clocks are cold and the NPU might be in a low-power state.

I run my home server with aggressive power-saving (deep C states, PCIe sleep — my AMD 7900XTX idles at 4 W). "Idle" wall power is around 32-38 W (as idle as a server running 20 containers can be). I was worried these would punish cold inference. They don't.

The NPU has no observable wake-up penalty. Cold-after-idle:

Audio	CPU INT8	NPU FP32
10 s	918 ms	276 ms
20 s	1 628 ms	693 ms
60 s	4 756 ms	884 ms

Real Home Assistant trace for "apaga la luz de la cocina y la del comedor" (turn off the kitchen and dining-room lights which is a longer-than-average-sentence): CPU 0.71 s vs NPU 0.18 s, identical transcript.

What this means in absolute terms

The result that genuinely surprised me: this 13-TOPS NPU running Parakeet ends up as fast or faster than the same model running on an Nvidia RTX 3060 (~13 TFLOPS on FP16), which I had been using on my previous server as an eGPU. The RTX did 0.15-0.3 s per utterance. The NPU does 0.1-0.2 s. Same ballpark, and:

The NPU pulls ~13 W during transcription
The RTX 3060 pulled ~170 W active, ~15 W idle

The NPU's active power is lower than the RTX's idle. On a workload that's mostly idle anyway, that's a 10× efficiency gain in steady state and infinite in active comparison.

For 13 TOPS, that's a remarkable use of silicon. The "NPUs are marketing" take is wrong for at least this workload.

Now, I am not claiming that the NPU is more powerful than a 3060, it clearly isn't, but I suspect it's able to match or best it because (and this is just a theory), it wakes up faster than a discrete GPU, and for a short burst of work like this, that gives it an early start that the nvidia card wasn't able to overcome. I'm sure that transcribing commands over 10 seconds the GPU would win, but those are very rare.

Try it

I packaged everything into a public Docker image. If you have:

An Intel Core Ultra processor (Meteor Lake / Arrow Lake / Lunar Lake)
/dev/accel/accel0 on your host (lsmod | grep intel_vpu to verify)
Home Assistant or any other Wyoming-protocol client

You can do this:

docker run -d \
  --name wyoming-parakeet-npu \
  --device /dev/accel/accel0 \
  -e LANGUAGE=es \
  -p 10300:10300 \
  -v parakeet-data:/data \
  --restart unless-stopped \
  ghcr.io/cibernox/wyoming-parakeet-on-intel-npu:latest

First boot downloads ~3.2 GB of model weights and compiles the NPU buckets (~60-90 s). Subsequent restarts are under 5 s. Point Home Assistant's Wyoming integration at tcp://<host>:10300 and you're done.

Repo with source, Dockerfile, docs and a docker-compose.yml example: github.com/cibernox/wyoming-parakeet-on-intel-npu.

What I'd love help with

If you're playing with this, things I haven't done yet that I think could move the needle:

VAD gate before the encoder — most wake-word false-positives carry a fraction of a second of speech then silence. Cheap RMS-based VAD on the host could avoid invoking the encoder entirely for those. Probably the single biggest aggregate energy saver in a real smart home.
Lazy bucket loading + LRU eviction — I keep multiple buckets resident, but each compiled blob takes ~1.5 GB of RAM. An LRU policy would let you compile many buckets but only keep N hot. (The repo already has a basic "lazy load one large bucket" mode; full LRU would be the next step.)
Investigating the TDT decoder's Python overhead — even with the decoder itself running at ~1 ms per call on NPU, the surrounding loop in onnx_asr (numpy state-handling, control flow) accounts for a meaningful fraction of total time on long audio.

PRs welcome.

Acknowledgements

This work stands on top of several open-source projects, all of which made this hack possible:

tboby/wyoming-onnx-asr — the Wyoming protocol server I forked from
istupakov/onnx-asr — the ASR pipeline library
istupakov/parakeet-tdt-0.6b-v3-onnx — Parakeet's ONNX export
openvinotoolkit/openvino — Intel's inference runtime
intel/linux-npu-driver — NPU userspace driver
amd/RyzenAI-SW Parakeet-TDT demo — proved the same approach works on a competing NPU; gave me the static-reshape recipe

Thanks to all of them for shipping working code.

DEV Community: Miguel Camba