<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Miguel Camba</title>
    <description>The latest articles on DEV Community by Miguel Camba (@cibernox).</description>
    <link>https://dev.to/cibernox</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951114%2F66b5de1b-8ca2-4945-81ee-35d69146aaa9.jpeg</url>
      <title>DEV Community: Miguel Camba</title>
      <link>https://dev.to/cibernox</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cibernox"/>
    <language>en</language>
    <item>
      <title>Running ASR for smart homes in the NPU of Intel processors</title>
      <dc:creator>Miguel Camba</dc:creator>
      <pubDate>Mon, 25 May 2026 19:50:26 +0000</pubDate>
      <link>https://dev.to/cibernox/running-asr-for-smart-homes-in-the-npu-of-intel-processors-2iec</link>
      <guid>https://dev.to/cibernox/running-asr-for-smart-homes-in-the-npu-of-intel-processors-2iec</guid>
      <description>&lt;p&gt;I run my own smart home — Home Assistant, voice assistant pipeline, the whole self-hosted thing. The speech-to-text step (Parakeet TDT 0.6B v3 over the &lt;a href="https://github.com/rhasspy/wyoming" rel="noopener noreferrer"&gt;Wyoming protocol&lt;/a&gt;) had been running on my i3 1220P intel NUC with an 12gb RTX 3060 eGPU for months. I recently upgraded my home server to a full desktop with an AMD 7900XTX, and since I want to save as much of the VRAM as I can for LLMs, I've been running nvidia parakeet on CPU since then. &lt;br&gt;
It works fine, but it always nagged me: my new home server has an Intel Core Ultra 7 265K (Arrow Lake) with the built-in "AI Boost" NPU, and that silicon was sitting completely idle.&lt;/p&gt;

&lt;p&gt;With the hype of AI, chip manufacturers have started to slap NPUs on their chips mostly so they can put AI on their names, but little to no software actually makes use of them, although some projects are starting to pop here and there.&lt;/p&gt;

&lt;p&gt;So I decided to actually try one if I could put that stupidly underused chunk of silicon to work on a workload that should, on paper, be ideal for it.&lt;/p&gt;

&lt;p&gt;And it worked remarkably well, but the road was bumpy.&lt;/p&gt;


&lt;h2&gt;
  
  
  TL;DR — the result, you came here for this table.
&lt;/h2&gt;

&lt;p&gt;Same Spanish audios, similar wyoming-onnx-asr stack, but I swapped the inference backend from plain &lt;a href="https://onnxruntime.ai/" rel="noopener noreferrer"&gt;ORT-CPU&lt;/a&gt; to OpenVINO targeting the NPU, and I went from using the INT8 quantized model on the CPU to using the full precision FP32 model on the NPU.&lt;/p&gt;

&lt;p&gt;Results averaged from 10 runs after 1 warmup round.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Audio&lt;/th&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Avg latency&lt;/th&gt;
&lt;th&gt;Energy / inference&lt;/th&gt;
&lt;th&gt;Power above idle&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 s&lt;/td&gt;
&lt;td&gt;CPU INT8&lt;/td&gt;
&lt;td&gt;978 ms&lt;/td&gt;
&lt;td&gt;44.6 J&lt;/td&gt;
&lt;td&gt;45.6 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NPU FP32&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;204 ms&lt;/strong&gt; ⚡&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.2 J&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;20.5 W&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20 s&lt;/td&gt;
&lt;td&gt;CPU INT8&lt;/td&gt;
&lt;td&gt;1 708 ms&lt;/td&gt;
&lt;td&gt;79.8 J&lt;/td&gt;
&lt;td&gt;46.7 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NPU FP32&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;615 ms&lt;/strong&gt; ⚡&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.8 J&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12.7 W&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60 s&lt;/td&gt;
&lt;td&gt;CPU INT8&lt;/td&gt;
&lt;td&gt;5 011 ms&lt;/td&gt;
&lt;td&gt;237.7 J&lt;/td&gt;
&lt;td&gt;47.4 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NPU FP32&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;818 ms&lt;/strong&gt; ⚡&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.0 J&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.4 W&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;3-6× faster wall time. 10-22× less energy per transcription.&lt;/strong&gt; For a workload that runs quite often in my home (I have 5 satellites and I don't reach for switches often), this is the kind of result that makes me wonder why nobody seems to be doing it. &lt;br&gt;
For a nice voice assistant, response speed is a critical part of the experience. It's not like 500ms extra makes for a terrible experience, but very little you save does improve the experience.&lt;/p&gt;

&lt;p&gt;I've packaged the whole thing into a Docker image: 👉 &lt;a href="https://github.com/cibernox/wyoming-parakeet-on-intel-npu" rel="noopener noreferrer"&gt;&lt;code&gt;ghcr.io/cibernox/wyoming-parakeet-on-intel-npu&lt;/code&gt;&lt;/a&gt;. If you have a Core Ultra chip and are Home Assistant, you can &lt;code&gt;docker run&lt;/code&gt; it and skip everything below.&lt;/p&gt;

&lt;p&gt;But if you want the story…&lt;/p&gt;


&lt;h2&gt;
  
  
  Why bother
&lt;/h2&gt;

&lt;p&gt;Quick context. The home server is a Proxmox 9.x box, Intel Core Ultra 7 265K, 64 GB DDR5, an AMD 7900XTX dedicated GPU, and various LXC + Docker workloads (Home Assistant, llama.cpp on GPU, paperless-ngx, the usual). I'd been running Parakeet TDT on CPU at ~0.5-0.8 s per utterance. Acceptable but not "instant", but it was a downgrade from where I was running it in my RTX 3060 that I could live with but it could feel it too. &lt;/p&gt;

&lt;p&gt;The CPU baseline is genuinely strong on this chip — Parakeet's INT8 ONNX through ORT-CPU benefits from AVX-VNNI INT8 matmuls and the 265K is beefier than most home servers. So when I say the NPU is 3-6× faster, I'm not comparing it to a low power N150 mini-cp. This is a 20-core desktop-class CPU at 125 W TDP.&lt;/p&gt;

&lt;p&gt;The Intel NPU on Arrow Lake is rated at &lt;strong&gt;13 TOPS&lt;/strong&gt;. By LLM-accelerator standards that's tiny, and AMD boosts NPUs with 40TOPS already. But Parakeet's encoder is exactly the kind of work an NPU is designed for: matrix multiplications with predictable shapes and modest activation memory. Worth trying.&lt;/p&gt;


&lt;h2&gt;
  
  
  Trap #1: Ubuntu's Level Zero loader is too old
&lt;/h2&gt;

&lt;p&gt;First time you'd think "yeah I just install OpenVINO and the NPU driver, right?" And it almost works. The container detected the NPU device node but reported &lt;code&gt;available_devices: ['CPU']&lt;/code&gt;. No NPU.&lt;/p&gt;

&lt;p&gt;The reason, after some &lt;code&gt;ZE_ENABLE_LOADER_DEBUG_TRACE=1&lt;/code&gt; archeology:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ZE_LOADER_DEBUG_TRACE: Load Library of libze_intel_vpu.so.1 failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ubuntu 24.04's bundled Level Zero loader (&lt;code&gt;libze1&lt;/code&gt; v1.16) is looking for the legacy library name &lt;code&gt;libze_intel_vpu.so.1&lt;/code&gt;. I should have figured this faster than I did because this chip was released in 2025, so it's totally to be expected that Ubuntu needed some help getting it to work. Recent Intel NPU driver builds install &lt;code&gt;libze_intel_npu.so.1&lt;/code&gt; — different name, same library. The loader needs to be v1.17 or newer to know about the new name.&lt;/p&gt;

&lt;p&gt;Fix is straightforward once you know:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;curl &lt;span class="nt"&gt;-fL&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s2"&gt;"https://github.com/oneapi-src/level-zero/releases/download/v1.28.6/libze1_1.28.6+u24.04_amd64.deb"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;--no-install-recommends&lt;/span&gt; ./libze1&lt;span class="k"&gt;*&lt;/span&gt;.deb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now &lt;code&gt;ov.Core().available_devices&lt;/code&gt; returns &lt;code&gt;['CPU', 'NPU']&lt;/code&gt; and the full device name comes back as &lt;code&gt;Intel(R) AI Boost&lt;/code&gt;. 🎉&lt;/p&gt;




&lt;h2&gt;
  
  
  Trap #2: don't try to run the INT8 model on NPU
&lt;/h2&gt;

&lt;p&gt;The model I was already using is INT8 quantized. Natural first move: feed the same ONNX to OpenVINO targeting NPU. It blows up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[OpenVINO-EP] Output names mismatch between OpenVINO and ONNX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What's happening: the INT8 Parakeet ONNX uses &lt;code&gt;DynamicQuantizeLinear&lt;/code&gt;/&lt;code&gt;MatMulInteger&lt;/code&gt;/&lt;code&gt;DequantizeLinear&lt;/code&gt; chains, and OpenVINO's graph optimizer aggressively folds those into native INT8 matmuls. The folding renames or drops intermediate tensors that the runtime is trying to read back. Hard fail at first inference.&lt;/p&gt;

&lt;p&gt;Worse: even if you find a way to coax it through (I tried &lt;code&gt;onnxruntime-openvino&lt;/code&gt;, raw OpenVINO with &lt;code&gt;enable_qdq_optimizer&lt;/code&gt;, even NNCF post-training quantization), &lt;strong&gt;INT8 runs &lt;em&gt;slower&lt;/em&gt; than FP32 on this NPU&lt;/strong&gt;. The Intel NPU is BF16-native — it converts everything to BF16 internally. Feeding it INT8 just means extra dequant/requant on every operator boundary.&lt;/p&gt;

&lt;p&gt;The right move is the opposite of what I expected: &lt;strong&gt;use the FP32 model&lt;/strong&gt;. It's 4× bigger on disk (2.5 GB vs 650 MB) but the compiler converts it cleanly to BF16 for the NPU and runs full speed.&lt;/p&gt;

&lt;p&gt;NOTE: After all theses tests I found that someone has created an FP16 version of Parakeet that is ~1.5 GB. I tried it briefly and if performed much better than INT8 but still 15% slower than fp32. I am not sure why, but if you are ram constrained you might prefer that one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trap #3: NPUs hate dynamic shapes
&lt;/h2&gt;

&lt;p&gt;The Parakeet encoder accepts dynamic input shape &lt;code&gt;(batch, 128, T)&lt;/code&gt; where &lt;code&gt;T&lt;/code&gt; is the number of mel-feature frames — proportional to audio length. A 1.5 s "lights off" command is 150 frames; a 60 s dictation is 6 000 frames. ONNX Runtime on CPU handles that natively — every call allocates whatever shape comes in.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick aside: what's a "mel-feature frame" you may ask?&lt;/strong&gt; (It's OK, I didn't know until yesterday) Speech models don't ingest raw audio. The audio is sliced into overlapping ~25 ms windows, each window converted into a 128-element vector of mel-frequency magnitudes (energies at different frequency bands, weighted to match human hearing). Parakeet does this conversion at 100 frames per second. &lt;code&gt;T&lt;/code&gt; = &lt;code&gt;audio_seconds × 100&lt;/code&gt;. That's the dimension that varies with utterance length.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Intel NPU absolutely does not do dynamic shapes. At least I couldn't find a way. The compiler bakes the tile sizes and memory layout into the compiled blob based on the static input dimensions. Hand it an unbounded dynamic shape and OpenVINO refuses to compile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ERROR] Upper bounds are not specified for node '/pre_encode/Cast' (type 'Convert'):
        input '0' bounds are '[9223372036854775807]'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tried bounded dynamic shapes too (&lt;code&gt;ov.PartialShape([1, 128, ov.Dimension(1, 2000)])&lt;/code&gt;) — the bounds don't propagate through every internal op of the Conformer, so the compiler still hits unbounded operands and bails out.&lt;/p&gt;

&lt;p&gt;Three options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;One static shape (e.g., 20 s)&lt;/strong&gt; — pad every utterance up to 20 s of silence and run the full encoder no matter what. Simple but very wasteful — a 2 s command pays the encoder cost of 20 s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recompile per request&lt;/strong&gt; — NPU compile takes ~12 s. Hard no.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-bucket dispatch&lt;/strong&gt; — compile a handful of static shapes ahead of time, cache them, and route each request to the smallest bucket that fits.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Option 3 is the only sane answer unless someone can prove me wrong on allowing dynamic shapes. Since smart home commands are usually rather quick, here are the bucket sizes I settled on for my Spanish smart-home traffic and the NPU encoder time for each:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bucket&lt;/th&gt;
&lt;th&gt;Typical traffic&lt;/th&gt;
&lt;th&gt;Encoder time on NPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Apaga la luz de la cocina y la del comedor"&lt;/td&gt;
&lt;td&gt;~55 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20 s&lt;/td&gt;
&lt;td&gt;Voice notes, reminders&lt;/td&gt;
&lt;td&gt;~150 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Without buckets, every single utterance would pay the 20 s bucket's ~150 ms encoder cost. With the 5 s bucket added, the most common commands now spend only 55 ms on the encoder phase. We could have smaller buckets, and I did try that, but each bucket requires a new compilation step, and takes space and memory, so I though that 2 tiers was granular enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trap #4: the false start that wasted a whole afternoon
&lt;/h2&gt;

&lt;p&gt;For most of this investigation I was getting "NPU" and "CPU" timings within noise of each other and was about to declare the NPU not worth it.&lt;/p&gt;

&lt;p&gt;Turned out my integration shim was being attached to the &lt;strong&gt;wrong attribute&lt;/strong&gt; on the loaded onnx-asr model.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;onnx_asr.load_model()&lt;/code&gt; returns a &lt;code&gt;TextResultsAsrAdapter&lt;/code&gt; that wraps the actual ASR object on &lt;code&gt;.asr&lt;/code&gt;. The wrapper does NOT proxy attribute writes. So this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;onnx_asr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nemo-parakeet-tdt-0.6b-v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenVINOEncoderShim&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;  &lt;span class="c1"&gt;# ← attribute added to wrapper, ignored
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…just adds an attribute to the wrapper that nothing reads. &lt;code&gt;model.recognize()&lt;/code&gt; still routes through &lt;code&gt;model.asr._encoder&lt;/code&gt;, which is the original ORT-CPU session. Every "NPU" benchmark I had been running was secretly plain ORT-CPU with an extra unused NPU encoder warming up uselessly in memory.&lt;/p&gt;

&lt;p&gt;One-line fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenVINOEncoderShim&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;  &lt;span class="c1"&gt;# ← actually used
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_decoder_joint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenVINODecoderShim&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once corrected, the real numbers landed where the silicon could deliver them. Lesson: when integrating with someone else's pipeline, &lt;strong&gt;add a tracer that confirms your code is actually being called&lt;/strong&gt; before you trust any benchmark. This was on me.&lt;/p&gt;




&lt;h2&gt;
  
  
  The code that works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Download the &lt;strong&gt;FP32&lt;/strong&gt; encoder (&lt;code&gt;encoder-model.onnx&lt;/code&gt; + 2.4 GB external data) and FP32 decoder from the &lt;a href="https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx" rel="noopener noreferrer"&gt;istupakov HF repo&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For each bucket size, reshape the encoder to a static &lt;code&gt;T&lt;/code&gt; and compile for NPU:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openvino&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ov&lt;/span&gt;
&lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ov&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Core&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;encoder-model.onnx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio_signal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T_fixed&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="n"&gt;compiled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CACHE_DIR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/ov_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PERFORMANCE_HINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LATENCY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU_TURBO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YES&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Same idea for the decoder/joint, but only ONE bucket — it's called per-token with fixed shapes regardless of audio length.&lt;/li&gt;
&lt;li&gt;At inference time, pick the smallest bucket whose &lt;code&gt;T_fixed&lt;/code&gt; ≥ the actual mel-frame count. Zero-pad to that bucket's length, pass &lt;code&gt;length=actual&lt;/code&gt; so the encoder knows where real audio ends.&lt;/li&gt;
&lt;li&gt;Plug the NPU-compiled encoder + decoder into &lt;code&gt;onnx_asr&lt;/code&gt; by assigning to &lt;code&gt;model.asr._encoder&lt;/code&gt; and &lt;code&gt;model.asr._decoder_joint&lt;/code&gt; (NOT &lt;code&gt;model._encoder&lt;/code&gt; — see Trap #4).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;NPU compile time is ~12 s per bucket cold, ~1 s when the &lt;code&gt;CACHE_DIR&lt;/code&gt; blob hits. First container start is ~80 s with all buckets; subsequent restarts are fast because everything is cached.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things I tried that didn't help
&lt;/h2&gt;

&lt;p&gt;So you don't have to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;onnxruntime-openvino&lt;/code&gt; with the INT8 model&lt;/strong&gt; → output-name mismatch bug&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NNCF post-training INT8 quantization&lt;/strong&gt; → still 17% slower than FP32 on this NPU, but not bad at all considering saves 75% of the RAM. If was is tight, this approach is for you, and quality degradations is very low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 model&lt;/strong&gt; (from the &lt;a href="https://huggingface.co/grikdotnet/parakeet-tdt-0.6b-fp16" rel="noopener noreferrer"&gt;grikdotnet repo&lt;/a&gt;) → marginally slower than FP32 because of FP32-I/O Cast ops the converter inserts. Saves ~50% RAM though, which is nice. I have plenty, so I didn't bother, but might be an easy save and fp32 -&amp;gt; fp16 should be negligible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async ping-pong with two &lt;code&gt;InferRequest&lt;/code&gt;s on the decoder&lt;/strong&gt; → TDT decoder is auto-regressive, nothing to overlap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;INFERENCE_PRECISION_HINT=f16&lt;/code&gt;&lt;/strong&gt; → no measurable effect; compiler was already running BF16 internally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MODEL_PRIORITY=HIGH&lt;/code&gt;&lt;/strong&gt; → compile-time only, no runtime effect&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bounded dynamic shapes&lt;/strong&gt; → bounds don't propagate through the Conformer ops, compiler still bails&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Benchmarking for smart home commands
&lt;/h2&gt;

&lt;p&gt;Voice commands arrive sporadically — a few seconds of speech after several minutes of silence. The relevant metric isn't steady-state throughput transcribing a 90min podcast, it's &lt;strong&gt;single-shot cold-after-idle latency&lt;/strong&gt;, because the CPU's caches/clocks are cold and the NPU might be in a low-power state.&lt;/p&gt;

&lt;p&gt;I run my home server with aggressive power-saving (deep C states, PCIe sleep — my AMD 7900XTX idles at 4 W). "Idle" wall power is around 32-38 W (as idle as a server running 20 containers can be). I was worried these would punish cold inference. They don't.&lt;/p&gt;

&lt;p&gt;The NPU has &lt;strong&gt;no observable wake-up penalty&lt;/strong&gt;. Cold-after-idle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Audio&lt;/th&gt;
&lt;th&gt;CPU INT8&lt;/th&gt;
&lt;th&gt;NPU FP32&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 s&lt;/td&gt;
&lt;td&gt;918 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;276 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20 s&lt;/td&gt;
&lt;td&gt;1 628 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;693 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60 s&lt;/td&gt;
&lt;td&gt;4 756 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;884 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Real Home Assistant trace for &lt;em&gt;"apaga la luz de la cocina y la del comedor"&lt;/em&gt; (&lt;em&gt;turn off the kitchen and dining-room lights&lt;/em&gt; which is a longer-than-average-sentence): CPU 0.71 s vs NPU 0.18 s, identical transcript. &lt;/p&gt;




&lt;h2&gt;
  
  
  What this means in absolute terms
&lt;/h2&gt;

&lt;p&gt;The result that genuinely surprised me: this 13-TOPS NPU running Parakeet ends up &lt;strong&gt;as fast or faster&lt;/strong&gt; than the same model running on an Nvidia RTX 3060 (~13 TFLOPS on FP16), which I had been using on my previous server as an eGPU. The RTX did 0.15-0.3 s per utterance. The NPU does 0.1-0.2 s. Same ballpark, and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The NPU pulls &lt;strong&gt;~13 W&lt;/strong&gt; during transcription&lt;/li&gt;
&lt;li&gt;The RTX 3060 pulled &lt;strong&gt;~170 W&lt;/strong&gt; active, &lt;strong&gt;~15 W idle&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The NPU's &lt;em&gt;active&lt;/em&gt; power is lower than the RTX's &lt;em&gt;idle&lt;/em&gt;. On a workload that's mostly idle anyway, that's a 10× efficiency gain in steady state and infinite in active comparison.&lt;/p&gt;

&lt;p&gt;For 13 TOPS, that's a remarkable use of silicon. The "NPUs are marketing" take is wrong for at least this workload.&lt;/p&gt;

&lt;p&gt;Now, I am not claiming that the NPU is more powerful than a 3060, it clearly isn't, but I suspect it's able to match or best it because (and this is just a theory), it wakes up faster than a discrete GPU, and for a short burst of work like this, that gives it an early start that the nvidia card wasn't able to overcome. I'm sure that transcribing commands over 10 seconds the GPU would win, but those are very rare.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;I packaged everything into a public Docker image. If you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An Intel Core Ultra processor (Meteor Lake / Arrow Lake / Lunar Lake)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/dev/accel/accel0&lt;/code&gt; on your host (&lt;code&gt;lsmod | grep intel_vpu&lt;/code&gt; to verify)&lt;/li&gt;
&lt;li&gt;Home Assistant or any other Wyoming-protocol client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; wyoming-parakeet-npu &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--device&lt;/span&gt; /dev/accel/accel0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;LANGUAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;es &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 10300:10300 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; parakeet-data:/data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/cibernox/wyoming-parakeet-on-intel-npu:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First boot downloads ~3.2 GB of model weights and compiles the NPU buckets (~60-90 s). Subsequent restarts are under 5 s. Point Home Assistant's Wyoming integration at &lt;code&gt;tcp://&amp;lt;host&amp;gt;:10300&lt;/code&gt; and you're done.&lt;/p&gt;

&lt;p&gt;Repo with source, Dockerfile, docs and a &lt;code&gt;docker-compose.yml&lt;/code&gt; example: &lt;strong&gt;&lt;a href="https://github.com/cibernox/wyoming-parakeet-on-intel-npu" rel="noopener noreferrer"&gt;github.com/cibernox/wyoming-parakeet-on-intel-npu&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd love help with
&lt;/h2&gt;

&lt;p&gt;If you're playing with this, things I haven't done yet that I think could move the needle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VAD gate before the encoder&lt;/strong&gt; — most wake-word false-positives carry a fraction of a second of speech then silence. Cheap RMS-based VAD on the host could avoid invoking the encoder entirely for those. Probably the single biggest aggregate energy saver in a real smart home.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lazy bucket loading + LRU eviction&lt;/strong&gt; — I keep multiple buckets resident, but each compiled blob takes ~1.5 GB of RAM. An LRU policy would let you compile many buckets but only keep N hot. (The repo already has a basic "lazy load one large bucket" mode; full LRU would be the next step.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investigating the TDT decoder's Python overhead&lt;/strong&gt; — even with the decoder itself running at ~1 ms per call on NPU, the surrounding loop in &lt;code&gt;onnx_asr&lt;/code&gt; (numpy state-handling, control flow) accounts for a meaningful fraction of total time on long audio.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PRs welcome.&lt;/p&gt;




&lt;h2&gt;
  
  
  Acknowledgements
&lt;/h2&gt;

&lt;p&gt;This work stands on top of several open-source projects, all of which made this hack possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/tboby/wyoming-onnx-asr" rel="noopener noreferrer"&gt;tboby/wyoming-onnx-asr&lt;/a&gt; — the Wyoming protocol server I forked from&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/istupakov/onnx-asr" rel="noopener noreferrer"&gt;istupakov/onnx-asr&lt;/a&gt; — the ASR pipeline library&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx" rel="noopener noreferrer"&gt;istupakov/parakeet-tdt-0.6b-v3-onnx&lt;/a&gt; — Parakeet's ONNX export&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/openvinotoolkit/openvino" rel="noopener noreferrer"&gt;openvinotoolkit/openvino&lt;/a&gt; — Intel's inference runtime&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/intel/linux-npu-driver" rel="noopener noreferrer"&gt;intel/linux-npu-driver&lt;/a&gt; — NPU userspace driver&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/amd/RyzenAI-SW/tree/main/Demos/ASR/Parakeet-TDT" rel="noopener noreferrer"&gt;amd/RyzenAI-SW Parakeet-TDT demo&lt;/a&gt; — proved the same approach works on a competing NPU; gave me the static-reshape recipe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks to all of them for shipping working code.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>smarthome</category>
    </item>
  </channel>
</rss>
