<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: madev</title>
    <description>The latest articles on DEV Community by madev (@mr1azl).</description>
    <link>https://dev.to/mr1azl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3871960%2Fe677930f-4eba-4981-86d9-d63a11d3fb83.png</url>
      <title>DEV Community: madev</title>
      <link>https://dev.to/mr1azl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mr1azl"/>
    <language>en</language>
    <item>
      <title>I Tried Running LLMs on Intel's NPU. Here's What Actually Happened.</title>
      <dc:creator>madev</dc:creator>
      <pubDate>Fri, 10 Apr 2026 14:23:50 +0000</pubDate>
      <link>https://dev.to/mr1azl/i-tried-running-llms-on-intels-npu-heres-what-actually-happened-5h17</link>
      <guid>https://dev.to/mr1azl/i-tried-running-llms-on-intels-npu-heres-what-actually-happened-5h17</guid>
      <description>&lt;p&gt;&lt;em&gt;A hands-on guide to local LLM inference on a Lenovo ThinkPad T14 Gen 5 with Intel Core Ultra 7 155U, comparing NPU, CPU, and llama.cpp performance.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Promise
&lt;/h2&gt;

&lt;p&gt;Intel's "AI Boost" NPU (Neural Processing Unit) ships in every Core Ultra laptop. The marketing suggests it's your on-device AI accelerator, ready to run models locally. I wanted to test that claim by running LLMs on my ThinkPad T14 Gen 5. What followed was a journey through compiler errors, dynamic shape limitations, and some surprising benchmark results.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Hardware
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Laptop:&lt;/strong&gt; Lenovo ThinkPad T14 Gen 5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; Intel Core Ultra 7 155U (Meteor Lake), 12 cores, 14 logical processors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 32 GB DDR5&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NPU:&lt;/strong&gt; Intel AI Boost (NPU 3720), ~10-11 TOPS, 18 GB shared memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU:&lt;/strong&gt; Intel integrated graphics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Windows 11&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NPU Driver:&lt;/strong&gt; 32.0.100.4512 (December 2025)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Attempt 1: The Obvious Approach (OVModelForCausalLM)
&lt;/h2&gt;

&lt;p&gt;The most documented way to run a model on Intel hardware is through &lt;code&gt;optimum-intel&lt;/code&gt; and OpenVINO. I exported Qwen2.5-7B-Instruct to OpenVINO IR format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;py &lt;span class="nt"&gt;-m&lt;/span&gt; optimum.commands.optimum_cli &lt;span class="nb"&gt;export &lt;/span&gt;openvino &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-7B-Instruct &lt;span class="nt"&gt;--weight-format&lt;/span&gt; int4 ./local-npu-model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then tried loading it on NPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;optimum.intel&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OVModelForCausalLM&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OVModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result: Immediate crash.&lt;/strong&gt; The NPU compiler threw a fatal error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLVM ERROR: Failed to infer result type(s):
"IE.Convolution"(...) {} : (tensor&amp;lt;1x0x1x1xf16&amp;gt;, tensor&amp;lt;1x28x1x1xf16&amp;gt;) -&amp;gt; ( ??? )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The NPU compiler requires static tensor shapes, but &lt;code&gt;optimum-intel&lt;/code&gt; exports models with dynamic shapes for variable sequence lengths. These two requirements are fundamentally incompatible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt 2: Forcing Static Shapes
&lt;/h2&gt;

&lt;p&gt;I tried passing &lt;code&gt;dynamic_shapes=False&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OVModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dynamic_shapes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; A warning appeared saying the parameter was ignored because "only dynamic shapes are supported" for causal LM models. Then the same crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt 3: Smaller Model, Same Problem
&lt;/h2&gt;

&lt;p&gt;Maybe 7B was too large? I tried Qwen2.5-1.5B-Instruct with the same export. &lt;strong&gt;Same exact crash&lt;/strong&gt;, just with different tensor dimensions (&lt;code&gt;0 != 12&lt;/code&gt; instead of &lt;code&gt;0 != 28&lt;/code&gt;). The problem was never the model size; it was the export format.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Breakthrough: Correct Export Flags + LLMPipeline
&lt;/h2&gt;

&lt;p&gt;After researching Intel's documentation and community forums, I found the solution. Two things needed to change:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. NPU-specific export flags
&lt;/h3&gt;

&lt;p&gt;The standard export produces models incompatible with NPU. You need symmetric quantization, full int4 ratio, and group size 128:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;py &lt;span class="nt"&gt;-m&lt;/span&gt; optimum.commands.optimum_cli &lt;span class="nb"&gt;export &lt;/span&gt;openvino &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-m&lt;/span&gt; Qwen/Qwen2.5-1.5B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--weight-format&lt;/span&gt; int4 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--sym&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ratio&lt;/span&gt; 1.0 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--group-size&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
    ./local-npu-model-1.5b-npu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Use &lt;code&gt;openvino-genai&lt;/code&gt; LLMPipeline, NOT &lt;code&gt;optimum-intel&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;OVModelForCausalLM&lt;/code&gt; does not handle NPU compilation correctly. Intel's dedicated &lt;code&gt;LLMPipeline&lt;/code&gt; from &lt;code&gt;openvino-genai&lt;/code&gt; manages the static shape requirements internally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--pre&lt;/span&gt; openvino openvino-tokenizers openvino-genai &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--extra-index-url&lt;/span&gt; https://storage.openvinotoolkit.org/simple/wheels/nightly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openvino_genai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ov_genai&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ov_genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LLMPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./local-npu-model-1.5b-npu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, how are you?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;It worked.&lt;/strong&gt; The NPU showed activity in Task Manager, and the model responded correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmarks
&lt;/h2&gt;

&lt;p&gt;With the NPU finally working, I ran the same three prompts across every backend I could test. Here are the results.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenVINO: NPU vs CPU (Qwen 2.5 1.5B, int4)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;NPU&lt;/th&gt;
&lt;th&gt;CPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model load time&lt;/td&gt;
&lt;td&gt;95.90s&lt;/td&gt;
&lt;td&gt;4.73s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total generation time&lt;/td&gt;
&lt;td&gt;24.05s&lt;/td&gt;
&lt;td&gt;22.46s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average speed&lt;/td&gt;
&lt;td&gt;~6.8 words/s&lt;/td&gt;
&lt;td&gt;~7.5 words/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Winner&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The CPU was faster at generation AND loaded the model 20x quicker. The NPU's 96-second load time is the compiler building static execution graphs, a one-time cost per session, but a painful one.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp vs OpenVINO (Qwen 2.5 1.5B)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Load Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;llama.cpp CPU&lt;/td&gt;
&lt;td&gt;~22 tok/s&lt;/td&gt;
&lt;td&gt;2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenVINO CPU&lt;/td&gt;
&lt;td&gt;~10 tok/s*&lt;/td&gt;
&lt;td&gt;5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenVINO NPU&lt;/td&gt;
&lt;td&gt;~9 tok/s*&lt;/td&gt;
&lt;td&gt;96s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Estimated tok/s converted from words/s (tokens are roughly 1.3x words)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;llama.cpp was the clear winner, roughly &lt;strong&gt;2x faster&lt;/strong&gt; than OpenVINO CPU on the same model, with near-instant load times.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp on larger models
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Load Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5 1.5B (q4_k_m)&lt;/td&gt;
&lt;td&gt;~22 tok/s&lt;/td&gt;
&lt;td&gt;2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5 7B (q3_k_m)&lt;/td&gt;
&lt;td&gt;~3.6 tok/s&lt;/td&gt;
&lt;td&gt;2s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 7B model is noticeably slower but perfectly usable for interactive chat.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The NPU is not for LLMs (yet)
&lt;/h3&gt;

&lt;p&gt;The NPU on Meteor Lake (Core Ultra Series 1) is rated at 10-11 TOPS. It's designed for always-on, low-power tasks: background noise suppression, Windows Copilot Recall, live captions, camera effects, and grammar checking. Running LLMs on it is technically possible but offers no speed advantage over CPU, and the compilation overhead is massive.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Export flags matter enormously
&lt;/h3&gt;

&lt;p&gt;The standard &lt;code&gt;optimum-cli export openvino&lt;/code&gt; command produces models that will never work on NPU. You must include &lt;code&gt;--sym --ratio 1.0 --group-size 128&lt;/code&gt; for NPU-compatible output. This is not well documented outside of Intel's official OpenVINO GenAI documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Use LLMPipeline, not OVModelForCausalLM
&lt;/h3&gt;

&lt;p&gt;If you're targeting NPU, &lt;code&gt;optimum-intel&lt;/code&gt;'s &lt;code&gt;OVModelForCausalLM&lt;/code&gt; is a dead end. It forces dynamic shapes that the NPU cannot handle. Intel's &lt;code&gt;openvino_genai.LLMPipeline&lt;/code&gt; is the correct tool; it manages static shape compilation internally.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. llama.cpp is king for CPU inference
&lt;/h3&gt;

&lt;p&gt;For local LLM inference on Intel laptop CPUs, llama.cpp (and tools built on it like LM Studio) delivers the best performance. It was 2x faster than OpenVINO in my tests and supports a massive ecosystem of GGUF models.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Your laptop is more capable than you think
&lt;/h3&gt;

&lt;p&gt;Even without a discrete GPU, a 32 GB Intel laptop can comfortably run 7B parameter models through llama.cpp at conversational speeds (~4 tok/s). Smaller models like 1.5B-3B run at 20+ tok/s, which feels instant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Setup for Intel Laptops (No Discrete GPU)
&lt;/h2&gt;

&lt;p&gt;If you have a similar Intel Core Ultra laptop and want to run local LLMs, here's what I'd recommend after all this testing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For everyday use:&lt;/strong&gt; Install &lt;a href="https://lmstudio.ai" rel="noopener noreferrer"&gt;LM Studio&lt;/a&gt;. It uses llama.cpp under the hood, has a great UI, and supports downloading models directly. Start with Qwen2.5-7B-Instruct or Gemma 3 4B in Q4_K_M quantization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For programmatic access:&lt;/strong&gt; Use LM Studio's local server feature or run &lt;code&gt;llama-server&lt;/code&gt; directly. Both expose an OpenAI-compatible API you can call from any language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For experimenting with NPU:&lt;/strong&gt; Use &lt;code&gt;openvino-genai&lt;/code&gt; with &lt;code&gt;LLMPipeline&lt;/code&gt; and models exported with &lt;code&gt;--sym --ratio 1.0 --group-size 128&lt;/code&gt;. Stick to models under 3B parameters. It's a fun experiment but won't outperform CPU for LLM workloads on current hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skip the NPU for LLMs&lt;/strong&gt; unless you specifically need low-power background inference. The CPU (or even iGPU) will serve you better for interactive chat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future
&lt;/h2&gt;

&lt;p&gt;Intel's NPU story is improving. OpenVINO 2025.3 added dynamic prompt support and 8K context for NPU. Lunar Lake and Arrow Lake have stronger NPUs. The software ecosystem is maturing. In a year or two, NPU-based LLM inference might actually be practical. But today, for Meteor Lake, CPU + llama.cpp is the way to go.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tested on April 10, 2026. Software versions: Python 3.10.11, OpenVINO 2025.3, openvino-genai (nightly), llama-cpp-python 0.3.2, optimum-intel latest.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>npu</category>
    </item>
  </channel>
</rss>
