I Tried Running LLMs on Intel's NPU. Here's What Actually Happened.

madev — Fri, 10 Apr 2026 14:23:50 +0000

A hands-on guide to local LLM inference on a Lenovo ThinkPad T14 Gen 5 with Intel Core Ultra 7 155U, comparing NPU, CPU, and llama.cpp performance.

The Promise

Intel's "AI Boost" NPU (Neural Processing Unit) ships in every Core Ultra laptop. The marketing suggests it's your on-device AI accelerator, ready to run models locally. I wanted to test that claim by running LLMs on my ThinkPad T14 Gen 5. What followed was a journey through compiler errors, dynamic shape limitations, and some surprising benchmark results.

My Hardware

Laptop: Lenovo ThinkPad T14 Gen 5
CPU: Intel Core Ultra 7 155U (Meteor Lake), 12 cores, 14 logical processors
RAM: 32 GB DDR5
NPU: Intel AI Boost (NPU 3720), ~10-11 TOPS, 18 GB shared memory
GPU: Intel integrated graphics
OS: Windows 11
NPU Driver: 32.0.100.4512 (December 2025)

Attempt 1: The Obvious Approach (OVModelForCausalLM)

The most documented way to run a model on Intel hardware is through optimum-intel and OpenVINO. I exported Qwen2.5-7B-Instruct to OpenVINO IR format:

py -m optimum.commands.optimum_cli export openvino --model Qwen/Qwen2.5-7B-Instruct --weight-format int4 ./local-npu-model

Then tried loading it on NPU:

from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(model_dir, device="NPU")

Result: Immediate crash. The NPU compiler threw a fatal error:

LLVM ERROR: Failed to infer result type(s):
"IE.Convolution"(...) {} : (tensor<1x0x1x1xf16>, tensor<1x28x1x1xf16>) -> ( ??? )

The NPU compiler requires static tensor shapes, but optimum-intel exports models with dynamic shapes for variable sequence lengths. These two requirements are fundamentally incompatible.

Attempt 2: Forcing Static Shapes

I tried passing dynamic_shapes=False:

model = OVModelForCausalLM.from_pretrained(model_dir, device="NPU", dynamic_shapes=False)

Result: A warning appeared saying the parameter was ignored because "only dynamic shapes are supported" for causal LM models. Then the same crash.

Attempt 3: Smaller Model, Same Problem

Maybe 7B was too large? I tried Qwen2.5-1.5B-Instruct with the same export. Same exact crash, just with different tensor dimensions (0 != 12 instead of 0 != 28). The problem was never the model size; it was the export format.

The Breakthrough: Correct Export Flags + LLMPipeline

After researching Intel's documentation and community forums, I found the solution. Two things needed to change:

1. NPU-specific export flags

The standard export produces models incompatible with NPU. You need symmetric quantization, full int4 ratio, and group size 128:

py -m optimum.commands.optimum_cli export openvino \
    -m Qwen/Qwen2.5-1.5B-Instruct \
    --weight-format int4 \
    --sym \
    --ratio 1.0 \
    --group-size 128 \
    ./local-npu-model-1.5b-npu

2. Use `openvino-genai` LLMPipeline, NOT `optimum-intel`

OVModelForCausalLM does not handle NPU compilation correctly. Intel's dedicated LLMPipeline from openvino-genai manages the static shape requirements internally:

pip install --pre openvino openvino-tokenizers openvino-genai \
    --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("./local-npu-model-1.5b-npu", "NPU")
result = pipe.generate("Hello, how are you?", max_new_tokens=50)
print(result)

It worked. The NPU showed activity in Task Manager, and the model responded correctly.

The Benchmarks

With the NPU finally working, I ran the same three prompts across every backend I could test. Here are the results.

OpenVINO: NPU vs CPU (Qwen 2.5 1.5B, int4)

Metric	NPU	CPU
Model load time	95.90s	4.73s
Total generation time	24.05s	22.46s
Average speed	~6.8 words/s	~7.5 words/s
Winner		CPU

The CPU was faster at generation AND loaded the model 20x quicker. The NPU's 96-second load time is the compiler building static execution graphs, a one-time cost per session, but a painful one.

llama.cpp vs OpenVINO (Qwen 2.5 1.5B)

Backend	Speed	Load Time
llama.cpp CPU	~22 tok/s	2s
OpenVINO CPU	~10 tok/s*	5s
OpenVINO NPU	~9 tok/s*	96s

Estimated tok/s converted from words/s (tokens are roughly 1.3x words)

llama.cpp was the clear winner, roughly 2x faster than OpenVINO CPU on the same model, with near-instant load times.

llama.cpp on larger models

Model	Speed	Load Time
Qwen 2.5 1.5B (q4_k_m)	~22 tok/s	2s
Qwen 2.5 7B (q3_k_m)	~3.6 tok/s	2s

The 7B model is noticeably slower but perfectly usable for interactive chat.

What I Learned

1. The NPU is not for LLMs (yet)

The NPU on Meteor Lake (Core Ultra Series 1) is rated at 10-11 TOPS. It's designed for always-on, low-power tasks: background noise suppression, Windows Copilot Recall, live captions, camera effects, and grammar checking. Running LLMs on it is technically possible but offers no speed advantage over CPU, and the compilation overhead is massive.

2. Export flags matter enormously

The standard optimum-cli export openvino command produces models that will never work on NPU. You must include --sym --ratio 1.0 --group-size 128 for NPU-compatible output. This is not well documented outside of Intel's official OpenVINO GenAI documentation.

3. Use LLMPipeline, not OVModelForCausalLM

If you're targeting NPU, optimum-intel's OVModelForCausalLM is a dead end. It forces dynamic shapes that the NPU cannot handle. Intel's openvino_genai.LLMPipeline is the correct tool; it manages static shape compilation internally.

4. llama.cpp is king for CPU inference

For local LLM inference on Intel laptop CPUs, llama.cpp (and tools built on it like LM Studio) delivers the best performance. It was 2x faster than OpenVINO in my tests and supports a massive ecosystem of GGUF models.

5. Your laptop is more capable than you think

Even without a discrete GPU, a 32 GB Intel laptop can comfortably run 7B parameter models through llama.cpp at conversational speeds (~4 tok/s). Smaller models like 1.5B-3B run at 20+ tok/s, which feels instant.

Recommended Setup for Intel Laptops (No Discrete GPU)

If you have a similar Intel Core Ultra laptop and want to run local LLMs, here's what I'd recommend after all this testing:

For everyday use: Install LM Studio. It uses llama.cpp under the hood, has a great UI, and supports downloading models directly. Start with Qwen2.5-7B-Instruct or Gemma 3 4B in Q4_K_M quantization.

For programmatic access: Use LM Studio's local server feature or run llama-server directly. Both expose an OpenAI-compatible API you can call from any language.

For experimenting with NPU: Use openvino-genai with LLMPipeline and models exported with --sym --ratio 1.0 --group-size 128. Stick to models under 3B parameters. It's a fun experiment but won't outperform CPU for LLM workloads on current hardware.

Skip the NPU for LLMs unless you specifically need low-power background inference. The CPU (or even iGPU) will serve you better for interactive chat.

The Future

Intel's NPU story is improving. OpenVINO 2025.3 added dynamic prompt support and 8K context for NPU. Lunar Lake and Arrow Lake have stronger NPUs. The software ecosystem is maturing. In a year or two, NPU-based LLM inference might actually be practical. But today, for Meteor Lake, CPU + llama.cpp is the way to go.

Tested on April 10, 2026. Software versions: Python 3.10.11, OpenVINO 2025.3, openvino-genai (nightly), llama-cpp-python 0.3.2, optimum-intel latest.

DEV Community: madev