A hands-on guide to local LLM inference on a Lenovo ThinkPad T14 Gen 5 with Intel Core Ultra 7 155U, comparing NPU, CPU, and llama.cpp performance.
The Promise
Intel's "AI Boost" NPU (Neural Processing Unit) ships in every Core Ultra laptop. The marketing suggests it's your on-device AI accelerator, ready to run models locally. I wanted to test that claim by running LLMs on my ThinkPad T14 Gen 5. What followed was a journey through compiler errors, dynamic shape limitations, and some surprising benchmark results.
My Hardware
- Laptop: Lenovo ThinkPad T14 Gen 5
- CPU: Intel Core Ultra 7 155U (Meteor Lake), 12 cores, 14 logical processors
- RAM: 32 GB DDR5
- NPU: Intel AI Boost (NPU 3720), ~10-11 TOPS, 18 GB shared memory
- GPU: Intel integrated graphics
- OS: Windows 11
- NPU Driver: 32.0.100.4512 (December 2025)
Attempt 1: The Obvious Approach (OVModelForCausalLM)
The most documented way to run a model on Intel hardware is through optimum-intel and OpenVINO. I exported Qwen2.5-7B-Instruct to OpenVINO IR format:
py -m optimum.commands.optimum_cli export openvino --model Qwen/Qwen2.5-7B-Instruct --weight-format int4 ./local-npu-model
Then tried loading it on NPU:
from optimum.intel import OVModelForCausalLM
model = OVModelForCausalLM.from_pretrained(model_dir, device="NPU")
Result: Immediate crash. The NPU compiler threw a fatal error:
LLVM ERROR: Failed to infer result type(s):
"IE.Convolution"(...) {} : (tensor<1x0x1x1xf16>, tensor<1x28x1x1xf16>) -> ( ??? )
The NPU compiler requires static tensor shapes, but optimum-intel exports models with dynamic shapes for variable sequence lengths. These two requirements are fundamentally incompatible.
Attempt 2: Forcing Static Shapes
I tried passing dynamic_shapes=False:
model = OVModelForCausalLM.from_pretrained(model_dir, device="NPU", dynamic_shapes=False)
Result: A warning appeared saying the parameter was ignored because "only dynamic shapes are supported" for causal LM models. Then the same crash.
Attempt 3: Smaller Model, Same Problem
Maybe 7B was too large? I tried Qwen2.5-1.5B-Instruct with the same export. Same exact crash, just with different tensor dimensions (0 != 12 instead of 0 != 28). The problem was never the model size; it was the export format.
The Breakthrough: Correct Export Flags + LLMPipeline
After researching Intel's documentation and community forums, I found the solution. Two things needed to change:
1. NPU-specific export flags
The standard export produces models incompatible with NPU. You need symmetric quantization, full int4 ratio, and group size 128:
py -m optimum.commands.optimum_cli export openvino \
-m Qwen/Qwen2.5-1.5B-Instruct \
--weight-format int4 \
--sym \
--ratio 1.0 \
--group-size 128 \
./local-npu-model-1.5b-npu
2. Use openvino-genai LLMPipeline, NOT optimum-intel
OVModelForCausalLM does not handle NPU compilation correctly. Intel's dedicated LLMPipeline from openvino-genai manages the static shape requirements internally:
pip install --pre openvino openvino-tokenizers openvino-genai \
--extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline("./local-npu-model-1.5b-npu", "NPU")
result = pipe.generate("Hello, how are you?", max_new_tokens=50)
print(result)
It worked. The NPU showed activity in Task Manager, and the model responded correctly.
The Benchmarks
With the NPU finally working, I ran the same three prompts across every backend I could test. Here are the results.
OpenVINO: NPU vs CPU (Qwen 2.5 1.5B, int4)
| Metric | NPU | CPU |
|---|---|---|
| Model load time | 95.90s | 4.73s |
| Total generation time | 24.05s | 22.46s |
| Average speed | ~6.8 words/s | ~7.5 words/s |
| Winner | CPU |
The CPU was faster at generation AND loaded the model 20x quicker. The NPU's 96-second load time is the compiler building static execution graphs, a one-time cost per session, but a painful one.
llama.cpp vs OpenVINO (Qwen 2.5 1.5B)
| Backend | Speed | Load Time |
|---|---|---|
| llama.cpp CPU | ~22 tok/s | 2s |
| OpenVINO CPU | ~10 tok/s* | 5s |
| OpenVINO NPU | ~9 tok/s* | 96s |
Estimated tok/s converted from words/s (tokens are roughly 1.3x words)
llama.cpp was the clear winner, roughly 2x faster than OpenVINO CPU on the same model, with near-instant load times.
llama.cpp on larger models
| Model | Speed | Load Time |
|---|---|---|
| Qwen 2.5 1.5B (q4_k_m) | ~22 tok/s | 2s |
| Qwen 2.5 7B (q3_k_m) | ~3.6 tok/s | 2s |
The 7B model is noticeably slower but perfectly usable for interactive chat.
What I Learned
1. The NPU is not for LLMs (yet)
The NPU on Meteor Lake (Core Ultra Series 1) is rated at 10-11 TOPS. It's designed for always-on, low-power tasks: background noise suppression, Windows Copilot Recall, live captions, camera effects, and grammar checking. Running LLMs on it is technically possible but offers no speed advantage over CPU, and the compilation overhead is massive.
2. Export flags matter enormously
The standard optimum-cli export openvino command produces models that will never work on NPU. You must include --sym --ratio 1.0 --group-size 128 for NPU-compatible output. This is not well documented outside of Intel's official OpenVINO GenAI documentation.
3. Use LLMPipeline, not OVModelForCausalLM
If you're targeting NPU, optimum-intel's OVModelForCausalLM is a dead end. It forces dynamic shapes that the NPU cannot handle. Intel's openvino_genai.LLMPipeline is the correct tool; it manages static shape compilation internally.
4. llama.cpp is king for CPU inference
For local LLM inference on Intel laptop CPUs, llama.cpp (and tools built on it like LM Studio) delivers the best performance. It was 2x faster than OpenVINO in my tests and supports a massive ecosystem of GGUF models.
5. Your laptop is more capable than you think
Even without a discrete GPU, a 32 GB Intel laptop can comfortably run 7B parameter models through llama.cpp at conversational speeds (~4 tok/s). Smaller models like 1.5B-3B run at 20+ tok/s, which feels instant.
Recommended Setup for Intel Laptops (No Discrete GPU)
If you have a similar Intel Core Ultra laptop and want to run local LLMs, here's what I'd recommend after all this testing:
For everyday use: Install LM Studio. It uses llama.cpp under the hood, has a great UI, and supports downloading models directly. Start with Qwen2.5-7B-Instruct or Gemma 3 4B in Q4_K_M quantization.
For programmatic access: Use LM Studio's local server feature or run llama-server directly. Both expose an OpenAI-compatible API you can call from any language.
For experimenting with NPU: Use openvino-genai with LLMPipeline and models exported with --sym --ratio 1.0 --group-size 128. Stick to models under 3B parameters. It's a fun experiment but won't outperform CPU for LLM workloads on current hardware.
Skip the NPU for LLMs unless you specifically need low-power background inference. The CPU (or even iGPU) will serve you better for interactive chat.
The Future
Intel's NPU story is improving. OpenVINO 2025.3 added dynamic prompt support and 8K context for NPU. Lunar Lake and Arrow Lake have stronger NPUs. The software ecosystem is maturing. In a year or two, NPU-based LLM inference might actually be practical. But today, for Meteor Lake, CPU + llama.cpp is the way to go.
Tested on April 10, 2026. Software versions: Python 3.10.11, OpenVINO 2025.3, openvino-genai (nightly), llama-cpp-python 0.3.2, optimum-intel latest.
Top comments (0)