Rijul Rajesh

Posted on Dec 21, 2025

Diagnose & Fix Painfully Slow Ollama: 4 Essential Debugging Techniques + Fixes

#ai #ollama #nvidia

Nowadays, more people have started using local LLMs and are actively utilizing them for day-to-day tasks.
However, the more use cases you run on a local LLM, you may start noticing some performance issues.

Instead of directly blaming your hardware, you can perform some basic checks to get the best performance out of your system.

Start with the Basics: Is Heat Killing Your Ollama Speed?

Okay, let's start simple - if you're on NVIDIA GPU (most common for Ollama), open a second terminal and run this while Ollama generates text:

nvidia-smi -l 1

The -l 1 refreshes every second so you see live changes.

When the system is idle, you might see something like this:

Here’s what you should pay attention to:

A temperature around 52°C means the GPU is cool and doing fine
P8 indicates a low-power, almost idle state
355MiB out of 4096MiB memory usage means there’s plenty of VRAM available
13% GPU utilization shows the GPU isn’t under much load

Now let’s put some pressure on it. In the worst case, you might see something like this:

Temp: 86°C – Danger zone. NVIDIA laptops usually start throttling around ~85–87°C to protect the hardware
Perf: P0 – Maximum performance state (which is good), but combined with high temperature, clock speeds get pulled back
Pwr: 42W/30W – Drawing more than the power cap, so it’s both power-limited and thermally limited
GPU-Util: 100% – Working hard, but tokens/sec feels slow because clocks have dropped
Memory: Nearly full – VRAM pressure can add extra slowdown due to swapping

Once your GPU hits around 85°C, the firmware steps in, decides it’s too hot, and reduces clock speeds. You still see 100% utilization (the GPU is trying), but the actual compute speed drops sharply. As a result, tokens/sec can fall from 35 to 12, even though nvidia-smi still looks “busy.”

What does these Perf states mean?

Perf State	Meaning	What it tells you
P0	Highest performance mode	GPU is allowed to run at full speed
P2	High performance	Slightly reduced, still strong
P5	Medium performance	Balanced / light workload
P8	Idle / low power	GPU is mostly resting

Lower numbers mean higher performance.

So P0 is fast, P8 is idle — but even in P0, heat and power limits can still slow things down.

Fixes you can try

Use a cooling pad – Can drop temperatures by 5–8°C almost immediately
Elevate the laptop – Raise the back by 1–2 inches to improve airflow

Target zone: Keep temperatures between 75–82°C with less than 85% VRAM usage. In this range, performance is generally smooth and stable.

Dive Into the Model: Check and Understand Its Quantization

Another thing worth checking is whether your model is quantized to the right level.

At a high level, an LLM is just a huge collection of numbers called weights. These numbers decide how the model thinks and generates text. The number of bits tells the computer how precisely each of those numbers is stored.

By default, LLMs store weights in FP16, which means 16 bits are used for every number. This gives very high precision, but it also uses a lot of memory and compute power.

Quantization reduces this precision by storing those numbers using fewer bits. The model becomes smaller and faster, at the cost of a tiny loss in accuracy.

FP16 → 16 bits (original size)
Q8_0 → 8 bits
Q4_K_M → 4 bits (sweet spot)
Q2_K → 2 bits

You can usually see the quantization level directly in the model name. For example, in Ollama:

llama3:8b-q4_K_M    ← 8B params, Q4_K_M quant
mistral:7b-q8_0     ← 7B params, Q8_0 quant
gemma2:9b           ← 9B params, default FP16

The table below makes the trade-offs easier to understand:

Format	Bits	VRAM	Quality	Best For
Q4_K_M	4	Low	Great	Most users
Q5_K_M	5	Medium	Excellent	Quality focus
Q8_0	8	High	Near-perfect	Precision tasks
FP16	16	Highest	Original	Big GPUs only

Anything cachable? Utilize KV Caches effectively

Read the rest of the article here

DEV Community