DEV Community

Cover image for Diagnose & Fix Painfully Slow Ollama: 4 Essential Debugging Techniques + Fixes
Rijul Rajesh
Rijul Rajesh

Posted on

Diagnose & Fix Painfully Slow Ollama: 4 Essential Debugging Techniques + Fixes

Nowadays, more people have started using local LLMs and are actively utilizing them for day-to-day tasks.
However, the more use cases you run on a local LLM, you may start noticing some performance issues.

Instead of directly blaming your hardware, you can perform some basic checks to get the best performance out of your system.

Start with the Basics: Is Heat Killing Your Ollama Speed?

Okay, let's start simple - if you're on NVIDIA GPU (most common for Ollama), open a second terminal and run this while Ollama generates text:

nvidia-smi -l 1
Enter fullscreen mode Exit fullscreen mode

The -l 1 refreshes every second so you see live changes.

When the system is idle, you might see something like this:

Here’s what you should pay attention to:

  • A temperature around 52°C means the GPU is cool and doing fine
  • P8 indicates a low-power, almost idle state
  • 355MiB out of 4096MiB memory usage means there’s plenty of VRAM available
  • 13% GPU utilization shows the GPU isn’t under much load

Now let’s put some pressure on it. In the worst case, you might see something like this:

  • Temp: 86°CDanger zone. NVIDIA laptops usually start throttling around ~85–87°C to protect the hardware
  • Perf: P0 – Maximum performance state (which is good), but combined with high temperature, clock speeds get pulled back

  • Pwr: 42W/30W – Drawing more than the power cap, so it’s both power-limited and thermally limited

  • GPU-Util: 100% – Working hard, but tokens/sec feels slow because clocks have dropped

  • Memory: Nearly full – VRAM pressure can add extra slowdown due to swapping

Once your GPU hits around 85°C, the firmware steps in, decides it’s too hot, and reduces clock speeds. You still see 100% utilization (the GPU is trying), but the actual compute speed drops sharply. As a result, tokens/sec can fall from 35 to 12, even though nvidia-smi still looks “busy.”

What does these Perf states mean?

Perf State Meaning What it tells you
P0 Highest performance mode GPU is allowed to run at full speed
P2 High performance Slightly reduced, still strong
P5 Medium performance Balanced / light workload
P8 Idle / low power GPU is mostly resting

Lower numbers mean higher performance.

So P0 is fast, P8 is idle — but even in P0, heat and power limits can still slow things down.

Fixes you can try

  • Use a cooling pad – Can drop temperatures by 5–8°C almost immediately
  • Elevate the laptop – Raise the back by 1–2 inches to improve airflow

Target zone: Keep temperatures between 75–82°C with less than 85% VRAM usage. In this range, performance is generally smooth and stable.


Dive Into the Model: Check and Understand Its Quantization

Another thing worth checking is whether your model is quantized to the right level.

At a high level, an LLM is just a huge collection of numbers called weights. These numbers decide how the model thinks and generates text. The number of bits tells the computer how precisely each of those numbers is stored.

By default, LLMs store weights in FP16, which means 16 bits are used for every number. This gives very high precision, but it also uses a lot of memory and compute power.

Quantization reduces this precision by storing those numbers using fewer bits. The model becomes smaller and faster, at the cost of a tiny loss in accuracy.

  • FP16 → 16 bits (original size)
  • Q8_0 → 8 bits
  • Q4_K_M → 4 bits (sweet spot)
  • Q2_K → 2 bits

You can usually see the quantization level directly in the model name. For example, in Ollama:

llama3:8b-q4_K_M    ← 8B params, Q4_K_M quant
mistral:7b-q8_0     ← 7B params, Q8_0 quant
gemma2:9b           ← 9B params, default FP16

Enter fullscreen mode Exit fullscreen mode

The table below makes the trade-offs easier to understand:

Format Bits VRAM Quality Best For
Q4_K_M 4 Low Great Most users
Q5_K_M 5 Medium Excellent Quality focus
Q8_0 8 High Near-perfect Precision tasks
FP16 16 Highest Original Big GPUs only

Anything cachable? Utilize KV Caches effectively

Read the rest of the article here

Top comments (0)