Nowadays, more people have started using local LLMs and are actively utilizing them for day-to-day tasks.
However, the more use cases you run on a local LLM, you may start noticing some performance issues.
Instead of directly blaming your hardware, you can perform some basic checks to get the best performance out of your system.
Start with the Basics: Is Heat Killing Your Ollama Speed?
Okay, let's start simple - if you're on NVIDIA GPU (most common for Ollama), open a second terminal and run this while Ollama generates text:
nvidia-smi -l 1
The -l 1 refreshes every second so you see live changes.
When the system is idle, you might see something like this:
Here’s what you should pay attention to:
- A temperature around 52°C means the GPU is cool and doing fine
- P8 indicates a low-power, almost idle state
- 355MiB out of 4096MiB memory usage means there’s plenty of VRAM available
- 13% GPU utilization shows the GPU isn’t under much load
Now let’s put some pressure on it. In the worst case, you might see something like this:
- Temp: 86°C – Danger zone. NVIDIA laptops usually start throttling around ~85–87°C to protect the hardware
Perf: P0 – Maximum performance state (which is good), but combined with high temperature, clock speeds get pulled back
Pwr: 42W/30W – Drawing more than the power cap, so it’s both power-limited and thermally limited
GPU-Util: 100% – Working hard, but tokens/sec feels slow because clocks have dropped
Memory: Nearly full – VRAM pressure can add extra slowdown due to swapping
Once your GPU hits around 85°C, the firmware steps in, decides it’s too hot, and reduces clock speeds. You still see 100% utilization (the GPU is trying), but the actual compute speed drops sharply. As a result, tokens/sec can fall from 35 to 12, even though nvidia-smi still looks “busy.”
What does these Perf states mean?
| Perf State | Meaning | What it tells you |
|---|---|---|
| P0 | Highest performance mode | GPU is allowed to run at full speed |
| P2 | High performance | Slightly reduced, still strong |
| P5 | Medium performance | Balanced / light workload |
| P8 | Idle / low power | GPU is mostly resting |
Lower numbers mean higher performance.
So P0 is fast, P8 is idle — but even in P0, heat and power limits can still slow things down.
Fixes you can try
- Use a cooling pad – Can drop temperatures by 5–8°C almost immediately
- Elevate the laptop – Raise the back by 1–2 inches to improve airflow
Target zone: Keep temperatures between 75–82°C with less than 85% VRAM usage. In this range, performance is generally smooth and stable.
Dive Into the Model: Check and Understand Its Quantization
Another thing worth checking is whether your model is quantized to the right level.
At a high level, an LLM is just a huge collection of numbers called weights. These numbers decide how the model thinks and generates text. The number of bits tells the computer how precisely each of those numbers is stored.
By default, LLMs store weights in FP16, which means 16 bits are used for every number. This gives very high precision, but it also uses a lot of memory and compute power.
Quantization reduces this precision by storing those numbers using fewer bits. The model becomes smaller and faster, at the cost of a tiny loss in accuracy.
- FP16 → 16 bits (original size)
- Q8_0 → 8 bits
- Q4_K_M → 4 bits (sweet spot)
- Q2_K → 2 bits
You can usually see the quantization level directly in the model name. For example, in Ollama:
llama3:8b-q4_K_M ← 8B params, Q4_K_M quant
mistral:7b-q8_0 ← 7B params, Q8_0 quant
gemma2:9b ← 9B params, default FP16
The table below makes the trade-offs easier to understand:
| Format | Bits | VRAM | Quality | Best For |
|---|---|---|---|---|
| Q4_K_M | 4 | Low | Great | Most users |
| Q5_K_M | 5 | Medium | Excellent | Quality focus |
| Q8_0 | 8 | High | Near-perfect | Precision tasks |
| FP16 | 16 | Highest | Original | Big GPUs only |
Anything cachable? Utilize KV Caches effectively
Read the rest of the article here



Top comments (0)