Running an LLM locally looks simple from the outside:
ollama run llama3
or
from transformers import AutoModelForCausalLM, AutoTokenizer
But under the hood, a lot happens before you get a single token back. The model must be downloaded, validated, memory-mapped or loaded into RAM/VRAM, tokenized, executed layer-by-layer, and then decoded back into text—for every token it generates.
The hardware and runtime you choose decide whether this feels instant and interactive—or slow and frustrating.
What this blog covers
- What really happens while loading and running a model locally or in a VM
- Why GPU is usually required for good performance
- Why CPU works but often feels too slow
- The easiest ways to run models locally (with tradeoffs)
- Why people start with Q4 quantized models
- Whether Q4 reduces accuracy (and how much it matters)
- Hidden costs: memory, KV cache, context length, disk I/O
- Practical tips for developers
1. Mental model: how a local LLM actually works
Think of a local LLM system as a pipeline of components working together:
| Component | Responsibility |
|---|---|
| Disk | Stores model weights |
| RAM / VRAM | Holds weights + runtime state |
| CPU / GPU | Executes matrix operations |
| Tokenizer | Converts text ↔ tokens |
| Runtime | Orchestrates execution |
| Sampler | Picks next token |
Two major phases:
Phase 1: Model Loading
- Read model files from disk
- Parse architecture + metadata
- Allocate memory
- Load or map weights
Phase 2: Inference (Generation)
- Convert input → tokens
- Run forward pass through layers
- Generate next token
- Repeat until done
Important: The model never "understands text". It only processes token IDs and matrices.
2. What happens when you start a model (deep dive)
Let’s go step by step in a real system.
Step 1: Model download & format
You typically download one of:
-
.gguf→ optimized for llama.cpp / Ollama -
.safetensors→ used by Transformers - sharded checkpoints → large models split into multiple files
Example sizes:
- 7B FP16 → ~13–14 GB
- 7B Q4 → ~3–4 GB
👉 This is your first tradeoff: size vs quality
Step 2: Metadata parsing
Before loading weights, the runtime reads metadata:
- number of layers
- hidden dimension size
- attention heads
- rope scaling / context length
- quantization type
This tells the runtime how to interpret the binary weights correctly.
Step 3: Memory allocation strategy
This is where things differ across runtimes.
Possible strategies:
- Full load into RAM (simple but heavy)
- Memory-mapped (mmap) → loads on demand
- GPU offloading → part or all layers on GPU
- Hybrid (CPU + GPU) → common in limited VRAM systems
Example:
Layer 1–20 → GPU
Layer 21–32 → CPU
This allows larger models to run on smaller GPUs—but introduces latency.
Step 4: Tokenization
Input text is converted into tokens.
"Build a scalable system"
becomes something like:
[1012, 345, 9821, 442]
Why this matters:
- Token count affects latency
- Token count affects memory (KV cache)
- Token count affects cost (in APIs)
Step 5: Forward pass (core computation)
Each token goes through every transformer layer:
- Embedding lookup
- Self-attention (Q, K, V matrices)
- Softmax + weighted sum
- Feed-forward network
- Normalization
This is repeated for:
- every token in prompt
- every generated token
👉 This is where 99% of compute time goes
Step 6: Sampling
Model outputs probabilities like:
"the" → 0.25
"a" → 0.20
"this" → 0.15
Sampling strategies:
- temperature → randomness
- top-k → restrict choices
- top-p → cumulative probability cutoff
This step determines how creative vs deterministic the output is.
Step 7: Loop continues
- append token
- update KV cache
- run forward pass again
This repeats until:
- max tokens reached
- stop token generated
3. Why GPU is critical (real explanation)
LLMs are fundamentally linear algebra machines.
Core operation:
Matrix × Matrix → Matrix
CPU vs GPU (practical view)
| Feature | CPU | GPU |
|---|---|---|
| Cores | Few (powerful) | Thousands (parallel) |
| Best at | Logic, branching | Matrix math |
| LLM inference | Slow | Fast |
Why GPU wins
Each layer requires:
- millions to billions of multiplications
- highly parallel operations
GPU executes these in parallel → massive speedup.
Real-world intuition
Imagine:
- CPU = 8 workers doing heavy tasks
- GPU = 5000 workers doing small tasks simultaneously
LLMs prefer many small parallel operations → GPU wins.
4. CPU: technically enough, practically limiting
Yes, CPU can:
- load model
- run inference
- generate output
But the problem is latency.
Example
If:
- 1 token = 400 ms
- 80 tokens = 32 seconds
That’s just for generation—not including prompt processing.
Where CPU still makes sense
- local experimentation
- offline batch jobs
- CI pipelines
- small models (≤3B)
Where CPU fails
- chat apps
- real-time APIs
- long prompts
- concurrent users
👉 Key insight:
CPU is functionally correct, but not experience-friendly.
5. Memory breakdown (often misunderstood)
Developers often think only about model size.
That’s incorrect.
Memory components
1. Model weights
- largest chunk
- depends on quantization
2. KV Cache (VERY IMPORTANT)
Stores past tokens for attention.
Formula intuition:
Memory ∝ tokens × layers × hidden_size
Long prompts → huge memory growth.
3. Runtime overhead
- buffers
- tensors
- allocator
4. System overhead
- OS
- background processes
👉 Real takeaway:
A 4GB model may require 6–8GB system memory in practice.
6. Why Q4 is the default starting point
Q4 = 4-bit quantization
What it does
- compress weights
- reduce memory
- improve speed
Why developers start with Q4
- fits in laptops
- works without high-end GPU
- faster load time
- "good enough" for most tasks
Example
| Model | FP16 | Q4 |
|---|---|---|
| 7B | ~14GB | ~4GB |
That’s a 3–4× reduction.
7. Does Q4 reduce accuracy?
Short answer: yes, but often acceptable.
What actually degrades
- reasoning depth (slightly)
- long-chain logic
- rare edge cases
What usually stays fine
- chat
- summarization
- coding assistance
- general Q&A
Why this happens
Lower precision → information loss in weights.
But modern quantization methods preserve most signal.
Practical guideline
| Use case | Recommended |
|---|---|
| Laptop dev | Q4 |
| Balanced | Q5/Q6 |
| High quality | Q8 / FP16 |
8. Easiest ways to run models locally
1. Ollama (best starting point)
ollama run llama3
Pros:
- minimal setup
- auto model management
- local API
2. LM Studio
- GUI-based
- good for testing
- built-in server mode
3. llama.cpp
- lightweight
- CPU-friendly
- supports GGUF
4. Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
device_map="auto"
)
Pros:
- maximum flexibility
- integration with ML pipelines
Cons:
- more setup
5. vLLM / TGI (production)
Use when:
- serving APIs
- multiple users
- need batching
9. Execution flow diagram
10. Why generation feels slow (core insight)
LLMs generate one token at a time.
Not a full sentence.
Not a paragraph.
This is why:
- latency is visible
- CPU feels painful
- GPU improves experience dramatically
11. Real-world performance comparison
| Setup | Experience |
|---|---|
| CPU VM | Slow, usable for testing |
| Laptop GPU | Smooth, interactive |
| Server GPU | Production ready |
12. Common mistakes developers make
- Ignoring KV cache memory
- Using huge context unnecessarily
- Assuming model size == total memory
- Running large models on weak VMs
- Ignoring disk speed
13. Recommended developer workflow
Step 1: Start simple
- Ollama
- Q4 model
Step 2: Validate use case
- prompts
- latency
Step 3: Measure
- memory
- tokens/sec
Step 4: Optimize
- move to GPU
- increase quantization quality
14. Final takeaway
Running an LLM locally is not magic—it’s a pipeline of:
- loading weights
- running matrix operations
- generating tokens sequentially
CPU can run models, but GPU makes them usable.
Q4 makes models accessible, even on limited hardware.
The best approach:
Start small → validate → scale hardware → optimize
Focus less on infrastructure perfection and more on building real use cases.

Top comments (0)