Akshit Zatakia

Posted on Apr 14

From ollama run to Tokens: What Really Happens When You Run an LLM Locally

#ai #llm #mcp #machinelearning

Running an LLM locally looks simple from the outside:

ollama run llama3

from transformers import AutoModelForCausalLM, AutoTokenizer

But under the hood, a lot happens before you get a single token back. The model must be downloaded, validated, memory-mapped or loaded into RAM/VRAM, tokenized, executed layer-by-layer, and then decoded back into text—for every token it generates.

The hardware and runtime you choose decide whether this feels instant and interactive—or slow and frustrating.

What this blog covers

What really happens while loading and running a model locally or in a VM
Why GPU is usually required for good performance
Why CPU works but often feels too slow
The easiest ways to run models locally (with tradeoffs)
Why people start with Q4 quantized models
Whether Q4 reduces accuracy (and how much it matters)
Hidden costs: memory, KV cache, context length, disk I/O
Practical tips for developers

1. Mental model: how a local LLM actually works

Think of a local LLM system as a pipeline of components working together:

Component	Responsibility
Disk	Stores model weights
RAM / VRAM	Holds weights + runtime state
CPU / GPU	Executes matrix operations
Tokenizer	Converts text ↔ tokens
Runtime	Orchestrates execution
Sampler	Picks next token

Two major phases:

Phase 1: Model Loading

Read model files from disk
Parse architecture + metadata
Allocate memory
Load or map weights

Phase 2: Inference (Generation)

Convert input → tokens
Run forward pass through layers
Generate next token
Repeat until done

Important: The model never "understands text". It only processes token IDs and matrices.

2. What happens when you start a model (deep dive)

Let’s go step by step in a real system.

Step 1: Model download & format

You typically download one of:

.gguf → optimized for llama.cpp / Ollama
.safetensors → used by Transformers
sharded checkpoints → large models split into multiple files

Example sizes:

7B FP16 → ~13–14 GB
7B Q4 → ~3–4 GB

👉 This is your first tradeoff: size vs quality

Step 2: Metadata parsing

Before loading weights, the runtime reads metadata:

number of layers
hidden dimension size
attention heads
rope scaling / context length
quantization type

This tells the runtime how to interpret the binary weights correctly.

Step 3: Memory allocation strategy

This is where things differ across runtimes.

Possible strategies:

Full load into RAM (simple but heavy)
Memory-mapped (mmap) → loads on demand
GPU offloading → part or all layers on GPU
Hybrid (CPU + GPU) → common in limited VRAM systems

Example:

Layer 1–20 → GPU
Layer 21–32 → CPU

This allows larger models to run on smaller GPUs—but introduces latency.

Step 4: Tokenization

Input text is converted into tokens.

"Build a scalable system"

becomes something like:

[1012, 345, 9821, 442]

Why this matters:

Token count affects latency
Token count affects memory (KV cache)
Token count affects cost (in APIs)

Step 5: Forward pass (core computation)

Each token goes through every transformer layer:

Embedding lookup
Self-attention (Q, K, V matrices)
Softmax + weighted sum
Feed-forward network
Normalization

This is repeated for:

every token in prompt
every generated token

👉 This is where 99% of compute time goes

Step 6: Sampling

Model outputs probabilities like:

"the" → 0.25
"a" → 0.20
"this" → 0.15

Sampling strategies:

temperature → randomness
top-k → restrict choices
top-p → cumulative probability cutoff

This step determines how creative vs deterministic the output is.

Step 7: Loop continues

append token
update KV cache
run forward pass again

This repeats until:

max tokens reached
stop token generated

3. Why GPU is critical (real explanation)

LLMs are fundamentally linear algebra machines.

Core operation:

Matrix × Matrix → Matrix

CPU vs GPU (practical view)

Feature	CPU	GPU
Cores	Few (powerful)	Thousands (parallel)
Best at	Logic, branching	Matrix math
LLM inference	Slow	Fast

Why GPU wins

Each layer requires:

millions to billions of multiplications
highly parallel operations

GPU executes these in parallel → massive speedup.

Real-world intuition

Imagine:

CPU = 8 workers doing heavy tasks
GPU = 5000 workers doing small tasks simultaneously

LLMs prefer many small parallel operations → GPU wins.

4. CPU: technically enough, practically limiting

Yes, CPU can:

load model
run inference
generate output

But the problem is latency.

Example

If:

1 token = 400 ms
80 tokens = 32 seconds

That’s just for generation—not including prompt processing.

Where CPU still makes sense

local experimentation
offline batch jobs
CI pipelines
small models (≤3B)

Where CPU fails

chat apps
real-time APIs
long prompts
concurrent users

👉 Key insight:

CPU is functionally correct, but not experience-friendly.

5. Memory breakdown (often misunderstood)

Developers often think only about model size.

That’s incorrect.

Memory components

1. Model weights

largest chunk
depends on quantization

2. KV Cache (VERY IMPORTANT)

Stores past tokens for attention.

Formula intuition:

Memory ∝ tokens × layers × hidden_size

Long prompts → huge memory growth.

3. Runtime overhead

buffers
tensors
allocator

4. System overhead

OS
background processes

👉 Real takeaway:

A 4GB model may require 6–8GB system memory in practice.

6. Why Q4 is the default starting point

Q4 = 4-bit quantization

What it does

compress weights
reduce memory
improve speed

Why developers start with Q4

fits in laptops
works without high-end GPU
faster load time
"good enough" for most tasks

Example

Model	FP16	Q4
7B	~14GB	~4GB

That’s a 3–4× reduction.

7. Does Q4 reduce accuracy?

Short answer: yes, but often acceptable.

What actually degrades

reasoning depth (slightly)
long-chain logic
rare edge cases

What usually stays fine

chat
summarization
coding assistance
general Q&A

Why this happens

Lower precision → information loss in weights.

But modern quantization methods preserve most signal.

Practical guideline

Use case	Recommended
Laptop dev	Q4
Balanced	Q5/Q6
High quality	Q8 / FP16

8. Easiest ways to run models locally

1. Ollama (best starting point)

ollama run llama3

Pros:

minimal setup
auto model management
local API

2. LM Studio

GUI-based
good for testing
built-in server mode

3. llama.cpp

lightweight
CPU-friendly
supports GGUF

4. Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto"
)

Pros:

maximum flexibility
integration with ML pipelines

Cons:

more setup

5. vLLM / TGI (production)

Use when:

serving APIs
multiple users
need batching

9. Execution flow diagram

10. Why generation feels slow (core insight)

LLMs generate one token at a time.

Not a full sentence.

Not a paragraph.

This is why:

latency is visible
CPU feels painful
GPU improves experience dramatically

11. Real-world performance comparison

Setup	Experience
CPU VM	Slow, usable for testing
Laptop GPU	Smooth, interactive
Server GPU	Production ready

12. Common mistakes developers make

Ignoring KV cache memory
Using huge context unnecessarily
Assuming model size == total memory
Running large models on weak VMs
Ignoring disk speed

13. Recommended developer workflow

Step 1: Start simple

Ollama
Q4 model

Step 2: Validate use case

prompts
latency

Step 3: Measure

memory
tokens/sec

Step 4: Optimize

move to GPU
increase quantization quality

14. Final takeaway

Running an LLM locally is not magic—it’s a pipeline of:

loading weights
running matrix operations
generating tokens sequentially

CPU can run models, but GPU makes them usable.

Q4 makes models accessible, even on limited hardware.

The best approach:

Start small → validate → scale hardware → optimize

Focus less on infrastructure perfection and more on building real use cases.