DEV Community

Cover image for From ollama run to Tokens: What Really Happens When You Run an LLM Locally
Akshit Zatakia
Akshit Zatakia

Posted on

From ollama run to Tokens: What Really Happens When You Run an LLM Locally

Running an LLM locally looks simple from the outside:

ollama run llama3
Enter fullscreen mode Exit fullscreen mode

or

from transformers import AutoModelForCausalLM, AutoTokenizer
Enter fullscreen mode Exit fullscreen mode

But under the hood, a lot happens before you get a single token back. The model must be downloaded, validated, memory-mapped or loaded into RAM/VRAM, tokenized, executed layer-by-layer, and then decoded back into text—for every token it generates.

The hardware and runtime you choose decide whether this feels instant and interactive—or slow and frustrating.


What this blog covers

  • What really happens while loading and running a model locally or in a VM
  • Why GPU is usually required for good performance
  • Why CPU works but often feels too slow
  • The easiest ways to run models locally (with tradeoffs)
  • Why people start with Q4 quantized models
  • Whether Q4 reduces accuracy (and how much it matters)
  • Hidden costs: memory, KV cache, context length, disk I/O
  • Practical tips for developers

1. Mental model: how a local LLM actually works

Think of a local LLM system as a pipeline of components working together:

Component Responsibility
Disk Stores model weights
RAM / VRAM Holds weights + runtime state
CPU / GPU Executes matrix operations
Tokenizer Converts text ↔ tokens
Runtime Orchestrates execution
Sampler Picks next token

Two major phases:

Phase 1: Model Loading

  • Read model files from disk
  • Parse architecture + metadata
  • Allocate memory
  • Load or map weights

Phase 2: Inference (Generation)

  • Convert input → tokens
  • Run forward pass through layers
  • Generate next token
  • Repeat until done

Important: The model never "understands text". It only processes token IDs and matrices.


2. What happens when you start a model (deep dive)

Let’s go step by step in a real system.

Step 1: Model download & format

You typically download one of:

  • .gguf → optimized for llama.cpp / Ollama
  • .safetensors → used by Transformers
  • sharded checkpoints → large models split into multiple files

Example sizes:

  • 7B FP16 → ~13–14 GB
  • 7B Q4 → ~3–4 GB

👉 This is your first tradeoff: size vs quality


Step 2: Metadata parsing

Before loading weights, the runtime reads metadata:

  • number of layers
  • hidden dimension size
  • attention heads
  • rope scaling / context length
  • quantization type

This tells the runtime how to interpret the binary weights correctly.


Step 3: Memory allocation strategy

This is where things differ across runtimes.

Possible strategies:

  • Full load into RAM (simple but heavy)
  • Memory-mapped (mmap) → loads on demand
  • GPU offloading → part or all layers on GPU
  • Hybrid (CPU + GPU) → common in limited VRAM systems

Example:

Layer 1–20 → GPU
Layer 21–32 → CPU
Enter fullscreen mode Exit fullscreen mode

This allows larger models to run on smaller GPUs—but introduces latency.


Step 4: Tokenization

Input text is converted into tokens.

"Build a scalable system"
Enter fullscreen mode Exit fullscreen mode

becomes something like:

[1012, 345, 9821, 442]
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Token count affects latency
  • Token count affects memory (KV cache)
  • Token count affects cost (in APIs)

Step 5: Forward pass (core computation)

Each token goes through every transformer layer:

  1. Embedding lookup
  2. Self-attention (Q, K, V matrices)
  3. Softmax + weighted sum
  4. Feed-forward network
  5. Normalization

This is repeated for:

  • every token in prompt
  • every generated token

👉 This is where 99% of compute time goes


Step 6: Sampling

Model outputs probabilities like:

"the" → 0.25
"a" → 0.20
"this" → 0.15
Enter fullscreen mode Exit fullscreen mode

Sampling strategies:

  • temperature → randomness
  • top-k → restrict choices
  • top-p → cumulative probability cutoff

This step determines how creative vs deterministic the output is.


Step 7: Loop continues

  • append token
  • update KV cache
  • run forward pass again

This repeats until:

  • max tokens reached
  • stop token generated

3. Why GPU is critical (real explanation)

LLMs are fundamentally linear algebra machines.

Core operation:

Matrix × Matrix → Matrix
Enter fullscreen mode Exit fullscreen mode

CPU vs GPU (practical view)

Feature CPU GPU
Cores Few (powerful) Thousands (parallel)
Best at Logic, branching Matrix math
LLM inference Slow Fast

Why GPU wins

Each layer requires:

  • millions to billions of multiplications
  • highly parallel operations

GPU executes these in parallel → massive speedup.


Real-world intuition

Imagine:

  • CPU = 8 workers doing heavy tasks
  • GPU = 5000 workers doing small tasks simultaneously

LLMs prefer many small parallel operations → GPU wins.


4. CPU: technically enough, practically limiting

Yes, CPU can:

  • load model
  • run inference
  • generate output

But the problem is latency.

Example

If:

  • 1 token = 400 ms
  • 80 tokens = 32 seconds

That’s just for generation—not including prompt processing.

Where CPU still makes sense

  • local experimentation
  • offline batch jobs
  • CI pipelines
  • small models (≤3B)

Where CPU fails

  • chat apps
  • real-time APIs
  • long prompts
  • concurrent users

👉 Key insight:

CPU is functionally correct, but not experience-friendly.


5. Memory breakdown (often misunderstood)

Developers often think only about model size.

That’s incorrect.

Memory components

1. Model weights

  • largest chunk
  • depends on quantization

2. KV Cache (VERY IMPORTANT)

Stores past tokens for attention.

Formula intuition:

Memory ∝ tokens × layers × hidden_size
Enter fullscreen mode Exit fullscreen mode

Long prompts → huge memory growth.


3. Runtime overhead

  • buffers
  • tensors
  • allocator

4. System overhead

  • OS
  • background processes

👉 Real takeaway:

A 4GB model may require 6–8GB system memory in practice.


6. Why Q4 is the default starting point

Q4 = 4-bit quantization

What it does

  • compress weights
  • reduce memory
  • improve speed

Why developers start with Q4

  • fits in laptops
  • works without high-end GPU
  • faster load time
  • "good enough" for most tasks

Example

Model FP16 Q4
7B ~14GB ~4GB

That’s a 3–4× reduction.


7. Does Q4 reduce accuracy?

Short answer: yes, but often acceptable.

What actually degrades

  • reasoning depth (slightly)
  • long-chain logic
  • rare edge cases

What usually stays fine

  • chat
  • summarization
  • coding assistance
  • general Q&A

Why this happens

Lower precision → information loss in weights.

But modern quantization methods preserve most signal.


Practical guideline

Use case Recommended
Laptop dev Q4
Balanced Q5/Q6
High quality Q8 / FP16

8. Easiest ways to run models locally

1. Ollama (best starting point)

ollama run llama3
Enter fullscreen mode Exit fullscreen mode

Pros:

  • minimal setup
  • auto model management
  • local API

2. LM Studio

  • GUI-based
  • good for testing
  • built-in server mode

3. llama.cpp

  • lightweight
  • CPU-friendly
  • supports GGUF

4. Transformers (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto"
)
Enter fullscreen mode Exit fullscreen mode

Pros:

  • maximum flexibility
  • integration with ML pipelines

Cons:

  • more setup

5. vLLM / TGI (production)

Use when:

  • serving APIs
  • multiple users
  • need batching

9. Execution flow diagram

Model flow diagram


10. Why generation feels slow (core insight)

LLMs generate one token at a time.

Not a full sentence.

Not a paragraph.

This is why:

  • latency is visible
  • CPU feels painful
  • GPU improves experience dramatically

11. Real-world performance comparison

Setup Experience
CPU VM Slow, usable for testing
Laptop GPU Smooth, interactive
Server GPU Production ready

12. Common mistakes developers make

  • Ignoring KV cache memory
  • Using huge context unnecessarily
  • Assuming model size == total memory
  • Running large models on weak VMs
  • Ignoring disk speed

13. Recommended developer workflow

Step 1: Start simple

  • Ollama
  • Q4 model

Step 2: Validate use case

  • prompts
  • latency

Step 3: Measure

  • memory
  • tokens/sec

Step 4: Optimize

  • move to GPU
  • increase quantization quality

14. Final takeaway

Running an LLM locally is not magic—it’s a pipeline of:

  • loading weights
  • running matrix operations
  • generating tokens sequentially

CPU can run models, but GPU makes them usable.

Q4 makes models accessible, even on limited hardware.

The best approach:

Start small → validate → scale hardware → optimize

Focus less on infrastructure perfection and more on building real use cases.

Top comments (0)