Cristian Sifuentes

Posted on Jul 20

Run Large Language Models on Your Own PC: A Scientist’s Guide to CPUs, GPUs, RAM, VRAM & Quantization

#ai #gpu #ram #machinelearning

Run Large Language Models on Your Own PC: A Scientist’s Guide to CPUs, GPUs, RAM, VRAM & Quantization 🚀

“Give me a GPU big enough and a model quantized enough, and I shall inference the world.” — Archimedes, probably

Why read this?

If you’ve ever asked yourself:

“Can I run a GPT‑style model on my rig without mortgaging the cat?”

…this article is for you. We’ll dissect the five hardware pillars that decide whether your local LLM soars or sputters:

Pillar	TL;DR
CPU	General‑purpose brain; great at many things, master of none.
GPU	Vector/matrix powerhouse; crunches m × x + b millions of times per second.
RAM	Short‑term memory for all running programs.
VRAM	GPU‑attached RAM; the model’s penthouse suite.
Quantization	Shrinks model weights (16 → 8 → 4 bits) so they fit into the suites above.

1️⃣ CPU vs GPU — Same Goal, Very Different Brains

Feature	CPU	GPU
Cores	Few (8‑32) complex cores	Hundreds‑thousands of simple ALUs
Optimized for	Branching, OS tasks, scalar math	Parallel matrix ops & graphics
Example workload	Sorting, web browser, OS interrupts	4 096 × 4 096 GEMM for a transformer layer
Why it matters for LLMs	Handles tokenizer, I/O, orchestration	Runs the attention & MLP math

Key takeaway: You can infer on a CPU, but you’ll wait ✈️. A mid‑range GPU slashes token times from seconds to milliseconds.

2️⃣ RAM — Where Everything Takes a Coffee Break

Role: Holds the model weights and your OS, browser, Spotify, etc.

Recommended for local LLMs

Minimum	Sweet Spot	Power‑User
16 GB DDR4/5	32‑64 GB	96 GB+ for massive experiments

Pro tip: Leave at least 4‑6 GB free for the OS. Linux swap + zram can rescue you in an emergency, but paging 20 GB to disk will feel like dial‑up.

3️⃣ VRAM — The GPU’s Penthouse Suite

Role: Stores the active tensors during inference/training. Closer = faster (PCIe ≪ on‑package HBM).

Card tier (2025)	VRAM	What fits?*
RTX 4060 / RX 7700	8‑12 GB	3‑4 B models @ 4‑bit
RTX 4070 Ti SUPER	16 GB	7‑8 B @4‑bit or 13‑B @8‑bit w/ offload
RTX 5090 / RX 8900	24‑36 GB	13‑34 B full 8‑bit
Prosumer Hopper H200	80‑144 GB HBM3e	70‑B full precision 😎

*Rough rule of thumb: size ≈ (model params × bits) / 8 + activation overhead.

4️⃣ Quantization — Weight‑Watchers for AI Models

Precision	Memory per parameter	Typical perplexity hit
FP16	2 bytes	Baseline
INT8	1 byte	+0‑2 pp depending on calibration
Q4 (4‑bit)	0.5 byte	+1‑4 pp (still chatty!)

Real‑world shrink‑ray 🌠

70‑B Llama‑2  FP16 : 140 GB
70‑B Llama‑2  INT8 : 70 GB
70‑B Llama‑2   Q4  : 35 GB   ✅ Fits on a 48 GB RTX 5090

Quantized weights + ggml / GPTQ / AWQ loaders = laptop‑level inference.

5️⃣ Quick Compatibility Checklist

Step	Command	What you’re checking
CPU info	`lscpu` / `wmic cpu get name`	AVX2 / AVX‑512 for CPU back‑ends
RAM free	`free -h` / Task Mgr	≥ 16 GB available
GPU & VRAM	`nvidia-smi` / `rocm-smi`	CUDA 12.x? HIP? VRAM amount
Driver & Toolkit	`nvcc --version`	Matches PyTorch / TensorRT build

6️⃣ From Theory to Tokens: Your First Local Run

Install Ollama (Mac/Win/Linux) or Text‑Generation‑WebUI.
Pull a quantized model

   ollama pull llama2:7b-chat-q4_K_M

Infer

   ollama run llama2:7b-chat-q4_K_M

Watch VRAM with nvidia-smi dmon — you’ll see ~6 GB used instead of 25 GB.

7️⃣ FAQ Speed‑round 🔄

Question	Short answer
Can I chain GPUs?	Yes, via tensor / pipeline parallelism or vLLM offload, but consumer apps rarely support it out‑of‑the‑box.
Is Intel ARC good enough?	Now that oneAPI supports BF16 & INT8, ARC 870 / 880 can handle 7‑B Q4 models.
Does Apple M‑series need VRAM?	Unified memory is VRAM; aim for M2 Pro 32 GB+ for 13‑B Q4.

Conclusion — Hardware matters, knowledge matters more

Running state‑of‑the‑art LLMs locally is no longer sci‑fi; it’s a weekend project when you:

Match model bits to VRAM via quantization.
Give your GPU room to breathe with adequate system RAM.
Jump into tooling (Hugging Face, Ollama, LM‑Studio) that abstracts the heavy lifting.

Stay tuned — in the next post we’ll benchmark 4‑bit vs 8‑bit inferencing on a 4070 Ti SUPER and show you how to fine‑tune Dolphin‑Mixtral on 32 GB of system RAM. 🎣

✍️ Written by: Cristian Sifuentes — Full‑stack dev & AI tinkerer. Dark themes, atomic commits, and the belief that every rig deserves its own language model.

Happy local inferencing! 🚀💬

Tags: #LLM #GPU #Quantization #EdgeAI #Python #HuggingFace #Ollama

DEV Community