This article was originally published on runaihome.com
$500 in 2026 buys you a GPU that runs 14B models at 23+ tokens per second. That's chat-speed — fast enough to feel like a real assistant, not a loading spinner. Whether you have an existing PC or are starting from a bare table, there is a working path to local LLM inference inside this budget. None of them are magic. They all involve trade-offs. Here's what each one actually gets you.
Three valid paths to $500
Before the parts list: the "total build" framing matters. Are you adding a GPU to an existing machine, or buying everything from scratch? The answer changes which path makes sense.
| Path | Who it's for | Total cost | Best model class |
|---|---|---|---|
| A: Used RTX 3060 add-in | You have a PCIe x16 slot and 450W+ PSU | ~$250 | 7B–14B Q4 |
| B: Complete scratch build | Starting from nothing, desk + floor | ~$530–$560 | 7B–14B Q4 |
| C: AMD APU mini PC | No case to fill, quiet operation preferred | ~$499–$550 | 14B dense, 28B MoE |
Each path deserves its own breakdown.
Path A: $250 to add local AI to any PC
If you have a desktop with a free PCIe x16 slot and a PSU rated at 450W or higher, a used RTX 3060 12GB is the fastest path to running useful models locally. eBay completed listings in May 2026 show used RTX 3060 12GB cards selling for around $249.99, ranging from $220 (OEM/blower models) to $280 (AIB partner triple-fan cards).
The 12GB VRAM is the key number. At that capacity, you can fit:
- Any 7B or 8B model in Q4_K_M (~4–5GB VRAM used): 42 tok/s on the RTX 3060
- Any 14B model in Q4_K_M (~8–9GB VRAM used): 22–29 tok/s on the RTX 3060
- Two smaller models loaded simultaneously if both fit in 12GB (tight, but possible with 2× 4B)
You cannot fit a 30B model entirely in VRAM on a 3060. A Q4_K_M 30B model requires ~17GB. If you try running it with partial CPU offload via Ollama, the CPU-side layers drag throughput below 5 tok/s on most home CPUs. For 30B+, you need more VRAM or a different path entirely.
The RTX 3060 runs at 170W TDP. Check that your current PSU has a 6-pin or 8-pin PCIe power connector and at least 450W total capacity — most gaming desktops from the last five years qualify. If your system only has an iGPU right now, you may need to disable it in BIOS after installing the discrete card to avoid driver conflicts.
Benchmark sourcing: singhajit.com tested Q4_K_XL at 16K context and measured 42.0 tok/s on 8B models and 22.7 tok/s on 14B models under CUDA 12.8. A separate test using the Vulkan backend found 29.4 tok/s on 14B Q4_K_M — the gap reflects different backends, not different hardware. Both numbers sit above the 20 tok/s floor where chat feels responsive.
Minimum PSU for this path: 450W. Comfortable: 550W or above. If your current PSU is older than five years or from a no-name brand, verify its actual output — a failing 600W PSU can deliver less stable current than a quality 450W.
Path B: Complete scratch build (~$530–$560)
If you're starting from nothing, here's a parts list that hits close to the target. Every price is from Newegg or eBay in May 2026.
| Component | Pick | Price |
|---|---|---|
| GPU | Used RTX 3060 12GB (eBay) | ~$250 |
| CPU | AMD Ryzen 5 5600 OEM (AM4) | ~$80 |
| Motherboard | Budget B450M (used eBay or Newegg) | ~$65 |
| RAM | 32GB DDR4-3200 (2×16GB kit) | ~$65–$100 |
| Storage | 1TB NVMe Gen3 (Kingston NV3 or WD Blue) | ~$70 |
| PSU | 550W 80+ Bronze (Thermaltake Smart or EVGA) | ~$45–$55 |
| Case | Budget mATX (new, basic airflow) | ~$30–$40 |
| Total | ~$605–$660 full new; ~$530–$560 mixing used |
The biggest wildcard in 2026 is DDR4 pricing. Manufacturers shifted production capacity to DDR5, and 32GB DDR4 kits that cost $60 in 2024 now run $65–$100 depending on speed and brand. Tom's Hardware's 2026 RAM price index shows DDR4 has risen 30–60% year-over-year because of this supply imbalance. If you can source a used 32GB kit for $55–65 on eBay, take it. Otherwise budget $80–100.
The B450 platform is technically discontinued by AMD for new CPU support updates, but for a pure inference rig it doesn't matter — you don't need to run a Ryzen 5000X3D, and the 5600 has worked on B450 since a BIOS update most boards shipped years ago. If you want one platform upgrade path, spend $15–20 more for a used B550 board.
Why Ryzen 5 5600 and not something faster? On a pure GPU inference rig, the CPU contributes almost nothing to LLM throughput — the GPU does all the matrix multiplications, and the CPU handles Ollama's server process plus your OS. A Ryzen 5 3600 at $50–60 used would produce identical LLM performance. The 5600 is a safe known-quantity that runs cool and quiet.
Why 32GB RAM and not 16GB? On a pure GPU inference rig, system RAM holds your OS, Ollama process, browser, code editor, and whatever else runs alongside. 16GB gets tight if you're running VSCode plus a browser with multiple tabs while the LLM runs in the background. 32GB keeps things comfortable. RAM doesn't affect LLM throughput here — models live on NVMe and load into VRAM, not into system memory.
Why 1TB NVMe instead of 500GB? At current pricing, a 1TB drive costs ~$70 at roughly $0.07/GB, while 500GB drives are $50 at $0.10/GB. The 500GB market has largely compressed to bad value. Three or four GGUF models in the 7B–14B range run 4–8GB each, so 1TB gives you room for 8–12 models plus your OS without juggling.
What this build runs
On a Ryzen 5 5600 + RTX 3060 12GB, practical workloads look like this:
- Llama 3.2 8B Q4_K_M: ~42 tok/s in Ollama — fast, feels like a real chatbot
- Qwen2.5-Coder 7B Q4_K_M: 40+ tok/s — solid for code completion with Continue.dev (see the local coding stack guide)
- Llama 3.3 14B Q4_K_M: 22–29 tok/s — more capable reasoning, still interactive
- Mistral Small 24B Q4_K_M: ~17GB VRAM required — won't fit in full GPU mode, falls back to partial CPU offload and drops below 5 tok/s
For anything in the 30B+ class, you're in the wrong budget tier. The used RTX 3090 guide covers what the 24GB VRAM jump unlocks and what it currently costs.
Path C: AMD APU mini PC (~$499–$550)
The third option involves no discrete GPU at all. AMD's Ryzen 7040/8040 series chips — specifically the Ryzen 9 7940HS and Ryzen 9 8945HS — pair Zen 4 CPU cores with an integrated Radeon 780M or 890M iGPU and run from the same pool of shared DDR5 memory. With 64GB of DDR5-5600, this creates a surprisingly capable LLM inference platform inside a quiet 0.7-liter box.
Tested configuration: Minisforum UM790 Pro (Ryzen 9 7940HS, Radeon 780M) with 64GB DDR5-5600, priced around $300–$350 for the base unit plus $120–$150 for the RAM upgrade, landing at $450–$500 all-in. Some pre-configured 64GB variants from Minisforum and Beelink list at street prices of $499–$550.
Measured performance on this hardware (llama.cpp via Vulkan, April 2026):
| Model | Architecture | Tokens/sec |
|---|---|---|
| Gemma 4 28B Q4_0 | MoE | 19.5 |
| Qwen3.5-32B-A3B Q4_0 | MoE | 20.8 |
| Nemotron-Cascade | MoE | 24.8 |
| Qwen3.5-27B Q4_K | Dense | 5.8 |
| Qwen3.5-32B Q4_K | Dense | 2.8 |
The pattern is obvious once you see it: dense models at this VRAM-sharing architecture are slow; MoE models are not. A Mixture-of-Experts model with 28–32B total parameters activates only 3–5B parameters per token, so the memory bandwidth consumed per step matches a much smaller model. The Radeon 780M's unified memory bandwidth can sustain 19–21 tok/s on MoE 28B. The same bandwidth applied to a dense 32B model — which activates all parameters per token — delivers 2.8 tok/s. That's below usable chat speed.
If you're
Top comments (0)