Yesterday—April 23rd I sat down with a blank note and a messy spreadsheet. Not for a client project or a work task. Just for something that’s been gnawing at me for weeks: I want to run large language models on my own hardware, at home, on my terms. Not in a cloud notebook that spins down after an hour. Not behind a metered API key. Something I can experiment with at 2 a.m. without worrying about surprise bills.
It started as a casual “what if,” and by the end of the evening I had a parts list, a stack of tradeoffs, and a very real budget scrawled on a piece of paper. I’m sharing that blueprint here—not as a polished build guide, but as a developer’s logbook. The point isn’t to show off a flawless system. It’s to think through the decisions, admit the unknowns, and maybe get some advice from folks who’ve walked this path before.
The Core Idea
Host some websites, tinker with local LLMs (think Llama, Mistral, Phi), and eventually offer a small hosted inference service to friends or local devs—something modest that might grow over time. The long-term dream: a self-contained AI lab where I control the hardware, the models, and the data.
But first, reality: I have a limited budget and I’m piecing this together part by part in a region where used enterprise gear isn’t as easy to come by. I’m in Kenya, so prices here are in Kenyan Shillings (KSh), but I’ll include rough USD equivalents for global context.
The Hardware Blueprint (v0.1)
Here’s what I’m planning to put in the case. It’s not exotic. It’s not a server rack full of A100s. It’s a quiet desktop that I hope will punch above its weight for quantized models.
| Component | Choice | Actual KSh Range | Real-World USD |
|---|---|---|---|
| GPU | RTX 3060 12GB (new/used, local market) | 45,000 – 63,000 | $350 – $490 |
| CPU | Ryzen 5 3600 (tray/used) | ~13,000 | $100 |
| Motherboard | B450 / B550 (decent VRMs) | 7,700 – 14,200 | $60 – $110 |
| RAM | 32GB DDR4 (single 32GB stick) | 16,500 – 18,950 | $128 – $147 |
| PSU | 650W – 750W 80+ Bronze/Gold | 8,000 – 10,000 | $62 – $77 |
| Case + Cooling | Budget airflow case, 3 fans | 3,000 – 6,000 | $23 – $46 |
| Total | ~93,000 – 125,000 | ~$720 – $970 |
That’s a wide range, I know. The final number will depend heavily on whether I snag a decent used GPU and how fussy I get about the motherboard and PSU. But even at the upper end, it’s cheaper than a mid-range gaming laptop.
Why Each Part Earned Its Spot
GPU: RTX 3060 12GB over the 3090
I spent a long time staring at RTX 3090 listings. 24 GB of VRAM is seductive—it can run unquantized 13B models, maybe even a quantized 30B without breaking a sweat. But at $700–$1000 used, it would eat up my entire budget. I’d be left with a beast of a GPU and no money for a CPU, RAM, or even a case to put it in.
So I pivoted. The RTX 3060 12GB. Not the 8GB variant—that 4 GB matters a lot for LLMs. 12 GB of VRAM will comfortably hold a 4-bit quantized Llama-2-13B or Mistral, and leaves room for the KV cache during inference. For smaller models like Phi-2 or a q4 Llama-7B, I’ll even have VRAM to spare for batching. Yes, I’ll buy used. Yes, I’m nervous about mining cards. But the local market has some ex-gaming units, and I’ll stress-test before committing.
The 3060’s power draw is also a plus: ~170W TDP means I won’t need a 1000W PSU, keeping the build more efficient and quieter.
CPU: Ryzen 5 3600 (used/refurb)
For pure LLM inference, the CPU isn’t the star—the GPU does the heavy lifting. But I still need enough muscle to handle API serving, web hosting, and any CPU-bound pre/post processing. A Ryzen 5 3600 is a 6-core/12-thread workhorse that’s dirt cheap on the used market. If I ever decide to experiment with CPU offloading for huge models or run a local vector database, it won’t be embarrassingly slow. And the AM4 platform gives me an upgrade path to a 5000-series chip later.
Motherboard: B450 or B550 with spare PCIe
I don’t need bleeding edge. I need one x16 slot for the GPU, maybe a spare x4 slot for a future NVMe adapter or a second NIC. Both B450 and B550 boards handle that fine. I’ll pick whichever has decent VRM cooling and is available at a reasonable price. The ability to drop in a faster Ryzen CPU later is a bonus.
RAM: 32 GB DDR4, no compromises
I'm buying 32GB upfront. For LLM serving, system RAM acts as a safety net—even when the GPU carries the model, the CPU still needs space for the OS, background services, context buffers, and any offloaded layers. A single 32GB DDR4 stick (around KSh 16,500–19,000) gives me all the headroom I need right now and leaves one slot open for a future jump to 64GB if I ever start running heavier multi-model setups or large vector databases. Waiting on RAM is a false economy when I'm this close to a balanced build.
Power Supply: 650W–750W, with tomorrow in mind
I see a lot of online advice shouting “800W minimum!” for high-end GPUs, but they’re talking about RTX 3090s and 4090s. For a 3060 + Ryzen 5 setup, a quality 650W unit is more than enough. I’m leaning toward 750W for a bit of headroom—if I ever upgrade to a hungrier GPU (used 3090 prices might drop), I won’t need to replace the PSU. I’m sticking to a reputable brand, though, because a cheap PSU is a false economy.
Case and Cooling: Airflow over aesthetics
No RGB glass panels here—just a mesh-front case with three fans. Good airflow is important because the GPU will be running flat-out during inference sessions, and I want to keep thermals low enough that the card’s fans aren’t screaming. I’ll probably undervolt the GPU slightly for efficiency.
What This Build Should Actually Do
I’m setting realistic expectations. With 12 GB VRAM, this machine will shine with:
- Llama-7B and 13B (4-bit quantized) – via GGUF/MLC formats
- Mistral-7B variants – tiny, fast, excellent for code and chat
- Phi-3-mini – small but surprisingly capable
- Ollama for dead-simple local model serving
- LM Studio for a GUI-based playground when I just want to test prompts
- vLLM later on, once I’m comfortable with the setup and want higher throughput for an API endpoint.
I actually can fine-tune on this box—QLoRA makes it practical. With 12 GB VRAM, I can run a 4-bit quantized 7B model and apply low-rank adaptation using standard settings without breaking a sweat. It won't be fast, but it'll work. Full fine-tuning is still out of reach, but for my experiments, QLoRA on a 7B model is more than enough.
The Uncomfortable Part: What I’m Still Unsure About
Used GPU reliability. I’ve bought used electronics before, but a GPU that’s been run 24/7 in a mining rig or a dusty gaming tower is a gamble. I’ll run FurMark and a VRAM stress test, but there’s always a chance of early failure. I’d love to hear how others vet used cards for AI workloads.
VRAM vs model ambitions. 12 GB puts a hard ceiling on model size. Right now I’m happy with 7B–13B quantized, but I can feel the itch to run larger open models (like Command R or a future Llama-3-70B-ish thing). I keep asking myself: will I regret not saving up for a 3090? The honest answer is: maybe. But learning on a 3060 is better than dreaming about a 3090 that never arrives.
Power stability at home. Brownouts and voltage swings aren’t unheard of in my area. I’ll need at least a basic UPS with AVR (automatic voltage regulation) to protect the hardware, and I haven’t priced that in yet.
Noise in a living space. This will sit in my apartment, not a dedicated server room. Inference won’t max out the GPU like gaming, but continuous serving might. I need to test fan curves and maybe swap in quieter case fans.
Monetization reality check. I mentioned “selling as a service.” I mean something humble: maybe offer a private API endpoint to local developers who want to experiment without paying for cloud tokens. It’s not a startup; it’s a side experiment that might offset some electricity costs. I genuinely don’t know if anyone will pay, or how to price it fairly.
Where This Is All Heading
The first goal is simple: get the machine booting, install Ollama, and make it work. I’ll spend a week or two just playing with models, measuring token/s speeds, and learning the quirks of managing a local inference stack.
From there, I plan to:
- Run a persistent LM Studio server or Ollama for personal use—coding assistants, document summarization, etc.
- Set up a Dockerized environment where I can spin up different model backends and test frameworks (LangChain, LlamaIndex, maybe a local RAG pipeline).
- Explore vLLM to understand high-throughput serving, including batching and PagedAttention.
- Host a few lightweight web apps alongside the inference service—personal projects that benefit from local AI.
- Eventually, try to monetize: sell API credits in bulk, or offer a “bring your own model” endpoint for students/hobbyists in my network. If it gains any traction, I’d reinvest every cent into upgrades (more RAM, better GPU, maybe a second identical machine for redundancy).
No grand roadmap. No “disrupt the industry” rhetoric. Just a developer growing a system piece by piece, learning along the way.
A Quick Ask to the Dev Community
I’d genuinely appreciate your wisdom:
- Used GPUs for AI: What tests do you run before buying? Any telltale signs of a card that’s been thermally tortured?
- RAM vs VRAM: For a local LLM server, would you prioritize 32 GB system RAM first, or save every penny toward a bigger GPU?
- PSU headroom: Is 750W a reasonable ceiling for a single high-end future card, or should I just bite the bullet and buy an 850W unit now?
- Monetization: Have any of you offered a local LLM API to others? How did you handle billing, rate limiting, or uptime?
If you’ve been down this road, or if you’re staring at a similar blueprint right now, maybe drop a comment. I’d love to hear what you’d do differently, what you’d keep the same, and what you wish someone had told you before you powered on that first home-built AI box.
This is just the beginning. The next post will hopefully have photos of a real, built machine, not just a parts list and a prayer. Until then, I’ll be refreshing listings for used RTX 3060s and trying to figure out if that “too-good-to-be-true” deal is a trap.
What’s your home lab running?
Top comments (0)