Six months ago I decided to see if I could stop paying for AI APIs entirely. Not "reduce costs." Not "use local models for simple stuff." I mean everything — coding help, document analysis, image understanding, automation, even the AI that writes my Dev.to posts.
The bill since then: $0 in API costs.
But "free" doesn't mean "zero effort." Here's the honest maintenance report from half a year of running my own AI infrastructure.
The Stack I Actually Run
Three machines, one desk, no cloud dependency:
| Machine | Role | Specs |
|---|---|---|
| Mac Mini M4 | Orchestrator + always-on models | 16GB RAM |
| Windows PC | GPU workhorse | RTX 3060 12GB + RTX 3070 8GB |
| Ubuntu box | CPU fallback + storage | Headless, 32GB RAM |
Between them: 8 Ollama models, a Chroma vector database, ~5,000 indexed document chunks from my projects, and a dozen Python scripts that tie it all together.
What Works Better Than Expected
Code generation is actually fine now
I was skeptical. Everyone says you need GPT-4 for "serious" coding. But Qwen 3 Coder 30B on my RTX 3060 writes functions, debugs errors, and suggests refactors at roughly 90% of the quality I'd get from Claude. The other 10%? Usually it's being overly verbose or missing an edge case I'd catch in review anyway.
The difference: 0 seconds of waiting vs. typing a prompt into a web UI and watching a progress bar. Local inference feels instant because it is instant.
No rate limits changes how you work
With cloud APIs, I catch myself thinking "is this query worth the tokens?" That's a weird mental tax. Local models removed it entirely. I can fire off 50 variations of a prompt, iterate on phrasing, test different approaches — all without checking a dashboard or worrying about a bill.
I probably use AI more now that it's free, not less. Counterintuitive.
Privacy isn't theoretical
I index my entire Obsidian vault, all my project documentation, client notes, even personal journal entries. The vector DB is local. The embeddings are local. The inference is local. Nobody's training on my data because nobody sees my data.
That's not just privacy — it's permission to be sloppy. I can paste a full error log without scrubbing paths. I can ask about a client project by name. I don't have to think about it.
What Broke (And How I Fixed It)
Model updates are a tax
Ollama makes updating easy: ollama pull qwen3.5:9b and you're done. But "easy" multiplied by 8 models across 3 machines is still work. Every few weeks there's a new version, a new quantization, a "better" tag.
I used to update everything immediately. Now I have a rule: only update when the current model fails me. If Qwen 3.5 9B answers my question correctly, I don't care that 9.1 exists. This cut my maintenance time by about 70%.
The "which machine is this model on?" problem
For the first two months I was constantly SSHing between machines to check what was installed where. Did I put the vision model on the Windows PC or the Ubuntu box? Is the 30B coder on the GPU machine or did I move it?
I built a simple model router (wrote about it here) that tracks endpoints and health-checks them. Now I just send a prompt and it figures out where to route it. Problem solved, but it took an afternoon to build and another afternoon to debug.
One machine goes down, your workflow doesn't stop — but it degrades
The Windows PC rebooted for an update mid-project last month. All my "heavy" models went with it. The router fell back to the Mac Mini's 4B model, which meant my coding assistant became... less helpful. It still answered, but the quality drop was noticeable.
I now schedule Windows updates for Sunday mornings when I'm not working. And I keep a "fallback" version of my most-used models on each machine, even if they're slower. Redundancy isn't just for servers.
VRAM is the real bottleneck
Not compute. Not CPU. VRAM. A 30B parameter model at 4-bit quantization needs ~18GB. My 3060 has 12GB, which means I can't run the biggest models at all. The 3070's 8GB is even more limiting.
I spent $150 on a used RTX 3060 with 12GB specifically for this setup. Best investment I made. If you're buying for local AI, VRAM > everything else. A 12GB card beats an 8GB card even if the 8GB card is technically "faster."
The Hidden Costs
| Cost | Monthly |
|---|---|
| Electricity (3 machines, ~200W average) | ~$25-35 |
| Hardware depreciation | Hard to calculate |
| My time maintaining it | ~2-3 hours/month |
So it's not literally free. But compared to $20-200/month for cloud API subscriptions? It's a rounding error.
The real cost is mental overhead. You become your own ops team. When something breaks at 11 PM, there's no support chat. You're the support chat.
The Moment I Almost Switched Back
Last month I had a complex architecture question for a client project. The local model gave me an answer that sounded right but was subtly wrong about how PostgreSQL handles concurrent index creation. I didn't catch it until review.
A cloud model might have gotten it right. Or it might have hallucinated something equally plausible. I don't actually know. But in that moment I considered keeping a Claude subscription as a "reality check" for critical questions.
I didn't do it. Instead, I added a rule: for architectural decisions that affect production, verify with official docs, not AI. Whether the AI is local or cloud, that's always been true. I just got lazy because the local model was so convenient.
Would I Do It Again?
Yes. With caveats.
Do this if: You're technical, you enjoy tinkering, you already have decent hardware, and you use AI heavily enough that API costs would matter.
Don't do this if: You want something that "just works," you need the absolute best model quality for every query, or your time is worth more than the API savings.
For me, the tradeoff is worth it. I spend maybe 2-3 hours a month on maintenance and save $50-150 in API costs. More importantly, I own the stack. No vendor lock-in. No terms-of-service changes. No "we've updated our pricing" emails.
What's Next
I'm experimenting with running smaller models (1-3B parameters) on the Mac Mini for truly instant responses to simple queries. The latency difference between 4B and 30B is massive — sometimes you just need a quick answer, not a thoughtful one.
Also looking into quantization options that let me squeeze bigger models into 12GB. Every GB of VRAM I free up is another model I can keep loaded.
Six months in, I'm not going back. But I'm also not pretending it's effortless. Local AI is a lifestyle choice, not just a technical one.
Questions about the setup or want the specific config files? Drop a comment.
→ I build custom local AI setups on Fiverr
→ Follow along on Telegram
Top comments (0)