DEV Community

Sam Hartley
Sam Hartley

Posted on

3 Months Running Everything Locally — What Broke, What Worked, What I'd Do Differently

It's been about three months since I made the switch. No more ChatGPT Plus. No more Claude subscription. No more Copilot. Everything I use day-to-day now runs on hardware sitting in my apartment — a Mac mini M4 and a PC with a mix of consumer GPUs.

I wrote the enthusiastic "I ditched OpenAI" post back in early March. This is the honest follow-up, because a lot of people asked me to come back after the honeymoon phase.

Some of it worked better than I expected. Some of it was genuinely annoying. Here's the unfiltered version.

The setup (short version)

  • Mac mini M4, 16 GB RAM — runs the orchestrator, small models, all the glue code
  • PC with an RTX 3060 12GB (plus spare 3070s and 3080s in a drawer) — runs the heavier models via Ollama
  • Everything talks over my LAN. Nothing leaves the house unless I tell it to.

I use this stuff for coding help, writing drafts, summarizing articles, transcribing voice notes, and a handful of personal automations.

What actually worked

1. Coding help for "normal" tasks

qwen3-coder:30b on the 3060 handles maybe 80% of what I used to ask GPT-4 for. Refactoring a function, explaining a gnarly regex, writing a quick shell script, sketching a React component. It's fast enough that I don't miss the cloud.

The latency is actually better than cloud APIs because there's no round trip to the US. I'll type a prompt and have tokens streaming back in under a second.

2. Voice notes and transcription

I didn't expect this to be the killer app, but it is. I talk to my Mac mini while I'm cooking, and Whisper on-device dumps the text into a daily markdown file. Zero cost, zero privacy worries, zero "oh I forgot to renew my API key."

This alone probably saved me more time than the coding stuff.

3. Batch jobs that used to rack up API bills

Summarizing 200 PDFs. Tagging a folder of screenshots. Generating alt text for a blog's image archive. These were the things that made me nervous to click "run" on OpenAI. Now I just let them chew overnight on the PC. Electricity is cheaper than tokens.

What broke or annoyed me

1. The "it's almost right" gap on hard stuff

For anything genuinely hard — like debugging a weird async bug in a codebase the model hasn't seen, or reasoning about a tricky algorithm — the gap between local 30B models and frontier cloud models is still real. It's not huge, but it's there.

I caught myself a few times thinking "I bet GPT-5 would get this in one shot" while I was on my fifth prompt with a local model. That's the honest truth.

My workaround: I keep a very small pay-as-you-go budget for the 2-3 times a month I actually need a frontier model. Probably $5/month total. Way cheaper than any subscription.

2. Context windows

Local models with huge contexts exist, but they get slow. Pasting a 50-file codebase into an 8B model and waiting is painful. I ended up writing a little script that does smart file selection instead of just dumping everything — basically a poor man's RAG. Worked better than I expected, but it was a weekend of yak-shaving I wasn't planning on.

3. Model sprawl

I have like 14 models pulled right now. qwen3-coder for code. deepseek-r1 for reasoning. A vision model for screenshots. Whisper for audio. A small embedding model. A translator. Each one made sense at the time. Now my Ollama directory is 180 GB and I can't remember what half of them are for.

I need to do a spring cleaning. I keep putting it off.

4. The "is it plugged in" problem

My PC is in another room. Sometimes my wife moves stuff and unplugs the switch. Sometimes Windows decides to reboot for updates at 3am. Sometimes the LAN cable gets bumped.

Cloud APIs just... work. Local stuff requires you to be your own SRE. I have health checks now. I have a Telegram bot that pings me when Ollama stops responding. This is not the kind of "home lab tinkering" I signed up for, but here we are.

What I'd do differently if I started today

Skip the "run everything locally or nothing" purity thing. I wasted a few weeks trying to make local models do things they're just not good at yet. The sweet spot is a hybrid setup: local for the 90% of boring stuff, a small cloud budget for the 10% that genuinely needs a bigger brain.

Buy one strong GPU instead of three mediocre ones. I have a drawer full of 3070s and 3080s I thought I'd use for "multi-GPU inference." In practice, a single 12GB card running a good 30B model handles almost everything I need, and juggling multiple cards adds complexity I don't want.

Put the models behind one API, not five. Ollama, llama.cpp, a Whisper server, a vision endpoint, embeddings... I should have stood up a single gateway that routes to the right backend. Instead I have five different base URLs in five different config files. It's fine, but it's ugly.

Write down what each model is for. Past-me assumed future-me would remember. Future-me does not remember.

Would I go back?

No. But I'd stop telling people "local AI is ready, ditch the subscriptions" like it's some binary choice. It's not.

If you're a developer who likes tinkering, has a decent GPU already, and wants to cut a $20-40/month subscription — yeah, it's great. You'll learn a lot, you'll own your stack, and you won't feel weird about pasting private code into someone else's API.

If you just want the best possible model for your work and don't want to babysit anything — stay on the cloud. There's no shame in it.

I'm in the first camp, but I was wrong to pretend the second camp was being lazy. They were being reasonable.


Anyone else running a mixed setup like this? I'm curious what other people's "local for X, cloud for Y" split looks like. Drop it in the comments — I'm especially interested in how you're handling the context-window problem without building your own RAG from scratch.

Top comments (0)