Sam Hartley

Posted on Apr 29

My 3-Machine AI Lab: How I Divide Work Between a Mac Mini, a Windows PC, and an Ubuntu Box

#homelab #ai #selfhosted #devops

I keep seeing posts about running AI on a single machine. "Just use Ollama on your laptop!" Sure, that works — until you want to run a 30B model while your IDE is indexing, your test suite is running, and you're editing a video.

I have three machines. Not because I'm rich — because I kept old hardware alive and gave each one a job. Here's how I split AI workloads across a Mac Mini M4, a Windows PC with an RTX 3060, and an old Ubuntu box, and why one machine was never enough.

The Three Machines

Machine	Specs	Role
Mac Mini M4	10-core, 16GB RAM	Orchestrator, coding, light inference
Windows PC	AMD 9970X, 128GB RAM, RTX 3060 12GB	Heavy inference, image generation, GPU rental
Ubuntu Box	Older CPU, 16GB RAM	Background services, OmniParser, CPU inference fallback

Total cost for the additional machines: the Windows PC was a workstation I already had, the Ubuntu box is a repurposed laptop. The Mac Mini is the only machine I bought specifically for this.

Why Three? Because One Keeps Running Into Walls.

Here's what happens when you try to do everything on a single machine:

On the Mac Mini alone: Run a 9B model → fine. Try a 30B model → the fans sound like a jet engine and inference drops to 2 tokens/second because Apple Silicon shares RAM between CPU and GPU. Start a build while inference is running? Everything crawls.

On the Windows PC alone: The RTX 3060 crushes inference — a 30B model at 4-bit runs at 8-12 tok/s. But the machine is also my GPU rental host (I rent it on Vast.ai for passive income). When someone rents it, I can't use it. And running heavy inference while coding in an IDE on the same machine? Stutter city.

On the Ubuntu box alone: It's slow. CPU-only inference on a 7B model takes 30+ seconds per token. But it runs 24/7 without complaints, never reboots for Windows updates, and costs almost nothing in power.

The answer was never "pick the best one." It was "give each one the right job."

The Division of Labor

Mac Mini: The Brain

This is where I actually work. Code editing, git operations, web browsing, writing. It runs:

Ollama with small models — Qwen 3 4B for quick chat, granite3.2-vision for screenshots
My automation agents — OpenClaw runs here 24/7, orchestrating the other machines
Dev environment — VS Code, terminal, browser, all the tools

The key insight: the Mac Mini is for interactive work. Fast response time matters more than throughput. A 4B model answering in 0.3 seconds beats a 30B model answering in 8 seconds when I'm in the middle of a thought.

What it does NOT do: heavy batch processing, image generation, anything that pins the CPU for more than a minute.

Windows PC: The Muscle

The RTX 3060 with 12GB VRAM is the workhorse. It runs:

Ollama with big models — Qwen3-Coder 30B for complex code, DeepSeek R1 8B for reasoning
FLUX image generation — when I need AI photos (I run a Fiverr gig for brand character photos — $0 generation cost when it's your own GPU)
Vast.ai host — when I'm not using it, someone else pays to rent it. ~$50-130/month passive income.

The scheduling is simple: during my working hours (roughly 9am-11pm Istanbul time), I pause Vast.ai rentals and use the GPU myself. Overnight, it goes back on the rental market.

128GB of RAM means I can load truly large models or run multiple services simultaneously. The GPU handles inference; the CPU cores handle data prep and post-processing.

Ubuntu Box: The Backbone

This machine's superpower is reliability. It runs:

OmniParser — a UI parsing service that my automation agents use to "see" screens (runs on port 8100, I SSH-tunnel to it from the Mac)
CPU inference fallback — when the Windows GPU is rented out and the Mac is busy, the Ubuntu box runs minicpm-v for vision tasks. Slow (~180 seconds per query) but it works.
Background cron jobs — data scraping, health checks, monitoring scripts
Testing ground — I deploy new services here first because if something crashes, my main workflow isn't affected

It's the machine I interact with least directly, but it's the one I'd miss most if it went down.

How They Talk to Each Other

This was the hardest part to figure out. Three machines that don't communicate are just three isolated computers.

The network:

Mac Mini (192.168.1.102) ←→ Windows PC (192.168.1.106) ←→ Ubuntu (192.168.1.100)

All on the same local network. No VPN needed at home, but I set up Tailscale for remote access when I'm not home.

Ollama routing:
The Mac Mini's Ollama is configured with OLLAMA_HOST=0.0.0.0 so it's accessible from the other machines. Same for the Windows PC. My automation agent on the Mac Mini routes requests based on model size:

def get_ollama_url(model: str) -> str:
    """Route to the right machine based on model size."""
    small_models = ["qwen3:4b", "granite3.2-vision:2b", "moondream"]
    big_models = ["qwen3-coder:30b", "deepseek-r1:8b"]

    if model in big_models:
        return "http://192.168.1.106:11434"  # Windows GPU
    elif model in small_models:
        return "http://localhost:11434"       # Mac Mini local
    else:
        return "http://192.168.1.100:11434"   # Ubuntu fallback

Simple. No Kubernetes, no service mesh. Just a function that knows which machine has which model.

SSH tunnels for services:
The Ubuntu box runs OmniParser on port 8100, but I need it on the Mac Mini. Instead of exposing it publicly:

# Auto-reconnecting SSH tunnel (runs via launchd on Mac)
ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 \
    -L 8100:localhost:8100 exp@192.168.1.100 -N

Now localhost:8100 on the Mac Mini reaches OmniParser on Ubuntu. Clean, secure, zero config on the Ubuntu side.

The Mistakes I Made

Mistake 1: Trying to make the Mac Mini do everything.
For the first month, I ran all inference on the Mac Mini. The 16GB unified memory seems like a lot until a 30B model is eating 20GB and your IDE is using another 4GB and macOS starts swapping to SSD. The SSD is fast, but swap is still swap. Code completion went from instant to 3-second delays.

Mistake 2: Not setting up Ollama access from day one.
I installed Ollama separately on each machine and used them in isolation for weeks. The moment I realized I could call the Windows Ollama from the Mac Mini was a lightbulb moment. The models were always there — I just wasn't using them.

Mistake 3: Ignoring the Ubuntu box because it's "slow."
CPU inference is slow. But "slow" is relative. A background task that takes 3 minutes on CPU doesn't matter if you're not waiting for it. I was treating every request as if it needed to be instant. Turns out, most don't.

Mistake 4: No fallback plan.
When the Windows PC rebooted for an update, my entire inference pipeline died because I had no fallback routing. Now: if Windows is down, Ubuntu takes over. If Ubuntu is down, the Mac runs a tiny model. Something always answers.

What This Setup Enables

Running three machines isn't about having the biggest, fastest setup. It's about never being blocked.

Windows GPU is rented out? Route to Mac or Ubuntu.
Mac Mini is busy with a build? Use the Windows GPU via network.
Ubuntu is running a heavy OmniParser job? Queue it, the Mac handles the request.
Need to generate 50 images for a client? Windows GPU does it in batch while I keep coding on the Mac.

And the cost breakdown:

Item	Monthly Cost
Mac Mini power	~$8
Windows PC power (idle + rental hours)	~$15-25
Ubuntu box power	~$5
Vast.ai earnings (offset)	~$50-130
Net	-$17 to -$97 (net profit)

The setup literally pays for itself. The Windows PC's GPU rental income exceeds the total power cost of all three machines.

Do You Need Three Machines?

Probably not. But you might need two.

If you're running AI on a laptop: get a cheap desktop (even without a GPU) and put it on your network as a background worker. Ollama on a second machine means your main machine stays responsive.

If you have a gaming PC: you already have a GPU. Install Ollama on it, set OLLAMA_HOST=0.0.0.0, and call it from your laptop. You now have a two-machine AI lab for zero additional cost.

The Ubuntu box in my setup could be a $50 Raspberry Pi 5 with 8GB RAM for the same purpose. It doesn't need to be fast. It needs to be there.

What I'm Adding Next

I have five more GPUs in storage (2× RTX 3070, 1× RTX 3080, 2× RTX 3060). The plan is to build an open-air rig and add them all to the Windows PC for both local inference and Vast.ai rental. That'd give me 62GB total VRAM — enough to run a 70B model locally or rent out as a cluster.

The Ubuntu box is getting a dedicated role as a monitoring and alerting server. Right now my health checks are scattered across cron jobs on all three machines. Centralizing them makes debugging easier.

And the Mac Mini? It stays the brain. More RAM would be nice (16GB is tight with Ollama + IDE + browser), but the M4's efficiency is hard to beat for interactive work.

The Takeaway

One machine is a workstation. Two machines is a lab. Three machines is a distributed system that happens to fit on a desk.

The magic isn't in the hardware — it's in the routing. Knowing which machine should handle which task, building fallback paths, and making sure a single machine going down doesn't kill your workflow.

Start with what you have. Add a second machine when you feel the friction. Route intelligently. The three-machine setup happened organically — each machine earned its place by solving a problem the others couldn't.

I write about running AI locally, home lab setups, and turning hardware into income. New post every few days — follow along if that's your thing.

My GPU rental story: One card → six cards → $250/month passive