How I Run a 50-Agent AI Workforce on a Single 6GB GPU

#ai #selfhosted #machinelearning #showdev

Build-in-public. This is the real architecture behind running ~50 local AI agents on 6GB of VRAM — one GPU lock, an eviction watchdog, a resource governor, and a model router. Originally posted on my blog.

The question I get most often is some version of "there's no way you run that many agents on a 6GB laptop GPU." The honest answer: not the way you're picturing it. I don't run 50 models at once. I run one model at a time, very deliberately — and most of the engineering is about scheduling, not inference. Here's the actual architecture.

The hard constraint: 6GB of VRAM

A single consumer GPU with 6GB of VRAM holds roughly one 7B-parameter model at a usable quantization. Two at once? It thrashes — the GPU starts swapping, latency explodes, and eventually a driver out-of-memory can take the whole machine down. I've had the desktop freeze from exactly that.

So the first design rule wrote itself: only one heavy model is allowed on the GPU at any moment.

That sounds limiting. It isn't — because almost nothing I run is latency-sensitive. A blog post that publishes at 7am doesn't care if it was generated at 6:52 or 6:58. Once you accept that your AI workforce is a batch system, not a chat window, the whole problem changes shape.

A lock, not a crowd

Every agent that needs the GPU has to take a lock first. It's a simple file-based queue with:

FIFO ordering
PID-based ownership
Stale-lock detection, so a crashed job can't wedge the line forever

If an agent can't get the lock within its timeout, it skips gracefully and tries again on its next scheduled run instead of piling up.

So at 50 agents, what's really happening is: dozens of cron-scheduled Python workers wake up throughout the day, and the ones that need the model form an orderly line for it. The fleet is huge; the GPU contention is always exactly one. That's the trick. It's less "50 models" and more "50 employees sharing one very busy workstation, politely."

Eviction and a VRAM watchdog

Even with the lock, idle models linger in VRAM. So a small monitor checks GPU usage every few minutes and evicts idle models when usage climbs past a threshold. Overnight, when I want the GPU clear for heavier jobs, that threshold drops automatically so daytime models get pushed out sooner.

A separate resource governor watches for fragmentation, cache pressure, and swap thrashing, and escalates from gentle (reduce context) to firm (force-evict) before anything can spiral into that driver-OOM freeze.

The four moving parts:

One lock serializes all heavy GPU work.

An eviction monitor frees VRAM when idle models overstay.

resource governor catches thrashing early and acts before the machine is at risk.

A model router lets agents ask for "a model for task X" instead of naming one, so the right size gets picked for the work.

The router is the real unlock

Agents never hardcode use the 7B model. They ask the router for a model suited to the task, and the router decides: a tiny model on CPU for a quick classification, the 7B for real writing, or a free cloud tier for something bigger when it makes sense.

That one layer means the same agents run unchanged whether you're on a potato or a workstation — the router absorbs the hardware difference. On a beefier machine it allows more concurrency and bigger models; on a weak one it leans on small local models and slows the cadence. Same code, different gear.