jay wong

Posted on Apr 15

How We Ran 28 AI Agents on a Single Server (And What Broke)

#agents #ai #automation #docker

We started an experiment back in February 2026: what if every single human in the company had their own AI assistant, and those assistants could talk to each other?
Eight weeks in, we have 28 agents running on a single commodity hardware box. They're handling operational toil, marketing coordination, release approvals, and churning out weekly reports. So far, they've processed over 50,000 Slack messages and made 500+ self-corrections.
I'm writing this to document what actually worked, why we blew up parts of the stack, and why we eventually open-sourced the whole mess.

The Setup

Each agent is isolated in its own Docker container. Here's the gross details:

Private memory: Conversation history persists locally.
Shared context: Access to the "Knowledge Base" (read-only for agents).
Interface: Slack channels are the primary UI.
Tasking: Connected to Notion for structured input/output.

There's a "Controller" agent that acts as the brain. It assigns goals, watches the fleet, and synthesizes lessons.

Architecture: Single server (64GB RAM, no GPU) → Controller agent at the top → 28 agent containers (Alice/Mktg, Bob/Ops, Carol/Fin, etc.) → all connected to shared knowledge + team memory layer at the bottom.

The Disaster Timeline

Memory Overflow — Week 2

Every agent keeps a MEMORY.md file as long-term storage. By week 2, we noticed some files growing to 20KB+. They were full of duplicate task notes, outdated jaggied fragments, and random chatter.
The symptom: Agents started hallucinating about completed tasks. They'd reference decisions that were overturned a month ago as if they were gospel.
The fix: We implemented a strict memory hierarchy with aggressive pruning. The gist is simple: not everything deserves to be remembered. Context is expensive, and context decay is real.
We split memory up:

Personal memory (per agent): Capped at 5KB. Auto-pruned weekly.
Member memory (per team member): Stores correction history and preferences.
Channel memory (per topic): Indexed with embeddings for semantic search.
Company memory (shared): Vetted knowledge that never expires.

The Correction Loop Problem — Week 3

We built a self-improvement system: when an agent gets corrected, it logs the lesson in a markdown file and adjusts its behavior.
In practice, agents were recording every correction, including contradictory ones. Agent A would get corrected by the CEO to "use formal tone," and an hour later get corrected by the CTO to "be more casual." The result was a confused agent that just started outputting garbage.
The fix: Corrections now go through a promotion pipeline.

Agent records locally in .learnings/corrections.md.
If the same correction happens twice → promote to the agent's "Core Rules."
If it's fundamental for the whole team → promote fleet-wide.
Contradictions are flagged for a human to resolve.

We call this Fleet Learning. Once one agent gets pwned (corrected) on a specific mistake, the entire fleet gets smarter immediately.

The Coordination Deadlock — Week 5

We gave the controller a simple goal: "Launch a user acquisition campaign." It broke it down into 6 sub-tasks and assigned them to 6 agents.
Two agents needed to coordinate on API design. Neither would start. Both were politely waiting for the other to finish their spec. Classic deadlock, but it happened in natural language, which makes it way harder to detect.
The symptom: Two agents sitting in a deadlock loop, trading "I'm waiting for your specs" messages back and forth for 3 days straight.
The fix: We built GoalOps—a structured protocol for goal decomposition.

Goals now have explicit dependencies.
Agents must declare what they are blocked on.
The controller monitors for stalls (no progress log for 24h? Escalate).
P2P handoffs have timeouts.

The Context Window Tax — Week 6

With 28 agents hitting LLM APIs constantly, we were burning through tokens like crazy. Each agent's system prompt alone was 3-4K tokens—personality definition, hardcoded rules, memory snippets, and company knowledge loaded every single time.
The cost: We were burning about $2,400/month. A huge chunk of that was wasted on repeating the same context over and over.
The fix:

Moved to on-demand loading of shared knowledge.
We built a semantic search tool (call it Teamind) instead of "context stuffing."
Agents search the index of all past conversations rather than loading everything into the prompt.
Pruned system prompts down to essentials (<2K tokens).

Result: API costs dropped to ~$800/month. It's still not dirt cheap, but it's sustainable enough that we aren't crying every time the bill comes in.

The Trust Problem — Ongoing

The hardest part isn't code; it's defining who is in charge. Some tasks need human approval (deployments, financial decisions). Some don't (formatting a report, finding docs).
We implemented a 3-tier system:

Auto-execute: Low-risk, reversible actions.
Notify-and-proceed: Medium-risk, human gets a ping.
Wait-for-approval: High-risk, blocks until human clicks "Yes."

The challenge? Every human has a different risk tolerance. Our marketing lead wants full autonomy for social posts. The ops lead wants approval on every deployment. We're still trying to tune this friction.

What Actually Works Well

Semantic Team Memory (Teamind)

Every Slack message is indexed with embeddings. Any agent can ask, "What did we decide about pricing last month?" and get an accurate answer with source links.
This alone justifies the whole setup. No more "I think someone mentioned this in a meeting I wasn't at."

Fleet Learning

When one agent gets burned by a common mistake—like using a deprecated API—the correction propagates instantly. After 8 weeks, we went from 500 individual corrections to solidifying 47 fleet-wide rules. New agents get these lessons for free, effectively starting their life smarter than day one.

Goal Decomposition

You give the controller: "Prepare the quarterly report."
It figures out:

Which agents have relevant data.
Creates sub-tasks with dependencies.
Monitors progress.
Merges the outputs.

This used to take a human PM 2-3 hours of coordination work. Now it takes one sentence and about 45 minutes of agent-toil. It's not instant, but it's consistent.

The Numbers

Agents running: 28
Server: Single, 64GB RAM, no GPU
Messages indexed: 50,000+
Self-corrections: 500+
Fleet-wide rules: 47
Monthly API cost: ~$800
Uptime (8 weeks): 99.2%

Should You Do This?

Probably not yet. Unless you have:

A team that already uses AI assistants.
Repetitive coordination overhead you want to eliminate.
A high tolerance for debugging agents that misunderstand each other.

If you're curious, we open-sourced everything: Corellis on GitHub. MIT license, runs on Docker, works with any OpenClaw setup.
The future isn't just one AI assistant. It's a team of them that learns together.

DEV Community

How We Ran 28 AI Agents on a Single Server (And What Broke)

Top comments (0)