jay wong

Posted on Apr 21 • Originally published at raw.githubusercontent.com

How We Ran 28 AI Agents on a Single Server (And What Broke)

#ai #opensource #multiagent #devops

Posted by the Corellis team — we're the folks behind CorellisOrg/corellis, an open-source multi-agent coordination layer. This post documents 8 weeks of running it in production.

We started an experiment back in February 2026: what if every single human in the company had their own AI assistant, and those assistants could talk to each other?

Eight weeks in, we have 28 agents running on a single commodity hardware box. They handle operational toil, marketing coordination, release approvals, and weekly reports. So far, they've processed 50,000+ Slack messages and made 500+ self-corrections.

Here's what actually worked, what blew up, and why we eventually open-sourced the whole mess.

The Setup

Each agent is isolated in its own Docker container:

Private memory: Conversation history persists locally.
Shared context: Read-only access to a company Knowledge Base.
Interface: Slack channels are the primary UI.
Tasking: Connected to Notion for structured input/output.

A "Controller" agent acts as the brain — assigning goals, watching the fleet, and synthesizing lessons across agents.

┌─────────────────────────────────────────────┐
│  Single Server (64GB RAM, no GPU)           │
│                                             │
│  🎛️  Controller                              │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐      │
│  │Alice │ │Bob   │ │Carol │ │...×24│      │
│  │Mktg  │ │Ops   │ │Fin   │ │      │      │
│  └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘      │
│     └─────────┴─────────┴────────┘          │
│       Shared knowledge + team memory        │
└─────────────────────────────────────────────┘

The Disaster Timeline

1. Memory Overflow — Week 2

Every agent keeps a MEMORY.md file as long-term storage. By week 2, some files had grown to 20KB+, stuffed with duplicate task notes, outdated fragments, and random chatter.

The symptom: Agents started hallucinating about completed tasks. They'd reference decisions that had been overturned a month ago as gospel.

The fix: A strict memory hierarchy with aggressive pruning. The core insight: not everything deserves to be remembered. Context is expensive, and context decay is real.

We split memory into four tiers:

Personal (per agent): Capped at 5KB. Auto-pruned weekly.
Member (per human): Correction history and preferences for each team member the agent works with.
Channel (per topic): Indexed with embeddings for semantic search.
Company (shared): Vetted knowledge that never expires.

2. The Correction Loop Problem — Week 3

We built a self-improvement system: when an agent gets corrected, it logs the lesson and adjusts its behavior.

In practice, agents were recording every correction, including contradictory ones. Agent A would get corrected by the CEO to "use formal tone," then an hour later get corrected by the CTO to "be more casual." Result: a confused agent outputting garbage.

The fix: Corrections go through a promotion pipeline.

Agent records locally in .learnings/corrections.md.
Same correction happens twice → promote to the agent's "Core Rules."
Applies fleet-wide → promote to shared rules.
Contradictions are flagged for a human to resolve.

We call this Fleet Learning. Once one agent gets corrected on a specific mistake, the rest get smarter immediately.

3. The Coordination Deadlock — Week 5

We gave the controller a simple goal: "Launch a user acquisition campaign." It broke it down into 6 sub-tasks and assigned them to 6 agents.

Two of those agents needed to coordinate on an API design. Neither would start. Both were politely waiting for the other to finish their spec. A classic deadlock — but it happened in natural language, which makes it way harder to detect from the outside.

The symptom: Two agents trading "I'm waiting for your specs" messages back and forth for 3 days straight.

The fix: We built GoalOps — a structured protocol for goal decomposition.

Goals have explicit dependencies.
Agents must declare what they're blocked on.
The controller monitors for stalls (no progress log in 24h → escalate).
P2P handoffs have timeouts.

4. The Context Window Tax — Week 6

With 28 agents hitting LLM APIs constantly, we were burning through tokens. Each agent's system prompt was 3–4K tokens — personality, hardcoded rules, memory snippets, and company knowledge loaded every single call.

The cost: ~$2,400/month, a huge chunk wasted on re-sending the same context.

The fix:

Moved to on-demand loading of shared knowledge.
Built a semantic search tool (we call it Teamind) instead of context stuffing. Agents search an index of past conversations rather than loading everything into the prompt.
Pruned system prompts to essentials (<2K tokens).

Result: API costs dropped to ~$800/month. Still not cheap, but sustainable.

5. The Trust Problem — Ongoing

The hardest part isn't code; it's defining who's in charge. Some tasks need human approval (deployments, financial decisions). Some don't (formatting a report, finding docs).

We run a 3-tier system:

Auto-execute: Low-risk, reversible.
Notify-and-proceed: Medium-risk, human gets a ping.
Wait-for-approval: High-risk, blocks until human clicks "Yes."

The challenge? Every human has a different risk tolerance. Our marketing lead wants full autonomy for social posts. The ops lead wants approval on every deployment. We're still tuning this friction.

What Actually Works Well

Semantic Team Memory (Teamind)

Every Slack message is indexed with embeddings. Any agent can ask "what did we decide about pricing last month?" and get an accurate answer with source links.

This alone justifies the whole setup. No more "I think someone mentioned this in a meeting I wasn't at."

Fleet Learning

When one agent gets burned by a common mistake — like using a deprecated API — the correction propagates instantly. Over 8 weeks, we went from 500 individual corrections to 47 fleet-wide rules. New agents get these for free and effectively start their life smarter than day one.

Goal Decomposition

You give the controller: "Prepare the quarterly report."

It figures out which agents have relevant data, creates sub-tasks with dependencies, monitors progress, and merges the outputs. This used to take a human PM 2–3 hours of coordination. Now it's one sentence and ~45 minutes of agent-toil. Not instant, but consistent.

The Numbers

Metric	Value
Agents running	28
Server	Single box, 64GB RAM, no GPU
Messages indexed	50,000+
Self-corrections	500+
Fleet-wide rules	47
Monthly API cost	~$800
Uptime (8 weeks)	99.2%

Should You Do This?

Probably not yet. Unless you have:

A team that already uses AI assistants.
Repetitive coordination overhead you want to eliminate.
A high tolerance for debugging agents that misunderstand each other.

If you're curious, we open-sourced everything: CorellisOrg/corellis on GitHub. MIT license, runs on Docker, works with any OpenClaw setup.

The future isn't just one AI assistant. It's a team of them that learns together.

Questions? The lobsters are listening. 🦞

— The Corellis team

DEV Community