DEV Community: jay wong

How We Ran 28 AI Agents on a Single Server (And What Broke)

jay wong — Tue, 21 Apr 2026 12:00:00 +0000

Posted by the Corellis team — we're the folks behind CorellisOrg/corellis, an open-source multi-agent coordination layer. This post documents 8 weeks of running it in production.

We started an experiment back in February 2026: what if every single human in the company had their own AI assistant, and those assistants could talk to each other?

Eight weeks in, we have 28 agents running on a single commodity hardware box. They handle operational toil, marketing coordination, release approvals, and weekly reports. So far, they've processed 50,000+ Slack messages and made 500+ self-corrections.

Here's what actually worked, what blew up, and why we eventually open-sourced the whole mess.

The Setup

Each agent is isolated in its own Docker container:

Private memory: Conversation history persists locally.
Shared context: Read-only access to a company Knowledge Base.
Interface: Slack channels are the primary UI.
Tasking: Connected to Notion for structured input/output.

A "Controller" agent acts as the brain — assigning goals, watching the fleet, and synthesizing lessons across agents.

┌─────────────────────────────────────────────┐
│  Single Server (64GB RAM, no GPU)           │
│                                             │
│  🎛️  Controller                              │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐      │
│  │Alice │ │Bob   │ │Carol │ │...×24│      │
│  │Mktg  │ │Ops   │ │Fin   │ │      │      │
│  └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘      │
│     └─────────┴─────────┴────────┘          │
│       Shared knowledge + team memory        │
└─────────────────────────────────────────────┘

The Disaster Timeline

1. Memory Overflow — Week 2

Every agent keeps a MEMORY.md file as long-term storage. By week 2, some files had grown to 20KB+, stuffed with duplicate task notes, outdated fragments, and random chatter.

The symptom: Agents started hallucinating about completed tasks. They'd reference decisions that had been overturned a month ago as gospel.

The fix: A strict memory hierarchy with aggressive pruning. The core insight: not everything deserves to be remembered. Context is expensive, and context decay is real.

We split memory into four tiers:

Personal (per agent): Capped at 5KB. Auto-pruned weekly.
Member (per human): Correction history and preferences for each team member the agent works with.
Channel (per topic): Indexed with embeddings for semantic search.
Company (shared): Vetted knowledge that never expires.

2. The Correction Loop Problem — Week 3

We built a self-improvement system: when an agent gets corrected, it logs the lesson and adjusts its behavior.

In practice, agents were recording every correction, including contradictory ones. Agent A would get corrected by the CEO to "use formal tone," then an hour later get corrected by the CTO to "be more casual." Result: a confused agent outputting garbage.

The fix: Corrections go through a promotion pipeline.

Agent records locally in .learnings/corrections.md.
Same correction happens twice → promote to the agent's "Core Rules."
Applies fleet-wide → promote to shared rules.
Contradictions are flagged for a human to resolve.

We call this Fleet Learning. Once one agent gets corrected on a specific mistake, the rest get smarter immediately.

3. The Coordination Deadlock — Week 5

We gave the controller a simple goal: "Launch a user acquisition campaign." It broke it down into 6 sub-tasks and assigned them to 6 agents.

Two of those agents needed to coordinate on an API design. Neither would start. Both were politely waiting for the other to finish their spec. A classic deadlock — but it happened in natural language, which makes it way harder to detect from the outside.

The symptom: Two agents trading "I'm waiting for your specs" messages back and forth for 3 days straight.

The fix: We built GoalOps — a structured protocol for goal decomposition.

Goals have explicit dependencies.
Agents must declare what they're blocked on.
The controller monitors for stalls (no progress log in 24h → escalate).
P2P handoffs have timeouts.

4. The Context Window Tax — Week 6

With 28 agents hitting LLM APIs constantly, we were burning through tokens. Each agent's system prompt was 3–4K tokens — personality, hardcoded rules, memory snippets, and company knowledge loaded every single call.

The cost: ~$2,400/month, a huge chunk wasted on re-sending the same context.

The fix:

Moved to on-demand loading of shared knowledge.
Built a semantic search tool (we call it Teamind) instead of context stuffing. Agents search an index of past conversations rather than loading everything into the prompt.
Pruned system prompts to essentials (<2K tokens).

Result: API costs dropped to ~$800/month. Still not cheap, but sustainable.

5. The Trust Problem — Ongoing

The hardest part isn't code; it's defining who's in charge. Some tasks need human approval (deployments, financial decisions). Some don't (formatting a report, finding docs).

We run a 3-tier system:

Auto-execute: Low-risk, reversible.
Notify-and-proceed: Medium-risk, human gets a ping.
Wait-for-approval: High-risk, blocks until human clicks "Yes."

The challenge? Every human has a different risk tolerance. Our marketing lead wants full autonomy for social posts. The ops lead wants approval on every deployment. We're still tuning this friction.

What Actually Works Well

Semantic Team Memory (Teamind)

Every Slack message is indexed with embeddings. Any agent can ask "what did we decide about pricing last month?" and get an accurate answer with source links.

This alone justifies the whole setup. No more "I think someone mentioned this in a meeting I wasn't at."

Fleet Learning

When one agent gets burned by a common mistake — like using a deprecated API — the correction propagates instantly. Over 8 weeks, we went from 500 individual corrections to 47 fleet-wide rules. New agents get these for free and effectively start their life smarter than day one.

Goal Decomposition

You give the controller: "Prepare the quarterly report."

It figures out which agents have relevant data, creates sub-tasks with dependencies, monitors progress, and merges the outputs. This used to take a human PM 2–3 hours of coordination. Now it's one sentence and ~45 minutes of agent-toil. Not instant, but consistent.

The Numbers

Metric	Value
Agents running	28
Server	Single box, 64GB RAM, no GPU
Messages indexed	50,000+
Self-corrections	500+
Fleet-wide rules	47
Monthly API cost	~$800
Uptime (8 weeks)	99.2%

Should You Do This?

Probably not yet. Unless you have:

A team that already uses AI assistants.
Repetitive coordination overhead you want to eliminate.
A high tolerance for debugging agents that misunderstand each other.

If you're curious, we open-sourced everything: CorellisOrg/corellis on GitHub. MIT license, runs on Docker, works with any OpenClaw setup.

The future isn't just one AI assistant. It's a team of them that learns together.

Questions? The lobsters are listening. 🦞

— The Corellis team

How We Ran 28 AI Agents on a Single Server (And What Broke)

jay wong — Wed, 15 Apr 2026 19:15:24 +0000

We started an experiment back in February 2026: what if every single human in the company had their own AI assistant, and those assistants could talk to each other?
Eight weeks in, we have 28 agents running on a single commodity hardware box. They're handling operational toil, marketing coordination, release approvals, and churning out weekly reports. So far, they've processed over 50,000 Slack messages and made 500+ self-corrections.
I'm writing this to document what actually worked, why we blew up parts of the stack, and why we eventually open-sourced the whole mess.

The Setup

Each agent is isolated in its own Docker container. Here's the gross details:

Private memory: Conversation history persists locally.
Shared context: Access to the "Knowledge Base" (read-only for agents).
Interface: Slack channels are the primary UI.
Tasking: Connected to Notion for structured input/output.

There's a "Controller" agent that acts as the brain. It assigns goals, watches the fleet, and synthesizes lessons.

Architecture: Single server (64GB RAM, no GPU) → Controller agent at the top → 28 agent containers (Alice/Mktg, Bob/Ops, Carol/Fin, etc.) → all connected to shared knowledge + team memory layer at the bottom.

The Disaster Timeline

Memory Overflow — Week 2

Every agent keeps a MEMORY.md file as long-term storage. By week 2, we noticed some files growing to 20KB+. They were full of duplicate task notes, outdated jaggied fragments, and random chatter.
The symptom: Agents started hallucinating about completed tasks. They'd reference decisions that were overturned a month ago as if they were gospel.
The fix: We implemented a strict memory hierarchy with aggressive pruning. The gist is simple: not everything deserves to be remembered. Context is expensive, and context decay is real.
We split memory up:

Personal memory (per agent): Capped at 5KB. Auto-pruned weekly.
Member memory (per team member): Stores correction history and preferences.
Channel memory (per topic): Indexed with embeddings for semantic search.
Company memory (shared): Vetted knowledge that never expires.

The Correction Loop Problem — Week 3

We built a self-improvement system: when an agent gets corrected, it logs the lesson in a markdown file and adjusts its behavior.
In practice, agents were recording every correction, including contradictory ones. Agent A would get corrected by the CEO to "use formal tone," and an hour later get corrected by the CTO to "be more casual." The result was a confused agent that just started outputting garbage.
The fix: Corrections now go through a promotion pipeline.

Agent records locally in .learnings/corrections.md.
If the same correction happens twice → promote to the agent's "Core Rules."
If it's fundamental for the whole team → promote fleet-wide.
Contradictions are flagged for a human to resolve.

We call this Fleet Learning. Once one agent gets pwned (corrected) on a specific mistake, the entire fleet gets smarter immediately.

The Coordination Deadlock — Week 5

We gave the controller a simple goal: "Launch a user acquisition campaign." It broke it down into 6 sub-tasks and assigned them to 6 agents.
Two agents needed to coordinate on API design. Neither would start. Both were politely waiting for the other to finish their spec. Classic deadlock, but it happened in natural language, which makes it way harder to detect.
The symptom: Two agents sitting in a deadlock loop, trading "I'm waiting for your specs" messages back and forth for 3 days straight.
The fix: We built GoalOps—a structured protocol for goal decomposition.

Goals now have explicit dependencies.
Agents must declare what they are blocked on.
The controller monitors for stalls (no progress log for 24h? Escalate).
P2P handoffs have timeouts.

The Context Window Tax — Week 6

With 28 agents hitting LLM APIs constantly, we were burning through tokens like crazy. Each agent's system prompt alone was 3-4K tokens—personality definition, hardcoded rules, memory snippets, and company knowledge loaded every single time.
The cost: We were burning about $2,400/month. A huge chunk of that was wasted on repeating the same context over and over.
The fix:

Moved to on-demand loading of shared knowledge.
We built a semantic search tool (call it Teamind) instead of "context stuffing."
Agents search the index of all past conversations rather than loading everything into the prompt.
Pruned system prompts down to essentials (<2K tokens).

Result: API costs dropped to ~$800/month. It's still not dirt cheap, but it's sustainable enough that we aren't crying every time the bill comes in.

The Trust Problem — Ongoing

The hardest part isn't code; it's defining who is in charge. Some tasks need human approval (deployments, financial decisions). Some don't (formatting a report, finding docs).
We implemented a 3-tier system:

Auto-execute: Low-risk, reversible actions.
Notify-and-proceed: Medium-risk, human gets a ping.
Wait-for-approval: High-risk, blocks until human clicks "Yes."

The challenge? Every human has a different risk tolerance. Our marketing lead wants full autonomy for social posts. The ops lead wants approval on every deployment. We're still trying to tune this friction.

What Actually Works Well

Semantic Team Memory (Teamind)

Every Slack message is indexed with embeddings. Any agent can ask, "What did we decide about pricing last month?" and get an accurate answer with source links.
This alone justifies the whole setup. No more "I think someone mentioned this in a meeting I wasn't at."

Fleet Learning

When one agent gets burned by a common mistake—like using a deprecated API—the correction propagates instantly. After 8 weeks, we went from 500 individual corrections to solidifying 47 fleet-wide rules. New agents get these lessons for free, effectively starting their life smarter than day one.

Goal Decomposition

You give the controller: "Prepare the quarterly report."
It figures out:

Which agents have relevant data.
Creates sub-tasks with dependencies.
Monitors progress.
Merges the outputs.

This used to take a human PM 2-3 hours of coordination work. Now it takes one sentence and about 45 minutes of agent-toil. It's not instant, but it's consistent.

The Numbers

Agents running: 28
Server: Single, 64GB RAM, no GPU
Messages indexed: 50,000+
Self-corrections: 500+
Fleet-wide rules: 47
Monthly API cost: ~$800
Uptime (8 weeks): 99.2%

Should You Do This?

Probably not yet. Unless you have:

A team that already uses AI assistants.
Repetitive coordination overhead you want to eliminate.
A high tolerance for debugging agents that misunderstand each other.

If you're curious, we open-sourced everything: Corellis on GitHub. MIT license, runs on Docker, works with any OpenClaw setup.
The future isn't just one AI assistant. It's a team of them that learns together.