Lessons from a few intense months of trying to make 100 AI agents write production code without breaking everything
When I started building Coroid, I thought I understood AI. I'd used Claude, ChatGPT, Cursor, Copilot — the whole suite. I thought: "How hard can it be to chain some API calls together and let an agent write code?"
I was wrong. Deeply, expensively, humblingly wrong.
What follows is not a marketing pitch. It's an honest post-mortem of the hardest technical problems I've had to solve — problems that tools like Claude Code, Cursor, and Codex make look effortless, but that required months of pain to get right.
1. How LLMs Actually Work (Not How You Think)
Before Coroid, my mental model of an LLM was: "smart text predictor." After building a harness from scratch — no Pi SDK, no Anthropic SDK, no OpenAI abstractions — I understand them as state machines with opinions.
The biggest revelation was tool calling. It's not magic. The model doesn't "know" about your functions the way you know about an import. It sees a blob of JSON schema in its context window and has been trained to emit specially formatted text that looks like a function call. The actual execution happens in your code.
Getting this right meant learning:
-
Naming conventions matter enormously. The model has been trained on billions of examples where certain verb-noun patterns correlate with certain behaviors.
read_fileworks better thanfile_retrievalbecause it matches patterns in the training data. - Schema design is prompt engineering. The structure of your JSON schema is part of the prompt. Required fields, descriptions, examples — every character influences whether the model hallucinates parameters.
-
Different models speak different "tool dialects". OpenAI's function calling format differs from Anthropic's tool use, which differs from what you need for local models via Ollama. We built a
StandardsCompliantToolServicethat normalizes across all of them, but getting there required parsing a lot of malformed JSON.
2. Context Compaction: The 100K Cliff
Here's something the API docs don't tell you: most models start degrading significantly well before their advertised context limit. That "200K context window"? In practice, coherence drops off a cliff around 80-100K tokens, sometimes earlier depending on the model.
For an agent that needs to work on large codebases, this is existential. You can't just feed the entire repo into the context window and hope.
Our solution was a three-layer compaction system:
Task Re-Anchoring: At every iteration of the agent loop, we restate the original objective. Sounds simple, but without it, the agent drifts. We actually measure this — "objective drift" scored via keyword overlap. Without re-anchoring, agents would start fixing linting errors instead of implementing the feature.
Working State Manager: Instead of dumping raw conversation history, we maintain a curated state object: activeFiles, discoveries, failedApproaches, verificationStatus. This gets injected into each turn, not the full transcript.
Shared Context Cache: A PostgreSQL-backed cache (TaskContextCacheService) that persists file contents, tool results, reasoning traces, and failed attempts across agent handoffs. When one agent passes work to another, the context doesn't travel with the message — it's retrieved from cache.
The result? We can run tasks that touch dozens of files across multi-hour sessions without hitting context limits. But getting here required accepting a hard truth: you cannot keep the full thread. You must aggressively curate what the model sees at each step.
3. Making File Editing Consistent Across Models
This was surprisingly brutal.
Line-number-based editing seems natural to humans. "Replace lines 45-52 with this code." But models are terrible at counting lines consistently, especially when:
- The file has changed since they last saw it
- Different tokenizers count whitespace differently
- One model uses 1-based indexing, another gets confused
We tried multiple approaches:
- Line-number patches: Failed ~30% of the time due to drift
- Search-and-replace blocks: Better, but models would slightly alter the search string
- AST-based edits: Too slow, required parsing every language
What worked was a hybrid: apply_code_changes with both "create" and "replace" modes, where replace uses a fuzzy search block with context lines. Even then, we had to build retry logic and fallbacks. Claude might handle it fine; GPT-4 might hallucinate line numbers; a local model might ignore the format entirely.
If you're building an agent that edits code, invest heavily here. This is the difference between a demo and a production system.
4. Building the Harness From Scratch
I rewrote our agent harness completely from scratch. No Pi, no Claude SDK, no OpenAI assistants API. Just raw HTTP calls, a state machine, and a lot of prayers.
Why? Because existing abstractions make assumptions that don't hold for autonomous systems:
- They assume human-in-the-loop
- They assume short conversations
- They assume one model, one provider
- They assume you can just "retry" when things go wrong
Our harness is essentially a ReAct loop (Reasoning + Acting), but productionizing it required solving edge cases nobody talks about:
Model misbehavior: All models misbehave, just in different ways. Claude might get philosophical and refuse to use tools. GPT-4 might get stuck in a loop calling the same tool with the same arguments. Local models might output malformed JSON. The harness needs to detect, nudge, and recover from each failure mode.
Context discipline: Every token costs money and attention. We dynamically construct system prompts based on the task phase — planning prompts differ from implementation prompts, which differ from review prompts. There's no "one prompt to rule them all" like in Claude Code.
State vs. reasoning separation: We learned to separate the agent's internal reasoning from its working state. Reasoning can be verbose; state must be compact. Mixing them causes both context bloat and confusion.
The result is something I'm proud of, but I have newfound respect for the teams at Anthropic, OpenAI, and Cursor. They make this look easy. It's not.
5. Discoverable/Progressive Skills
In vibe coding, you can just tell the LLM: "Use the React skill." In an autonomous system, the agent needs to discover what skills exist without you holding its hand.
But there's a tension: you can't dump every skill into the context window (token pollution), but you also can't expect the agent to know about skills it hasn't seen.
We built a 4-level progressive disclosure system:
- Registry (~100 tokens): "Here are the categories of skills available"
- Manifest (~500-1000 tokens): "Here's what the React skill contains"
- Resources: Specific files, patterns, examples
- Scripts: Executable code the agent can run
Different agent profiles get different levels:
- Architect agents get the full registry and are encouraged to explore
- Developer agents get suggested skills injected based on the task
- QA/Review agents auto-load domain-specific skills
This is still not perfect. The hardest part is teaching the agent when to ask for a skill it doesn't know about, based on hints in the task spec. We ended up adding "skill discovery cues" to our spec format — metadata that tells the agent "if you see X pattern, you probably need Y skill."
6. MCP: Building a Protocol Bridge
MCP (Model Context Protocol) is the right idea: standardize how AI tools connect to external services. But implementing both a server and client taught me how early the ecosystem is.
Our MCP integration uses progressive loading just like skills:
- Level 0: Registry of available servers (~100 tokens each)
- Level 1: Manifest with server capabilities
- Level 2: Full JSON schemas for tool definitions
We support Sentry, Jira, Slack, and custom organization-level servers. But multi-tenancy is tricky — each organization sees only their configured MCP servers, with project-level access control.
The real learning? Protocols are easy; semantic interoperability is hard. Just because two servers speak MCP doesn't mean their tools compose well. We had to build translation layers and normalization logic that the spec doesn't cover.
7. Git Worktrees, Sandboxes, and Cloud Isolation
Running 100 agents in parallel means 100 agents potentially touching the same codebase. Git branches work for humans; they break down when agents create, delete, and force-push branches autonomously.
Our evolution:
- S3 copy architecture: Each agent got a full repo copy. Took 1-5 minutes. Unacceptable.
- Git-in-S3 hybrid: Better, but still slow.
- Shared POSIX storage + Git worktrees: The winner. Instant branch creation (<1 second), isolated working directories, shared object database.
But isolation isn't just about Git. It's about sandboxing. Each agent runs in a constrained environment:
- Docker containers with 256MB memory limits
- 30-second timeouts
- No network access (unless explicitly requested)
- File-system restrictions
I think the hardest part is still ahead: battle-testing security. We continuously try to make agents break out of sandboxes. So far, so good. But this is an arms race I'm not sure anyone has fully solved.
8. Running 100 Agents in Parallel Without Chaos
This is the one I'm most proud of, and I think it's something even big labs struggle with.
The problem isn't parallelism — it's convergence. 100 agents working on different tasks will step on each other. Two agents editing the same file. One agent deleting a function another just started using. A dependency agent and a consumer agent racing.
Our solution came from an unexpected place: npm and package managers. They solved dependency resolution and conflict detection. We built something similar:
Dependency Analyzer Agent: Uses graph thinking + LSP code verification to build a task dependency graph before execution. It answers: "Which tasks must complete before others start?"
Conflict Resolution Agent: A specialized AI that analyzes potential conflicts and produces a waterfall execution plan. Tasks get ordered to minimize collisions.
Cross-project awareness: Tasks aren't isolated to single repos. We define schemas between repos (frontend ↔ backend ↔ admin) so agents know about cross-repo contracts. If the backend agent changes an API, the frontend agent knows to wait.
Does it work perfectly? No. But it works most of the time, which for 100 parallel agents feels like a miracle.
9. Recursive Self-Improvement (RSI)
Inspired by OpenClaw's heartbeat pattern, we built an RSI system that's surprisingly simple:
A cron job starts a "reporter agent" every N minutes. It examines:
- Recent task failures and patterns
- Code quality trends
- Performance metrics
- Agent behavior anomalies
If it finds something actionable, it spawns work agents:
- A pattern-fixer agent updates prompts or skills
- A documentation agent updates specs
- A calibration agent runs eval challenges
The key insight: RSI doesn't need to be recursive in the sci-fi sense. It just needs to be a closed loop where the system observes itself and initiates improvement tasks. The "recursive" part comes from the fact that improved prompts produce better agents, which produce better observations, which produce better improvements.
10. Consistency: Solving the 8/10 Problem
LLMs are notorious for inconsistency. A task that succeeds 8 times might fail twice for no apparent reason. In a vibe-coding session, you just retry. In an autonomous system, a 20% failure rate is catastrophic.
Our answer: multi-agent consensus with specialization.
For every task, we run 4 agents in a structured pipeline:
- Architect Agent: Plans the approach using "ULTRATHINK" protocol + "INTENTIONAL MINIMALISM"
- Developer Agent: Implements using "CLEAN CODE CULT" protocol
- QA Agent: Tests using "ZERO TOLERANCE" protocol — Playwright-based, tries to break things
- Reviewer Agent: Reviews using "STEELMAN REVIEW" protocol — argues for the code's correctness before critiquing
They talk to each other. The QA agent doesn't just report bugs; it negotiates with the developer. The reviewer doesn't just approve/reject; it suggests specific improvements. They converge when everyone agrees the output matches the spec.
Autonomous remediation handles failures: simple errors get fixed in-task; complex issues spawn remediation tasks. We cap it at 3 QA→Dev cycles to prevent infinite loops.
This raised our consistency from ~80% to ~95%. The last 5% might require something fundamentally different, but 95% autonomous is life-changing.
11. Kubernetes, Distributed Systems, and OVH
I chose OVH (European provider) over AWS/GCP/Azure. Partly principle, partly cost, partly learning.
Setting up a distributed system that grows dynamically is hard, regardless of provider. But doing it on a less "batteries-included" platform meant learning the fundamentals:
- Managed Kubernetes: Our production runs on OVH Managed Kubernetes
- Staging: A "fat container" on an OVH VPS — Docker Compose with supervisord managing all services
- Observability: Prometheus, Grafana, Loki, GlitchTip — all self-managed
- Database: PostgreSQL 17 with point-in-time recovery
The learning curve was real, but the result is a system I understand deeply. When something breaks, I know which logs to check. When we need to scale, I know which knobs to turn.
Key services in our mesh:
- API Gateway (Hono) on port 3000
- Orchestrator (NestJS) on port 3005 — agent coordination via RabbitMQ
- Repository Sync (NestJS) on port 3006 — Git sync, workspace management
- AI Agents (Bun runtime) on port 3008
- Spec-Driven Service (NestJS) on port 3010
- LSP Service on port 3022 — semantic code intelligence
- Eval Service on port 3025 — model comparison, regression detection
All communicating via REST, WebSocket/Socket.IO, RabbitMQ, and JWT service tokens.
12. Sub-Agents: Smaller, Smarter, Specialized
Anthropic's research on sub-agents resonated with me. One big agent with a huge context window is often worse than multiple small agents with focused contexts.
We use Explorer Subagents (2-5 per parent agent) for parallel codebase exploration:
- One explores the data layer
- One explores the API layer
- One explores tests and examples
- They report back in parallel, minimizing wall-clock time
For complex tasks, a parent agent spawns children, collects results, and synthesizes. Each child has a tight context window and a narrow mandate. It's often both cheaper and faster than one monolithic agent.
13. The Web Tester Framework
Traditional agent testing: take a screenshot, look at it. This is unbearably slow and expensive.
We built something different: an agent browser that lives inside the JavaScript of our testing framework (Playwright-based). Instead of visual analysis:
- The agent navigates via DOM queries
- Fills forms programmatically
- Clicks elements by semantic role
- Verifies state via JavaScript assertions
- All at native execution speed
The agent "sees" the page through accessibility trees and DOM structure, not pixels. Tests that took minutes now take seconds. It's not as general as visual testing, but for verifying functional correctness, it's transformative.
We still do visual regression for UI components, but the heavy lifting of functional testing is now agent-driven and fast.
14. Prompt Engineering as System Architecture
I don't mean "how to write a good ChatGPT prompt." I mean how to architect a system where prompts are dynamic, versioned, and adaptive.
Our prompt system:
- File-based in
/system-prompts(Markdown with YAML frontmatter) - Version controlled, synced to a registry
- Template variables:
{{variableName}} - Client repositories can override via their own
/system-prompts
But the real magic is dynamic construction. Our harness doesn't have a single system prompt. It builds one per task based on:
- The task type (planning vs implementation vs review)
- The target model (Claude prefers different framing than GPT)
- The project tech stack (React skills for frontend tasks, NestJS skills for backend)
- The current phase (first iteration vs recovery from error)
- Recent failures (if the last attempt failed, inject "common failure mode" guidance)
This took months of experimentation. What works for Claude 3.5 Sonnet might confuse GPT-4o. What works for planning might sabotage implementation. Prompt engineering at this level is less about clever wording and more about information architecture — what to show, when to show it, and what to omit.
15. Viability and Bad Pattern Elimination
Here's a problem I haven't seen anyone else solve: LLMs inherit bad patterns from existing code and amplify them.
If your codebase has a sketchy auth check, the agent will copy it. If there's a memory leak pattern, the agent will reproduce it. Vibe coders know this — "it works but the code is worse than before."
We built a Viability System:
- Pattern Detection: Scans codebase for anti-patterns (via AST + regex + LSP)
- Pattern Learning: When a bad pattern is identified, it generates a "negative example" and a "preferred alternative"
- Prompt Injection: Before the agent writes code, injects: "DO NOT use pattern X; USE pattern Y instead"
- Retroactive Cleanup: Spans improvement tasks to eliminate bad patterns from existing code
I don't know of another system that does this. It's unique to Coroid, and it addresses one of the most insidious problems in AI-generated code: the entropy increase of each generation.
16. Building a Real Chat System
"Just build a ChatGPT clone" — famous last words.
Streaming LLM responses with tool calls across multiple providers is genuinely hard. Each model has quirks:
- OpenAI streams tool calls as separate delta events
- Anthropic interleaves tool use with thinking blocks
- Some local models don't stream tool calls at all
Plus we integrated with Slack and Discord, each with their own message format limitations, rate limits, and threading models.
The UX required a lot of "dirty hacks" (as I call them in my notes):
- Filtering malformed chunks
- Reordering out-of-order streaming events
- Handling tool call JSON split across multiple chunks
- Normalizing different models' output formats into a unified message stream
What users see as "smooth streaming" is actually a pipeline of parsers, normalizers, buffers, and fallbacks. Building this gave me deep respect for every chat interface that feels effortless.
17. Spec-Driven Development: The Dark Factory Mentality
The most important lesson, and the one that makes everything else possible:
Specs in, PR out.
In vibe coding, you chat with the AI and iterate. This is great for exploration, terrible for production. In Coroid, no agent starts work without a detailed spec. The spec defines:
- What to build
- What NOT to build (guardrails)
- Acceptance criteria
- Dependencies and cross-project contracts
- Test expectations
- UI/UX requirements (if applicable)
This is how industry worked before AI — requirements documents, design specs, test plans. The agents use these as their source of truth. When they disagree, they refer back to the spec. When they're uncertain, they ask for spec clarification.
We even built a GitHub Spec Kit integration with completeness scoring. A vague spec gets rejected before any agent touches it.
The "Dark Factory" mentality: turn off the lights, let the factory run autonomously, come back tomorrow and review the PRs. But factories need blueprints. Specs are those blueprints.
18. Brownfield, Not Greenfield
Coroid is designed for the messy reality: you already have a codebase. Maybe it was vibed in Lovable, V0, or Cursor. Maybe it's a 5-year-old monolith. Coroid doesn't care.
The power is: throw it a project, let it work overnight, review the PRs in the morning.
This works alongside your existing team. While your developers build the next big feature, Coroid clears the backlog: bug fixes, refactors, tests, documentation, dependency updates.
I learned this from previous startups: the biggest bottleneck isn't writing new code — it's maintaining old code while building new code. Coroid addresses that directly.
What I Really Think About AI and Developers
Is AI coming for our jobs? No.
AI is another tool. A powerful one. One that makes previously impossible things possible.
My genuine suggestion to anyone worried about AI: Start building that idea you've been sitting on. The barrier to entry has never been lower. You can prototype in a weekend what used to take months. You can test assumptions faster. You can iterate cheaper.
AI won't destroy things for developers. It will make more things possible, faster. The developers who thrive will be those who learn to work with AI, not against it or in fear of it.
Coroid is my bet on that future. It's not perfect. It fails sometimes. But when it works — when you wake up to 12 pull requests that fix bugs, add tests, and refactor legacy code while you slept — it feels like magic.
Magic that took a few intense months of very unmagical engineering to create.
What's one thing you've learned building with AI that surprised you? I'd genuinely love to hear.
Building Coroid with NestJS, Hono, Next.js, Kysely, PostgreSQL, Kubernetes, RabbitMQ, and more agents than I can count. Deployed on OVH. Evaluated continuously. Improved recursively.
Top comments (0)