Patrik Högberg

Posted on Apr 28

What I Learned Building a Dark Factory Coding Platform (The Hard Way)

#ai #programming #vibecoding #automation

Lessons from a few intense months of trying to make 100 AI agents write production code without breaking everything

When I started building Coroid, I thought I understood AI. I'd used Claude, ChatGPT, Cursor, Copilot — the whole suite. I thought: "How hard can it be to chain some API calls together and let an agent write code?"

I was wrong. Deeply, expensively, humblingly wrong.

What follows is not a marketing pitch. It's an honest post-mortem of the hardest technical problems I've had to solve — problems that tools like Claude Code, Cursor, and Codex make look effortless, but that required months of pain to get right.

1. How LLMs Actually Work (Not How You Think)

Before Coroid, my mental model of an LLM was: "smart text predictor." After building a harness from scratch — no Pi SDK, no Anthropic SDK, no OpenAI abstractions — I understand them as state machines with opinions.

The biggest revelation was tool calling. It's not magic. The model doesn't "know" about your functions the way you know about an import. It sees a blob of JSON schema in its context window and has been trained to emit specially formatted text that looks like a function call. The actual execution happens in your code.

Getting this right meant learning:

Naming conventions matter enormously. The model has been trained on billions of examples where certain verb-noun patterns correlate with certain behaviors. read_file works better than file_retrieval because it matches patterns in the training data.
Schema design is prompt engineering. The structure of your JSON schema is part of the prompt. Required fields, descriptions, examples — every character influences whether the model hallucinates parameters.
Different models speak different "tool dialects". OpenAI's function calling format differs from Anthropic's tool use, which differs from what you need for local models via Ollama. We built a StandardsCompliantToolService that normalizes across all of them, but getting there required parsing a lot of malformed JSON.

2. Context Compaction: The 100K Cliff

Here's something the API docs don't tell you: most models start degrading significantly well before their advertised context limit. That "200K context window"? In practice, coherence drops off a cliff around 80-100K tokens, sometimes earlier depending on the model.

For an agent that needs to work on large codebases, this is existential. You can't just feed the entire repo into the context window and hope.

Our solution was a three-layer compaction system:

Task Re-Anchoring: At every iteration of the agent loop, we restate the original objective. Sounds simple, but without it, the agent drifts. We actually measure this — "objective drift" scored via keyword overlap. Without re-anchoring, agents would start fixing linting errors instead of implementing the feature.

Working State Manager: Instead of dumping raw conversation history, we maintain a curated state object: activeFiles, discoveries, failedApproaches, verificationStatus. This gets injected into each turn, not the full transcript.

Shared Context Cache: A PostgreSQL-backed cache (TaskContextCacheService) that persists file contents, tool results, reasoning traces, and failed attempts across agent handoffs. When one agent passes work to another, the context doesn't travel with the message — it's retrieved from cache.

The result? We can run tasks that touch dozens of files across multi-hour sessions without hitting context limits. But getting here required accepting a hard truth: you cannot keep the full thread. You must aggressively curate what the model sees at each step.

3. Making File Editing Consistent Across Models

This was surprisingly brutal.

Line-number-based editing seems natural to humans. "Replace lines 45-52 with this code." But models are terrible at counting lines consistently, especially when:

The file has changed since they last saw it
Different tokenizers count whitespace differently
One model uses 1-based indexing, another gets confused

We tried multiple approaches:

Line-number patches: Failed ~30% of the time due to drift
Search-and-replace blocks: Better, but models would slightly alter the search string
AST-based edits: Too slow, required parsing every language

What worked was a hybrid: apply_code_changes with both "create" and "replace" modes, where replace uses a fuzzy search block with context lines. Even then, we had to build retry logic and fallbacks. Claude might handle it fine; GPT-4 might hallucinate line numbers; a local model might ignore the format entirely.

If you're building an agent that edits code, invest heavily here. This is the difference between a demo and a production system.

4. Building the Harness From Scratch

I rewrote our agent harness completely from scratch. No Pi, no Claude SDK, no OpenAI assistants API. Just raw HTTP calls, a state machine, and a lot of prayers.

Why? Because existing abstractions make assumptions that don't hold for autonomous systems:

They assume human-in-the-loop
They assume short conversations
They assume one model, one provider
They assume you can just "retry" when things go wrong

Our harness is essentially a ReAct loop (Reasoning + Acting), but productionizing it required solving edge cases nobody talks about:

Model misbehavior: All models misbehave, just in different ways. Claude might get philosophical and refuse to use tools. GPT-4 might get stuck in a loop calling the same tool with the same arguments. Local models might output malformed JSON. The harness needs to detect, nudge, and recover from each failure mode.

Context discipline: Every token costs money and attention. We dynamically construct system prompts based on the task phase — planning prompts differ from implementation prompts, which differ from review prompts. There's no "one prompt to rule them all" like in Claude Code.

State vs. reasoning separation: We learned to separate the agent's internal reasoning from its working state. Reasoning can be verbose; state must be compact. Mixing them causes both context bloat and confusion.

The result is something I'm proud of, but I have newfound respect for the teams at Anthropic, OpenAI, and Cursor. They make this look easy. It's not.

5. Discoverable/Progressive Skills

In vibe coding, you can just tell the LLM: "Use the React skill." In an autonomous system, the agent needs to discover what skills exist without you holding its hand.

But there's a tension: you can't dump every skill into the context window (token pollution), but you also can't expect the agent to know about skills it hasn't seen.

We built a 4-level progressive disclosure system:

Registry (~100 tokens): "Here are the categories of skills available"
Manifest (~500-1000 tokens): "Here's what the React skill contains"
Resources: Specific files, patterns, examples
Scripts: Executable code the agent can run

Different agent profiles get different levels:

Architect agents get the full registry and are encouraged to explore
Developer agents get suggested skills injected based on the task
QA/Review agents auto-load domain-specific skills

This is still not perfect. The hardest part is teaching the agent when to ask for a skill it doesn't know about, based on hints in the task spec. We ended up adding "skill discovery cues" to our spec format — metadata that tells the agent "if you see X pattern, you probably need Y skill."

6. MCP: Building a Protocol Bridge

MCP (Model Context Protocol) is the right idea: standardize how AI tools connect to external services. But implementing both a server and client taught me how early the ecosystem is.

Our MCP integration uses progressive loading just like skills:

Level 0: Registry of available servers (~100 tokens each)
Level 1: Manifest with server capabilities
Level 2: Full JSON schemas for tool definitions

We support Sentry, Jira, Slack, and custom organization-level servers. But multi-tenancy is tricky — each organization sees only their configured MCP servers, with project-level access control.

The real learning? Protocols are easy; semantic interoperability is hard. Just because two servers speak MCP doesn't mean their tools compose well. We had to build translation layers and normalization logic that the spec doesn't cover.

7. Git Worktrees, Sandboxes, and Cloud Isolation

Running 100 agents in parallel means 100 agents potentially touching the same codebase. Git branches work for humans; they break down when agents create, delete, and force-push branches autonomously.

Our evolution:

S3 copy architecture: Each agent got a full repo copy. Took 1-5 minutes. Unacceptable.
Git-in-S3 hybrid: Better, but still slow.
Shared POSIX storage + Git worktrees: The winner. Instant branch creation (<1 second), isolated working directories, shared object database.

But isolation isn't just about Git. It's about sandboxing. Each agent runs in a constrained environment:

Docker containers with 256MB memory limits
30-second timeouts
No network access (unless explicitly requested)
File-system restrictions

I think the hardest part is still ahead: battle-testing security. We continuously try to make agents break out of sandboxes. So far, so good. But this is an arms race I'm not sure anyone has fully solved.

8. Running 100 Agents in Parallel Without Chaos

This is the one I'm most proud of, and I think it's something even big labs struggle with.

The problem isn't parallelism — it's convergence. 100 agents working on different tasks will step on each other. Two agents editing the same file. One agent deleting a function another just started using. A dependency agent and a consumer agent racing.

Our solution came from an unexpected place: npm and package managers. They solved dependency resolution and conflict detection. We built something similar:

Dependency Analyzer Agent: Uses graph thinking + LSP code verification to build a task dependency graph before execution. It answers: "Which tasks must complete before others start?"

Conflict Resolution Agent: A specialized AI that analyzes potential conflicts and produces a waterfall execution plan. Tasks get ordered to minimize collisions.

Cross-project awareness: Tasks aren't isolated to single repos. We define schemas between repos (frontend ↔ backend ↔ admin) so agents know about cross-repo contracts. If the backend agent changes an API, the frontend agent knows to wait.

Does it work perfectly? No. But it works most of the time, which for 100 parallel agents feels like a miracle.

9. Recursive Self-Improvement (RSI)

Inspired by OpenClaw's heartbeat pattern, we built an RSI system that's surprisingly simple:

A cron job starts a "reporter agent" every N minutes. It examines:

Recent task failures and patterns
Code quality trends
Performance metrics
Agent behavior anomalies

If it finds something actionable, it spawns work agents:

A pattern-fixer agent updates prompts or skills
A documentation agent updates specs
A calibration agent runs eval challenges

The key insight: RSI doesn't need to be recursive in the sci-fi sense. It just needs to be a closed loop where the system observes itself and initiates improvement tasks. The "recursive" part comes from the fact that improved prompts produce better agents, which produce better observations, which produce better improvements.

10. Consistency: Solving the 8/10 Problem

LLMs are notorious for inconsistency. A task that succeeds 8 times might fail twice for no apparent reason. In a vibe-coding session, you just retry. In an autonomous system, a 20% failure rate is catastrophic.

Our answer: multi-agent consensus with specialization.

For every task, we run 4 agents in a structured pipeline:

Architect Agent: Plans the approach using "ULTRATHINK" protocol + "INTENTIONAL MINIMALISM"
Developer Agent: Implements using "CLEAN CODE CULT" protocol
QA Agent: Tests using "ZERO TOLERANCE" protocol — Playwright-based, tries to break things
Reviewer Agent: Reviews using "STEELMAN REVIEW" protocol — argues for the code's correctness before critiquing

They talk to each other. The QA agent doesn't just report bugs; it negotiates with the developer. The reviewer doesn't just approve/reject; it suggests specific improvements. They converge when everyone agrees the output matches the spec.

Autonomous remediation handles failures: simple errors get fixed in-task; complex issues spawn remediation tasks. We cap it at 3 QA→Dev cycles to prevent infinite loops.

This raised our consistency from ~80% to ~95%. The last 5% might require something fundamentally different, but 95% autonomous is life-changing.

11. Kubernetes, Distributed Systems, and OVH

I chose OVH (European provider) over AWS/GCP/Azure. Partly principle, partly cost, partly learning.

Setting up a distributed system that grows dynamically is hard, regardless of provider. But doing it on a less "batteries-included" platform meant learning the fundamentals:

Managed Kubernetes: Our production runs on OVH Managed Kubernetes
Staging: A "fat container" on an OVH VPS — Docker Compose with supervisord managing all services
Observability: Prometheus, Grafana, Loki, GlitchTip — all self-managed
Database: PostgreSQL 17 with point-in-time recovery

The learning curve was real, but the result is a system I understand deeply. When something breaks, I know which logs to check. When we need to scale, I know which knobs to turn.

Key services in our mesh:

API Gateway (Hono) on port 3000
Orchestrator (NestJS) on port 3005 — agent coordination via RabbitMQ
Repository Sync (NestJS) on port 3006 — Git sync, workspace management
AI Agents (Bun runtime) on port 3008
Spec-Driven Service (NestJS) on port 3010
LSP Service on port 3022 — semantic code intelligence
Eval Service on port 3025 — model comparison, regression detection

All communicating via REST, WebSocket/Socket.IO, RabbitMQ, and JWT service tokens.

12. Sub-Agents: Smaller, Smarter, Specialized

Anthropic's research on sub-agents resonated with me. One big agent with a huge context window is often worse than multiple small agents with focused contexts.

We use Explorer Subagents (2-5 per parent agent) for parallel codebase exploration:

One explores the data layer
One explores the API layer
One explores tests and examples
They report back in parallel, minimizing wall-clock time

For complex tasks, a parent agent spawns children, collects results, and synthesizes. Each child has a tight context window and a narrow mandate. It's often both cheaper and faster than one monolithic agent.

13. The Web Tester Framework

Traditional agent testing: take a screenshot, look at it. This is unbearably slow and expensive.

We built something different: an agent browser that lives inside the JavaScript of our testing framework (Playwright-based). Instead of visual analysis:

The agent navigates via DOM queries
Fills forms programmatically
Clicks elements by semantic role
Verifies state via JavaScript assertions
All at native execution speed

The agent "sees" the page through accessibility trees and DOM structure, not pixels. Tests that took minutes now take seconds. It's not as general as visual testing, but for verifying functional correctness, it's transformative.

We still do visual regression for UI components, but the heavy lifting of functional testing is now agent-driven and fast.

14. Prompt Engineering as System Architecture

I don't mean "how to write a good ChatGPT prompt." I mean how to architect a system where prompts are dynamic, versioned, and adaptive.

Our prompt system:

File-based in /system-prompts (Markdown with YAML frontmatter)
Version controlled, synced to a registry
Template variables: {{variableName}}
Client repositories can override via their own /system-prompts

But the real magic is dynamic construction. Our harness doesn't have a single system prompt. It builds one per task based on:

The task type (planning vs implementation vs review)
The target model (Claude prefers different framing than GPT)
The project tech stack (React skills for frontend tasks, NestJS skills for backend)
The current phase (first iteration vs recovery from error)
Recent failures (if the last attempt failed, inject "common failure mode" guidance)

This took months of experimentation. What works for Claude 3.5 Sonnet might confuse GPT-4o. What works for planning might sabotage implementation. Prompt engineering at this level is less about clever wording and more about information architecture — what to show, when to show it, and what to omit.

15. Viability and Bad Pattern Elimination

Here's a problem I haven't seen anyone else solve: LLMs inherit bad patterns from existing code and amplify them.

If your codebase has a sketchy auth check, the agent will copy it. If there's a memory leak pattern, the agent will reproduce it. Vibe coders know this — "it works but the code is worse than before."

We built a Viability System:

Pattern Detection: Scans codebase for anti-patterns (via AST + regex + LSP)
Pattern Learning: When a bad pattern is identified, it generates a "negative example" and a "preferred alternative"
Prompt Injection: Before the agent writes code, injects: "DO NOT use pattern X; USE pattern Y instead"
Retroactive Cleanup: Spans improvement tasks to eliminate bad patterns from existing code

I don't know of another system that does this. It's unique to Coroid, and it addresses one of the most insidious problems in AI-generated code: the entropy increase of each generation.

16. Building a Real Chat System

"Just build a ChatGPT clone" — famous last words.

Streaming LLM responses with tool calls across multiple providers is genuinely hard. Each model has quirks:

OpenAI streams tool calls as separate delta events
Anthropic interleaves tool use with thinking blocks
Some local models don't stream tool calls at all

Plus we integrated with Slack and Discord, each with their own message format limitations, rate limits, and threading models.

The UX required a lot of "dirty hacks" (as I call them in my notes):

Filtering malformed chunks
Reordering out-of-order streaming events
Handling tool call JSON split across multiple chunks
Normalizing different models' output formats into a unified message stream

What users see as "smooth streaming" is actually a pipeline of parsers, normalizers, buffers, and fallbacks. Building this gave me deep respect for every chat interface that feels effortless.

17. Spec-Driven Development: The Dark Factory Mentality

The most important lesson, and the one that makes everything else possible:

Specs in, PR out.

In vibe coding, you chat with the AI and iterate. This is great for exploration, terrible for production. In Coroid, no agent starts work without a detailed spec. The spec defines:

What to build
What NOT to build (guardrails)
Acceptance criteria
Dependencies and cross-project contracts
Test expectations
UI/UX requirements (if applicable)

This is how industry worked before AI — requirements documents, design specs, test plans. The agents use these as their source of truth. When they disagree, they refer back to the spec. When they're uncertain, they ask for spec clarification.

We even built a GitHub Spec Kit integration with completeness scoring. A vague spec gets rejected before any agent touches it.

The "Dark Factory" mentality: turn off the lights, let the factory run autonomously, come back tomorrow and review the PRs. But factories need blueprints. Specs are those blueprints.

18. Brownfield, Not Greenfield

Coroid is designed for the messy reality: you already have a codebase. Maybe it was vibed in Lovable, V0, or Cursor. Maybe it's a 5-year-old monolith. Coroid doesn't care.

The power is: throw it a project, let it work overnight, review the PRs in the morning.

This works alongside your existing team. While your developers build the next big feature, Coroid clears the backlog: bug fixes, refactors, tests, documentation, dependency updates.

I learned this from previous startups: the biggest bottleneck isn't writing new code — it's maintaining old code while building new code. Coroid addresses that directly.

What I Really Think About AI and Developers

Is AI coming for our jobs? No.

AI is another tool. A powerful one. One that makes previously impossible things possible.

My genuine suggestion to anyone worried about AI: Start building that idea you've been sitting on. The barrier to entry has never been lower. You can prototype in a weekend what used to take months. You can test assumptions faster. You can iterate cheaper.

AI won't destroy things for developers. It will make more things possible, faster. The developers who thrive will be those who learn to work with AI, not against it or in fear of it.

Coroid is my bet on that future. It's not perfect. It fails sometimes. But when it works — when you wake up to 12 pull requests that fix bugs, add tests, and refactor legacy code while you slept — it feels like magic.

Magic that took a few intense months of very unmagical engineering to create.

What's one thing you've learned building with AI that surprised you? I'd genuinely love to hear.

Building Coroid with NestJS, Hono, Next.js, Kysely, PostgreSQL, Kubernetes, RabbitMQ, and more agents than I can count. Deployed on OVH. Evaluated continuously. Improved recursively.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.