what i actually learned coordinating 15 MCP servers (it's not what you'd expect)

#agents #ai #architecture #mcp

everyone talks about MCP servers like they're the hard part. they're not. writing a single MCP server is maybe 200 lines of code. the hard part is what happens when you have 15 of them running simultaneously and they all need to cooperate.

i've been building a multi-agent system for the past few months. 9 services, 15 MCP servers, 60+ Cloudflare Workers. here's what i actually learned — most of it the hard way.

lesson 1: the orchestration layer is the real product

anyone can write an MCP server. child_process.exec(), parse the output, return JSON. done.

but when server #7 times out and server #3 depends on its output, and server #12 is rate-limited, and the user is waiting... that's where the real engineering lives.

we built a coordinator daemon that does health checks every 30 seconds across all services. when something goes down, it doesn't just retry — it reroutes through fallback chains. primary fails? try the secondary. secondary fails? degrade gracefully and tell the user what happened.

this is boring plumbing work. it's also the thing that makes the difference between a demo and a production system.

lesson 2: security is not optional (and it's scarier than you think)

we run 15 MCP servers. each one is a potential attack surface. the patterns we've seen (and defended against):

shell injection: if your MCP server calls child_process.exec() with user input, you're one crafted prompt away from rm -rf /. we use shlex.quote() on literally everything.
env variable leakage: secrets loaded from env vars accidentally appearing in LLM context windows through error messages. this one is subtle and terrifying.
path traversal: ../../etc/passwd in a file-reading MCP server. os.path.realpath() + directory whitelist, no exceptions.

we eventually built a "constitution gate" — a dual-LLM validation layer that checks every input before it reaches any tool. paranoid? maybe. but we haven't been pwned yet.

lesson 3: the model is becoming a commodity

we route between groq, cerebras, ollama (local), and claude depending on the task. same prompt, different providers, based on:

latency requirements (groq for fast, claude for complex)
cost (local ollama for repetitive tasks)
availability (if one provider is down, cascade to the next)

the model doesn't matter as much as people think. what matters is the routing logic, the fallback chains, the budget governance that prevents a runaway loop from draining your API credits.

lesson 4: your agent's memory is more important than its reasoning

we have three layers of memory:

session memory (what happened in this conversation)
task memory (success/failure patterns across all tasks)
playbook memory (reusable templates auto-generated from successful task sequences)

when a new task comes in, the orchestrator checks memory before planning. "have we seen something like this before? what worked? what failed?" this alone cut our error rate by ~40%.

lesson 5: silence is a feature

this is the one nobody talks about. our system has a dead-man's-switch — if the coordinator hasn't checked in for 60 minutes, something is wrong. but the inverse is also true: the system doesn't need to be doing something all the time.

the most reliable systems i've built are the ones that know when to shut up and wait.

these aren't revolutionary insights. they're the boring, practical things you learn when you actually try to run multiple MCP servers in production instead of just demoing one in a blog post.

if you're building something similar, i'd genuinely love to hear what patterns you've found. especially around multi-server coordination — i feel like we're all reinventing the same wheels independently.