Building an MCP server in production: lessons from 2,300 npm downloads

#ai #devtools #javascript #llm

i shipped an MCP server six months ago expecting it to be a weekend toy. it's now been pulled down a little over 2,300 times from npm and runs inside other people's editors every day. that gap — between "stdio script i tested once" and "thing strangers depend on" — is where i learned everything i actually know about the Model Context Protocol. this post is the stuff the spec doesn't tell you.

if you've read the MCP docs you know the happy path: expose some tools, return some content, the client (Claude Desktop, an IDE, whatever) calls them. what the docs are quieter about is what happens when a real client on a real machine hits your server in ways you didn't anticipate. here's what broke, and what i'd tell myself on day one.

the transport will surprise you

the first thing nobody warns you about: stdout is sacred. MCP over stdio uses your process's standard output as the message channel. the moment some dependency deep in your tree does a stray console.log, you've injected garbage into the JSON-RPC stream and the client disconnects with a parse error that points nowhere useful.

i lost an evening to this. the fix is boring and absolute — every diagnostic goes to stderr, and stdout belongs to the protocol alone:

// logger.js — the only logging that's safe in a stdio MCP server
export function log(...args) {
  // stdout is reserved for JSON-RPC frames. everything else -> stderr.
  process.stderr.write(`[2ndopinion] ${args.join(" ")}\n`);
}

// and defensively, at startup, trap anything that didn't get the memo:
console.log = (...args) => log("(captured console.log)", ...args);

that last line caught two transitive dependencies that logged on init. without it the server worked on my machine and failed on half my users'.

tool descriptions are a UX surface, not metadata

the second lesson reframed how i think about the whole protocol. your tool's description field isn't documentation — it's a prompt. the model reads it to decide whether and how to call your tool. a vague description means the tool gets called at the wrong time, with the wrong arguments, or not at all.

my first version of a review tool looked like this:

server.tool(
  "review",
  "Reviews code",
  { code: z.string() },
  async ({ code }) => runReview(code)
);

"reviews code" tested fine when i prompted it, because i knew what i meant. in the wild, models called it on commit messages, on config files, on half-finished functions, and then surfaced confusing output. rewriting the description to say what the tool does, when to use it, and what it returns cut the misfire rate dramatically:

server.tool(
  "review_diff",
  "Run a multi-model code review on a git diff or code snippet. " +
  "Use this BEFORE merging when you want a second opinion on correctness, " +
  "security, or edge cases. Returns issues ranked by severity with file:line refs. " +
  "Do not use for prose, commit messages, or config files.",
  {
    diff: z.string().describe("unified diff or raw code to review"),
    language: z.string().optional().describe("e.g. 'python', 'typescript'"),
  },
  async ({ diff, language }) => runReview(diff, language)
);

the naming matters too. review_diff tells the model what shape of input it wants; review invites it to throw anything at you. treat every string a model reads as a prompt, because it is one.

timeouts, partial failure, and the upstream you don't control

my server fans a diff out to three upstream model APIs and merges the results. in development, all three always answered. in production, one provider would occasionally hang for 40 seconds, and because i was doing a naive Promise.all, the whole tool call blocked behind the slowest — or worse, rejected entirely when one upstream 500'd.

Promise.allSettled with a per-call timeout fixed both the latency tail and the all-or-nothing failure mode. a review from two of three models is still a useful review; a hang is not.

const withTimeout = (p, ms) =>
  Promise.race([
    p,
    new Promise((_, rej) => setTimeout(() => rej(new Error("timeout")), ms)),
  ]);

const settled = await Promise.allSettled(
  models.map((m) => withTimeout(m.review(diff), 25_000))
);

const reviews = settled
  .filter((r) => r.status === "fulfilled")
  .map((r) => r.value);

if (reviews.length === 0) throw new Error("all reviewers failed or timed out");

partial results beat perfect results that never arrive. this is true of most fan-out systems, but it's especially true when an LLM is waiting on the other end and a stalled tool call just looks like the whole agent froze.

version skew is the silent killer

clients implement MCP at slightly different speeds. a capability that works in one editor's build throws "method not found" in another. i now log the client's declared protocol version on the initialize handshake and degrade gracefully instead of assuming everyone is on the latest spec. the single most useful thing i added was a --doctor flag so users could self-diagnose before opening an issue:

$ npx 2ndopinion-cli --doctor

2ndopinion mcp doctor
─────────────────────────────────────────
✓ node v20.11.0            (>= 18 required)
✓ stdio transport          handshake ok
✓ ANTHROPIC_API_KEY        found
✓ OPENAI_API_KEY           found
✗ GEMINI_API_KEY           missing — gemini reviewer disabled
✓ client protocol          2025-03-26
─────────────────────────────────────────
2 of 3 reviewers ready. consensus needs >= 2. you're good to go.

half my early "it's broken" reports were a missing env var. the doctor turned a support thread into a one-line answer the user could read themselves.

what the downloads taught me

the through-line across all of this: an MCP server isn't a library, it's a contract with a non-deterministic caller you can't see. the model decides when to call you, with inputs you didn't write, on a machine you'll never log into. defensive output handling, descriptions written as prompts, partial-failure tolerance, and self-diagnostics aren't polish — they're the difference between 2,300 downloads and 2,300 uninstalls.

i build all of this into 2ndOpinion, which runs Claude, Codex, and Gemini in parallel to cross-review a diff and returns a weighted consensus — each model's accuracy is tracked per language and issue type, so the verdict isn't just a vote, it's a calibrated one. it ships as the MCP server this post is about, plus a REST API, a GitHub PR agent, and the CLI.

if you want to feel the partial-consensus behavior yourself, npx 2ndopinion-cli runs a review in your terminal in about a minute, and the $5 starter pack (100 credits) is enough to put it through a real backlog before you decide anything. and if you're building your own MCP server: log to stderr, write your descriptions like prompts, and ship a doctor command. future-you, reading the bug reports, will be grateful.

Top comments (1)

Alex Shev • Jun 17

That gap between a weekend stdio script and a server strangers depend on is the real MCP lesson. Once other people's editors call it, the protocol surface becomes product surface: versioning, error messages, latency, install docs, and safe defaults all matter. It works on my machine stops being a useful bar.